Data Wrangling: Step 2 of the Data Analysis Process
In my previous article, I wrote about and established the importance of asking the right questions as the first step of the data analysis process.
Catch up here
After asking the right questions and identifying the data needed to answer them, the 2nd step in the data analysis process is Data Wrangling.
Just like how in life, not everything is easy to work with, the same can be said for data. Imagine you’re planning a road trip. Before you hit the road, you need to figure out where you’re going, how to get there, and what you need for the journey.
Wrangling is similar to organising
Similarly, in the data analysis process, step 2 is all about figuring out how to get the data ready for the journey ahead. Data wrangling is all about ensuring the data is of great quality and structure before you can begin the analysis, visualization, or building predictive models using machine learning.
Data wrangling is the process of gathering, assessing, and cleaning your data to make sure it’s in the best quality and structure possible for analysis.
Data wrangling is a three-step process: gather, assess, and clean.
1. Data Gathering
The first step in data wrangling is gathering your data. You need to gather the data to answer your questions.
Depending on the source and format of your data, the steps for gathering it may vary. This could include downloading a file from the internet, scraping a web page, or querying an API. Once the data is obtained, you’ll need to import it into the software you’re working with (PowerBI, programming environment, such as a Jupyter notebook).
2. Data Assessment
The next step is assessing the data.
This has to do with checking for both quality and tidiness issues.
Quality issues refer to problems with the data content, such as missing values, inconsistent data, incorrect data types, and duplicates.
Tidiness issues refer to problems with the data structure that make it challenging to analyze, such as having multiple variables in one column.
As said by Hadley Wickham in his paper Tidy Data, the data is tidy when,
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
To assess the data, you can use both visual methods — scrolling through the data in a software application and looking for quality, and tidiness issues and programmatic methods — using small code snippets and functions to check the data quality.
3. Data Cleaning
After identifying any issues with the data, the next step is cleaning it. The goal is to solve the issues earlier identified.
The data cleaning techniques include correcting the data types, handling missing values, dropping the duplicates, dropping unwanted columns, and merging multiple data frames to ensure that it’s of the highest quality and well-structured.
There are two types of cleaning: manual and programmatic.
Manual cleaning is only recommended if the issues are single occurrences, as it can be time-consuming and prone to errors.
Programmatic cleaning is more efficient and reliable, as it involves converting your defined cleaning tasks into code and then running that code.
Summing up,
After cleaning your data, it’s important to reassess and iterate on any of the steps if necessary. With a bit of effort and patience, you’ll end up with a clean and well-structured dataset ready for analysis. This ensures that your data is accurate and reliable for analysis.
Finally, you may choose to store your data for future use in a file or database.
In the next article, I will do a project walkthrough on how to go about data wrangling. Give this article 50 claps and follow for more.