In my previous post, I wrote an introduction to wrangling data — step 2 of the data analysis process.
In this article, we will do a step-by-step project walkthrough on how to go about data wrangling.
The first step in data wrangling is to gather data.
The data used is based on a Twitter account, ‘WeRateDogs’ which rates dogs from a humorous point of view.
Let’s see some dogs I’m talking about
Yep, dogs like the one above 🥺
And like this 👇
These dog ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. Well, we know why now 😉
The data needed for this analysis is collected from three sources:
- WeRateDogs Twitter archive file at hand provided by Udacity with tweets about random dogs.
- An image prediction of the dog breed file is to be programmatically downloaded from the Udacity servers.
- Querying Twitter’s API using the Tweepy library to get additional information on the tweets like favourite and retweet counts.
Here’s how I loaded each data needed:
First things first, importing the necessary libraries
Now, let’s get the first data
The second data is online and we are going to use the requests library to get it.
# Creating a folder to store downloaded data if it doesn't already exists
folder_name = "image_predictions"
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# Getting the data from the server using the request library
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
# Storing the file to local server
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
file.write(response.content)
# Checking to see if data is saved in created folder
os.listdir(folder_name)
# Loading in tweet image prediction file
image_predictions = pd.read_csv("image-predictions.tsv", sep="\t")
Let’s see what the data looks like
Going ahead to the third dataset, this is done using the Twitter (now X) API.
# Querying Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)
# Querying Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Saving each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
end = timer()
print(end - start)
print(fails_dict)
# Identifying information of interest from JSON stored in txt file and put it in a dataframe
tweet_list = []
with open("tweet-json.txt", "r") as file:
for line in file:
tweets = json.loads(line)
tweet_id = tweets["id"]
retweet_count = tweets["retweet_count"]
favorite_count = tweets["favorite_count"]
tweet_list.append({"tweet_id": tweet_id, "retweet_count": retweet_count, "favorite_count": favorite_count})
tweet_status = pd.DataFrame(tweet_list, columns = ["tweet_id", "retweet_count", "favorite_count"])
All the data has been loaded up! It’s time to get into it 🤭
The next thing is to assess the data
Remember, this has to do with checking for both quality and tidiness issues. This can be done visually or programmatically.
Quality issues refer to problems with the data content, such as missing values, inconsistent data, incorrect data types, and duplicates.
Tidiness issues refer to problems with the data structure that make it challenging to analyze, such as having multiple variables in one column.
As said by Hadley Wickham in his paper Tidy Data, the data is tidy when,
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Let’s first identify the quality issues.
Looking at the tweet_archive
dataset if there are any issues. We make use of commands like tweet_archive.info()
, tweet_archive.unique()
, tweet_archive.value_counts()
.
Issues noted from here:
- Retweets are present, want only original tweets
- Multiple null variables in different columns
- Incorrect data type for timestamp
- Here, there are 23 rows with denominator != 10
- Weird dog names like “None”, “a”
- Null values are represented as none in doggo, floofer, pupper, puppo columns
Moving to the
image_predictions
dataset
- From above, we can see non-descriptive column names (p1, p2, p3, p1_conf, p2_conf, p3_conf)
- Duplicate variables present in jpg_url
Issues from here:
- Inconsistent dog name format in p1, p2, p3 columns
- Some tweets that do not refer to dog ratings
The tidiness issues are:
- Information on the tweets is in 3 tables
- The text column contains repetitive information
- 4 variables are contained in 4 columns (doggo, floofer, pupper, puppo)
Having identified all the issues, the next step is to clean.
In this phase, I cleaned all the issues stated in the assessment data stage. I used the Define-Code-Test approach to clean data.
Now that we’ve addressed all issues, it’s important to store these changes.
Yay! We’re done. It was definitely a long process but it was worth it 💯
After going through the data wrangling process, you can now query the data and visualize insights. The fun part.
The most common dog breeds are Golden Retriever, Pembroke, Labrador Retriever, Chihuahua, Pug, Pomeranian, Toy Poodle, Malamute, Chow and French Bulldog.
Let’s see a picture of a few
That brings us to the end. The data wrangling project was fun to work on and I had a lot of learning on this project.
Get the dataset and the code on my GitHub Repo.
I know it’s a long read, but you made it. Look out for the next article on Step 3 of the data analysis process. Go ahead to give this article 50 claps 😊