Using Machine Learning to Predict Black Friday Sales
Machine learning has to do with helping systems learn from data, identify patterns and make decisions with minimal human intervention.
We will take a look at the hackathon from Analytics Vidhya on Black Friday Sales. We want to predict the purchase amount of customer against various products.
The steps to go about this
- Define a model: Here you select the model (decision trees, random forests, XGBRegressor etc). It is usually defined like this
my_model = ModelName()
. - Fit the model: Refers to training a model, taking the model we’ve defined and applying to our dataset. It captures the pattern in our dataset. It is defined thus,
my_model.fit(features, target)
. Features refer to the inputs, the variables which help predict the target that is, output. - Predict: After building and training a model, we apply or generalize to data it’s not seen before to make predictions.
my_model.predict(data)
. - Validate: This involves determining how accurate the model’s predictions are.
Before we start defining our model or training it, we have to understand our dataset. We do this by asking questions. For example, What are the columns in my dataset? How many rows are in my dataset? Are there any missing values? What’s the average of a particular numerical column and so on.
This is why we perform basic data exploration or Exploratory Data Analysis (EDA). Pandas (a tool from the python library) is the primary tool used in exploring and manipulating data. It is abbreviated in code as pd
.
Let’s see how we answer some of these questions on our Black Friday Sales dataset.
Basic Data Exploration
- First we import all the libraries we need.
- Then we read in the datasets into a data frame using pandas.
The train dataset is sample of data used to fit the model, it further split into train and validation sets (also called test for evaluation). The test dataset is the sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. More
- Let’s see the first few rows of the datasets we just loaded in. We do this using the
.head()
function.
train.head()
test.head()
sample.head() #this is how our submission file to the competition should look like
- To find out the columns we have we use the
.column
function.
- Are there any missing values? Let’s check.
train.isnull().sum().sort_values()
test.isnull().sum().sort_values()
Missing data can cause problems building our model, we will see how to handle that as we go on.
- Statistics from our data. Use the
.describe()
function.
- Use the
.info()
function for a concise summary of the dataset.
The model also cannot handle raw categories since it doesn’t know how to fit on strings. Keep reading, we will handle that too.
Let’s also gain some insights using visualization.
Which gender purchases more?
Which age group makes the most purchase?
Preprocessing the dataset, fitting the model and making predictions
Now that we have gotten a feel of the dataset, it’s time to preprocess the dataset.
I like to split the train dataset first and perform preprocessing on one set.
We start by choosing the columns we want to use as features. That is all the columns excluding the User_ID and Product_ID.
main_cols = X_train_full.columns.difference(['User_ID', 'Product_ID'])
We separate the main_cols
into categorical and numerical columns. Then keep the selected columns.
Let’s move on to handle missing data and categorical variables.
We would do this using pipelines. It is a simple way to keep data preprocessing and modeling code organized. It puts together preprocessing and modeling steps so the whole bundle is used as if it were a single step.
We fit our model next.
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
Making predictions is follows:
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
We then evaluate the model using mean_absolute_error (MAE) to see how well our model has performed (accuracy of the model). Other types of accuracy metrics here.
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
We can improve the model’s accuracy through a number of ways: using a different model, selecting less features, balancing the dataset, etc.
Finally we apply our model on the test dataset.
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)
Time to submit to the competition!
We will check if the sample submission file is the same as the test dataset. If it is, we will replace the target column of the sample submission file with our predicted test target.
View the whole Jupyter Notebook on my GitHub Repo.
This hackathon and article is the final project of the She Code Africa’s Mentorship Program Cohort 4 which marks its completion. The journey has been a rollercoaster ride over the last 3 months.
There has been immense growth in both technical and soft skills from SQL, Power BI, Machine Learning to Communication, Problem-Solving, Time Management, Adaptability…
I got to meet awesome ladies Florence Egwu, Purity Supaki, Tolulope Oladeji, Oluwatosin Lasisi. Special thanks to my mentor Kolawole Precious and She Code Africa — Admin.
I will definitely keep on growing as I am still on the journey.
Do share and clap.