Using Machine Learning to Predict Black Friday Sales

Eniola Ogunmona
5 min readMay 20, 2021

--

Machine learning has to do with helping systems learn from data, identify patterns and make decisions with minimal human intervention.

We will take a look at the hackathon from Analytics Vidhya on Black Friday Sales. We want to predict the purchase amount of customer against various products.

The steps to go about this

  • Define a model: Here you select the model (decision trees, random forests, XGBRegressor etc). It is usually defined like this my_model = ModelName() .
  • Fit the model: Refers to training a model, taking the model we’ve defined and applying to our dataset. It captures the pattern in our dataset. It is defined thus, my_model.fit(features, target). Features refer to the inputs, the variables which help predict the target that is, output.
  • Predict: After building and training a model, we apply or generalize to data it’s not seen before to make predictions. my_model.predict(data).
  • Validate: This involves determining how accurate the model’s predictions are.
Steps to building and using a model

Before we start defining our model or training it, we have to understand our dataset. We do this by asking questions. For example, What are the columns in my dataset? How many rows are in my dataset? Are there any missing values? What’s the average of a particular numerical column and so on.

This is why we perform basic data exploration or Exploratory Data Analysis (EDA). Pandas (a tool from the python library) is the primary tool used in exploring and manipulating data. It is abbreviated in code as pd.

Let’s see how we answer some of these questions on our Black Friday Sales dataset.

Basic Data Exploration

  • First we import all the libraries we need.
  • Then we read in the datasets into a data frame using pandas.

The train dataset is sample of data used to fit the model, it further split into train and validation sets (also called test for evaluation). The test dataset is the sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. More

  • Let’s see the first few rows of the datasets we just loaded in. We do this using the .head() function.
train.head()
First 5 rows of train dataset
test.head()
First 5 rows of test dataset
sample.head() #this is how our submission file to the competition should look like
First 5 rows of sample submission dataset
  • How many rows and columns are in my dataset? We answer this using the .shape function.
Number of rows and columns in our datasets
  • To find out the columns we have we use the .column function.
Columns in the train dataset
  • Are there any missing values? Let’s check.
train.isnull().sum().sort_values()
Number of missing values in train dataset
test.isnull().sum().sort_values()
Number of missing values in test dataset

Missing data can cause problems building our model, we will see how to handle that as we go on.

  • Statistics from our data. Use the .describe() function.
Descriptive statistics of train dataset
  • Use the .info() function for a concise summary of the dataset.

The model also cannot handle raw categories since it doesn’t know how to fit on strings. Keep reading, we will handle that too.

Let’s also gain some insights using visualization.

Which gender purchases more?

The male gender makes more purchases

Which age group makes the most purchase?

The 18–25 age group makes the most purchases

Preprocessing the dataset, fitting the model and making predictions

Now that we have gotten a feel of the dataset, it’s time to preprocess the dataset.

I like to split the train dataset first and perform preprocessing on one set.

We start by choosing the columns we want to use as features. That is all the columns excluding the User_ID and Product_ID.

main_cols = X_train_full.columns.difference(['User_ID', 'Product_ID'])

We separate the main_cols into categorical and numerical columns. Then keep the selected columns.

Let’s move on to handle missing data and categorical variables.

We would do this using pipelines. It is a simple way to keep data preprocessing and modeling code organized. It puts together preprocessing and modeling steps so the whole bundle is used as if it were a single step.

We fit our model next.

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

Making predictions is follows:

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

We then evaluate the model using mean_absolute_error (MAE) to see how well our model has performed (accuracy of the model). Other types of accuracy metrics here.

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

We can improve the model’s accuracy through a number of ways: using a different model, selecting less features, balancing the dataset, etc.

Finally we apply our model on the test dataset.

# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)

Time to submit to the competition!

We will check if the sample submission file is the same as the test dataset. If it is, we will replace the target column of the sample submission file with our predicted test target.

View the whole Jupyter Notebook on my GitHub Repo.

This hackathon and article is the final project of the She Code Africa’s Mentorship Program Cohort 4 which marks its completion. The journey has been a rollercoaster ride over the last 3 months.

There has been immense growth in both technical and soft skills from SQL, Power BI, Machine Learning to Communication, Problem-Solving, Time Management, Adaptability…

I got to meet awesome ladies Florence Egwu, Purity Supaki, Tolulope Oladeji, Oluwatosin Lasisi. Special thanks to my mentor Kolawole Precious and She Code Africa — Admin.

I will definitely keep on growing as I am still on the journey.

Do share and clap.

--

--

No responses yet