Retail Analysis with Walmart Data – Project

Estimated read time 7 min read

DESCRIPTION

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Dataset Description

This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:

  • Store – the store number
  • Date – the week of sales
  • Weekly_Sales –  sales for the given store
  • Holiday_Flag – whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week
  • Temperature – Temperature on the day of sale
  • Fuel_Price – Cost of fuel in the region
  • CPI – Prevailing consumer price index
  • Unemployment – Prevailing unemployment rate

Holiday Events

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Analysis Tasks

Basic Statistics tasks

  • Which store has maximum sales
  • Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation
  • Which store/s has good quarterly growth rate in Q3’2012
  • Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together
  • Provide a monthly and semester view of sales in units and give insights

Statistical Model

For Store 1 – Build  prediction models to forecast demand

  • Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.
  • Change dates into days by creating new variable.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from patsy import dmatrices
import sklearn
import seaborn as sns
walmart_data = pd.read_csv("Walmart_Store_sales.csv")
walmart_data.head()
walmart_data_groupby = walmart_data.groupby('Store')['Weekly_Sales'].sum()
print("Store Number {} has maximum Sales. Sum of Total Sales {}".format(walmart_data_groupby.idxmax
(),walmart_data_groupby.max()))

Store Number 20 has maximum Sales. Sum of Total Sales 301397792.46000004

walmart_data_std = walmart_data.groupby('Store').agg({'Weekly_Sales':'std'})
print("Store Number {} has maximum Standard Deviation. STD {}".format(walmart_data_std['Weekly_Sales'
].idxmax(),walmart_data_std['Weekly_Sales'].max()))

Store Number 14 has maximum Standard Deviation. STD 317569.9494755081

walmart_data_std = walmart_data.groupby('Store').agg({'Weekly_Sales':['mean','std']})
walmart_data_std.head()
walmart_data_Q32012 = walmart_data[(pd.to_datetime(walmart_data['Date']) >= pd.to_datetime('07-01-201
2')) & (pd.to_datetime(walmart_data['Date']) <= pd.to_datetime('09-30-2012'))]
walmart_data_growth = walmart_data_Q32012.groupby(['Store'])['Weekly_Sales'].sum()
print("Store Number {} has Good Quartely Growth in Q3'2012 {}".format(walmart_data_growth.idxmax(),w
almart_data_growth.max()))

Store Number 4 has Good Quartely Growth in Q3’2012 25652119.35

# Stores Holiday Sales
stores_holiday_sales = walmart_data[walmart_data['Holiday_Flag'] == 1]

# Stores Weekday Sales
stores_nonholiday_sales = walmart_data[walmart_data['Holiday_Flag'] == 0]

#Stores Sales in Super Bowl Day 
#Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

stores_holiday_sales_superBowl = stores_holiday_sales[(pd.to_datetime(stores_holiday_sales['Date']) =
= pd.to_datetime('12-02-2010')) |(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('11-
02-2011'))|(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('10-02-2012'))|(pd.to_date
time(stores_holiday_sales['Date']) == pd.to_datetime('08-02-2013'))]

#Stores Sales in Labour Day 
#Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
stores_holiday_sales_labourDay = stores_holiday_sales[(pd.to_datetime(stores_holiday_sales['Date']) =
= pd.to_datetime('10-09-2010')) |(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('09-
09-2011'))|(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('07-09-2012'))|(pd.to_date
time(stores_holiday_sales['Date']) == pd.to_datetime('06-09-2013'))]

#Stores Sales in Thanks Giving 
#Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

stores_holiday_sales_thanksgiving = stores_holiday_sales[(pd.to_datetime(stores_holiday_sales['Date'
]) == pd.to_datetime('26-11-2010')) |(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime(
'25-11-2011'))|(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('23-11-2012'))|(pd.to_
datetime(stores_holiday_sales['Date']) == pd.to_datetime('29-11-2013'))]

#Stores Sales in Christmas
# Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

stores_holiday_sales_Christmas = stores_holiday_sales[(pd.to_datetime(stores_holiday_sales['Date']) =
= pd.to_datetime('31-12-2010')) |(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('30-
12-2011'))|(pd.to_datetime(stores_holiday_sales['Date']) == pd.to_datetime('28-12-2012'))|(pd.to_date
time(stores_holiday_sales['Date']) == pd.to_datetime('27-12-2013'))]
stores_nonholiday_sales_mean = stores_nonholiday_sales.groupby(['Date']).agg({'Weekly_Sales':'mean'})
.reset_index()
stores_holiday_sales_sum = stores_holiday_sales.groupby(['Date']).agg({'Weekly_Sales':'sum'}).reset_
index()
for row in stores_holiday_sales_sum.itertuples():
for row1 in stores_nonholiday_sales_mean.itertuples():
if row.Weekly_Sales > row1.Weekly_Sales:
print("On this Date {} Holiday Sales is greater than Non Holiday Sales and the Sales :-
{}".format(row.Date,row.Weekly_Sales))
break;

print("Super Bowl Day Sale",stores_holiday_sales_superBowl['Weekly_Sales'].sum())
print("Labour Day Sale",stores_holiday_sales_labourDay['Weekly_Sales'].sum())
print("Thanksgiving Day Sale",stores_holiday_sales_thanksgiving['Weekly_Sales'].sum())
print("Christmas Day Sale",stores_holiday_sales_Christmas['Weekly_Sales'].sum())
x_features_object = walmart_data[walmart_data['Store'] ==1][['Store','Date']]
date_obj = walmart_data[walmart_data['Store'] ==1][['Date']]
date_obj.index +=1
x_features_object.Date = date_obj.index
x_features_object.head()
y_target = walmart_data[walmart_data['Store'] ==1]['Weekly_Sales']
y_target.head()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_features_object,y_target,random_state=1)
#Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb #2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and #fuel price have any impact on sales.

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(x_train,y_train)
feature_dataset = walmart_data[walmart_data['Store'] ==1][['Store','CPI','Unemployment','Fuel_Price'
]]
feature_dataset.head()
response_set_cpi = walmart_data[walmart_data['Store'] ==1]['CPI'].astype('int64')
response_set_unemployment = walmart_data[walmart_data['Store'] ==1]['Unemployment'].astype('int64')
from sklearn.model_selection import train_test_split
x_train_cpi,x_test_cpi,y_train_cpi,y_test_cpi = train_test_split(feature_dataset,response_set_cpi,ran
dom_state=1)
x_train_unemp,x_test_unemp,y_train_unemp,y_test_unemp = train_test_split(feature_dataset,response_set
_unemployment,random_state=1)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=10000)
logreg.fit(x_train_cpi,y_train_cpi)
y_pred = logreg.predict(x_test_cpi)
logreg.fit(x_train_unemp,y_train__unemp)
y_pred_unemp = logreg.predict(x_test_unemp)
from sklearn import metrics
print(metrics.accuracy_score(y_test_cpi,y_pred))
print(metrics.accuracy_score(y_test_unemp,y_pred_unemp))

0.7222222222222222

0.9444444444444444

print('cpi actual :', y_test_cpi.values[0:30])
print('cpi Predicted :', y_pred[0:30])
print('actual Unemployment :', y_test_unemp.values[0:30])
print('Predicted Unemployment :', y_pred_unemp[0:30])

cpi actual : [215 221 211 211 221 211 210 211 215 217 221 212 216 218 211 210 211 217

215 211 212 217 221 219 214 211 211 219 215 219]

cpi Predicted : [215 221 211 211 221 211 211 211 215 215 221 211 215 218 211 211 211 217

215 211 211 217 221 221 215 211 211 221 215 220]

actual Unemployment : [7 7 7 8 7 7 7 7 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7]

Predicted Unemployment : [7 7 7 7 6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7]

walmart_data['Day'] = pd.to_datetime(walmart_data['Date']).dt.day_name()
walmart_data.head()

Loading...