Customer Segmentation Using Machine Learning Clustering Algorithms K-Means
Dataset
The Model
2.1 Choosing the Right Clusters Using the Elbow Method and the Silhouette Score
2.2 Building the K-Means Model
2.3 Model Evaluation
Summary
This project is designed to segment customers based on their spending using KMeans
, an unsupervised clustering machine learning algorithm. This machine learning analysis is used when we do not have labelled data, or we want to group data based on certain characteristics. In the context of business, we are able to group customers based on their spending pattern, and then extract certain underlying characteristics from those groups, such as age.
The analysis will be done on an anonymous UK e-commerce company. There are 6 distinct clusters in total based on customers’ total and frequency transactions.
Dataset
There are 541,909 total transactions which requires some work to clean up the data. First, we need to drop the unnecessary columns.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
raw_df = pd.read_csv('data/data.csv', encoding='unicode_escape')
raw_df.drop(['StockCode', 'InvoiceDate','Description','Country'],axis = 1, inplace =True)
There are 135,080 missing data for CustomerID. We will drop this as we assume that we do not have the necessary information for these customers, such as their age, occupation, gender, etc, that are necessary for analysing relevant customers features for each cluster. We will also drop transactions that are negative. These might be returns, which we will not analyse as we are only interested in customers’ spending.
raw_df["CustomerID"].isna().sum()
df = raw_df.loc[raw_df["Quantity"] >0 ]
df = df.loc[df["UnitPrice"] >0 ]
df["Total"]=df["Quantity"]*df["UnitPrice"]
df.drop(['Quantity', 'UnitPrice'],axis = 1, inplace =True)
df.dropna(axis = 0, inplace=True)
We now want to group total spent and frequency of each unique customers.
Amount = df.groupby('CustomerID')['Total'].sum()
Amount = Amount.reset_index()
Amount.columns=['CustomerID','Amount']
Frequency=df.groupby('CustomerID')['InvoiceNo'].count()
Frequency=Frequency.reset_index()
Frequency.columns=['CustomerID','Frequency']
df1 = pd.merge(Amount, Frequency, on='CustomerID', how='inner')
Checking for outliers using box plots
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(6,6))
fig.suptitle('Outliers\n', size = 25)
sns.boxplot(ax=axes[0], data=df1['Amount'], palette='Spectral').set_title("Amount")
sns.boxplot(ax=axes[1], data=df1['Frequency'], palette='Spectral').set_title("Frequency")
plt.tight_layout()
There are significant outliers that will affect the results if we do not remove these. Using Isolation Forest
we can identify and remove these.
from sklearn.ensemble import IsolationForest
df2 = df1.copy()
model=IsolationForest(n_estimators=150, max_samples='auto', contamination=float(0.1), max_features=1.0)
model.fit(df2[['Amount','Frequency']])
scores=model.decision_function(df2[['Amount','Frequency']])
anomaly=model.predict(df2[['Amount','Frequency']])
df2['scores']=scores
df2['anomaly']=anomaly
anomaly = df2.loc[df2['anomaly']==-1]
anomaly_index = list(anomaly.index)
print('Total number of outliers is:', len(anomaly))
Total number of outliers is: 434
df2 = df2.drop(anomaly_index, axis = 0).reset_index(drop=True)
df2.drop(['scores', 'anomaly'], axis = 1, inplace =True)
Checking the new dataframe
we see that there is a considerable improvement
We need to rescale this data so that the KMeans algorithm can better cluster the data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df3=scaler.fit_transform(df2[['Amount','Frequency']])
StandardScaler normalises the data using the standard score z-value so that the data behave more like a normal distribution, which is crucial for some machine learning algorithms. We create a new df3
as we want to preserve CustomerID, as we will see later. We are now ready to build the model.
2. The Model
2.1 Choosing the Right Clusters Using the Elbow Method and the Silhouette Score
Before building an unsupervised clustering algorithms, we must first specify how many clusters should the algorithm group the data. There are certain metrics that can be used to assess how many clusters there should be. We will use the elbow method, which plots the number of clusters on the X axis and the inertia of each cluster on the y axis, and the silhouette score, which measures the how well a a cluster is compared to the other.
The inertia is the sum of distance between each cluster point to its cluster center, also known as Within-Cluster Sum of Square (WCSS). Thus the lower the inertia, the closer each point to its center, meaning that the cluster is well-grouped. The more clusters there are, however, the less marginal benefit there will be, and the graph will look like an elbow.
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(df3)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
We can see that at around 4 to 7 clusters the WCSS started to plateau, so the optimal cluster number would be around this range. We can further analyse the clusters by using the silhouette score. Essentially it measures how similar an object is to its own cluster and to other clusters. It ranges from -1 to 1, where 1 is the best score. The higher the score the closer the object is to its own cluster and less so to the other clusters, which is an indication of how well a cluster is in grouping the data points.
Note that its different from the WCSS, which only measures how grouped a cluster is, not how well a cluster groups data compared to the other clusters. We can run a for
loop to assess 4-8 clusters.
from sklearn import datasets
from sklearn.metrics import silhouette_score
for i in [4,5,6,7,8]:
km = KMeans(n_clusters=i, random_state=42)
km.fit_predict(df3)
score = silhouette_score(df3, km.labels_, metric='euclidean')
print(f"{i} clusters Silhouette Average Score: %.3f'" % score)
2 clusters Silhouette Average Score: 0.629'
3 clusters Silhouette Average Score: 0.531'
4 clusters Silhouette Average Score: 0.512'
5 clusters Silhouette Average Score: 0.477'
6 clusters Silhouette Average Score: 0.476'
7 clusters Silhouette Average Score: 0.431'
8 clusters Silhouette Average Score: 0.415'
Note that the silhouette score always decreases, thus we need to evaluate the model along with the elbow method. Combined with the WCSS, there is a significant drop from 6 to 7 clusters, though less so for 5 to 6. We can get a boost in WCCS score from 4 to 6 clusters, thus we we will use 6 clusters for our model.
2.2 Building the K-Means Model
The K-Means algorithm randomly chooses a random data point and assign this as the cluster center. Then the algorithm calculates how far each data in this cluster compared to the cluster center. This process will repeat, while simulatenously moving the cluster center, until it minimises the distance between each data point to its cluster center given a number of clusters. K-Means++ is a variation of this algorithm by choosing the initial cluster centers as far as possible to reduce the calculation time.
df_kmeans = df3.copy()
kmeans = KMeans(n_clusters = 6, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(df_kmeans)
df_kmeans = df2.copy()
df_kmeans['Cluster'] = y_kmeans
df_kmeans['Cluster'].value_counts()
We can then plot the clusters to see how well the model evaluates the clusters
2.3 Model Evaluation
We can see that the model has done reasonably well in grouping the data points. Note though that there are only two features in this model, which made the visualisation relatively simple. We can identify how many members for each cluster:
df2['group']=y_kmeans
df2['group'].value_counts()
0 1957
3 840
5 429
1 317
2 211
4 150
What we can do next is to filter the clusters based on their CustomerID and analyse each cluster
df2['group']=y_kmeans
cluster_0 = df2[df2["group"]==0]
Say we want to analyse the cluster with the highest spending, in this case cluster 2
pd.set_option('display.float_format', lambda x: '%.2f' % x)
cluster_2.describe()
We can then analyse features that are relevant to this cluster based on their customer ID. Unfortunately no such data is available for this dataset, although the analysis would be straightforward. For example, we might be able to find an average age that is particular to this cluster, such as people within the age of 40-60 that have been treated to a certain marketing approach.
If we want to convert low spenders in the age of 40-60, those who are in cluster 0, we can focus said marketing approach on these people to assess whether we are able to convert them to high spenders.
We can also test a variety of different clusters to see if there are certain patterns that can arise with different clusters. It is possible that with more clusters, we are able to narrow down a range of features, such as age and marketing strategy, to gain more understanding of the customer base.
3. Summary
We have used K-Means algorithm to analyse the spending habit of customers. 6 distinct clusters have been identified that can be used to identify whether there are certain apparent features for the customers in each cluster. By analysing them we can group customers based on certain features to gain a better understanding of different customer segments.