Data Normalization (Standard Scaler)
Before performing clustering, it is crucial to normalize the data, especially when features are on
different scales. For example, age might range from 18 to 70, whereas monthly_spending could
range from 50 to 500.
Standard Scaler is used here to standardize features by subtracting the mean and scaling to unit
variance, making them comparable.
Clustering Process
Feature Selection: Only a subset of features (age, tenure, monthly_spending) is selected for
clustering.
Agglomerative Clustering:
Uses linkage='ward' to minimize variance within clusters.
The n_clusters=4 parameter specifies that the data will be grouped into 4 clusters.
Code:
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import StandardScaler
from [Link] import AgglomerativeClustering
import [Link] as sch
# Step 1: Data Aggregation
# 1.1 Create a synthetic customer dataset with random values
[Link](42)
n = 500 # Number of customers
# Features: age, tenure, monthly_spending, number of products
data = {
'age': [Link](18, 70, size=n), # Age between 18 and 70
'tenure': [Link](1, 10, size=n), # Tenure between 1 and 10 years
'monthly_spending': [Link](50, 500, size=n), # Monthly spending between 50 and
500
'num_products': [Link](1, 6, size=n) # Number of products between 1 and 5
}
df = [Link](data)
# 1.2 Introduce missing values in 'monthly_spending' column
[Link][::10, 'monthly_spending'] = [Link] # Set every 10th value as NaN
# 1.3 Handle missing values by filling them with the mean of the column
df['monthly_spending'].fillna(df['monthly_spending'].mean(), inplace=True)
# 1.4 Visualize the distribution of key features
[Link](style="whitegrid")
[Link](figsize=(12, 8))
# Plot the distribution of key features
for i, feature in enumerate(['age', 'tenure', 'monthly_spending', 'num_products'], 1):
[Link](2, 2, i)
[Link](df[feature], kde=True, color="teal")
[Link](f'Distribution of {feature}')
[Link](feature)
[Link]("Frequency")
plt.tight_layout()
[Link]()
# 1.5 Normalize the numerical columns (age, tenure, monthly_spending, num_products) using
StandardScaler
scaler = StandardScaler()
df_scaled = [Link](scaler.fit_transform(df), columns=[Link])
# Step 2: Clustering Using Hierarchical Clustering
# 2.1 Select features for clustering (scaled data)
X = df_scaled[['age', 'tenure', 'monthly_spending']]
# 2.2 Perform Agglomerative Clustering
agg_clust = AgglomerativeClustering(linkage='ward', n_clusters=4) # We start with 4 clusters
df['cluster'] = agg_clust.fit_predict(X)
# 2.3 Plot a dendrogram to visualize the hierarchical clustering process
[Link](figsize=(10, 7))
[Link]([Link](X, method='ward'))
[Link]('Dendrogram of Customer Segments')
[Link]('Customers')
[Link]('Euclidean Distance')
[Link]()
# Step 3: Cluster Evaluation
# 3.1 Analyze the characteristics of each cluster using mean, median, and std deviation
cluster_means = [Link]('cluster')[['age', 'tenure', 'monthly_spending', 'num_products']].mean()
cluster_medians = [Link]('cluster')[['age', 'tenure', 'monthly_spending',
'num_products']].median()
cluster_std = [Link]('cluster')[['age', 'tenure', 'monthly_spending', 'num_products']].std()
# 3.2 Print out the cluster statistics (mean, median, std)
print("Cluster Means:\n", cluster_means)
print("\nCluster Medians:\n", cluster_medians)
print("\nCluster Standard Deviations:\n", cluster_std)
# Step 4: Cluster Profiling
# 4.1 Visualize the clusters using a pairplot, color-coded by cluster labels
[Link](df[['age', 'tenure', 'monthly_spending', 'num_products', 'cluster']], hue='cluster',
palette='Set2')
[Link]("Pairplot of Customer Features by Cluster", y=1.02)
[Link]()
# 4.2 Visualize clusters using scatter plots
# Scatter plot of Age vs Monthly Spending
[Link](figsize=(10, 6))
[Link](x='age', y='monthly_spending', hue='cluster', data=df, palette='Set2', s=100,
alpha=0.7)
[Link]('Customer Segments: Age vs Monthly Spending')
[Link]('Age')
[Link]('Monthly Spending')
[Link]()
# Scatter plot of Tenure vs Number of Products
[Link](figsize=(10, 6))
[Link](x='tenure', y='num_products', hue='cluster', data=df, palette='Set2', s=100,
alpha=0.7)
[Link]('Customer Segments: Tenure vs Number of Products')
[Link]('Tenure (Years)')
[Link]('Number of Products')
[Link]()