0% found this document useful (0 votes)
9 views4 pages

Data Normalization and Clustering Guide

The document outlines a data normalization and clustering process using the Standard Scaler and Agglomerative Clustering. It details the creation of a synthetic customer dataset, the normalization of features, and the clustering of customers into four groups based on selected features. Additionally, it includes steps for visualizing the data distributions and cluster characteristics through various plots.

Uploaded by

skandapmwork2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Data Normalization and Clustering Guide

The document outlines a data normalization and clustering process using the Standard Scaler and Agglomerative Clustering. It details the creation of a synthetic customer dataset, the normalization of features, and the clustering of customers into four groups based on selected features. Additionally, it includes steps for visualizing the data distributions and cluster characteristics through various plots.

Uploaded by

skandapmwork2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data Normalization (Standard Scaler)

Before performing clustering, it is crucial to normalize the data, especially when features are on
different scales. For example, age might range from 18 to 70, whereas monthly_spending could
range from 50 to 500.
Standard Scaler is used here to standardize features by subtracting the mean and scaling to unit
variance, making them comparable.

Clustering Process
Feature Selection: Only a subset of features (age, tenure, monthly_spending) is selected for
clustering.
Agglomerative Clustering:
Uses linkage='ward' to minimize variance within clusters.
The n_clusters=4 parameter specifies that the data will be grouped into 4 clusters.

Code:
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import StandardScaler
from [Link] import AgglomerativeClustering
import [Link] as sch

# Step 1: Data Aggregation


# 1.1 Create a synthetic customer dataset with random values
[Link](42)
n = 500 # Number of customers

# Features: age, tenure, monthly_spending, number of products


data = {
'age': [Link](18, 70, size=n), # Age between 18 and 70
'tenure': [Link](1, 10, size=n), # Tenure between 1 and 10 years
'monthly_spending': [Link](50, 500, size=n), # Monthly spending between 50 and
500
'num_products': [Link](1, 6, size=n) # Number of products between 1 and 5
}

df = [Link](data)

# 1.2 Introduce missing values in 'monthly_spending' column


[Link][::10, 'monthly_spending'] = [Link] # Set every 10th value as NaN

# 1.3 Handle missing values by filling them with the mean of the column
df['monthly_spending'].fillna(df['monthly_spending'].mean(), inplace=True)

# 1.4 Visualize the distribution of key features


[Link](style="whitegrid")
[Link](figsize=(12, 8))

# Plot the distribution of key features


for i, feature in enumerate(['age', 'tenure', 'monthly_spending', 'num_products'], 1):
[Link](2, 2, i)
[Link](df[feature], kde=True, color="teal")
[Link](f'Distribution of {feature}')
[Link](feature)
[Link]("Frequency")

plt.tight_layout()
[Link]()

# 1.5 Normalize the numerical columns (age, tenure, monthly_spending, num_products) using
StandardScaler
scaler = StandardScaler()
df_scaled = [Link](scaler.fit_transform(df), columns=[Link])
# Step 2: Clustering Using Hierarchical Clustering
# 2.1 Select features for clustering (scaled data)
X = df_scaled[['age', 'tenure', 'monthly_spending']]

# 2.2 Perform Agglomerative Clustering


agg_clust = AgglomerativeClustering(linkage='ward', n_clusters=4) # We start with 4 clusters
df['cluster'] = agg_clust.fit_predict(X)

# 2.3 Plot a dendrogram to visualize the hierarchical clustering process


[Link](figsize=(10, 7))
[Link]([Link](X, method='ward'))
[Link]('Dendrogram of Customer Segments')
[Link]('Customers')
[Link]('Euclidean Distance')
[Link]()

# Step 3: Cluster Evaluation


# 3.1 Analyze the characteristics of each cluster using mean, median, and std deviation
cluster_means = [Link]('cluster')[['age', 'tenure', 'monthly_spending', 'num_products']].mean()
cluster_medians = [Link]('cluster')[['age', 'tenure', 'monthly_spending',
'num_products']].median()
cluster_std = [Link]('cluster')[['age', 'tenure', 'monthly_spending', 'num_products']].std()

# 3.2 Print out the cluster statistics (mean, median, std)


print("Cluster Means:\n", cluster_means)
print("\nCluster Medians:\n", cluster_medians)
print("\nCluster Standard Deviations:\n", cluster_std)

# Step 4: Cluster Profiling


# 4.1 Visualize the clusters using a pairplot, color-coded by cluster labels
[Link](df[['age', 'tenure', 'monthly_spending', 'num_products', 'cluster']], hue='cluster',
palette='Set2')
[Link]("Pairplot of Customer Features by Cluster", y=1.02)
[Link]()

# 4.2 Visualize clusters using scatter plots


# Scatter plot of Age vs Monthly Spending
[Link](figsize=(10, 6))
[Link](x='age', y='monthly_spending', hue='cluster', data=df, palette='Set2', s=100,
alpha=0.7)
[Link]('Customer Segments: Age vs Monthly Spending')
[Link]('Age')
[Link]('Monthly Spending')
[Link]()

# Scatter plot of Tenure vs Number of Products


[Link](figsize=(10, 6))
[Link](x='tenure', y='num_products', hue='cluster', data=df, palette='Set2', s=100,
alpha=0.7)
[Link]('Customer Segments: Tenure vs Number of Products')
[Link]('Tenure (Years)')
[Link]('Number of Products')
[Link]()

You might also like