7/5/25, 2:40 PM about:blank
Cheat Sheet: Building Unsupervised Learning Models
Unsupervised learning models
Model Name Brief Description Code Syntax
UMAP (Uniform Manifold Approximation and Projection) is used for
dimensionality reduction. from umap.umap_ import UMAP
Pros: High performance, preserves global structure. umap = UMAP(n_neighbors=15, min_dist=0.1, n_components=2)
Cons: Sensitive to parameters.
Applications: Data visualization, feature extraction.
Key hyperparameters:
UMAP
n_neighbors: Controls the local neighborhood size (default =
15).
min_dist: Controls the minimum distance between points in the
embedded space (default = 0.1).
n_components: The dimensionality of the embedding (default =
2).
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear
dimensionality reduction technique.
from [Link] import TSNE
Pros: Good for visualizing high-dimensional data. tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
Cons: Computationally expensive, prone to overfitting.
Applications: Data visualization, anomaly detection.
Key hyperparameters:
t-SNE
n_components: The number of dimensions for the output
(default = 2).
perplexity: Balances attention between local and global aspects
of the data (default = 30).
learning_rate: Controls the step size during optimization
(default = 200).
PCA (principal component analysis) is used for linear dimensionality
reduction. from [Link] import PCA
Pros: Easy to interpret, reduces noise. pca = PCA(n_components=2)
Cons: Linear, may lose information in nonlinear data.
Applications: Feature extraction, compression.
Key hyperparameters:
PCA
n_components: Number of principal components to retain
(default = 2).
whiten: Whether to scale the components (default = False).
svd_solver: The algorithm to compute the components (default =
'auto').
DBSCAN (Density-Based Spatial Clustering of Applications with from [Link] import DBSCAN
Noise) is a density-based clustering algorithm. dbscan = DBSCAN(eps=0.5, min_samples=5)
Pros: Identifies outliers, does not require the number of clusters.
Cons: Difficult with varying density clusters.
Applications: Anomaly detection, spatial data clustering.
DBSCAN Key hyperparameters:
eps: The maximum distance between two points to be considered
neighbors (default = 0.5).
min_samples: Minimum number of samples in a neighborhood
to form a cluster (default = 5).
HDBSCAN (Hierarchical DBSCAN) improves on DBSCAN by import hdbscan
handling varying density clusters. clusterer = [Link](min_cluster_size=5)
Pros: Better handling of varying densities.
Cons: Can be slower than DBSCAN.
Applications: Large datasets, complex clustering problems.
HDBSCAN Key hyperparameters:
min_cluster_size: The minimum size of clusters (default = 5).
min_samples: Minimum number of samples to form a cluster
(default = 10).
K-Means is a centroid-based clustering algorithm that groups data into
k clusters. from [Link] import KMeans
Pros: Efficient, simple to implement. kmeans = KMeans(n_clusters=3)
Cons: Sensitive to initial cluster centroids.
Applications: Customer segmentation, pattern recognition.
K-Means Key hyperparameters:
clustering
n_clusters: Number of clusters (default = 8).
init: Method for initializing the centroids ('k-means++' or
'random', default = 'k-means++').
n_init: Number of times the algorithm will run with different
centroid seeds (default = 10).
Associated fuctions used
Method Brief Description Code Syntax
make_blobs Generates isotropic Gaussian blobs for from [Link] import make_blobs
clustering. X, y = make_blobs(n_samples=100, centers=2, random_state=42)
about:blank 1/2
7/5/25, 2:40 PM about:blank
Method Brief Description Code Syntax
from [Link] import multivariate_normal
samples = multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=100)
Generates samples from a multivariate
multivariate_normal
normal distribution.
import [Link] as px
fig = px.scatter_3d(df, x='x', y='y', z='z')
[Link]()
Creates a 3D scatter plot using Plotly
[Link].scatter_3d
Express.
import geopandas as gpd
gdf = [Link](df, geometry='geometry')
Creates a GeoDataFrame from a Pandas
[Link]
DataFrame.
gdf = gdf.to_crs(epsg=3857)
Transforms the coordinate reference
geopandas.to_crs
system of a GeoDataFrame.
import contextily as ctx
ax = [Link](figsize=(10, 10))
ctx.add_basemap(ax)
Adds a basemap to a GeoDataFrame plot
contextily.add_basemap
for context.
from [Link] import PCA
pca = PCA(n_components=2)
[Link](X)
variance_ratio = pca.explained_variance_ratio_
Returns the proportion of variance
pca.explained_variance_ratio_
explained by each principal component.
Author
Jeff Grossman
Abhishek Gagneja
about:blank 2/2