0% found this document useful (0 votes)
7 views8 pages

DS ML Machine Learning I

This document is a refresher guide for Data Science and Machine Learning interviews, covering essential mathematical and technical concepts across various algorithms. It includes topics such as data types, exploratory data analysis, data preprocessing, feature engineering, and different machine learning algorithms. The material is designed for individuals with a foundational knowledge in the field and serves as a revision tool for key concepts tested in interviews.

Uploaded by

Albeniz
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

DS ML Machine Learning I

This document is a refresher guide for Data Science and Machine Learning interviews, covering essential mathematical and technical concepts across various algorithms. It includes topics such as data types, exploratory data analysis, data preprocessing, feature engineering, and different machine learning algorithms. The material is designed for individuals with a foundational knowledge in the field and serves as a revision tool for key concepts tested in interviews.

Uploaded by

Albeniz
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning: Data Science and ML Refresher

Shubhankar Agrawal

Abstract
This document serves as a quick refresher for Data Science and Machine Learning interviews. It covers mathematical and technical concepts
across a range of algorithms. This requires the reader to have a foundational level knowledge with tertiary education in the field. This PDF
contains material for revision over key concepts that are tested in interviews.

Contents
Table 1. Data Types

1 Machine Learning 1 Data Type Description


1.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Nominal Categorical No order
Ordinal Categorical Ordered
2 Exploratory Data Analysis 2 Interval Numerical Can be negative
2.1 General Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 2 Ratio Numerical Has a defined 0
2.2 Univariate Analyses . . . . . . . . . . . . . . . . . . . . . . . . 2
Target Variable Distribution • Feature Distributions
1. Machine Learning
2.3 Bi-variate Analyses . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Multi-variate Analyses . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Key Concepts
Data Splits:
3 Data Pre-Processing 2
Train: The model learns from it.
3.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Valid: Tune hyper-parameters, prevent over-fitting.
Missing Values • Outliers • Erroneous Values • Others Test: Unseen data, evaluate performance.
3.2 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Standardization • Normalization • Box Cox Transform
Table 2. Bias Variance Trade-Off
3.3 Target Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Bias Variance
4 Feature Engineering 2
What? Error Prediction Variability
4.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Complexity Too Simple Too Complex
Text • Pixels • Temporal Fitting Under Over
4.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Train Error High Low
Temporal • Complexity Test Error High High
4.3 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Formula Bias(𝜃)
̂ = 𝔼[𝜃]
̂ −𝜃 Var(𝜃)
̂ = 𝔼[(𝜃̂ − 𝔼[𝜃])
̂ 2]

4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Types of Learning:
4.5 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Supervised: Labelled data
Iterative • Model Based • Statistical Tests
Unsupervised: Unlabelled data
4.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 3
Reinforcement: Learn with feedback from environment
5 Algorithms 4 Other terms:
Parameters: Weights the model learns
5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
OLS • Gradient Descent • Metrics • Regularization • Assumptions
Hyper-parameters: Weights to adjust performance
Cross Validation: Expose all data (K Fold, LOOCV, Temporal)
5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Logistic Regression • Naive Bayes • Metrics • Assumptions

5.3 More Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


K Nearest Neighbours • Support Vector Machines • Decision Trees

5.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . 6


Bagging • Boosting • Stacking

5.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
K-Means • DB Scan • GMM • Metrics

6 More Techniques 7
6.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . 7
6.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 7
Policy Learning • Q-Learning • Exploration Exploitation

7 Nuances 8
7.1 Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 1. Bias Variance Trade-off
7.2 Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 [2]
7.3 More Data Unhelpful . . . . . . . . . . . . . . . . . . . . . . . 8

1–8
Machine Learning: Data Science and ML Refresher

2. Exploratory Data Analysis • Fill with Mean / Median by groups (or) nearest neighbours
• Analyse temporal patterns to fill by dates
2.1. General Statistics
• Fit a regression line to fill missing values
Pandas operations to analyse dataframes:
info: Information on null values, data types, and memory 3.1.2. Outliers
describe: Descriptive statistics of mean, median and IQR Identify by Box Plots, Anomaly Detection
value_counts: Counts for categorical columns
• Remove if not too many.
2.2. Univariate Analyses • Winsorize (clip by IQR)
2.2.1. Target Variable Distribution
3.1.3. Erroneous Values
Continuous: Plot distribution, identify outliers
Some values might be errors and can be fixed
Discrete: Value counts to check for imbalanced data
2.2.2. Feature Distributions • Amounts scaled by 10 / 100
Continuous: Box plots, Histograms • Negative values in a non-negative space
Discrete: Bar charts of value counts • Invalid values (such as co-ordinates)

3.1.4. Others
2.3. Bi-variate Analyses
• Drop duplicates
• Fix entities (NYC vs New York City)
• Date formats
Table 3. Bi-variate Analyses
Comparing Compared Plots 3.2. Feature Scaling
Continuous Continuous Scatter 3.2.1. Standardization
Continuous Discrete Violin, Bar, Line • Calculate Z Scores
Discrete Discrete Grouped Bars • Useful to bring features to similar distributions
• Robust to outliers
Pair plots from Seaborn can plot different types of features against • Better for SVM, Linear Regression
each other in a single graphic.
Can also use IQR to scale more robustly.

3.2.2. Normalization
• MinMax Scaling
• Good for data without outliers
• Better for KNN, Neural Networks

3.2.3. Box Cox Transform


Log (or) Power transform maximizing normality

𝑛
𝑛 ∑
max 𝓁(𝜆) = − log(𝜎2 ) + (𝜆 − 1) log(𝑦𝑖 ) (1)
𝜆 2 𝑖=1

Table 4. Feature Scaling


Method Formula
𝑥−𝜇
Standardization 𝑥std =
𝜎
Figure 2. Pair Plots 𝑥 − min(𝑥)
[4] Normalization 𝑥norm =
max(𝑥) − min(𝑥)
⎧ (𝑥𝜆 − 1)
2.4. Multi-variate Analyses if 𝜆 ≠ 0
Box Cox Transform 𝑦(𝜆) = 𝜆
• Heat maps ⎨
log(𝑥) if 𝜆 = 0
• Violin Plots (Binary categories) ⎩
• Scatter Plots (with Sizes)
• Correlation matrices
3.3. Target Scaling
3. Data Pre-Processing Log Scale: When extremely skewed, but reduces interpretability
3.1. Data Cleaning
(With co-efficients)
NOTE: Use median instead of mean when the data is skewed.
4. Feature Engineering
3.1.1. Missing Values
Remove if not too many (especially in Target) 4.1. Extraction
Impute: Fill values 4.1.1. Text
Interpolate: Estimate the line • TF-IDF scores
Ways to fill values: • Sentence Embeddings

2–8
Machine Learning: Data Science and ML Refresher

4.1.2. Pixels
Table 7. Model based Feature Selection
• Intensity
• Hue Model Identification
• Brightness Lasso (L1) 0 co-efficients - Remove
Random Forest Feature Importances - Descending
4.1.3. Temporal
• Year, Month, Day
• Part of day, Weekday
4.5.2. Model Based
• Fourier Transforms (sin, cos) for periodicity
Use model capabilities to identify important features
4.2. Transformation
4.2.1. Temporal 4.5.3. Statistical Tests

Add temporal changes and history Statistical tests can provide comparison of significance by looking at:

• Test Statistic (direction of influence)


Lags @ 𝑥𝑡 = 𝑥𝑡−7 • P-value (Acceptance)
Rolling Mean @ 𝑥𝑡 = (𝑥𝑡−2 + 𝑥𝑡−3 + 𝑥𝑡−4 )∕3 (2)
Difference @ 𝑥𝑡 = 𝑥𝑡 ∕𝑥𝑡−1
Table 8. Statistical Test
Differencing also brings stationarity
Feature Target Test
4.2.2. Complexity
Continuous Continuous T-Test / Z-Test
Introduce non-linearity
Continuous Discrete ANOVA (F-Test)
Discrete Continuous ANOVA (F-Test)
Polynomial @ 𝑥𝑖 = 𝑥𝑖2 + 𝑥𝑖3 Discrete Discrete Chi Square
(3)
Interaction @ 𝑥𝑡 = 𝑥𝑖 𝑥𝑗 + 𝑥𝑖 𝑥𝑗2
Multicollinearity can also be removed
4.3. Encoding
Converting categorical to numerical variables. Some models can • Correlation matrix (Identify highly correlated features)
handle categorical data, so not necessary. • Variance Inflation Factor (VIF > 5 or 10)

VIF is calculated by regressing each feature on the other features


Table 5. Encoding
1
Type How Cardinality Columns VIF𝑖 = (4)
One Hot 0∕1 Dummies Low # distinct values 1 − 𝑅𝑖2
Ordinal Order + Scale Any -
Target Group Mean High - 4.6. Dimensionality Reduction
Principal Components Analysis:
NOTE: Encoding should only be done using train data
• Standardize data (Z Score)
• Identify data types
• Compute covariance matrix
• Generate lags, averages, aggregate features
• Eigenvalue Decomposition
• Feature Pre-processing

4.4. Sampling Linear Discriminant Analysis:


Usually performed when data is imbalanced.
Downsample: Reduce instances of majority class • Group by classes
Upsample: Increase instances of minority class • SSW (Sum of Squares Within classes)
Upsampling involves creating instances: • SSB (Sum of Squares Between classes) [Multiply # class sam-
ples]
• SMOTE - Interpolate points in space • SSB / SSW
• Variational Auto Encoder - Learn Distribution • Eigenvalue Decomposition

4.5. Selection
4.5.1. Iterative
Table 9. Dimensionality Reduction
Sequentially add or remove features one by one to optimize perfor-
mance PCA LDA t-SNE UMAP
Labels? No Yes No Both
Table 6. Iterative Feature Selection
Linear Yes Yes No No
Preserves Global Classes Local Global
Step Forward Backward Linear Clustering
Best For Classify Visualize
Start 0 features All features Patterns Embeddings
Step Feature to add Feature to remove Issues Non-linear Labels Local Tuning

3–8
Machine Learning: Data Science and ML Refresher

5. Algorithms
Examples of different types: Pred Int = 𝑦̂ ∗ ± 𝑡𝛼∕2,𝑛−𝑘−1 ⋅ SE(𝑦̂ ∗ )
Supervised: Most models with labelled data √ ( ) (11)
Unsupervised: Clustering, Variational Auto-Encoders (VAE) SE(𝑦̂ ∗ ) = 𝜎̂ 2 ⋅ 1 + 𝐱⊤∗ (𝐗⊤ 𝐗)−1 𝐱∗
Parametric: Most models with weights to learn
5.1.3. Metrics
Non-Parametric: K Nearest Neighbours, Decision Trees 𝑛
Discriminative: Most models that predict 1∑
MAE = |𝑦 − 𝑦̂ 𝑖 |
Generative: Naive Bayes, Latent Dirichlet Analysis, VAEs 𝑛 𝑖=1 𝑖
𝑛
1 ∑ 𝑦𝑖 − 𝑦̂ 𝑖
5.1. Regression MAPE = | | ⋅ 100
𝑛 𝑖=1 𝑦𝑖
Predicting a continuous variable. 𝑛
1∑
MSE = (𝑦 − 𝑦̂ 𝑖 )2
𝑦 = 𝑋𝛽 + 𝜀 (5) 𝑛 𝑖=1 𝑖
√ (12)
Interpretation: A change in X by 1 unit, increases / decreases y √
√1 ∑
𝑛

by 𝛽 units. RMSE = (𝑦 − 𝑦̂ 𝑖 )2
𝑛 𝑖=1 𝑖
5.1.1. OLS ∑𝑛
(𝑦 − 𝑦̂ 𝑖 )2
𝑖=1 𝑖
2
Ordinary Least Squares, closed Form 𝑅 = 1 − ∑𝑛
(𝑦 − 𝑦)
𝑖=1 𝑖
̄ 2

2 (1 − 𝑅2 )(𝑛 − 1)
𝛽̂ = arg min(𝑦 − 𝑋𝛽)′ (𝑦 − 𝑋𝛽) 𝑅adj =1−( )
𝑛−𝑝−1
𝛽
(6)
𝛽̂ = (𝑋 𝑋)−1 𝑋 ′ 𝑦

Notes:

5.1.2. Gradient Descent • 𝑅2 increases with variables; use adjusted 𝑅2


Iterative Solution • Use RMSE instead of MSE to stay in the same scale

5.1.4. Regularization
Table 10. Gradient Descent Terminology Control co-efficients preventing them from getting too large
Variable Symbol Description
Loss/Cost Function 𝐽 Penalizes predictions Table 11. Regression Regularization
Learning Rate 𝛼 Learning step size
Type Term Gradient

Lasso (L1) 𝜆 |𝛽𝑗 | 𝜆 ⋅ sign(𝛽𝑗 )
𝜆 ∑ 2
𝑛
Ridge (L2) 𝛽𝑗 2𝜆𝛽𝑗
2
1 ∑ 2 𝜆 ∑ 2 ∑
MSE 𝐽(𝛽) = (𝑦 − 𝑋𝑖 𝛽) Elastic Net 𝛽𝑗 + 𝜆1 |𝛽𝑗 | 2𝜆𝛽𝑗 + 𝜆1 ⋅ sign(𝛽𝑗 )
2𝑛 𝑖=1 𝑖 2

𝛽 (𝑡+1) = 𝛽 (𝑡) − 𝛼∇𝐽(𝛽 (𝑡) )


(7)
1
∇𝐽(𝛽) = − 𝑋 ′ (𝑦 − 𝑋𝛽) 5.1.5. Assumptions
𝑛
1 • Data follows linear relationship
𝛽 (𝑡+1) = 𝛽 (𝑡) + 𝜂 ⋅ 𝑋 ′ (𝑦 − 𝑋𝛽 (𝑡) ) • Errors are normally distributed
𝑛
• Errors are homo-skedastic (Constant variance)
Standard Error • No multicollinearity of features
∑𝑛 • No auto-correlation of errors
𝑖=1
(𝑦𝑖 − 𝑦̂ 𝑖 )2
2
𝜎̂ = (8)
𝑛−𝑘−1 Fixes: Log transforms, outlier removals
Confidence Interval (Co-efficients)
5.2. Classification
Predicting a discrete variable, a binary or a multi-class.
Conf Int 𝛽 = 𝛽𝑗 ± 𝑡𝛼∕2,𝑛−𝑘−1 ⋅ SE(𝛽𝑗 )

𝜎̂ 2 (9) 5.2.1. Logistic Regression
SE(𝛽𝑗 ) = ⊤ Fits a Linear Regression to the log odds
𝐗 𝐗𝑗𝑗
Terminology
Confidence Interval
Range of mean predicted value 𝑝
Odds =
1−𝑝
Conf Int = 𝑦̂ ∗ ± 𝑡𝛼∕2,𝑛−𝑘−1 ⋅ SE(𝑦̂ ∗ ) 𝑝 (13)
Log-Odds = ln ( ) = 𝛽0 + ⋯ + 𝛽𝑘 𝑥𝑘
√ ( ) (10) 1−𝑝
SE(𝑦̂ ∗ ) = 𝜎̂ 2 ⋅ 𝐱⊤∗ (𝐗⊤ 𝐗)−1 𝐱∗ Odds Ratio (OR) = 𝑒𝛽𝑗

Prediction Interval Interpretation: A change in X by 1 unit, increases / decreases the


Range of newly predicted value, includes observation noise. Log Odds by 𝛽 units.

4–8
Machine Learning: Data Science and ML Refresher

𝑧 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘 𝑇𝑃 + 𝑇𝑁
Accuracy =
1 (14) 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑝 = 𝜎(𝑧) = [Sigmoid] 𝑇𝑃
1 + 𝑒−𝑧 Precision =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
Recall / Sensitivity (TPR) =
𝑇𝑃 + 𝐹𝑁
Table 12. Logistic Regression Co-efficients 𝑇𝑁
Specificity (TNR) =
Beta Odds Ratio Effect 𝑇𝑁 + 𝐹𝑃
Precision ⋅ Recall
0 1 No effect (𝑝 = 1 − 𝑝) F1 Score = 2 ⋅
Precision + Recall
<0 <1 Decreases 𝑝
Precision ⋅ Recall
>0 >1 Increases 𝑝 F𝛽 = (1 + 𝛽 2 ) ⋅ 2
(𝛽 ⋅ Precision) + Recall
(18)
𝑛
1∑
Log Loss = − [𝑦 log(𝑦̂ 𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑦̂ 𝑖 )]
𝑛 𝑖=1 𝑖
𝑛
1∑
∇𝛽𝑗 Log Loss = (𝑦̂ − 𝑦𝑖 ) 𝑥𝑖𝑗 (15)
𝑛 𝑖=1 𝑖
𝑛
1∑
𝛽𝑗 = 𝛽𝑗 − 𝜂 ⋅ (𝑦̂ − 𝑦𝑖 ) 𝑥𝑖𝑗
𝑛 𝑖=1 𝑖

Multi-class Classification

𝑒𝑧𝑘
𝑝𝑖𝑘 = ∑𝐾
𝑗=1
𝑒 𝑧𝑗
𝑛 𝐾
1 ∑∑ Figure 3. AUC ROC (and) AUC PR
Cross Entropy Loss = − 𝑦 log(𝑝𝑖𝑘 )
𝑛 𝑖=1 𝑘=1 𝑖𝑘 (16) [1]

∇𝑧𝑘 Loss = 𝑝𝑖𝑘 − 𝑦𝑖𝑘


5.2.4. Assumptions
𝑛
(𝑘) (𝑘) 1∑ Similar assumptions as linear regression on fit
𝛽𝑗 = 𝛽𝑗 −𝜂⋅ (𝑝 − 𝑦𝑖𝑘 )𝑥𝑖𝑗
𝑛 𝑖=1 𝑖𝑘
• Logits follow linear relationship
• Data is linearly separable
• Categories are mutually exclusive
Table 13. Multi Class Classification
Binary Multiclass 5.3. More Methods
Methods that can be used for both regression and classification
Loss Binary Log Loss Cross Entropy
Activation Sigmoid Softmax 5.3.1. K Nearest Neighbours
• Non-parametric method
• Does not have any training
5.2.2. Naive Bayes • Evaluation done by aggregating nearest points
Key Assumptions: • Cross-validate to get best K value

• Features are independent 5.3.2. Support Vector Machines


• Continuous features follow Gaussian distribution Fits a hyperplane
Support Vectors: Points on margin

∏𝑛
𝑃(𝐶𝑘 ) 𝑖=1
𝑃(𝑋𝑖 ∣ 𝐶𝑘 ) 𝑓(𝑥) = 𝑤 𝑇 𝑥 + 𝑏
𝑃(𝐶𝑘 ∣ 𝑋1 , 𝑋2 , … , 𝑋𝑛 ) =
𝑃(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) 1 (19)
𝑛
(17) Minimize: ‖𝑤‖2
∏ 2
𝐶̂ = arg max 𝑃(𝐶𝑘 ) 𝑃(𝑋𝑖 ∣ 𝐶𝑘 )
𝐶𝑘
𝑖=1 Classification
Constraints
5.2.3. Metrics
𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) ≥ 1, ∀𝑖 (20)

Hinge Loss
Table 14. Confusion Matrix
𝑛
Real / Pred True False 1∑
L= max(0, 1 − 𝑦𝑖 𝑦̂ 𝑖 ) (21)
𝑛 𝑖=1
True True Positive (TP) False Negative (FN)
False False Positive (FP) True Negative (TN)
Multi-class classification

5–8
Machine Learning: Data Science and ML Refresher

One vs One (OvO): Fit 𝑛2 models • Maximize Information Gain


One vs Rest (OvR: Fit 𝑛 models • Minimize Entropy
Regression • Minimize Gini
Constraints • Minimize MSE (Regression)

Discrete: Evaluate each categorical value vs others


𝑦𝑖 − (𝑤 𝑇 𝑥𝑖 + 𝑏) ≤ 𝜖 + 𝜉𝑖+ ,
(22) Continuous: Evaluate boundaries where predictions change
(𝑤 𝑇 𝑥𝑖 + 𝑏) − 𝑦𝑖 ≤ 𝜖 + 𝜉𝑖−
𝑘

Gini(𝑡) = 1 − 𝑝𝑖2
𝑖=1
𝑘

Entropy(𝑡) = − 𝑝𝑖 log2 (𝑝𝑖 )
𝑖=1
∑ |𝑡𝑣 |
Information Gain(𝑡, 𝐴) = Entropy(𝑡) − Entropy(𝑡𝑣 )
𝑣∈Values(𝐴)
|𝑡|
∑ |𝑡𝑣 | |𝑡𝑣 |
Intrinsic Information(𝐴) = − log2 ( )
𝑣∈Values(𝐴)
|𝑡| |𝑡|
Figure 4. Support Vectors
[5] Information Gain(𝐴)
Gain Ratio(𝐴) =
Intrinsic Information(𝐴)
Kernels (23)

• Linear C 5.0 Algorithm


• Polynomial
• Radial Basis Function (RBF): Exponential equation • Use Gain Ratio for optimized splits
• Winnowing (Remove features least used)
NOTE: RBF Non-parametric (dimensions ∝ samples) • Prune with Cost Complexity (# Leaf Nodes, Entropy)

Hyper-parameters
Table 15. Regularization with C in SVM Can be used for Regularization (with Pruning)
C E.g. Margin Slack / Errors
Low 0.1 Wide High Slack
Moderate 1.0 Balanced Moderate Slack Table 16. Hyper-parameters
High 10 Narrow Low Slack Hyper-parameter Use
Very High 06 Very Narrow Very Low Slack
max_depth Maximum depth of tree
min_samples_split Minimum samples to split
5.3.3. Decision Trees min_samples_leaf Minimum samples at leaf
Tree-based structure to perform splits.
Can be considered a non-parametric model for not making assump-
tions on data distribution. 5.4. Ensemble Methods
• Identify best feature to split on 5.4.1. Bagging
• Recursively continue splits Bootstrap Aggregating models
• Calculate predictions by average of node values
• Run parallel
• Minimize variance

𝑛 𝑛
1∑ 1 ∑ 1
Var(𝑦)
̂ = Var ( 𝑦̂ 𝑖 ) = 2 Var(𝑦̂ 𝑖 ) = 2 ⋅ 𝑛 ⋅ Var(𝑦̂ 𝑖 ) (24)
𝑛 𝑖=1 𝑛 𝑖=1 𝑛

Random Forest

• Randomly subset samples at each node


• Randomly subset features to test at each node

Out of Bag: Remaining samples not used for validation score

5.4.2. Boosting
Sequentially built models

Figure 5. Decision Tree • Run iteratively


[3] • Minimize bias
• Gradient Descent
Split Calculations • Sum predictions from weighted models

6–8
Machine Learning: Data Science and ML Refresher

Adaptive Boosting (AdaBoost) • Assumes spherical clusters


Weight samples higher for misclassification • Results depend on initialization
• Follows Expectation (Assign cluster) Maximization (recalculate
𝑀
centroid)

𝐹(𝑥) = 𝛼𝑚 ℎ𝑚 (𝑥)
𝑚=1 𝐾
∑ ∑
1 1 − 𝜖𝑚 𝐽(𝐾) = ‖𝑥𝑖 − 𝜇𝑘 ‖2 (28)
𝛼𝑚 = ln ( ) 𝑘=1 𝑥𝑖 ∈𝐶𝑘
2 𝜖𝑚
𝐷𝑚 (𝑥) = 𝐷𝑚−1 (𝑥) ⋅ exp (−𝛼𝑚 𝑦𝑚 ℎ𝑚 (𝑥)) NOTE: KMeans++ can be used to distance centroid initialization
where (25)
5.5.2. DB Scan
𝐹(𝑥) = Final Prediction
Density based clustering
ℎ𝑚 (𝑥) = Prediction from the 𝑚th model
𝛼𝑚 = Weight for the 𝑚th model, based on its accuracy • Needs # Points and Minimum Distance
𝜖𝑚 = Weighted error rate of the 𝑚th model • Does not need number of clusters
• Identifies outliers
𝐷𝑚 (𝑥) = Weight of sample 𝑥 after the 𝑚th model update

Gradient Boosting
Add learners on residuals from previous models
core point: Number of points within distance 𝜖 ≥ minPts (29)

𝐹𝑀 (𝑥) = 𝐹𝑀−1 (𝑥) + 𝜂ℎ𝑀 (𝑥) 5.5.3. GMM

𝑟𝑖 = −∇𝐿(𝐹𝑀−1 (𝑥𝑖 )) Gaussian Mixture Models: Soft clustering


𝑀
∑ • Assume several Gaussian distributions
𝐹(𝑥) = 𝜂ℎ𝑚 (𝑥) • Model latent parameters
𝑚=1

where
𝐾

𝐹𝑀 (𝑥) = Prediction from the 𝑀 th iteration of the model 𝑃(𝑥) = 𝜋𝑘 𝒩(𝑥 ∣ 𝜇𝑘 , Σ𝑘 ) (30)
𝐹𝑀−1 (𝑥) = Prediction from the (𝑀 − 1)th iteration of the model 𝑘=1

ℎ𝑚 (𝑥) = Model at the 𝑚th iteration (fit on residuals) 5.5.4. Metrics


𝑟𝑖 = Residual (negative gradient of the loss function) Silhouette Score: Intra-cluster vs inter-cluster distance.
𝐿 = Loss function, used to compute the residuals Range: -1 to 1
(26)
𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑖) = (31)
max(𝑎(𝑖), 𝑏(𝑖))

Table 17. Gradient Boosting


Where:

LightGBM XGBoost • 𝑠(𝑖): Silhouette score for point 𝑖,


• 𝑎(𝑖): Mean intra-cluster distance (average distance of 𝑖 to all
Growth Leaf Wise Depth Wise
other points in the same cluster),
Categorical Direct Encoding
• 𝑏(𝑖): Mean nearest-cluster distance (average distance of 𝑖 to
Memory Efficient Not so much
points in the nearest cluster).

LightGBM: Adjusted Rand Index: Similarities by pairwise points


Range: 0 to 1
• Histogram based approach to optimize splits
• Gradient Based One Sided Sampling (GOSS) Index − Expected Index
𝐴𝑅𝐼 = (32)
• Exclusive Feature Bundling (EFB) Max Index − Expected Index

∑ Where:
𝑖∈𝐼𝑗
𝑔𝑖
Leaf Value 𝑤𝑗 = − ∑ (27) • Index: Number of point pairs assigned to the same or different
𝑖∈𝐼𝑗
ℎ𝑖 + 𝜆 clusters in both ground truth and predicted clusters
• Expected Index: The expected value of the Index if clusters were
5.4.3. Stacking
randomly assigned
Combine advantages of many models • Max Index: The maximum possible value of the Index
• Train several base models
• Train meta model - cross validated predictions of base models 6. More Techniques

Hard Voting: Majority prediction 6.1. Anomaly Detection


Soft Voting: Weighted average (accuracies) • Isolation Forest (Random Forest - Average depth)
• DB Scan
5.5. Clustering
5.5.1. K-Means 6.2. Reinforcement Learning
Unsupervised approach to group data Receptive environment-based algorithm with feedback

7–8
Machine Learning: Data Science and ML Refresher

6.2.1. Policy Learning [5] Suport Vectors. [Online]. Available: [Link]


Bellman Equation: Model-based recursive equation for state value net / figure / Overview - of - SVM - algorithm - a - SVM - for -
updates. classification-b-SVM-for-regression_fig1_347831458.


𝑉(𝑠) ← max 𝑃(𝑠′ |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾𝑉 ∗ (𝑠′ )]
𝑎
𝑠′
∑ ∑ (33)
𝑉 𝜋 (𝑠) = 𝜋(𝑎|𝑠) 𝑃(𝑠′ |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾𝑉 𝜋 (𝑠′ )]
𝑎 𝑠′

NOTE: Requires known probabilities and rewards


6.2.2. Q-Learning
Model-free algorithm to decide best actions from trial and error.

𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝛼 [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾 max 𝑄(𝑠′ , 𝑎′ ) − 𝑄(𝑠, 𝑎)]


𝑎′

∑ ∑
𝑄𝜋 (𝑠, 𝑎) = 𝑃(𝑠′ |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾 𝜋(𝑎′ |𝑠′ )𝑄𝜋 (𝑠′ , 𝑎′ )] (34)
𝑠′ 𝑎′

𝜋(𝑠) = arg max 𝑄(𝑠, 𝑎)


𝑎

6.2.3. Exploration Exploitation


Balance exploration (to search new paths) and exploitation (capitalize
on high rewards)

𝜖-Greedy

random action with probability 𝜖,


𝑎={ (35)
arg max 𝑎 𝑄(𝑠, 𝑎) with probability 1 − 𝜖.

7. Nuances
7.1. Imbalanced Data
• Data Sampling
• Weighted Loss Functions
• Tree Based Methods (Robust)
• Precision-Recall instead of ROC (False Positives)

7.2. Biases
• Identify difference in distributions for features
• Up-sample data across biased attributes
• Normalize with respect to groups
• Mask / Group together data
• Embed to lower dimension with VAE

7.3. More Data Unhelpful


• Data is white noise
• Model is restrictive (cannot learn more)
• Second round of data collection might bring biases
• Increase the imbalance

References
[1] Area Under Curves. [Online]. Available: [Link]
[Link]/how-and-why-i-switched-from-the-roc-curve-
to- the - precision- recall- curve - to- analyze - my- imbalanced-
6171da91c6b8.
[2] Bias Variance Tradeoff. [Online]. Available: https : / / en .
[Link]/wiki/Bias%E2%80%93variance_tradeoff.
[3] Decision Trees. [Online]. Available: [Link]
com / machine - learning - decision - tree - classification -
algorithm.
[4] Pair Plots. [Online]. Available: [Link]
python-seaborn-pairplot-method/.

8–8

You might also like