DS ML Machine Learning I
DS ML Machine Learning I
Shubhankar Agrawal
Abstract
This document serves as a quick refresher for Data Science and Machine Learning interviews. It covers mathematical and technical concepts
across a range of algorithms. This requires the reader to have a foundational level knowledge with tertiary education in the field. This PDF
contains material for revision over key concepts that are tested in interviews.
Contents
Table 1. Data Types
4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Types of Learning:
4.5 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Supervised: Labelled data
Iterative • Model Based • Statistical Tests
Unsupervised: Unlabelled data
4.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 3
Reinforcement: Learn with feedback from environment
5 Algorithms 4 Other terms:
Parameters: Weights the model learns
5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
OLS • Gradient Descent • Metrics • Regularization • Assumptions
Hyper-parameters: Weights to adjust performance
Cross Validation: Expose all data (K Fold, LOOCV, Temporal)
5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Logistic Regression • Naive Bayes • Metrics • Assumptions
5.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
K-Means • DB Scan • GMM • Metrics
6 More Techniques 7
6.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . 7
6.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 7
Policy Learning • Q-Learning • Exploration Exploitation
7 Nuances 8
7.1 Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 1. Bias Variance Trade-off
7.2 Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 [2]
7.3 More Data Unhelpful . . . . . . . . . . . . . . . . . . . . . . . 8
1–8
Machine Learning: Data Science and ML Refresher
2. Exploratory Data Analysis • Fill with Mean / Median by groups (or) nearest neighbours
• Analyse temporal patterns to fill by dates
2.1. General Statistics
• Fit a regression line to fill missing values
Pandas operations to analyse dataframes:
info: Information on null values, data types, and memory 3.1.2. Outliers
describe: Descriptive statistics of mean, median and IQR Identify by Box Plots, Anomaly Detection
value_counts: Counts for categorical columns
• Remove if not too many.
2.2. Univariate Analyses • Winsorize (clip by IQR)
2.2.1. Target Variable Distribution
3.1.3. Erroneous Values
Continuous: Plot distribution, identify outliers
Some values might be errors and can be fixed
Discrete: Value counts to check for imbalanced data
2.2.2. Feature Distributions • Amounts scaled by 10 / 100
Continuous: Box plots, Histograms • Negative values in a non-negative space
Discrete: Bar charts of value counts • Invalid values (such as co-ordinates)
3.1.4. Others
2.3. Bi-variate Analyses
• Drop duplicates
• Fix entities (NYC vs New York City)
• Date formats
Table 3. Bi-variate Analyses
Comparing Compared Plots 3.2. Feature Scaling
Continuous Continuous Scatter 3.2.1. Standardization
Continuous Discrete Violin, Bar, Line • Calculate Z Scores
Discrete Discrete Grouped Bars • Useful to bring features to similar distributions
• Robust to outliers
Pair plots from Seaborn can plot different types of features against • Better for SVM, Linear Regression
each other in a single graphic.
Can also use IQR to scale more robustly.
3.2.2. Normalization
• MinMax Scaling
• Good for data without outliers
• Better for KNN, Neural Networks
𝑛
𝑛 ∑
max 𝓁(𝜆) = − log(𝜎2 ) + (𝜆 − 1) log(𝑦𝑖 ) (1)
𝜆 2 𝑖=1
2–8
Machine Learning: Data Science and ML Refresher
4.1.2. Pixels
Table 7. Model based Feature Selection
• Intensity
• Hue Model Identification
• Brightness Lasso (L1) 0 co-efficients - Remove
Random Forest Feature Importances - Descending
4.1.3. Temporal
• Year, Month, Day
• Part of day, Weekday
4.5.2. Model Based
• Fourier Transforms (sin, cos) for periodicity
Use model capabilities to identify important features
4.2. Transformation
4.2.1. Temporal 4.5.3. Statistical Tests
Add temporal changes and history Statistical tests can provide comparison of significance by looking at:
4.5. Selection
4.5.1. Iterative
Table 9. Dimensionality Reduction
Sequentially add or remove features one by one to optimize perfor-
mance PCA LDA t-SNE UMAP
Labels? No Yes No Both
Table 6. Iterative Feature Selection
Linear Yes Yes No No
Preserves Global Classes Local Global
Step Forward Backward Linear Clustering
Best For Classify Visualize
Start 0 features All features Patterns Embeddings
Step Feature to add Feature to remove Issues Non-linear Labels Local Tuning
3–8
Machine Learning: Data Science and ML Refresher
5. Algorithms
Examples of different types: Pred Int = 𝑦̂ ∗ ± 𝑡𝛼∕2,𝑛−𝑘−1 ⋅ SE(𝑦̂ ∗ )
Supervised: Most models with labelled data √ ( ) (11)
Unsupervised: Clustering, Variational Auto-Encoders (VAE) SE(𝑦̂ ∗ ) = 𝜎̂ 2 ⋅ 1 + 𝐱⊤∗ (𝐗⊤ 𝐗)−1 𝐱∗
Parametric: Most models with weights to learn
5.1.3. Metrics
Non-Parametric: K Nearest Neighbours, Decision Trees 𝑛
Discriminative: Most models that predict 1∑
MAE = |𝑦 − 𝑦̂ 𝑖 |
Generative: Naive Bayes, Latent Dirichlet Analysis, VAEs 𝑛 𝑖=1 𝑖
𝑛
1 ∑ 𝑦𝑖 − 𝑦̂ 𝑖
5.1. Regression MAPE = | | ⋅ 100
𝑛 𝑖=1 𝑦𝑖
Predicting a continuous variable. 𝑛
1∑
MSE = (𝑦 − 𝑦̂ 𝑖 )2
𝑦 = 𝑋𝛽 + 𝜀 (5) 𝑛 𝑖=1 𝑖
√ (12)
Interpretation: A change in X by 1 unit, increases / decreases y √
√1 ∑
𝑛
by 𝛽 units. RMSE = (𝑦 − 𝑦̂ 𝑖 )2
𝑛 𝑖=1 𝑖
5.1.1. OLS ∑𝑛
(𝑦 − 𝑦̂ 𝑖 )2
𝑖=1 𝑖
2
Ordinary Least Squares, closed Form 𝑅 = 1 − ∑𝑛
(𝑦 − 𝑦)
𝑖=1 𝑖
̄ 2
2 (1 − 𝑅2 )(𝑛 − 1)
𝛽̂ = arg min(𝑦 − 𝑋𝛽)′ (𝑦 − 𝑋𝛽) 𝑅adj =1−( )
𝑛−𝑝−1
𝛽
(6)
𝛽̂ = (𝑋 𝑋)−1 𝑋 ′ 𝑦
′
Notes:
5.1.4. Regularization
Table 10. Gradient Descent Terminology Control co-efficients preventing them from getting too large
Variable Symbol Description
Loss/Cost Function 𝐽 Penalizes predictions Table 11. Regression Regularization
Learning Rate 𝛼 Learning step size
Type Term Gradient
∑
Lasso (L1) 𝜆 |𝛽𝑗 | 𝜆 ⋅ sign(𝛽𝑗 )
𝜆 ∑ 2
𝑛
Ridge (L2) 𝛽𝑗 2𝜆𝛽𝑗
2
1 ∑ 2 𝜆 ∑ 2 ∑
MSE 𝐽(𝛽) = (𝑦 − 𝑋𝑖 𝛽) Elastic Net 𝛽𝑗 + 𝜆1 |𝛽𝑗 | 2𝜆𝛽𝑗 + 𝜆1 ⋅ sign(𝛽𝑗 )
2𝑛 𝑖=1 𝑖 2
4–8
Machine Learning: Data Science and ML Refresher
𝑧 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘 𝑇𝑃 + 𝑇𝑁
Accuracy =
1 (14) 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑝 = 𝜎(𝑧) = [Sigmoid] 𝑇𝑃
1 + 𝑒−𝑧 Precision =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
Recall / Sensitivity (TPR) =
𝑇𝑃 + 𝐹𝑁
Table 12. Logistic Regression Co-efficients 𝑇𝑁
Specificity (TNR) =
Beta Odds Ratio Effect 𝑇𝑁 + 𝐹𝑃
Precision ⋅ Recall
0 1 No effect (𝑝 = 1 − 𝑝) F1 Score = 2 ⋅
Precision + Recall
<0 <1 Decreases 𝑝
Precision ⋅ Recall
>0 >1 Increases 𝑝 F𝛽 = (1 + 𝛽 2 ) ⋅ 2
(𝛽 ⋅ Precision) + Recall
(18)
𝑛
1∑
Log Loss = − [𝑦 log(𝑦̂ 𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑦̂ 𝑖 )]
𝑛 𝑖=1 𝑖
𝑛
1∑
∇𝛽𝑗 Log Loss = (𝑦̂ − 𝑦𝑖 ) 𝑥𝑖𝑗 (15)
𝑛 𝑖=1 𝑖
𝑛
1∑
𝛽𝑗 = 𝛽𝑗 − 𝜂 ⋅ (𝑦̂ − 𝑦𝑖 ) 𝑥𝑖𝑗
𝑛 𝑖=1 𝑖
Multi-class Classification
𝑒𝑧𝑘
𝑝𝑖𝑘 = ∑𝐾
𝑗=1
𝑒 𝑧𝑗
𝑛 𝐾
1 ∑∑ Figure 3. AUC ROC (and) AUC PR
Cross Entropy Loss = − 𝑦 log(𝑝𝑖𝑘 )
𝑛 𝑖=1 𝑘=1 𝑖𝑘 (16) [1]
∏𝑛
𝑃(𝐶𝑘 ) 𝑖=1
𝑃(𝑋𝑖 ∣ 𝐶𝑘 ) 𝑓(𝑥) = 𝑤 𝑇 𝑥 + 𝑏
𝑃(𝐶𝑘 ∣ 𝑋1 , 𝑋2 , … , 𝑋𝑛 ) =
𝑃(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) 1 (19)
𝑛
(17) Minimize: ‖𝑤‖2
∏ 2
𝐶̂ = arg max 𝑃(𝐶𝑘 ) 𝑃(𝑋𝑖 ∣ 𝐶𝑘 )
𝐶𝑘
𝑖=1 Classification
Constraints
5.2.3. Metrics
𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) ≥ 1, ∀𝑖 (20)
Hinge Loss
Table 14. Confusion Matrix
𝑛
Real / Pred True False 1∑
L= max(0, 1 − 𝑦𝑖 𝑦̂ 𝑖 ) (21)
𝑛 𝑖=1
True True Positive (TP) False Negative (FN)
False False Positive (FP) True Negative (TN)
Multi-class classification
5–8
Machine Learning: Data Science and ML Refresher
Hyper-parameters
Table 15. Regularization with C in SVM Can be used for Regularization (with Pruning)
C E.g. Margin Slack / Errors
Low 0.1 Wide High Slack
Moderate 1.0 Balanced Moderate Slack Table 16. Hyper-parameters
High 10 Narrow Low Slack Hyper-parameter Use
Very High 06 Very Narrow Very Low Slack
max_depth Maximum depth of tree
min_samples_split Minimum samples to split
5.3.3. Decision Trees min_samples_leaf Minimum samples at leaf
Tree-based structure to perform splits.
Can be considered a non-parametric model for not making assump-
tions on data distribution. 5.4. Ensemble Methods
• Identify best feature to split on 5.4.1. Bagging
• Recursively continue splits Bootstrap Aggregating models
• Calculate predictions by average of node values
• Run parallel
• Minimize variance
𝑛 𝑛
1∑ 1 ∑ 1
Var(𝑦)
̂ = Var ( 𝑦̂ 𝑖 ) = 2 Var(𝑦̂ 𝑖 ) = 2 ⋅ 𝑛 ⋅ Var(𝑦̂ 𝑖 ) (24)
𝑛 𝑖=1 𝑛 𝑖=1 𝑛
Random Forest
5.4.2. Boosting
Sequentially built models
6–8
Machine Learning: Data Science and ML Refresher
Gradient Boosting
Add learners on residuals from previous models
core point: Number of points within distance 𝜖 ≥ minPts (29)
where
𝐾
∑
𝐹𝑀 (𝑥) = Prediction from the 𝑀 th iteration of the model 𝑃(𝑥) = 𝜋𝑘 𝒩(𝑥 ∣ 𝜇𝑘 , Σ𝑘 ) (30)
𝐹𝑀−1 (𝑥) = Prediction from the (𝑀 − 1)th iteration of the model 𝑘=1
∑ Where:
𝑖∈𝐼𝑗
𝑔𝑖
Leaf Value 𝑤𝑗 = − ∑ (27) • Index: Number of point pairs assigned to the same or different
𝑖∈𝐼𝑗
ℎ𝑖 + 𝜆 clusters in both ground truth and predicted clusters
• Expected Index: The expected value of the Index if clusters were
5.4.3. Stacking
randomly assigned
Combine advantages of many models • Max Index: The maximum possible value of the Index
• Train several base models
• Train meta model - cross validated predictions of base models 6. More Techniques
7–8
Machine Learning: Data Science and ML Refresher
∑
𝑉(𝑠) ← max 𝑃(𝑠′ |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾𝑉 ∗ (𝑠′ )]
𝑎
𝑠′
∑ ∑ (33)
𝑉 𝜋 (𝑠) = 𝜋(𝑎|𝑠) 𝑃(𝑠′ |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾𝑉 𝜋 (𝑠′ )]
𝑎 𝑠′
∑ ∑
𝑄𝜋 (𝑠, 𝑎) = 𝑃(𝑠′ |𝑠, 𝑎) [𝑅(𝑠, 𝑎, 𝑠′ ) + 𝛾 𝜋(𝑎′ |𝑠′ )𝑄𝜋 (𝑠′ , 𝑎′ )] (34)
𝑠′ 𝑎′
𝜖-Greedy
7. Nuances
7.1. Imbalanced Data
• Data Sampling
• Weighted Loss Functions
• Tree Based Methods (Robust)
• Precision-Recall instead of ROC (False Positives)
7.2. Biases
• Identify difference in distributions for features
• Up-sample data across biased attributes
• Normalize with respect to groups
• Mask / Group together data
• Embed to lower dimension with VAE
References
[1] Area Under Curves. [Online]. Available: [Link]
[Link]/how-and-why-i-switched-from-the-roc-curve-
to- the - precision- recall- curve - to- analyze - my- imbalanced-
6171da91c6b8.
[2] Bias Variance Tradeoff. [Online]. Available: https : / / en .
[Link]/wiki/Bias%E2%80%93variance_tradeoff.
[3] Decision Trees. [Online]. Available: [Link]
com / machine - learning - decision - tree - classification -
algorithm.
[4] Pair Plots. [Online]. Available: [Link]
python-seaborn-pairplot-method/.
8–8