COMPREHENSIVE AI/ML GUIDE
Level 1: Machine Learning Fundamentals
Level 2: Data Preprocessing
Level 3: Evaluation & Metrics
Level 4: Probability & Statistics
Level 5: Model Improvement Techniques
Generated on: April 06, 2026
TABLE OF CONTENTS
Level 1: Machine Learning Algorithms
• Regression Models
• Classification Algorithms
• Tree-Based Models
• Support Vector Machines
Level 2: Data Preprocessing
• Missing Values
• Encoding Categorical Data
• Feature Scaling
• Train-Test Split & Cross-Validation
Level 3: Evaluation & Metrics
• Confusion Matrix
• Classification Metrics
• ROC Curve & AUC
Level 4: Probability & Statistics
• Descriptive Statistics
• Probability Concepts
• Distributions
Level 5: Model Improvement
• Bias vs Variance
• Hyperparameter Tuning
• Feature Selection
LEVEL 1: MACHINE LEARNING ALGORITHMS
Objective: Understand various algorithms and when to use them.
1.1 LINEAR REGRESSION
What is it?
Linear Regression is a supervised learning algorithm that models the relationship between one
independent variable (X) and a dependent variable (Y) by fitting a straight line.
Why is it?
It's the simplest and most interpretable model for understanding relationship between variables. It's
the foundation for understanding more complex algorithms.
Where is it used?
House price prediction, stock price forecasting, sales forecasting, and any continuous variable
prediction.
Mathematical Formula:
y = mx + c
OR
y = β■ + β■x + ε
Where:
• y = dependent variable (output)
• x = independent variable (input)
• m (or β■) = slope (rate of change)
• c (or β■) = y-intercept
• ε = error term
Cost Function (Mean Squared Error):
MSE = (1/n) Σ(y■ - ■■)²
Algorithm:
1. Initialize parameters m and c to 0
2. Calculate predictions: ■ = mx + c
3. Calculate error: MSE
4. Update m and c using gradient descent
5. Repeat until convergence
Gradient Descent Update:
m = m - α × (∂MSE/∂m)
c = c - α × (∂MSE/∂c)
α = learning rate (typically 0.01)
Example:
Predicting house prices based on area in sq ft:
• Area = 2000 sq ft → Price = $300,000
• Area = 3000 sq ft → Price = $450,000
• Fitted line: Price = 150×Area + 0
Types:
• Simple Linear Regression (1 independent variable)
• Cannot handle multiple features directly
1.2 MULTIPLE LINEAR REGRESSION
What is it?
Extension of Linear Regression with multiple independent variables affecting one dependent
variable.
Formula:
y = β■ + β■x■ + β■x■ + ... + β■x■ + ε
Example Prediction:
House Price = 50,000 + 150×(Area) + 5,000×(Bedrooms) + 3,000×(Age)
When to use:
• When multiple factors affect the outcome
• More realistic real-world scenarios
• Better predictions than simple regression
1.3 POLYNOMIAL REGRESSION
What is it?
Extends linear regression by fitting a polynomial curve instead of a straight line.
Formula:
y = β■ + β■x + β■x² + β■x³ + ... + β■x■ + ε
Degree of Polynomial:
• Degree 2 (Quadratic): y = β■ + β■x + β■x²
• Degree 3 (Cubic): y = β■ + β■x + β■x² + β■x³
Example:
Modeling growth that accelerates (exponential-like):
• Stock portfolio value over years (non-linear growth)
• Acceleration in physics
Pros & Cons:
✓ Captures non-linear relationships
✗ Risk of overfitting with high degree
✗ Computationally more expensive
1.4 LOGISTIC REGRESSION (■ VERY IMPORTANT)
What is it?
Classification algorithm that predicts probability of binary outcome (0 or 1, True or False).
Why is it important?
• Foundation of neural networks
• Used in credit card fraud detection
• Email spam classification
• Medical diagnosis
Key Concept:
Uses sigmoid function to convert linear output to probability.
Sigmoid(z) = 1 / (1 + e^(-z))
where z = β■ + β■x
Sigmoid Properties:
• Output always between 0 and 1
• z = 0 → sigmoid = 0.5
• z → ∞ → sigmoid → 1
• z → -∞ → sigmoid → 0
Decision Boundary:
If probability > 0.5 → Predict Class 1
If probability ≤ 0.5 → Predict Class 0
Cost Function (Log Loss / Binary Cross-Entropy):
J(β) = -1/n × Σ[y■×log(■■) + (1-y■)×log(1-■■)]
Example:
Email Spam Classification:
• Input: 50 features (word frequency, sender info)
• Output: Probability of being spam
• If P(spam) > 0.5 → Mark as spam
Limitations:
• Cannot handle non-linear decision boundaries well
• Works best with linearly separable data
1.5 K-NEAREST NEIGHBORS (KNN)
What is it?
Non-parametric algorithm that classifies based on K nearest data points.
Algorithm:
1. Choose K (number of neighbors)
2. Calculate distance to all training points
3. Find K nearest points
4. Classification: majority vote among K neighbors
Distance Metrics:
Euclidean: d = √[(x■-x■)² + (y■-y■)²]
Manhattan: d = |x■-x■| + |y■-y■|
Example with K=3:
Predicting iris flower type based on petal length & width:
• Find 3 nearest flowers
• If 2 are Iris Setosa, 1 is Versicolor → Predict Setosa
Choosing K:
• K too small → Noise sensitive
• K too large → Over-smoothing
• Rule of thumb: K = √(n) where n = training samples
Pros & Cons:
✓ Simple to understand
✓ No training phase
✗ Slow during prediction (O(n) complexity)
✗ Memory intensive
✗ Sensitive to feature scaling
1.6 NAIVE BAYES
What is it?
Probabilistic classifier based on Bayes' theorem with assumption that features are independent.
Bayes' Theorem:
P(A|B) = P(B|A) × P(A) / P(B)
Where:
• P(A|B) = Posterior probability
• P(B|A) = Likelihood
• P(A) = Prior probability
• P(B) = Evidence
For Classification:
P(Class|Features) ∝ P(Features|Class) × P(Class)
Example: Email Spam Detection
P(Spam|'Click here') = P('Click here'|Spam) × P(Spam) / P('Click here')
• P(Spam) = 0.3 (prior: 30% emails are spam)
• P('Click here'|Spam) = 0.8 (80% spam emails contain this)
• Calculate posterior probability → Classify
Types:
• Multinomial Naive Bayes: For text/word counts
• Gaussian Naive Bayes: For continuous features
• Bernoulli Naive Bayes: For binary features
Pros & Cons:
✓ Fast training & prediction
✓ Works well with text classification
✓ Low memory requirement
✗ Assumes feature independence (often violated)
✗ Zero frequency problem
1.7 DECISION TREES (■ VERY IMPORTANT)
What is it?
Tree-based model that makes decisions by splitting data based on feature values, similar to a
flowchart.
Tree Structure:
• Root Node: Initial split
• Internal Nodes: Decision points
• Leaf Nodes: Final predictions
How it works:
1. Start with all samples at root
2. Find feature and threshold that best splits data
3. Recursively repeat for each subset
4. Stop when pure or max depth reached
Splitting Criteria (Information Gain):
Entropy(S) = -Σ p■ × log■(p■)
where p■ = proportion of class i
Information Gain:
IG = Entropy(Parent) - Σ(Entropy(Child) × weight(Child))
Example: Loan Approval Decision Tree
Credit Score > 700?
/\
Yes No
/\
Income > 50k? Deny Loan
/\
Yes No
/\
Approve Deny
Advantage:
✓ Highly interpretable
✓ Handles non-linear relationships
✓ No feature scaling needed
✓ Works with mixed feature types
Disadvantage:
✗ Prone to overfitting
✗ Unstable (small data changes → big tree changes)
✗ Biased toward high-cardinality features
1.8 RANDOM FOREST (■ VERY IMPORTANT)
What is it?
Ensemble method that combines multiple decision trees to make better predictions than individual
trees.
How it works:
1. Create multiple bootstrap samples (random samples with replacement) from data
2. Train a decision tree on each sample
3. At each node, consider only random subset of features
4. Final prediction = average (regression) or majority vote (classification)
Why it works better:
• Reduces overfitting through averaging
• Each tree sees different data/features
• Combines multiple weak learners → Strong learner
Hyperparameters:
• n_estimators: Number of trees (100-1000 typical)
• max_depth: Maximum depth of each tree
• min_samples_split: Minimum samples to split
• max_features: Features to consider at each split
Feature Importance:
Importance = Σ(gain from splits using feature) / total_gain
Example Prediction:
3 trees vote: Tree1=Approved, Tree2=Approved, Tree3=Denied
→ Final = Approved (majority)
Pros & Cons:
✓ Excellent performance on most datasets
✓ Handles non-linear relationships
✓ Feature importance extraction
✓ Robust to outliers
✗ Less interpretable than single trees
✗ Computationally expensive
✗ Memory intensive
1.9 GRADIENT BOOSTING (Basic Idea)
What is it?
Sequential ensemble method where each tree corrects errors of previous trees.
How it differs from Random Forest:
• Random Forest: Parallel trees (independent)
• Gradient Boosting: Sequential trees (dependent)
Algorithm:
1. Fit first tree to data
2. Calculate residuals (errors)
3. Fit new tree to residuals
4. Update predictions = old + new tree predictions
5. Repeat
Mathematical Idea:
F(x) = F■(x) + v×T■(x) + v×T■(x) + ... + v×T■(x)
v = learning rate (0.01-0.1)
Key Concept:
Each tree learns from mistakes of ensemble so far
Popular Implementations:
• Gradient Boosting Machines (GBM)
• XGBoost (eXtreme Gradient Boosting)
• LightGBM
• CatBoost
When to use:
• Competitions (Kaggle)
• High-performance requirements
• Complex non-linear patterns
Pros & Cons:
✓ Often best performance
✓ Lower learning rate = better generalization
✗ Slower training than Random Forest
✗ More hyperparameters to tune
1.10 SUPPORT VECTOR MACHINE (SVM)
What is it?
Algorithm that finds optimal hyperplane maximizing margin between two classes.
Core Concept: Maximum Margin
Distance between hyperplane and closest points (support vectors) is maximized.
Mathematical Formula:
Decision Boundary: w■x + b = 0
Prediction: sign(w■x + b)
Where:
• w = weight vector (defines hyperplane angles)
• x = input features
• b = bias term (shifts hyperplane)
Margin Maximization:
Maximize: 2/||w||
Subject to: y■(w■x■ + b) ≥ 1
Key Concept: Support Vectors
Points on margin boundaries are support vectors
Only these matter for final model
Handling Non-Linear Data: Kernel Trick
• Linear Kernel: For linearly separable data
• RBF (Radial Basis Function) Kernel: For non-linear data
• Polynomial Kernel: For polynomial separability
Maps data to higher dimension without computing it explicitly
Example: RBF Kernel
K(x, y) = exp(-γ||x - y||²)
When to use SVM:
• Binary classification
• High-dimensional data
• Sparse data
• When training set is small to medium
Pros & Cons:
✓ Effective in high dimensions
✓ Memory efficient (only support vectors matter)
✓ Versatile (different kernel functions)
✗ Slow for large datasets
✗ Hard to interpret
✗ Hyperparameter tuning critical
LEVEL 2: DATA PREPROCESSING
Objective: Prepare raw data for machine learning models. This is crucial - garbage in = garbage
out!
Interview Question: "What will you do before training a model?"
2.1 HANDLING MISSING VALUES
What is it?
Process of dealing with incomplete data (NaN, null values).
Why it matters?
• Most algorithms cannot handle missing values
• Biases results if ignored
• Can indicate data quality issues
Methods to Handle Missing Values:
1. Deletion (Removal):
• Remove rows with missing values
• Use when: <5% data missing, data is redundant
• Pros: Simple, no bias
• Cons: Loss of information
2. Mean/Median Imputation:
• Replace with mean (continuous) or median (robust to outliers)
Missing value = mean(column)
• Use when: Data is MCAR (Missing Completely At Random)
• Pros: Simple, fast
• Cons: Reduces variance, biased estimates
3. Forward Fill / Backward Fill (For Time Series):
Forward Fill: Use previous value
Backward Fill: Use next value
4. Machine Learning Imputation:
• Train model on non-missing values
• Predict missing values
• Better accuracy but computationally expensive
Example:
Missing Age in customer data:
• Option 1: Delete rows (lose 20% data)
• Option 2: Fill with mean age (30 years)
• Option 3: Use KNN to predict based on similar customers
2.2 ENCODING CATEGORICAL DATA
What is it?
Converting categorical (non-numeric) variables into numeric format.
Why is it needed?
Algorithms work with numbers, not text/categories.
Types:
1. Label Encoding:
Assigns integer to each category.
Color: Red=0, Green=1, Blue=2
• Pros: Simple, low memory
• Cons: Implies order (model might think Blue > Green)
• Use when: Tree-based models, or ordinal categories
2. One-Hot Encoding:
Creates binary column for each category.
Original: Color=[Red, Green, Blue]
One-Hot: Red=[1,0,0], Green=[0,1,0], Blue=[0,0,1]
• Pros: No false ordering
• Cons: Increases dimensions (curse of dimensionality)
• Use when: Linear models (Logistic Regression, SVM)
3. Target Encoding (Mean Encoding):
Replace category with mean target value of that category.
For category C: encoded_value = mean(target | category = C)
Example: Location encoding for house prices
• Downtown mean price: $500k → encode as 500
• Suburb mean price: $300k → encode as 300
4. Frequency Encoding:
Replace with frequency of category
• Red appears 100 times → encode as 100
• Green appears 50 times → encode as 50
Example Comparison:
Original Label One-Hot Target Frequency
Red 0 [1,0,0] 500k 100
Green 1 [0,1,0] 450k 50
Blue 2 [0,0,1] 300k 30
2.3 FEATURE SCALING
What is it?
Transforming numerical features to similar scales.
Why is it important?
• Algorithms like KNN, SVM, Neural Networks are distance-based
• Features with larger ranges dominate
• Example: Age (0-100) vs Income (0-1,000,000)
• Income dominates, age becomes irrelevant!
1. Normalization (Min-Max Scaling):
X_scaled = (X - X_min) / (X_max - X_min)
• Range: [0, 1]
• Use when: Need bounded range, have outliers
Example:
Temperature: [10°C, 30°C]
For 20°C: (20 - 10) / (30 - 10) = 10 / 20 = 0.5
2. Standardization (Z-score Normalization):
X_scaled = (X - mean) / standard_deviation
• Range: [-3, 3] (approximately)
• Use when: Data is normally distributed
• More robust to outliers than normalization
Example:
Age: mean=40, std=10, value=50
Z-score = (50 - 40) / 10 = 1
Algorithms Requiring Scaling:
✓ KNN (distance-based)
✓ SVM (distance-based)
✓ Neural Networks (gradient descent sensitive)
✓ Linear/Logistic Regression (faster convergence)
Algorithms NOT Requiring Scaling:
✗ Decision Trees (splits on individual features)
✗ Random Forest (ensemble of trees)
✗ Gradient Boosting (tree-based)
2.4 TRAIN-TEST SPLIT
What is it?
Dividing data into training set (model learns) and test set (model evaluation).
Why is it crucial?
• Test accuracy estimates real-world performance
• Training accuracy is biased (model saw same data)
Typical Split:
Training: 70-80%
Testing: 20-30%
(For large datasets: 90/10 acceptable)
Example:
1000 samples → 800 train, 200 test
Stratified Split (For Imbalanced Data):
Maintains class distribution in both sets.
Example: 95% Negative (Not Fraud), 5% Positive (Fraud)
• Normal split might have: Train 96% negative, Test 90% negative
• Stratified split maintains: Train 95% negative, Test 95% negative
Why it matters:
Without stratification: Test set under-represents minority class
Model appears better than it actually is
2.5 CROSS-VALIDATION
What is it?
Splitting data into multiple folds for robust performance estimation.
Problem with Train-Test Split:
• Result depends on random split
• Single test fold might not be representative
• Different splits → Different results
K-Fold Cross-Validation Solution:
1. Divide data into K equal folds (typically K=5)
2. For each fold i:
- Use fold i as test
- Use remaining K-1 folds as train
3. Average results across K iterations
CV_Score = (Score■ + Score■ + ... + Score■) / K
Example (K=5):
Iteration 1: Train on [2,3,4,5], Test on [1]
Iteration 2: Train on [1,3,4,5], Test on [2]
Iteration 3: Train on [1,2,4,5], Test on [3]
Iteration 4: Train on [1,2,3,5], Test on [4]
Iteration 5: Train on [1,2,3,4], Test on [5]
Final Score = Average of 5 scores
Advantages:
✓ Better statistical estimate
✓ Uses all data for both training & testing
✓ Robust to data shuffling
Types:
• K-Fold (regular)
• Stratified K-Fold (maintains class distribution)
• Leave-One-Out (LOO): K = number of samples (expensive)
• Time Series Split (respects temporal ordering)
LEVEL 3: EVALUATION & METRICS
Objective: Measure model performance accurately.
Key Principle: Never trust a single metric!
3.1 CONFUSION MATRIX
What is it?
Table showing True Positives, False Positives, True Negatives, False Negatives.
Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)
Definitions:
• TP (True Positive): Correctly predicted positive
• TN (True Negative): Correctly predicted negative
• FP (False Positive): Incorrectly predicted positive (Type I error)
• FN (False Negative): Incorrectly predicted negative (Type II error)
Medical Example:
Disease Test Results (TP, FN are disease cases):
• TP=95: Correctly identified 95 sick people
• FN=5: Missed 5 sick people (dangerous!)
• TN=1000: Correctly identified 1000 healthy people
• FP=50: Incorrectly flagged 50 healthy people
3.2 CLASSIFICATION METRICS
1. Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
• Overall correctness
• PROBLEM: Useless for imbalanced data!
• Example: 99% Negative class
Predicting always "No" → 99% accuracy (but useless!)
2. Precision
Precision = TP / (TP + FP)
• "Of all predictions I made positive, how many were correct?"
• False alarms measure
• Email spam detection: Precision = Non-spam emails among predicted spam
• High precision = Few false alarms
3. Recall (Sensitivity)
Recall = TP / (TP + FN)
• "Of all actual positives, how many did I find?"
• Coverage measure
• Medical test: Recall = Detected cases among all sick people
• High recall = Catch all positive cases
4. F1 Score (Harmonic Mean)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
• Balances precision-recall tradeoff
• Use when both errors are costly
5. Specificity
Specificity = TN / (TN + FP)
• "Of all actual negatives, how many did I find?"
• True negative rate
When to use which metric:
• Balanced data, similar costs → Accuracy
• False positives costly (spam filter) → Precision
• False negatives costly (disease test) → Recall
• Both errors costly → F1 Score
3.3 ROC CURVE & AUC SCORE
What is ROC Curve?
Plot showing Recall (True Positive Rate) vs False Positive Rate at different thresholds.
True Positive Rate (TPR) = Recall:
TPR = TP / (TP + FN)
False Positive Rate (FPR):
FPR = FP / (FP + TN)
How Threshold Affects:
• Lower threshold → More predictions positive
TPR increases (catch more positives)
FPR increases (more false alarms)
• Higher threshold → Fewer predictions positive
TPR decreases (miss more positives)
FPR decreases (fewer false alarms)
Interpreting ROC Curve:
• Perfect classifier: Curve goes up, then right (top-left corner)
• Random classifier: Diagonal line (45 degrees)
• Better models: Curve bows toward top-left
AUC (Area Under Curve):
AUC ∈ [0, 1]
AUC = 0.5: Random (no discrimination)
AUC = 1.0: Perfect
AUC = 0.7-0.8: Good
Interpretation:
AUC = Probability that model ranks random positive higher than random negative
Fraud Detection Example:
AUC = 0.9 means: 90% of the time, model ranks fraudulent transaction higher
than legitimate one
When to use:
• For ranking quality
• Different thresholds
• Imbalanced datasets
• When you don't know optimal threshold
LEVEL 4: PROBABILITY & STATISTICS (BASICS ONLY)
Objective: Understand probabilistic thinking for ML.
Scope: Intuition only, not deep mathematics.
4.1 DESCRIPTIVE STATISTICS
Mean (Average):
µ = Σx / n
Example: [1, 2, 3, 4, 5] → µ = 15/5 = 3
Variance (Spread from mean):
σ² = Σ(x - µ)² / n
High variance = Data spread out
Low variance = Data clustered
Standard Deviation (Square root of variance):
σ = √(σ²)
Same units as data (more interpretable)
Example: If σ=2, most data within±2 units of mean
68-95-99.7 Rule (Normal Distribution):
68% within 1σ of mean
95% within 2σ of mean
99.7% within 3σ of mean
Example:
Heights: µ=170cm, σ=5cm
• 68% of people: 165-175 cm
• 95% of people: 160-180 cm
• Person with height 185cm is rare (>2σ)
4.2 PROBABILITY BASICS
Probability Definition:
P(A) = Number of favorable outcomes / Total outcomes
Range: [0, 1]
• P(A) = 0: Impossible
• P(A) = 0.5: Equally likely
• P(A) = 1: Certain
Example: Fair Coin Flip
P(Heads) = 1/2 = 0.5
Independent Events Multiplication Rule:
P(A and B) = P(A) × P(B)
Example: Two coin flips
P(both heads) = 0.5 × 0.5 = 0.25
Mutually Exclusive Events Addition Rule:
P(A or B) = P(A) + P(B)
Example: Rolling die (1 or 2)
P(1 or 2) = 1/6 + 1/6 = 2/6 = 0.33
4.3 CONDITIONAL PROBABILITY
What is it?
Probability of event given another event already happened.
Mathematical Definition:
P(A|B) = P(A and B) / P(B)
• P(A|B) = Probability of A given B
• Read as: "Probability of A given B"
Example: Medical Test
• Disease prevalence: P(Disease) = 0.01 (1%)
• Test accuracy: P(Positive|Disease) = 0.99
• False positive rate: P(Positive|No Disease) = 0.05
Question: If test positive, what's probability of disease?
Answer: Use Bayes Theorem!
4.4 BAYES' THEOREM (■ VERY IMPORTANT)
What is it?
Mathematical framework for updating beliefs based on new evidence.
Formula:
P(A|B) = P(B|A) × P(A) / P(B)
Components:
• P(A|B) = Posterior (what we want, updated belief)
• P(B|A) = Likelihood (evidence given hypothesis)
• P(A) = Prior (initial belief before evidence)
• P(B) = Evidence (probability of observing data)
Intuition:
Posterior = Likelihood × Prior / Evidence
(Updated belief = How likely evidence × Initial belief)
Medical Test Example:
• Prior: P(Disease) = 0.01 (1% have disease)
• Likelihood: P(Positive|Disease) = 0.99 (test catches 99% of sick)
• False positive: P(Positive|Healthy) = 0.05
Calculate P(Positive):
P(Positive) = P(Positive|Disease)×P(Disease) +
P(Positive|Healthy)×P(Healthy)
= 0.99×0.01 + 0.05×0.99
= 0.0099 + 0.0495 = 0.0594
Calculate Posterior:
P(Disease|Positive) = 0.99 × 0.01 / 0.0594
= 0.0099 / 0.0594
= 0.167 (16.7%)
Interpretation:
Despite positive test, only 16.7% chance of actual disease!
Reason: Disease is rare, so most positives are false alarms
Why it matters in ML:
• Foundation of Bayesian inference
• Naive Bayes classifier uses this
• Uncertainty quantification
4.5 NORMAL DISTRIBUTION
What is it?
Symmetric bell-shaped probability distribution.
Mathematical Formula:
P(x) = (1 / (σ√(2π))) × e^(-(x-µ)²/(2σ²))
Key Properties:
• Symmetric around mean (µ)
• Peak at mean
• Spread controlled by standard deviation (σ)
• Entire area under curve = 1 (total probability)
Why it's important:
• Many real-world phenomena approximately normal
• Central Limit Theorem: Sum of many random variables ≈ normal
• Foundation for statistical tests
• Gaussian Naive Bayes assumes normal distribution
Examples of Approximately Normal Distributions:
• Heights in population
• Test scores
• Measurement errors
• IQ scores
Standard Normal Distribution:
• µ = 0, σ = 1
• Used as reference (z-scores)
Real-World Application:
Factory produces light bulbs (µ=1000hrs, σ=50hrs)
• What % last > 1100 hours? (2σ above mean)
• Answer: ~2.3% (using 68-95-99.7 rule)
LEVEL 5: MODEL IMPROVEMENT TECHNIQUES
Objective: Make models perform better.
5.1 BIAS VS VARIANCE
What is it?
Two sources of error that limit model performance.
Bias:
Error from oversimplified assumptions (underfitting)
High Bias = Model too simple
• Cannot capture true relationship
• Example: Using linear model for non-linear data
• Result: Poor training AND test performance
Variance:
Error from being too sensitive to training data (overfitting)
High Variance = Model too complex
• Memorizes training data
• Example: Decision tree with no depth limit
• Result: Good training, poor test performance
Bias-Variance Tradeoff:
Total Error = Bias² + Variance + Irreducible Error
Visualizing Tradeoff:
Model Complexity (→)
• Very simple: High bias, low variance (underfitting)
• Medium: Balanced (sweet spot)
• Very complex: Low bias, high variance (overfitting)
How to detect Bias vs Variance:
• High bias: Training error high AND test error high
→ Solution: Use complex model
• High variance: Training error low BUT test error high
→ Solution: Regularization, more data, simpler model
5.2 HYPERPARAMETER TUNING
What is it?
Adjusting model parameters before training (not learned from data).
Examples of Hyperparameters:
• Learning rate (α) in gradient descent
• Number of trees (n_estimators) in Random Forest
• Tree depth (max_depth)
• K in K-Nearest Neighbors
• Regularization strength (λ) in linear regression
• Batch size in neural networks
Why is it critical?
Same algorithm with different hyperparameters = different performance
Example: Decision Tree
max_depth = 3 → Simple model (high bias)
max_depth = 20 → Complex model (high variance)
max_depth = 8 → Balanced (optimal)
5.3 GRID SEARCH
What is it?
Exhaustive search over defined hyperparameter ranges.
Algorithm:
1. Define ranges for each hyperparameter
2. Create all combinations
3. Train model for each combination
4. Evaluate on validation set
5. Return best combination
Example:
Hyperparameters:
max_depth ∈ [3, 5, 7, 10]
min_samples_split ∈ [2, 5, 10]
Total combinations: 4 × 3 = 12
Train 12 models
Pros & Cons:
✓ Systematic, guaranteed to find best
✓ Easy to implement
✗ Computationally expensive (exponential combinations)
✗ Slow for many hyperparameters
5.4 RANDOM SEARCH
What is it?
Random sampling from hyperparameter space (faster than Grid).
Algorithm:
1. Define ranges for hyperparameters
2. Sample N random combinations
3. Train model for each
4. Return best
Comparison with Grid Search:
Grid Search: 4 × 4 × 4 = 64 combinations
Random Search: Sample 20 random combinations
When Random is Better:
Some hyperparameters might not matter much
Grid would waste time on non-important ones
Random finds important combinations with less computation
Pros & Cons:
✓ Faster than grid search
✓ Handles many hyperparameters better
✗ Not guaranteed to find best
✗ Less predictable coverage
5.5 FEATURE SELECTION
What is it?
Selecting most important features and removing irrelevant ones.
Why is it important?
• Improves model interpretability
• Reduces overfitting
• Faster training
• Reduces memory requirements
• Handles curse of dimensionality
1. Filter Methods (Statistical):
Rank features by statistical properties independently.
Correlation with target: r ∈ [-1, 1]
Select features with |r| > threshold
Pros: Fast, scalable
Cons: Ignores feature interactions
2. Wrapper Methods (Model-Based):
Train model with different feature subsets.
Exhaustive/Greedy optimization of feature subsets
• Forward Selection: Start with 0 features, add best
• Backward Elimination: Start with all, remove worst
• Recursive Feature Elimination (RFE)
Pros: Considers interactions
Cons: Computationally expensive
3. Embedded Methods (During Training):
Features selected as part of model training
Coefficient magnitude |β| in Linear Regression
Feature importance in Tree-based models
Examples:
• Lasso Regression (L1 regularization)
• Tree feature importance (Random Forest)
• Permutation importance
Example: Predicting House Price
Available features:
• Area (important) → Keep
• Number of bedrooms (important) → Keep
• Owner's favorite color (useless) → Remove
• Paint brand (barely matters) → Remove
Result: 2 features instead of 100+ → Better model
QUICK REFERENCE GUIDE
Problem Type Algorithm When to Use
Regression Linear Reg. Simple, interpretable relationships
Regression Polynomial Reg. Non-linear patterns
Classification Logistic Reg. Binary, large datasets
Classification KNN Small-medium data, non-linear
Classification Naive Bayes Text classification, fast needed
Classification Decision Tree Interpretability important
Classification Random Forest Best accuracy, medium complexity
Classification SVM High dimensions, large margin needed
Classification Gradient Boost Competition, highest accuracy
When to Use Which Metric
• Balanced Data, Similar Consequences → Accuracy
• False Positives Costly → Precision
• False Negatives Costly → Recall
• Balance Both → F1 Score
• Different Thresholds → ROC/AUC
Data Preprocessing Checklist
■ Handle missing values
■ Encode categorical features
■ Scale numerical features
■ Check for outliers
■ Create train-test split
■ Use K-fold cross-validation
■ Feature selection if many features
Model Improvement Workflow
1. Train baseline model
2. Check bias vs variance
3. If High Bias: Use complex model
4. If High Variance: Regularize, add data, simplify
5. Tune hyperparameters (Grid/Random Search)
6. Feature selection/engineering
7. Ensemble methods (combine models)
KEY FORMULAS SUMMARY
Linear Regression
y = β■ + β■x + ε
MSE = (1/n) Σ(y■ - ■■)²
Logistic Regression
P(y=1|x) = 1 / (1 + e^(-z))
Log Loss = -1/n × Σ[y×log(■) + (1-y)×log(1-■)]
Classification Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Feature Scaling
Normalization: X = (X - min) / (max - min)
Standardization: X = (X - mean) / std
Statistics
Mean: µ = Σx / n
Variance: σ² = Σ(x - µ)² / n
Std Dev: σ = √(σ²)
Probability
P(A and B) = P(A) × P(B) [if independent]
P(A|B) = P(A and B) / P(B)
Bayes: P(A|B) = P(B|A) × P(A) / P(B)
Cross-Validation
CV_Score = (Score■ + Score■ + ... + Score■) / K