0% found this document useful (0 votes)
3 views34 pages

ML AI Complete Guide

The document serves as a comprehensive guide to AI and machine learning, covering fundamental concepts, data preprocessing techniques, evaluation metrics, and various algorithms including regression, classification, and ensemble methods. It details the mathematical foundations, applications, and pros and cons of each algorithm, as well as essential preprocessing steps like handling missing values, encoding categorical data, and feature scaling. The guide is structured in levels, progressing from basic concepts to more advanced techniques, making it suitable for learners at different stages.

Uploaded by

shaktithatinati
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views34 pages

ML AI Complete Guide

The document serves as a comprehensive guide to AI and machine learning, covering fundamental concepts, data preprocessing techniques, evaluation metrics, and various algorithms including regression, classification, and ensemble methods. It details the mathematical foundations, applications, and pros and cons of each algorithm, as well as essential preprocessing steps like handling missing values, encoding categorical data, and feature scaling. The guide is structured in levels, progressing from basic concepts to more advanced techniques, making it suitable for learners at different stages.

Uploaded by

shaktithatinati
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMPREHENSIVE AI/ML GUIDE

Level 1: Machine Learning Fundamentals

Level 2: Data Preprocessing

Level 3: Evaluation & Metrics

Level 4: Probability & Statistics

Level 5: Model Improvement Techniques

Generated on: April 06, 2026


TABLE OF CONTENTS

Level 1: Machine Learning Algorithms

• Regression Models

• Classification Algorithms

• Tree-Based Models

• Support Vector Machines

Level 2: Data Preprocessing

• Missing Values

• Encoding Categorical Data

• Feature Scaling

• Train-Test Split & Cross-Validation

Level 3: Evaluation & Metrics

• Confusion Matrix

• Classification Metrics

• ROC Curve & AUC

Level 4: Probability & Statistics

• Descriptive Statistics

• Probability Concepts

• Distributions

Level 5: Model Improvement

• Bias vs Variance

• Hyperparameter Tuning

• Feature Selection
LEVEL 1: MACHINE LEARNING ALGORITHMS

Objective: Understand various algorithms and when to use them.

1.1 LINEAR REGRESSION

What is it?
Linear Regression is a supervised learning algorithm that models the relationship between one
independent variable (X) and a dependent variable (Y) by fitting a straight line.

Why is it?
It's the simplest and most interpretable model for understanding relationship between variables. It's
the foundation for understanding more complex algorithms.

Where is it used?
House price prediction, stock price forecasting, sales forecasting, and any continuous variable
prediction.

Mathematical Formula:

y = mx + c
OR
y = β■ + β■x + ε

Where:
• y = dependent variable (output)
• x = independent variable (input)
• m (or β■) = slope (rate of change)
• c (or β■) = y-intercept
• ε = error term

Cost Function (Mean Squared Error):

MSE = (1/n) Σ(y■ - ■■)²

Algorithm:
1. Initialize parameters m and c to 0
2. Calculate predictions: ■ = mx + c
3. Calculate error: MSE
4. Update m and c using gradient descent
5. Repeat until convergence
Gradient Descent Update:

m = m - α × (∂MSE/∂m)
c = c - α × (∂MSE/∂c)

α = learning rate (typically 0.01)

Example:
Predicting house prices based on area in sq ft:
• Area = 2000 sq ft → Price = $300,000
• Area = 3000 sq ft → Price = $450,000
• Fitted line: Price = 150×Area + 0

Types:
• Simple Linear Regression (1 independent variable)
• Cannot handle multiple features directly

1.2 MULTIPLE LINEAR REGRESSION

What is it?
Extension of Linear Regression with multiple independent variables affecting one dependent
variable.

Formula:

y = β■ + β■x■ + β■x■ + ... + β■x■ + ε

Example Prediction:
House Price = 50,000 + 150×(Area) + 5,000×(Bedrooms) + 3,000×(Age)

When to use:
• When multiple factors affect the outcome
• More realistic real-world scenarios
• Better predictions than simple regression

1.3 POLYNOMIAL REGRESSION

What is it?
Extends linear regression by fitting a polynomial curve instead of a straight line.
Formula:

y = β■ + β■x + β■x² + β■x³ + ... + β■x■ + ε

Degree of Polynomial:
• Degree 2 (Quadratic): y = β■ + β■x + β■x²
• Degree 3 (Cubic): y = β■ + β■x + β■x² + β■x³

Example:
Modeling growth that accelerates (exponential-like):
• Stock portfolio value over years (non-linear growth)
• Acceleration in physics

Pros & Cons:


✓ Captures non-linear relationships
✗ Risk of overfitting with high degree
✗ Computationally more expensive
1.4 LOGISTIC REGRESSION (■ VERY IMPORTANT)

What is it?
Classification algorithm that predicts probability of binary outcome (0 or 1, True or False).

Why is it important?
• Foundation of neural networks
• Used in credit card fraud detection
• Email spam classification
• Medical diagnosis

Key Concept:
Uses sigmoid function to convert linear output to probability.

Sigmoid(z) = 1 / (1 + e^(-z))
where z = β■ + β■x

Sigmoid Properties:
• Output always between 0 and 1
• z = 0 → sigmoid = 0.5
• z → ∞ → sigmoid → 1
• z → -∞ → sigmoid → 0

Decision Boundary:
If probability > 0.5 → Predict Class 1
If probability ≤ 0.5 → Predict Class 0

Cost Function (Log Loss / Binary Cross-Entropy):

J(β) = -1/n × Σ[y■×log(■■) + (1-y■)×log(1-■■)]

Example:
Email Spam Classification:
• Input: 50 features (word frequency, sender info)
• Output: Probability of being spam
• If P(spam) > 0.5 → Mark as spam

Limitations:
• Cannot handle non-linear decision boundaries well
• Works best with linearly separable data

1.5 K-NEAREST NEIGHBORS (KNN)


What is it?
Non-parametric algorithm that classifies based on K nearest data points.

Algorithm:
1. Choose K (number of neighbors)
2. Calculate distance to all training points
3. Find K nearest points
4. Classification: majority vote among K neighbors

Distance Metrics:

Euclidean: d = √[(x■-x■)² + (y■-y■)²]


Manhattan: d = |x■-x■| + |y■-y■|

Example with K=3:


Predicting iris flower type based on petal length & width:
• Find 3 nearest flowers
• If 2 are Iris Setosa, 1 is Versicolor → Predict Setosa

Choosing K:
• K too small → Noise sensitive
• K too large → Over-smoothing
• Rule of thumb: K = √(n) where n = training samples

Pros & Cons:


✓ Simple to understand
✓ No training phase
✗ Slow during prediction (O(n) complexity)
✗ Memory intensive
✗ Sensitive to feature scaling

1.6 NAIVE BAYES

What is it?
Probabilistic classifier based on Bayes' theorem with assumption that features are independent.

Bayes' Theorem:

P(A|B) = P(B|A) × P(A) / P(B)


Where:
• P(A|B) = Posterior probability
• P(B|A) = Likelihood
• P(A) = Prior probability
• P(B) = Evidence

For Classification:

P(Class|Features) ∝ P(Features|Class) × P(Class)

Example: Email Spam Detection


P(Spam|'Click here') = P('Click here'|Spam) × P(Spam) / P('Click here')
• P(Spam) = 0.3 (prior: 30% emails are spam)
• P('Click here'|Spam) = 0.8 (80% spam emails contain this)
• Calculate posterior probability → Classify

Types:
• Multinomial Naive Bayes: For text/word counts
• Gaussian Naive Bayes: For continuous features
• Bernoulli Naive Bayes: For binary features

Pros & Cons:


✓ Fast training & prediction
✓ Works well with text classification
✓ Low memory requirement
✗ Assumes feature independence (often violated)
✗ Zero frequency problem
1.7 DECISION TREES (■ VERY IMPORTANT)

What is it?
Tree-based model that makes decisions by splitting data based on feature values, similar to a
flowchart.

Tree Structure:
• Root Node: Initial split
• Internal Nodes: Decision points
• Leaf Nodes: Final predictions

How it works:
1. Start with all samples at root
2. Find feature and threshold that best splits data
3. Recursively repeat for each subset
4. Stop when pure or max depth reached

Splitting Criteria (Information Gain):

Entropy(S) = -Σ p■ × log■(p■)
where p■ = proportion of class i

Information Gain:

IG = Entropy(Parent) - Σ(Entropy(Child) × weight(Child))

Example: Loan Approval Decision Tree


Credit Score > 700?
/\
Yes No
/\
Income > 50k? Deny Loan
/\
Yes No
/\
Approve Deny
Advantage:
✓ Highly interpretable
✓ Handles non-linear relationships
✓ No feature scaling needed
✓ Works with mixed feature types
Disadvantage:
✗ Prone to overfitting
✗ Unstable (small data changes → big tree changes)
✗ Biased toward high-cardinality features

1.8 RANDOM FOREST (■ VERY IMPORTANT)

What is it?
Ensemble method that combines multiple decision trees to make better predictions than individual
trees.

How it works:
1. Create multiple bootstrap samples (random samples with replacement) from data
2. Train a decision tree on each sample
3. At each node, consider only random subset of features
4. Final prediction = average (regression) or majority vote (classification)

Why it works better:


• Reduces overfitting through averaging
• Each tree sees different data/features
• Combines multiple weak learners → Strong learner

Hyperparameters:
• n_estimators: Number of trees (100-1000 typical)
• max_depth: Maximum depth of each tree
• min_samples_split: Minimum samples to split
• max_features: Features to consider at each split

Feature Importance:

Importance = Σ(gain from splits using feature) / total_gain

Example Prediction:
3 trees vote: Tree1=Approved, Tree2=Approved, Tree3=Denied
→ Final = Approved (majority)

Pros & Cons:


✓ Excellent performance on most datasets
✓ Handles non-linear relationships
✓ Feature importance extraction
✓ Robust to outliers
✗ Less interpretable than single trees
✗ Computationally expensive
✗ Memory intensive

1.9 GRADIENT BOOSTING (Basic Idea)

What is it?
Sequential ensemble method where each tree corrects errors of previous trees.

How it differs from Random Forest:


• Random Forest: Parallel trees (independent)
• Gradient Boosting: Sequential trees (dependent)

Algorithm:
1. Fit first tree to data
2. Calculate residuals (errors)
3. Fit new tree to residuals
4. Update predictions = old + new tree predictions
5. Repeat

Mathematical Idea:

F(x) = F■(x) + v×T■(x) + v×T■(x) + ... + v×T■(x)


v = learning rate (0.01-0.1)

Key Concept:
Each tree learns from mistakes of ensemble so far

Popular Implementations:
• Gradient Boosting Machines (GBM)
• XGBoost (eXtreme Gradient Boosting)
• LightGBM
• CatBoost

When to use:
• Competitions (Kaggle)
• High-performance requirements
• Complex non-linear patterns

Pros & Cons:


✓ Often best performance
✓ Lower learning rate = better generalization
✗ Slower training than Random Forest
✗ More hyperparameters to tune
1.10 SUPPORT VECTOR MACHINE (SVM)

What is it?
Algorithm that finds optimal hyperplane maximizing margin between two classes.

Core Concept: Maximum Margin


Distance between hyperplane and closest points (support vectors) is maximized.

Mathematical Formula:

Decision Boundary: w■x + b = 0


Prediction: sign(w■x + b)

Where:
• w = weight vector (defines hyperplane angles)
• x = input features
• b = bias term (shifts hyperplane)

Margin Maximization:

Maximize: 2/||w||
Subject to: y■(w■x■ + b) ≥ 1

Key Concept: Support Vectors


Points on margin boundaries are support vectors
Only these matter for final model

Handling Non-Linear Data: Kernel Trick


• Linear Kernel: For linearly separable data
• RBF (Radial Basis Function) Kernel: For non-linear data
• Polynomial Kernel: For polynomial separability

Maps data to higher dimension without computing it explicitly

Example: RBF Kernel

K(x, y) = exp(-γ||x - y||²)

When to use SVM:


• Binary classification
• High-dimensional data
• Sparse data
• When training set is small to medium
Pros & Cons:
✓ Effective in high dimensions
✓ Memory efficient (only support vectors matter)
✓ Versatile (different kernel functions)
✗ Slow for large datasets
✗ Hard to interpret
✗ Hyperparameter tuning critical
LEVEL 2: DATA PREPROCESSING

Objective: Prepare raw data for machine learning models. This is crucial - garbage in = garbage
out!
Interview Question: "What will you do before training a model?"

2.1 HANDLING MISSING VALUES

What is it?
Process of dealing with incomplete data (NaN, null values).

Why it matters?
• Most algorithms cannot handle missing values
• Biases results if ignored
• Can indicate data quality issues

Methods to Handle Missing Values:

1. Deletion (Removal):
• Remove rows with missing values
• Use when: <5% data missing, data is redundant
• Pros: Simple, no bias
• Cons: Loss of information

2. Mean/Median Imputation:
• Replace with mean (continuous) or median (robust to outliers)

Missing value = mean(column)

• Use when: Data is MCAR (Missing Completely At Random)


• Pros: Simple, fast
• Cons: Reduces variance, biased estimates

3. Forward Fill / Backward Fill (For Time Series):

Forward Fill: Use previous value


Backward Fill: Use next value

4. Machine Learning Imputation:


• Train model on non-missing values
• Predict missing values
• Better accuracy but computationally expensive
Example:
Missing Age in customer data:
• Option 1: Delete rows (lose 20% data)
• Option 2: Fill with mean age (30 years)
• Option 3: Use KNN to predict based on similar customers

2.2 ENCODING CATEGORICAL DATA

What is it?
Converting categorical (non-numeric) variables into numeric format.

Why is it needed?
Algorithms work with numbers, not text/categories.

Types:

1. Label Encoding:
Assigns integer to each category.

Color: Red=0, Green=1, Blue=2

• Pros: Simple, low memory


• Cons: Implies order (model might think Blue > Green)
• Use when: Tree-based models, or ordinal categories

2. One-Hot Encoding:
Creates binary column for each category.

Original: Color=[Red, Green, Blue]


One-Hot: Red=[1,0,0], Green=[0,1,0], Blue=[0,0,1]

• Pros: No false ordering


• Cons: Increases dimensions (curse of dimensionality)
• Use when: Linear models (Logistic Regression, SVM)

3. Target Encoding (Mean Encoding):


Replace category with mean target value of that category.

For category C: encoded_value = mean(target | category = C)


Example: Location encoding for house prices
• Downtown mean price: $500k → encode as 500
• Suburb mean price: $300k → encode as 300

4. Frequency Encoding:

Replace with frequency of category

• Red appears 100 times → encode as 100


• Green appears 50 times → encode as 50

Example Comparison:

Original Label One-Hot Target Frequency

Red 0 [1,0,0] 500k 100


Green 1 [0,1,0] 450k 50
Blue 2 [0,0,1] 300k 30

2.3 FEATURE SCALING

What is it?
Transforming numerical features to similar scales.

Why is it important?
• Algorithms like KNN, SVM, Neural Networks are distance-based
• Features with larger ranges dominate
• Example: Age (0-100) vs Income (0-1,000,000)

• Income dominates, age becomes irrelevant!

1. Normalization (Min-Max Scaling):

X_scaled = (X - X_min) / (X_max - X_min)

• Range: [0, 1]
• Use when: Need bounded range, have outliers

Example:
Temperature: [10°C, 30°C]
For 20°C: (20 - 10) / (30 - 10) = 10 / 20 = 0.5

2. Standardization (Z-score Normalization):


X_scaled = (X - mean) / standard_deviation

• Range: [-3, 3] (approximately)


• Use when: Data is normally distributed
• More robust to outliers than normalization

Example:
Age: mean=40, std=10, value=50
Z-score = (50 - 40) / 10 = 1

Algorithms Requiring Scaling:


✓ KNN (distance-based)
✓ SVM (distance-based)
✓ Neural Networks (gradient descent sensitive)
✓ Linear/Logistic Regression (faster convergence)

Algorithms NOT Requiring Scaling:


✗ Decision Trees (splits on individual features)
✗ Random Forest (ensemble of trees)
✗ Gradient Boosting (tree-based)
2.4 TRAIN-TEST SPLIT

What is it?
Dividing data into training set (model learns) and test set (model evaluation).

Why is it crucial?
• Test accuracy estimates real-world performance
• Training accuracy is biased (model saw same data)

Typical Split:

Training: 70-80%
Testing: 20-30%
(For large datasets: 90/10 acceptable)

Example:
1000 samples → 800 train, 200 test

Stratified Split (For Imbalanced Data):


Maintains class distribution in both sets.

Example: 95% Negative (Not Fraud), 5% Positive (Fraud)


• Normal split might have: Train 96% negative, Test 90% negative
• Stratified split maintains: Train 95% negative, Test 95% negative

Why it matters:
Without stratification: Test set under-represents minority class
Model appears better than it actually is

2.5 CROSS-VALIDATION

What is it?
Splitting data into multiple folds for robust performance estimation.

Problem with Train-Test Split:


• Result depends on random split
• Single test fold might not be representative
• Different splits → Different results

K-Fold Cross-Validation Solution:


1. Divide data into K equal folds (typically K=5)
2. For each fold i:
- Use fold i as test
- Use remaining K-1 folds as train
3. Average results across K iterations
CV_Score = (Score■ + Score■ + ... + Score■) / K

Example (K=5):
Iteration 1: Train on [2,3,4,5], Test on [1]
Iteration 2: Train on [1,3,4,5], Test on [2]
Iteration 3: Train on [1,2,4,5], Test on [3]
Iteration 4: Train on [1,2,3,5], Test on [4]
Iteration 5: Train on [1,2,3,4], Test on [5]
Final Score = Average of 5 scores

Advantages:
✓ Better statistical estimate
✓ Uses all data for both training & testing
✓ Robust to data shuffling

Types:
• K-Fold (regular)
• Stratified K-Fold (maintains class distribution)
• Leave-One-Out (LOO): K = number of samples (expensive)
• Time Series Split (respects temporal ordering)
LEVEL 3: EVALUATION & METRICS

Objective: Measure model performance accurately.


Key Principle: Never trust a single metric!

3.1 CONFUSION MATRIX

What is it?
Table showing True Positives, False Positives, True Negatives, False Negatives.

Predicted Negative Predicted Positive

Actual Negative True Negative (TN) False Positive (FP)

Actual Positive False Negative (FN) True Positive (TP)

Definitions:
• TP (True Positive): Correctly predicted positive
• TN (True Negative): Correctly predicted negative
• FP (False Positive): Incorrectly predicted positive (Type I error)
• FN (False Negative): Incorrectly predicted negative (Type II error)

Medical Example:
Disease Test Results (TP, FN are disease cases):
• TP=95: Correctly identified 95 sick people
• FN=5: Missed 5 sick people (dangerous!)
• TN=1000: Correctly identified 1000 healthy people
• FP=50: Incorrectly flagged 50 healthy people

3.2 CLASSIFICATION METRICS

1. Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

• Overall correctness
• PROBLEM: Useless for imbalanced data!
• Example: 99% Negative class
Predicting always "No" → 99% accuracy (but useless!)

2. Precision

Precision = TP / (TP + FP)


• "Of all predictions I made positive, how many were correct?"
• False alarms measure
• Email spam detection: Precision = Non-spam emails among predicted spam
• High precision = Few false alarms

3. Recall (Sensitivity)

Recall = TP / (TP + FN)

• "Of all actual positives, how many did I find?"


• Coverage measure
• Medical test: Recall = Detected cases among all sick people
• High recall = Catch all positive cases

4. F1 Score (Harmonic Mean)

F1 = 2 × (Precision × Recall) / (Precision + Recall)

• Balances precision-recall tradeoff


• Use when both errors are costly

5. Specificity

Specificity = TN / (TN + FP)

• "Of all actual negatives, how many did I find?"


• True negative rate

When to use which metric:


• Balanced data, similar costs → Accuracy
• False positives costly (spam filter) → Precision
• False negatives costly (disease test) → Recall
• Both errors costly → F1 Score
3.3 ROC CURVE & AUC SCORE

What is ROC Curve?


Plot showing Recall (True Positive Rate) vs False Positive Rate at different thresholds.

True Positive Rate (TPR) = Recall:

TPR = TP / (TP + FN)

False Positive Rate (FPR):

FPR = FP / (FP + TN)

How Threshold Affects:


• Lower threshold → More predictions positive
TPR increases (catch more positives)
FPR increases (more false alarms)

• Higher threshold → Fewer predictions positive


TPR decreases (miss more positives)
FPR decreases (fewer false alarms)

Interpreting ROC Curve:


• Perfect classifier: Curve goes up, then right (top-left corner)
• Random classifier: Diagonal line (45 degrees)
• Better models: Curve bows toward top-left

AUC (Area Under Curve):

AUC ∈ [0, 1]
AUC = 0.5: Random (no discrimination)
AUC = 1.0: Perfect
AUC = 0.7-0.8: Good

Interpretation:
AUC = Probability that model ranks random positive higher than random negative

Fraud Detection Example:


AUC = 0.9 means: 90% of the time, model ranks fraudulent transaction higher
than legitimate one

When to use:
• For ranking quality
• Different thresholds
• Imbalanced datasets
• When you don't know optimal threshold
LEVEL 4: PROBABILITY & STATISTICS (BASICS ONLY)

Objective: Understand probabilistic thinking for ML.


Scope: Intuition only, not deep mathematics.

4.1 DESCRIPTIVE STATISTICS

Mean (Average):

µ = Σx / n

Example: [1, 2, 3, 4, 5] → µ = 15/5 = 3

Variance (Spread from mean):

σ² = Σ(x - µ)² / n

High variance = Data spread out


Low variance = Data clustered

Standard Deviation (Square root of variance):

σ = √(σ²)

Same units as data (more interpretable)


Example: If σ=2, most data within±2 units of mean

68-95-99.7 Rule (Normal Distribution):

68% within 1σ of mean


95% within 2σ of mean
99.7% within 3σ of mean

Example:
Heights: µ=170cm, σ=5cm
• 68% of people: 165-175 cm
• 95% of people: 160-180 cm
• Person with height 185cm is rare (>2σ)
4.2 PROBABILITY BASICS

Probability Definition:

P(A) = Number of favorable outcomes / Total outcomes

Range: [0, 1]
• P(A) = 0: Impossible
• P(A) = 0.5: Equally likely
• P(A) = 1: Certain

Example: Fair Coin Flip


P(Heads) = 1/2 = 0.5

Independent Events Multiplication Rule:

P(A and B) = P(A) × P(B)

Example: Two coin flips


P(both heads) = 0.5 × 0.5 = 0.25

Mutually Exclusive Events Addition Rule:

P(A or B) = P(A) + P(B)

Example: Rolling die (1 or 2)


P(1 or 2) = 1/6 + 1/6 = 2/6 = 0.33

4.3 CONDITIONAL PROBABILITY

What is it?
Probability of event given another event already happened.

Mathematical Definition:

P(A|B) = P(A and B) / P(B)


• P(A|B) = Probability of A given B
• Read as: "Probability of A given B"

Example: Medical Test


• Disease prevalence: P(Disease) = 0.01 (1%)
• Test accuracy: P(Positive|Disease) = 0.99
• False positive rate: P(Positive|No Disease) = 0.05

Question: If test positive, what's probability of disease?


Answer: Use Bayes Theorem!

4.4 BAYES' THEOREM (■ VERY IMPORTANT)

What is it?
Mathematical framework for updating beliefs based on new evidence.

Formula:

P(A|B) = P(B|A) × P(A) / P(B)

Components:
• P(A|B) = Posterior (what we want, updated belief)
• P(B|A) = Likelihood (evidence given hypothesis)
• P(A) = Prior (initial belief before evidence)
• P(B) = Evidence (probability of observing data)

Intuition:
Posterior = Likelihood × Prior / Evidence
(Updated belief = How likely evidence × Initial belief)

Medical Test Example:


• Prior: P(Disease) = 0.01 (1% have disease)
• Likelihood: P(Positive|Disease) = 0.99 (test catches 99% of sick)
• False positive: P(Positive|Healthy) = 0.05

Calculate P(Positive):

P(Positive) = P(Positive|Disease)×P(Disease) +
P(Positive|Healthy)×P(Healthy)
= 0.99×0.01 + 0.05×0.99
= 0.0099 + 0.0495 = 0.0594

Calculate Posterior:
P(Disease|Positive) = 0.99 × 0.01 / 0.0594
= 0.0099 / 0.0594
= 0.167 (16.7%)

Interpretation:
Despite positive test, only 16.7% chance of actual disease!
Reason: Disease is rare, so most positives are false alarms

Why it matters in ML:


• Foundation of Bayesian inference
• Naive Bayes classifier uses this
• Uncertainty quantification
4.5 NORMAL DISTRIBUTION

What is it?
Symmetric bell-shaped probability distribution.

Mathematical Formula:

P(x) = (1 / (σ√(2π))) × e^(-(x-µ)²/(2σ²))

Key Properties:
• Symmetric around mean (µ)
• Peak at mean
• Spread controlled by standard deviation (σ)
• Entire area under curve = 1 (total probability)

Why it's important:


• Many real-world phenomena approximately normal
• Central Limit Theorem: Sum of many random variables ≈ normal
• Foundation for statistical tests
• Gaussian Naive Bayes assumes normal distribution

Examples of Approximately Normal Distributions:


• Heights in population
• Test scores
• Measurement errors
• IQ scores

Standard Normal Distribution:


• µ = 0, σ = 1
• Used as reference (z-scores)

Real-World Application:
Factory produces light bulbs (µ=1000hrs, σ=50hrs)
• What % last > 1100 hours? (2σ above mean)
• Answer: ~2.3% (using 68-95-99.7 rule)
LEVEL 5: MODEL IMPROVEMENT TECHNIQUES

Objective: Make models perform better.

5.1 BIAS VS VARIANCE

What is it?
Two sources of error that limit model performance.

Bias:
Error from oversimplified assumptions (underfitting)

High Bias = Model too simple

• Cannot capture true relationship


• Example: Using linear model for non-linear data
• Result: Poor training AND test performance

Variance:
Error from being too sensitive to training data (overfitting)

High Variance = Model too complex

• Memorizes training data


• Example: Decision tree with no depth limit
• Result: Good training, poor test performance

Bias-Variance Tradeoff:

Total Error = Bias² + Variance + Irreducible Error

Visualizing Tradeoff:
Model Complexity (→)
• Very simple: High bias, low variance (underfitting)
• Medium: Balanced (sweet spot)
• Very complex: Low bias, high variance (overfitting)

How to detect Bias vs Variance:


• High bias: Training error high AND test error high
→ Solution: Use complex model
• High variance: Training error low BUT test error high
→ Solution: Regularization, more data, simpler model

5.2 HYPERPARAMETER TUNING

What is it?
Adjusting model parameters before training (not learned from data).

Examples of Hyperparameters:
• Learning rate (α) in gradient descent
• Number of trees (n_estimators) in Random Forest
• Tree depth (max_depth)
• K in K-Nearest Neighbors
• Regularization strength (λ) in linear regression
• Batch size in neural networks

Why is it critical?
Same algorithm with different hyperparameters = different performance

Example: Decision Tree

max_depth = 3 → Simple model (high bias)


max_depth = 20 → Complex model (high variance)
max_depth = 8 → Balanced (optimal)

5.3 GRID SEARCH

What is it?
Exhaustive search over defined hyperparameter ranges.

Algorithm:
1. Define ranges for each hyperparameter
2. Create all combinations
3. Train model for each combination
4. Evaluate on validation set
5. Return best combination

Example:

Hyperparameters:
max_depth ∈ [3, 5, 7, 10]
min_samples_split ∈ [2, 5, 10]
Total combinations: 4 × 3 = 12
Train 12 models
Pros & Cons:
✓ Systematic, guaranteed to find best
✓ Easy to implement
✗ Computationally expensive (exponential combinations)
✗ Slow for many hyperparameters

5.4 RANDOM SEARCH

What is it?
Random sampling from hyperparameter space (faster than Grid).

Algorithm:
1. Define ranges for hyperparameters
2. Sample N random combinations
3. Train model for each
4. Return best

Comparison with Grid Search:

Grid Search: 4 × 4 × 4 = 64 combinations


Random Search: Sample 20 random combinations

When Random is Better:


Some hyperparameters might not matter much
Grid would waste time on non-important ones
Random finds important combinations with less computation

Pros & Cons:


✓ Faster than grid search
✓ Handles many hyperparameters better
✗ Not guaranteed to find best
✗ Less predictable coverage

5.5 FEATURE SELECTION

What is it?
Selecting most important features and removing irrelevant ones.

Why is it important?
• Improves model interpretability
• Reduces overfitting
• Faster training
• Reduces memory requirements
• Handles curse of dimensionality

1. Filter Methods (Statistical):


Rank features by statistical properties independently.

Correlation with target: r ∈ [-1, 1]


Select features with |r| > threshold

Pros: Fast, scalable


Cons: Ignores feature interactions

2. Wrapper Methods (Model-Based):


Train model with different feature subsets.

Exhaustive/Greedy optimization of feature subsets

• Forward Selection: Start with 0 features, add best


• Backward Elimination: Start with all, remove worst
• Recursive Feature Elimination (RFE)

Pros: Considers interactions


Cons: Computationally expensive

3. Embedded Methods (During Training):


Features selected as part of model training

Coefficient magnitude |β| in Linear Regression


Feature importance in Tree-based models

Examples:
• Lasso Regression (L1 regularization)
• Tree feature importance (Random Forest)
• Permutation importance

Example: Predicting House Price


Available features:
• Area (important) → Keep
• Number of bedrooms (important) → Keep
• Owner's favorite color (useless) → Remove
• Paint brand (barely matters) → Remove

Result: 2 features instead of 100+ → Better model


QUICK REFERENCE GUIDE

Problem Type Algorithm When to Use

Regression Linear Reg. Simple, interpretable relationships

Regression Polynomial Reg. Non-linear patterns

Classification Logistic Reg. Binary, large datasets

Classification KNN Small-medium data, non-linear

Classification Naive Bayes Text classification, fast needed

Classification Decision Tree Interpretability important

Classification Random Forest Best accuracy, medium complexity

Classification SVM High dimensions, large margin needed

Classification Gradient Boost Competition, highest accuracy

When to Use Which Metric

• Balanced Data, Similar Consequences → Accuracy


• False Positives Costly → Precision
• False Negatives Costly → Recall
• Balance Both → F1 Score
• Different Thresholds → ROC/AUC

Data Preprocessing Checklist

■ Handle missing values


■ Encode categorical features
■ Scale numerical features
■ Check for outliers
■ Create train-test split
■ Use K-fold cross-validation
■ Feature selection if many features

Model Improvement Workflow

1. Train baseline model


2. Check bias vs variance
3. If High Bias: Use complex model
4. If High Variance: Regularize, add data, simplify
5. Tune hyperparameters (Grid/Random Search)
6. Feature selection/engineering
7. Ensemble methods (combine models)
KEY FORMULAS SUMMARY

Linear Regression
y = β■ + β■x + ε
MSE = (1/n) Σ(y■ - ■■)²

Logistic Regression
P(y=1|x) = 1 / (1 + e^(-z))
Log Loss = -1/n × Σ[y×log(■) + (1-y)×log(1-■)]

Classification Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Feature Scaling
Normalization: X = (X - min) / (max - min)
Standardization: X = (X - mean) / std

Statistics
Mean: µ = Σx / n
Variance: σ² = Σ(x - µ)² / n
Std Dev: σ = √(σ²)

Probability
P(A and B) = P(A) × P(B) [if independent]
P(A|B) = P(A and B) / P(B)
Bayes: P(A|B) = P(B|A) × P(A) / P(B)

Cross-Validation
CV_Score = (Score■ + Score■ + ... + Score■) / K

You might also like