0% found this document useful (0 votes)
10 views23 pages

ML Module 2

The document covers foundational topics in machine learning, focusing on classification techniques and linear regression. It explains various classification types, linear regression methods, evaluation metrics, regularization techniques, and optimization methods like gradient descent. Additionally, it introduces core classification models such as logistic regression, Naive Bayes, and KNN, along with advanced models like SVM, decision trees, and ensemble learning strategies.

Uploaded by

haseebvlog5580
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

ML Module 2

The document covers foundational topics in machine learning, focusing on classification techniques and linear regression. It explains various classification types, linear regression methods, evaluation metrics, regularization techniques, and optimization methods like gradient descent. Additionally, it introduces core classification models such as logistic regression, Naive Bayes, and KNN, along with advanced models like SVM, decision trees, and ensemble learning strategies.

Uploaded by

haseebvlog5580
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ML_Module 2

Part 1: Foundational Topics (Classification & Linear Regression)

1. Introduction to Classification

Classification is a supervised learning technique where the model


learns to assign a category label to an input based on historical
data.

How it Works: The model identifies specific characteristics


(features) like shape or color to distinguish between groups.

The Decision Boundary: This is a conceptual line or region that


separates different classes. If a data point falls on one side, it’s
labeled Class A; if the other, Class B.
Types of Classification:

Binary Classification: Choosing between exactly two options


(e.g., an email is either "Spam" or "Not Spam").
Multi class Classification: Categorizing data into more than
two distinct groups where each point belongs to only one
(e.g., classifying an image as a "Dog," "Cat," or "Bird").

Multi-Label Classification: A single data point can have


multiple labels simultaneously (e.g., a movie tagged as both
"Action" and "Comedy").

Real-World Example: In Medical Diagnosis, a model analyzes


test results and symptoms to classify whether a patient has a
specific condition, helping doctors make faster decisions.

2. Linear Regression

Linear Regression predicts a continuous numerical value (like


price or weight) by assuming a straight-line relationship between
input and output.

Hypothesis Function: This function is the equation used to


make predictions about the dependent variable based on the
independent variables. It represents the relationship between the
input features and the target output
Here, m (slope) shows how much y changes as x increases, and
b (intercept) is the starting value of y when x is zero.

Simple vs. Multiple Regression:

Simple: Uses one input to predict one output (e.g., predicting


Salary based on Years of Experience).

Multiple: Uses several inputs to predict one output (e.g.,


predicting House Price using size, location, and age).

The "Best Fit Line": The goal is to find a line that minimizes the
vertical distances between each actual point (•) and the predicted
line.
The algorithm adjusts the slope and intercept of the line to
minimize the sum of squared residuals (errors).

Handling Categorical Data: If your data has non-numeric labels


(like "Urban" or "Rural"), you must use Dummy Variables (binary
1s and 0s) so the math-based model can process them.

Real-World Example: Real Estate companies use multiple


linear regression to estimate property values based on square
footage, neighborhood safety, and proximity to schools.

3. Evaluation Metrics for Regression

To know if your regression model is "good," you must measure its


error.

Mean Squared Error (MSE): Calculates the average of the


squared differences between predicted and actual values.
Squaring the error ensures that negative and positive errors don't
cancel out and penalizes large errors more heavily.

Mean Absolute Error (MAE): Takes the average of the absolute


differences. it does not take square. It is less sensitive to outliers
compared to MSE because it doesn't square the errors.
R-Squared ($R^2$):Calculates the average of the squared
differences between predicted and actual values then divide it by
total data. how accurate the model’s predictions are compared to
the actual results. it tell us value between 1 and 0 and higher is
better.

Adjusted R-Squared: Unlike standard R-squared, this penalizes


the model for adding extra input features that don't actually help
with prediction, preventing the model from becoming needlessly
complex as it always increases if adding feature does not help.

4. Regularization Techniques

When a model is too complex, it might "overfit"—meaning it


memorizes the training data but fails on new data. Regularization
adds a penalty to prevent the model's coefficients from becoming too
large.

Lasso Regression (L1): Adds a penalty based on the absolute


value of coefficients. It can shrink some coefficients to exactly
zero, effectively performing feature selection by removing
useless inputs.

Ridge Regression (L2): Adds a penalty based on the square of


coefficients. It shrinks coefficients toward zero but never makes
them exactly zero. It is excellent for handling corelated
features (when input features are highly related to each other).
Elastic Net: A hybrid that combines both L1 and L2 penalties to
get the benefits of both.

5. Optimization: Gradient Descent

Gradient Descent is the "engine" that trains models by iteratively


adjusting parameters to find the point where the error (Cost
Function) is lowest.

How it Works: Imagine you are on a hill in a thick fog. To find


the valley (lowest error), you take small steps in the steepest
downward direction.

Learning Rate ($\alpha$): This controls the size of each step. If


the step is too big, you might overshot the valley; if too small, it
will take too long to get there.

Why use it? For massive datasets, solving equations directly is


too computationally expensive. Gradient Descent is faster and
more memory-efficient.

Part 2: Core Classification Models (Logistic, Naive Bayes, & KNN)

1. Logistic Regression
Despite its name, Logistic Regression is used for Classification, not
regression. It predicts the probability that an input belongs to a
specific category.

The Sigmoid Function: This is the heart of the model. It takes


any number and squashes it into a range between 0 and 1. If the
result is eq 0.5, it’s usually Class 1; if lower, it’s Class 0.

Logistic regression model transforms the linear


regression function continuous value output into categorical value
output using a sigmoid function

Some Terminologies.

Independent Variables:These are the input features or


predictor variables used to make predictions about the
dependent variable.
Dependent Variable: This is the target variable that we aim to
predict. In logistic regression, the dependent variable is
categorical.
Odds: This is the ratio of the probability of an event happening
to the probability of it not happening. It differs from probability
because probability is the ratio of occurrences to total
possibilities.
Cost Function (Log Loss): We use Log Loss instead of MSE
because it creates a smooth, bowl-shaped (convex) curve,
allowing Gradient Descent to easily find the single best solution.

Types of Logistic Regression:

Binomial: Two categories only (e.g., Pass/Fail).

Multi-nomial: More than two categories with no specific order


(e.g., Cat, Dog, Bird).

Ordinal: More than two categories with a natural ranking


(e.g., Rating a movie: Low, Medium, High).

Real-World Example: Bank Fraud Detection. The model


calculates the probability (e.g., 85%) that a transaction is
fraudulent. If that probability exceeds a certain threshold, the
bank blocks the card.

2. Naive Bayes Classifiers

Naive Bayes is a fast, probabilistic classifier based on Bayes'


Theorem.
The "Naive" Assumption: It is called "naive" because it
assumes that every feature is independent of every other
feature. For example, in a fruit model, "roundness" and "red
color" are treated as if they have nothing to do with each other,
even though they both suggest an apple.

Common Variants:

Gaussian Naive Bayes: (For Continuous Data):

Used when features are continuous / real numbers (e.g.,


height, temperature, or petal length).

Mechanism:It assumes that features for each class follow


a Normal (Gaussian) Distribution(a bell curve).

Example: Predicting flower species based on the exact


measurement of petals.

Multi-nomial Naive Bayes:

assumes that features represent discrete counts(without


float), Ideal for text classification where you care about word
frequencies.

Mechanism:It looks at how many times a word appears in a


document.

Example:A spam filter that flags an email because the word


"Free" appears five times.
Bernoulli Naive Bayes:

Best for binary data. It only cares if a feature is present or


absent(1 or 0), not how many times it appears.

Used when features are 1s and 0s (e.g. "Yes/No" or


"Present/Absent").

Mechanism:It only cares if a word is in a document or not,


not how many times it shows up.

Example:Sentiment Analysis where you check if words like


"excellent" or "terrible" simply exist in a review.

Complement Naive Bayes (CNB):

Specifically designed for unbalanced datasets where one


category is much larger than the others.

Mechanism: Instead of calculating probabilities based on the


target class, it looks at the complement(all other classes) to
reduce bias toward the majority group.

Example:Medical diagnosis where "Healthy" cases vastly


outnumber "Disease" cases. CNB prevents the model from
just guessing "Healthy" every time to get high accuracy.

Email Spam Filtering. If the word "Winner" appears frequently


in known spam, the model increases the probability that a
new email containing "Winner" is spam.
3. K-Nearest Neighbors (KNN)

KNN is a simple, "lazy" algorithm that doesn't build


a mathematical model during training; it just stores the data and
waits until it needs to make a prediction.

The "K" Factor: To classify a new point, the model looks at the
K closest points (neighbors). If K=3, it checks the 3 nearest
neighbors and assigns the new point to the category that has the
most votes.

Distance Metrics: Proximity is usually measured using


Euclidean Distance (a straight line like a bird flies) or
Manhattan Distance (moving in a grid like a taxi in a city).
Impact of K Value:

Small K (e.g., K=1): The model is very sensitive to noise and


"outliers" (overfitting).

Large K: The model becomes too simple and might ignore


small, important patterns (underfitting).

Decision Boundaries: For a 1-Nearest Neighbor model, the


boundary looks like a Voronoi Diagram—a complex web of
regions where every spot inside a region is closest to a specific
training point.
Real-World Example: Personalized Recommendations. Netflix
might suggest a movie to you because "K" users who have
similar tastes to yours (your neighbors in the data) also enjoyed
that movie.

Analogy to solidify understanding:

Logistic Regression is like a judge deciding if a person is guilty


(1) or innocent (0) based on the evidence.
Naive Bayes is like a detective who lists clues (features) and
calculates the odds of a crime based on how often those clues
appeared in past cases.

KNN is like a chameleon—it simply looks at the people standing


closest to it and decides to act exactly like the majority of them.

Part 3: Advanced Boundary & Hierarchical Models

1. Support Vector Machines (SVM)

SVM is a supervised learning algorithm that finds the best possible


boundary (Hyperplane) to separate different classes by maximizing
the gap between them.
Key Concepts:

Hyperplane: The decision line that divides classes (e.g., a


line in 2D or a flat surface in 3D).
Support Vectors: The data points closest to the boundary.
These are the "pillars" that actually hold up and define the
boundary.

Margin: The distance between the hyperplane and the


support vectors. SVM always tries to maximize margin to
ensure the model is stable on new data.

Hard vs. Soft Margin: A "Hard Margin" requires every point


to be perfectly separated, which fails if there are outliers. A
"Soft Margin" allows a few mistakes to create a smoother,
more useful boundary.
The Kernel Trick (Non-Linear SVM):

When data is messy and can't be separated by a straight line,


SVM uses a Kernel to move the data into a higher dimension
where a straight cut is possible.

Radial Basis Function (RBF): Excellent for circular or


complex patterns.

Polynomial Kernel: Useful for more curved, overlapping


relationships.

Real-World Example: Handwritten Digit Recognition. SVM can


distinguish between a "2" and a "3" by finding the most distinct
"borderline" features of how those numbers are drawn.

2. Decision Trees

A Decision Tree acts like a flowchart for making decisions. It breaks


down a complex dataset into smaller, simpler groups by asking a
series of "Yes/No" questions.
Tree Structure:

Root Node: The very first question that splits the entire
dataset.

Internal Nodes: Intermediate questions based on specific


features.

Leaf Nodes: The final decision or category (e.g., "Purchase"


or "No Purchase").

How the Tree "Decides" to Split:

Entropy: A measure of uncertainty or "messiness." If a


group is 50% apples and 50% oranges, entropy is high. If it's
100% apples, entropy is zero.

Information Gain: The reduction in entropy after a split. The


tree chooses the question that reduces messiness the most.
Gini Index: Measures impurity. It's a faster way to see how
often a random element would be labeled incorrectly.

Common Algorithms:

ID3: Uses Information Gain; struggles with numbers.

C4.5: Improved ID3 that handles continuous numbers and


prevents overfitting.

CART: Used for both Classification and Regression. It


creates simple binary (2-way) splits.
Real-World Example: Loan Approval. A bank uses a tree to
ask: "Is income > $50k?" >"Is credit score > 700?" >
"Approve/Reject".

3. Ensemble Learning

Ensemble Learning is the practice of combining multiple models to


get a single, more accurate prediction. It follows the "wisdom of the
crowd" principle.
Bagging (Bootstrap Aggregating):

Trains many models independently and in parallel on


random subsets of data.
Bootstrap Sampling:This involves creating new datasets by
randomly picking rows from the original data with
replacement(meaning one row can appear multiple times in a
subset).

Out-of-Bag (OOB) Evaluation: About 1/3 of the data is


usually left out of each bootstrap sample. This "extra" data
can be used to test the model's accuracy without needing a
separate validation set.

Random Forest: The most famous bagging algorithm. it


builds hundreds of decision trees and takes a majority vote
for the final answer. It is highly robust to noise and missing
data.

Boosting:

Trains models sequentially (one after another). Each new


model focuses specifically on the "weakest students" (the
errors) of the previous model.
AdaBoost: Short for Adaptive Boosting. It starts by giving
equal weight to all data points. After each round, it increases
the weight of misclassified [Link] gives more weight to
misclassified points so the next model works harder to get
them right.

Gradient Boosting (GBM):

Instead of adjusting weights, it uses Gradient Descent to


minimize the "Residual Error" (the gap between actual and
predicted values) from the previous model.

XGBoost (Extreme Gradient Boosting):

An optimized version of GBM that is faster and


uses Regularization to prevent the model from becoming
too complex and overfitting.

CatBoost:

Specifically built to handle Categorical Data(like city


names or colors) without needing heavy preprocessing like
one-hot encoding.
Real-World Example: Stock Market Forecasting. Instead of
relying on one financial indicator, an ensemble model combines
dozens of different strategies to make a safer, more accurate
prediction.

4. Hyperparameter Tuning

To make these advanced models work perfectly, we must "tune" their


settings (hyperparameters).

Grid Search CV: An exhaustive search that tries every possible


combination of settings you provide to find the absolute best one.

Randomized Search CV: Picks random combinations of settings.


It is much faster than Grid Search and often finds a great
solution with less effort.

Analogy to solidify understanding:

SVM is like a security guard at a club who draws the widest


possible line on the floor to keep two rival groups as far apart as
possible.

Decision Trees are like a game of 20 Questions, where each


question narrows down the possibilities until the answer is
obvious.
Random Forest is like a jury—instead of one judge deciding a
case, 12 people vote, which reduces the chance of one person's
bias ruining the result.

Part 4: Deep Dive into Decision Tree Algorithms & Selection Measures

A decision tree isn't just one algorithm; it is a family of algorithms


that differ in how they choose to split data and what kind of data
they can handle,.

1. Key Attribute Selection Measures

To build a tree, the model must decide which "question" is the best
to ask at each node. It uses two primary mathematical tools:

Information Gain (Used by ID3): It measures how much the


"uncertainty" (Entropy) decreases after a split.

Entropy is the measure of "messiness." If a group is a 50/50


mix of classes, entropy is high (1.0); if it's 100% one class,
entropy is zero.

High Information Gain means the question successfully


separated the data into clearer, purer groups.

Gini Index (Used by CART): This measures how often a random


element would be incorrectly labeled.

A Gini of 0 means the group is "pure" (all one class).


Why use it? It is computationally faster than entropy because
it doesn't involve complex logarithmic math.

Real-World Example: If you are sorting healthy vs. sick


patients, a "pure" node would contain only healthy patients.

Feature Gini Index Entropy

Definition Measures impurity b Measures impurity us


ased on squared pro ing information theor
babilities y(logarithmic)

Formula 2 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ∑ 𝑝𝑖 log⁡2 (𝑝𝑖 )


𝐺𝑖𝑛𝑖 = 1 − ∑ 𝑝𝑖

Range 0(pure)to~0.5(max im 0 (pure) to 1 (max im


purity for binary) purity for binary)

Computational Cost Lower(no logarithms) Higher(uses logarith


ms)

Bias Toward Splits Prefers larger class s More sensitive to cla


eparation ss imbalance

2. Specialized Decision Tree Algorithms


Your sources identify six specific types of trees, each with unique
advantages:
1. ID3 (Iterative Dichoto miser 3): The classic version. It uses
Information Gain but cannot handle continuous numbers (like
age or height) and is prone to over-fitting (memorizing data too
closely).

2. C4.5: An upgrade to ID3. It uses a Gain Ratio to prevent bias


toward features with too many options. Crucially, it can handle
continuous data by sorting values and finding the best split
point.
3. CART (Classification and Regression Trees): A highly flexible
model that works for both labels and numbers. It builds binary
trees (only two branches per node) and uses Gini Impurity for
classification.

4. CHAID (Chi-Square Automatic Interaction Detection): Best for


large datasets with many categories. It uses statistical Chi-
Square tests to find the strongest relationship between features
and the target.

5. MARS (Multivariate Adaptive Regression Splines): An


extension of CART designed for non-linear relationships. It
uses "knots" to connect different linear segments, allowing the
model to follow complex curves in data.

6. Conditional Inference Trees: These use statistical hypothesis


testing to choose splits, which helps minimize bias and prevent
the model from picking features just because they have many
categories.
3. Critical Training Concepts

Pruning: After a tree is fully grown, it often becomes too


complex. Pruning involves cutting off branches that don't
significantly improve accuracy, which helps the model generalize
better to new data.

Hyperparameter Tuning (GridSearchCV): This is used to find


the perfect settings, such as max_depth (how deep the tree
goes) or min_samples_split (how many items must be in a
group before it can be split again).

Real-World Example: In Credit Risk Assessment, a bank might


use a CART model to predict if a client will default. The tree splits
by Income, then by Credit Score, finally reaching a "Leaf Node" that
outputs a specific risk percentage.

You might also like