Supervised Machine Learning
Bagging
Random Forest
Prepared By
Archana
Bias
• The error from incorrect or overly simplistic assumptions in the learning algorithm.
• Impact: High bias can cause a model to miss relevant relationships between features and
the target, leading to underfitting.
• Example: Using a simple linear regression model to fit a highly complex, non-linear
dataset.
Variance
• The model's sensitivity to small fluctuations in the training data.
• Impact: High variance means the model is too complex, fitting the training data (and its
noise) too closely, which results in overfitting and poor performance on new data.
• Example: A very deep neural network that is trained on a small dataset and learns the
noise instead of the underlying patterns.
The bias-variance tradeoff
• The relationship: As you try to decrease bias (make the model more complex), variance
tends to increase. Conversely, as you decrease variance (simplify the model), bias tends
to increase.
• The goal: To find an optimal balance between bias and variance to build a model that has
the lowest total error and generalizes well to unseen data.
Overfitting
• What it is: A model that is too complex and has learned the training data's noise and outliers.
• Performance: Performs very well on the training data but poorly on new, unseen data.
• Characteristics: High variance, low bias.
• Cause: Overly complex model, not enough training data.
• Analogy: A student who memorizes every answer to a practice test but cannot answer questions
on the actual exam because they didn't learn the underlying concepts.
• Example: A stock prediction model that fits historical data perfectly but fails to predict future
trends because it learned random fluctuations.
Underfitting
• What it is: A model that is too simple and fails to capture the underlying trend of the data.
• Performance: Performs poorly on both the training data and new, unseen data.
• Characteristics: High bias, low variance.
• Cause: Model is too simple, not enough training.
• Analogy: A student who only learns one or two very basic concepts and cannot solve any complex
problems.
• Example: An image recognition model that is too simple to distinguish between cats and dogs.
How to balance
• Use a more complex model if the model is underfitting.
• Use a simpler model or regularization if the model is overfitting.
• Increase training data to help the model learn more general patterns.
• Use early stopping to stop training before the model starts to learn
the noise.
• Increase the training data size and complexity to address overfitting.
• Adjust the number of features used in the model.
• Use cross-validation to get a more robust evaluation of the model's
performance.
Cross Validation in Machine Learning
Ensemble Learning
Cross-validation
It is a technique used to check how well a machine learning model performs on
unseen data. It splits the data into several parts, trains the model on some parts and
tests it on the remaining part repeating this process multiple times. Finally the results
from each validation step are averaged to produce a more accurate estimate of the
model's performance.
The main purpose of cross validation is to prevent overfitting. If you want to make
sure your machine learning model is not just memorizing the training data but is
capable of adapting to real-world data cross-validation is a commonly used technique.
Types of Cross-Validation
Holdout Validation
LOOCV (Leave One Out Cross Validation)
Stratified Cross-Validation
K-Fold Cross Validation
1. Holdout Validation
The Holdout Method is a fundamental validation technique in machine learning used to
evaluate the performance of a predictive model. In this method, the available dataset is split
into two mutually exclusive subsets:
The dataset is commonly divided into training set and test set.
Typical split ratios include 70:30, 80:20 or 60:40 depending on dataset size.
A larger training set helps the model learn better patterns.
A larger test set provides a more reliable estimate of performance.
The holdout method is a form of cross-validation . It is simpler and faster.
It is most effective when the dataset is large enough to allow meaningful splitting.
Random shuffling before splitting is often applied to reduce bias.
How it works
Split the data: The most common approach is to split the dataset into a training set and a testing set, with a
typical split being 70-80% for training and 20-30% for testing. Sometimes, a third "validation" set is created,
which is used to refine the model during iterative testing, while the test set is only used for the final, one-
time performance evaluation.
Train the model: The model learns patterns and trends from the training data.
Evaluate performance: The model's performance is then measured on the unseen holdout data by
comparing its predictions to the actual values.
Refine and finalize: If a validation set is used, the model can be iteratively refined using the validation data.
The final performance is then reported based on the final model's performance on the test set.
Pros and cons
Pros: 1)Simple and fast to implement. 2)Easy to understand and execute.
Cons:
Bias: The model's performance can be highly dependent on the specific random split. A poor split can lead
to a high error estimate.
Inefficient for small datasets: If the dataset is small, splitting it may leave the model with too little data to
train on, and the holdout set may not be representative.
Single evaluation: The model is only evaluated on one test set, which may not be a reliable indicator of
future performance
2. LOOCV (Leave One Out Cross Validation)
In this method we perform training on the whole dataset but
leaves only one data-point of the available dataset and then
iterates for each data-point.
In LOOCV the model is trained on n−1 samples and tested on
the one omitted sample repeating this process for each data
point in the dataset.
An advantage of using this method is that we make use of all
data points and hence it is low bias.
The major drawback of this method is that it leads to higher
variation in the testing model as we are testing against one
data point. If the data point is an outlier it can lead to higher
variation.
Another drawback is it takes a lot of execution time as it
iterates over the number of data points we have.
3. K-Fold Cross Validation
In K-Fold Cross Validation we split the dataset into k number of
subsets known as folds then we perform training on the all the
subsets but leave one (k-1) subset for the evaluation of the trained
model.
In this method, we iterate k times with a different subset reserved for
testing purpose each time.
It is always suggested that the value of k should be 10 as the lower
value of k takes towards validation and higher value of k leads to
LOOCV method.
Example of K Fold Cross Validation:
K-Fold Cross Validation
4. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the
cross-validation process maintains the same class distribution as the
entire dataset.
This is particularly important when dealing with imbalanced datasets
where certain classes may be under represented.
Here:
The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
During each iteration, one-fold is used for testing and the remaining
folds are used for training.
The process is repeated k times with each fold serving as the test set
exactly once.
Stratified Cross-Validation is essential when dealing with classification
problems where maintaining the balance of class distribution is crucial
for the model to generalize well to unseen data.
Ensemble Techniques in Machine Learning
Ensemble learning is a method where we use many small models
instead of just one.
Each of these models may not be very strong on its own, but when we
put their results together, we get a better and more accurate answer.
It's like asking a group of people for advice instead of just one person—
each one might be a little wrong, but together, they usually give a better
answer.
Types of Ensembles Learning in Machine Learning
Bagging (Bootstrap Aggregating)
Boosting
Stacking (Stacked Generalization)
Voting
Bagging or Bootstrap Aggregating
It works by training multiple base models independently and in parallel on
different random subsets of the training data.
These subsets are created using bootstrap sampling, where data points are
randomly selected with replacement, allowing some samples to appear multiple
times while others may be excluded.
•In classification tasks, the final prediction is decided by majority voting, the class
chosen by most base models.
•For regression tasks, predictions are averaged across all base models, known as
bagging regression.
•Bagging is versatile and can be applied with various base learners such as decision
trees, support vector machines or neural networks.
Bagging Algorithm
Bagging classifier can be used for both regression and classification tasks.
Overview of Bagging classifier algorithm:
•Bootstrap Sampling: Divides the original training data into ‘N’ subsets and
randomly selects a subset with replacement in some rows from other
subsets. This step ensures that the base models are trained on diverse
subsets of the data and there is no class imbalance.
•Base Model Training: For each bootstrapped sample we train a
base model independently on that subset of data. These weak
models are trained in parallel to increase computational efficiency
and reduce time consumption. We can use different base learners
i.e. different ML models as base learners to bring variety and
robustness.
•Prediction Aggregation: To make a prediction on testing data
combine the predictions of all base models. For classification tasks
it can include majority voting or weighted majority while for
regression it involves averaging the predictions.
•Out-of-Bag (OOB) Evaluation: Some samples are excluded from the
training subset of particular base models during the bootstrapping
method. These “out-of-bag” samples can be used to estimate the
model’s performance without the need for cross-validation.
•Final Prediction: After aggregating the predictions from all the base
models, Bagging produces a final prediction for each instance.
Fig: Bagging Classifier explained
The following are some of the reasons for using bagging classifiers:
[Link] Overfitting: Bagging is particularly effective in reducing overfitting, which is a common
problem in machine learning models. By training on various subsets of the data and then aggregating
the results, the bagging classifier creates a more generalized model.
[Link] Stability: The method enhances the stability of the machine learning models. Even if a
part of the data is noisy, the overall model remains unaffected because of the averaging or voting
process.
[Link] High Variance: It is especially beneficial for algorithms that have high variance. The
averaging of predictions across various models reduces this variance, leading to more reliable
predictions.
[Link] Computation: The training of individual models in a bagging classifier can be done in
parallel, which speeds up the training process. This is particularly useful for large datasets.
[Link] Accuracy: By combining the strengths of multiple models, bagging often leads to an
improvement in prediction accuracy compared to individual models.
[Link] to Outliers: Since bagging involves training on different subsets of data, the overall model is
less sensitive to outliers than individual models might be.
Sampling techniques for bagging
Bagging Classifier can be termed as some of the following based on the sampling
technique used for creating training samples:
Pasting Sampling: When the random subsets of data is taken in the random
manner without replacement (bootstrap = False), the algorithm can be called
as Pasting
Bagging Sampling: When the random subsets of data are drawn with replacement
(bootstrap = True), the algorithm can be called as Bagging. It is also called
as bootstrap aggregation.
Random Subspace: When the random subsets of features are drawn, the
algorithm can be termed as Random Subspace.
Random Patches: When both the ransom subsets of samples and features are
drawn, the algorithm can be termed as Random Patches.
Boosting Algorithm
Models are trained one after another. Each new model focuses on fixing
the errors made by the previous ones. The final prediction is a weighted
combination of all models, which helps reduce bias and improve
accuracy.
Boosting is an ensemble technique that combines multiple weak
learners to create a strong learner. Weak models are trained in series
such that each next model tries to correct errors of the previous model
until the entire training dataset is predicted correctly. One of the most
well-known boosting algorithms is AdaBoost (Adaptive Boosting).
Overview of Boosting algorithm:
•Initialize Model Weights: Begin with a single weak learner and assign
equal weights to all training examples.
•Train Weak Learner: Train weak learners on these dataset.
•Sequential Learning: Boosting works by training models sequentially
where each model focuses on correcting the errors of its predecessor.
Boosting typically uses a single type of weak learner like decision trees.
•Weight Adjustment: Boosting assigns weights to training datapoints.
Misclassified examples receive higher weights in the next iteration so
that next models pay more attention to them.
Decision trees are a great starting point in machine learning — they’re clear and
make sense.
But there’s a catch: they often don’t work well when dealing with new data. The
predictions can be inconsistent and unreliable, which is a real problem when
you’re trying to build something useful.
This is where Random Forest comes in. It takes what’s good about decision trees
and makes them work better by combining multiple trees together.
It a popular machine learning algorithm merges the outputs of numerous decision
trees to produce a single outcome.
One of the most important features of the Random Forest Algorithm is that it can
handle the data set containing continuous variables, as in the case of regression,
and categorical variables, as in the case of classification.
The algorithm’s strength lies in its ability to handle complex datasets and mitigate
overfitting, making it a valuable tool for various predictive tasks in machine
learning.
Understanding of Random forest
Random Forest is built on the notion of ensemble learning
Random Forest is a classifier that comprises a number of decision trees on
various subsets of the provided dataset and takes the average to enhance the
predicted accuracy of that dataset," as the name implies.
Instead of depending on a single decision tree, the random forest collects the
predictions from each tree and predicts the final output based on the majority
vote of predictions.
How does the Random Forest Algorithm Work?
Create Many Decision Trees: The algorithm makes many decision trees each using
a random part of the data. So every tree is a bit different.
Pick Random Features: Each individual decision tree is constructed using a
random subset of the features available in the dataset. This mechanism is known
as feature randomness or random subspace method. i.e., When building each
tree, it doesn’t look at all the features (columns) at once. It picks a few at
random to decide how to split the data. This helps the trees stay different from
each other.
Each Tree Makes a Prediction: Every tree gives its own answer or prediction
based on what it learned from its part of the data.
Combine the Predictions: For classification we choose a category as the final
answer is the one that most trees agree on i.e majority voting and
for regression we predict a number as the final answer is the average of all the
trees predictions.
Why It Works Well: Using random data and features for each tree helps avoid
overfitting and makes the overall prediction more accurate and trustworthy.
Key Features of Random Forest
•Handles Missing Data: It can work even if some data is missing so you
don’t always need to fill in the gaps yourself.
•Shows Feature Importance: It tells you which features (columns) are
most useful for making predictions which helps you understand your
data better.
•Works Well with Big and Complex Data: It can handle large datasets
with many features without slowing down or losing accuracy.
•Used for Different Tasks: You can use it for both classification like
predicting types or labels and regression like predicting numbers or
amounts.
Assumptions of Random Forest
•Each tree makes its own decisions: Every tree in the forest makes its
own predictions without relying on others.
•Random parts of the data are used: Each tree is built using random
samples and features to reduce mistakes.
•Enough data is needed: Sufficient data ensures the trees are different
and learn unique patterns and variety.
•Different predictions improve accuracy: Combining the predictions
from different trees leads to a more accurate final result.
Bagging Vs Random Forest
Bagging
Bootstrap Aggregation (Bagging) is a technique that uses the concept of
bootstrap sampling to reduce the variance of a model by combining multiple
predictions.
In bagging, we create multiple subsets of the original dataset by sampling with
replacement.
Each subset is used to train a decision tree model, and the predictions of all
trees are averaged to produce the final output.
The main idea behind bagging is to reduce the variance of a model by
introducing randomness in the dataset.
By sampling with replacement, each subset has a slightly different distribution
than the original dataset, which introduces diversity in the decision trees.
Therefore, bagging can reduce overfitting and improve the accuracy of a model.
Reduce the variance of a model
• Reducing a model's variance means making its predictions more consistent
across different training datasets, which helps it to generalize better to new,
unseen data and prevents overfitting.
• High variance models are overly sensitive to the training data's noise, leading
to unstable predictions and poor performance on new examples. Techniques to
reduce variance include increasing the training data, simplifying the model,
applying regularization, and using ensemble methods.
• What is model variance?
• High Variance: A model with high variance produces very different predictions
when trained on different subsets of the data. It has likely "memorized" the
noise and specific details of the training data instead of the underlying
patterns.
• Low Variance: A model with low variance is more stable. Its predictions are
similar even if the training data changes slightly, indicating it has learned a
more general representation of the data.
• Why reduce variance?
• Prevent overfitting: High variance often indicates a model is overfitting the
training data. Reducing variance makes the model more robust and less likely to
perform poorly on new data.
• Improve generalization: A model with low variance is better at generalizing its
knowledge to make accurate predictions on data it has never seen before.
• Increase stability: By reducing variance, you create a more stable and reliable
model that is less sensitive to the fluctuations in the training set.
• How to reduce variance
• Increase the amount of training data: More data helps the model learn the
underlying structure rather than specific noise.
• Simplify the model: Use a less complex model, or prune complex ones like a
deep decision tree. This forces the model to find more general patterns.
• Use regularization: Techniques like L1 or L2 regularization add a penalty to the
model's complexity, discouraging it from becoming too sensitive to the training
data.
• Use ensemble methods: Combine the predictions of multiple models (e.g.,
bagging, random forests) to create a more stable and robust prediction
• In machine learning, being "overly sensitive to training data noise" means
that a model has learned the irrelevant details, random fluctuations, or
errors (noise) in its training data as if they were actual, meaningful
patterns or signals. This condition is known as overfitting and is
associated with high variance.
Key Implications
• Poor Generalization: The model performs very well on the specific
training data it has "memorized" but performs poorly when exposed to
new, unseen, real-world data because the "patterns" it learned from the
noise do not exist in the real world.
• Modeling Errors as Patterns: The model mistakes random errors or
outliers in the data for true underlying relationships.
• High Variability: The model's predictions would change significantly if
it were trained on a slightly different subset of the data because it's
tightly coupled to the specific quirks of the initial dataset.
Causes
• This usually happens when:
• The model is too complex for the amount or nature of the training
data.
• The training data set is too small or not diverse enough to represent
all possible scenarios.
• The model is trained for too long, eventually starting to learn the
noise instead of the generalizable signal
Random Forest
Random forest is a modification of bagging that further improves the
performance of the model by introducing randomness in the feature selection
process.
In random forest, we create multiple decision trees using a subset of the
original features. In each node of the tree, instead of using all the features, we
randomly select a subset of features to split the data. This process is repeated
for each node, resulting in a decision tree that uses a subset of the features.
By using a subset of the features at each node, random forest introduces
diversity in the decision trees, which further reduces the variance of the
model.
Moreover, the feature selection process prevents the trees from being highly
correlated, which is a problem in bagging.
Therefore, the random forest can improve the accuracy of a model by reducing
overfitting and increasing the diversity of the trees.
Difference between Bagging and Random Forest
The main difference between bagging and random forest
lies in the way they introduce randomness in the dataset.
Bagging introduces randomness by sampling with
replacement, while random forest introduces randomness
by using a subset of features for each tree.
Bagging is a simple and effective technique that can
improve the accuracy of a model by reducing the variance.
However, bagging does not address the problem of highly
correlated trees, which can reduce the diversity of the
model.
Random forest, on the other hand, addresses the problem of
correlated trees by using a subset of features at each node.
This results in a diverse set of trees that can improve the accuracy
of a model by reducing overfitting and increasing the diversity of
the model.
Another difference between bagging and random forests is the
number of trees used in the model.
Bagging can use any number of trees, while random forest
typically uses a large number of trees (hundreds or thousands) to
achieve better accuracy.
This is because random forest requires more trees to achieve the
same level of accuracy as bagging, due to the feature selection
process.
Random Forest Hyperparameter Tuning in Python
While Random Forest is a robust model, fine-tuning its hyperparameters such as the
number of trees, maximum depth and feature selection can improve its prediction
and performance.
1. n_estimators: It defines the number of trees in the forest. More trees typically
improve model performance but increase computational cost.
By default: n_estimators=100
2. max_features: Limits the number of features to consider when splitting a node.
This helps control overfitting.
By default: max_features="sqrt" [available: ["sqrt", "log2", None}]
sqrt: Selects the square root of the total features. This is a common setting to reduce
overfitting and speed up the model.
log2: This option selects the base-2 logarithm of the total number of features. It
provide more randomness and reduce overfitting more than the square root option.
None: If None is chosen the model uses all available features for splitting each node.
This increases the model's complexity and may cause overfitting, especially with many features.
3. max_depth: Controls the maximum depth of each tree. A shallow tree may
underfit while a deep tree may overfit. So choosing right value of it is important.
By default: max_depth=None
max_depth=None means that there is no predefined limit on the maximum
depth the tree can grow. The tree will expand until one of the following
conditions is met:
All leaves are pure: This means that every leaf node in the tree contains only
samples belonging to a single class (in classification) or all samples in the leaf
have the same target value (in regression).
All leaves contain less than min_samples_split samples: The min_samples_split
parameter defines the minimum number of samples required to split an internal
node. If a node contains fewer samples than this threshold, it will not be split
further, and it becomes a leaf node.
4. max_leaf_nodes: Limits the number of leaf nodes in the tree hence
controlling its size and complexity. None means it takes an unlimited number of
nodes. By default: max_leaf_nodes = None
5. max_sample: Apart from the features, we have a large set of training
datasets. max_sample determines how much of the dataset is given to each
individual tree. None means [Link][0] is taken.
By default: max_sample = None
6. min_sample_split: Specifies the minimum number of samples required to
split an internal node.
By default: min_sample_split = 2
THANK YOU