UNIT-6
Ensemble Learning
Study Guide
Dr. Vinod Patidar
Associate Professor
CSE Department, PIT
Parul University
Introduction to Ensemble Learning-
Ensemble learning is a machine learning technique where, instead of relying on a single model, you
combine the predictions from multiple models to arrive at a final, more accurate, and more stable
prediction.
The core principle is rooted in the idea of the "wisdom of the crowd." A single expert might be
wrong, but the collective answer from a large, diverse group of experts is likely to be closer to the
truth. In machine learning, these "experts" are individual models, often called "weak learners." A
weak learner is a model that performs only slightly better than random guessing (e.g., a simple
decision tree with only one or two splits).
By combining hundreds or thousands of these weak learners in a sophisticated way, they can form
an incredibly powerful "strong learner." This approach is effective because different models make
different types of errors; by combining their outputs, the individual errors tend to cancel each other
out. This enhances the model's ability to generalize—to make accurate predictions on new, unseen
data—and makes the final model more robust against noise.
This technique is a versatile framework that can be applied to:
● Classification: Predicting a category (e.g., spam or not spam).
● Regression: Predicting a continuous value (e.g., house price).
● Anomaly Detection: Identifying unusual data points (e.g., fraudulent transactions).
Why Ensemble Learning is Important
Ensemble methods are a cornerstone of modern machine learning for several compelling reasons:
● Improved Predictive Performance: This is the primary benefit. Ensembles are specifically
designed to reduce the two main sources of error in machine learning:
o Bias: Error from overly simplistic assumptions (underfitting).
o Variance: Error from sensitivity to small fluctuations in the training data (overfitting).
● Reduction of Overfitting & Robustness: A single complex model can easily overfit by learning
the noise in the training data. Ensembles combat this by averaging out the predictions of many
models. This smooths out individual peculiarities, making the overall model more stable, reliable,
and less sensitive to outliers.
● Handling Model Diversity (Versatility): No single algorithm is best for every problem.
Ensembles allow you to combine fundamentally different types of models (e.g., a decision tree, a
support vector machine, and a neural network) in a process called stacking. This leverages the
unique strengths of each algorithm.
● Risk Mitigation: In high-stakes fields like medical diagnosis or financial trading, relying on a
single model is risky. An ensemble acts like a "committee" of models, making the final decision more
democratic, dependable, and less prone to failure from a single point of weakness.
● State-of-the-Art Performance: Ensemble methods are frequently used to win data science
competitions and build top-tier, state-of-the-art systems.
Common Ensemble Learning Methods
There are three main families of ensemble techniques:
1. Bagging (Bootstrap Aggregating): Focuses on reducing variance. It involves training multiple
models in parallel on different random subsets of the data and then averaging their predictions.
The Random Forest is the most famous example.
2. Boosting: Focuses on reducing bias. It builds models sequentially, where each new model is
trained to correct the errors made by the previous ones. AdaBoost and Gradient Boosting (like
XGBoost) are the most well-known examples.
3. Stacking (Stacked Generalization): This advanced technique combines diverse models. It
trains several different base models and then uses a "meta-model" to learn how to best combine
their predictions.
1. Bagging (Bootstrap Aggregating)
Bagging is an ensemble method designed primarily to combat overfitting by reducing the variance
of a model.
How Bagging Works
1. Bootstrapping: From an original dataset of 'N' samples, you create many new datasets, also
of size 'N', by sampling with replacement. This means some data points may be chosen multiple
times, and some (on average, 36.8%) may not be chosen at all. The data points left out are called
"Out-of-Bag" (OOB) samples and can be used for validation.
2. Parallel Training: A base learning algorithm (e.g., a decision tree) is trained independently on
each of these bootstrap samples. Since each model sees slightly different data, each will learn
slightly different patterns.
3. Aggregation: Once all models are trained, their predictions are combined.
o Classification: A majority vote is taken.
o Regression: The average of all predictions is taken.
A classic example of bagging is the Random Forest algorithm, which combines the predictions from
multiple decision trees.
Applications of Bagging
● IT: Improving network intrusion detection systems.
● Environment: Mapping wetland patterns in remote sensing.
● Finance: Fraud detection and credit risk assessment.
● Healthcare: Predicting medical conditions and for bioinformatics.
MACHINE LEARNING 2
Advantages and Disadvantages of Bagging
Advantages Disadvantages
Variance Reduction: Highly effective at
Increased Model Complexity: The ensemble can
reducing variance and preventing
be complex and difficult to interpret.
overfitting.
Limited Improvement for Stable Models: Offers
Easier Implementation: Libraries like
little benefit if the base model already has low
Scikit-learn make it simple to implement.
variance.
Robustness: Averaging makes the model Slower Training: Training multiple models can be
more stable and robust to noise. computationally expensive.
2. Boosting
Boosting trains models sequentially, with each model in the sequence focusing on fixing the
mistakes of the one before it. The goal is to convert a collection of weak learners into a single, highly
accurate strong learner.
How Boosting Works
1. Initial Model: A simple weak learner is trained on the original dataset. Initially, all data points
are given equal importance or "weight."
2. Identify Errors and Re-weigh: The model's predictions are compared to the actual outcomes.
The data points that the model misclassified are given increased weights.
3. Train the Next Model: A second weak learner is trained, but this time the training process is
influenced by the updated weights, forcing it to focus on the difficult, misclassified examples.
4. Repeat: This process is repeated for a specified number of iterations, with each new model
focusing on the errors left over by the ensemble so far.
5. Final Aggregation: The final prediction is a weighted sum of the predictions from all the
models. Models that performed better are given more say in the final decision.
Examples of boosting algorithms include AdaBoost (Adaptive Boosting) and Gradient Boosting
(e.g., XGBoost, LightGBM).
Applications of Boosting
● Healthcare: Identifying patients at risk for diseases or predicting cancer survival rates.
● IT: Used by search engines for page ranking and in image retrieval (e.g., the Viola-Jones
algorithm).
● Finance: Deployed for automated fraud detection and pricing analysis.
MACHINE LEARNING 3
Advantages and Disadvantages of Boosting
Advantages Disadvantages
Improved Accuracy: Combines weak models to Complex: The process of re-weighting and
create a final model with high accuracy, sequential training can be algorithmically
effectively reducing bias. complex.
Better Handling of Imbalanced Data: It Dependency: Each model is dependent on
naturally focuses on the harder-to-classify the previous one, which can cause errors to
(often minority class) data points. propagate.
Robustness to Overfitting (with care): While it Computationally Intensive: The sequential
can overfit if run for too many iterations, it's nature makes it difficult to parallelize, which
often more robust than a single model. can slow down training.
3. Stacking (Stacked Generalization)
Stacking is an advanced ensemble technique that combines the predictions from multiple different
types of models.
How Stacking Works
1. Train Base Models: Several different models (e.g., a Random Forest, a neural network, a
support vector machine) are trained on the full training dataset.
2. Create Meta-Dataset: The predictions from these base models are then used as input features
to train a new, final model.
3. Train Meta-Model: This final model, called the "meta-model" or "blender," learns how to best
combine the predictions from the base models to make the final prediction.
This approach leverages the unique strengths of each base algorithm, allowing the ensemble to
capture a much richer set of patterns.
Bagging vs. Boosting: A Comparison
Feature Bagging (e.g., Random Forest) Boosting (e.g., AdaBoost, XGBoost)
Sequential: Each model is trained
Training Parallel: All models are trained
one after another, as it depends on
Process independently and simultaneously.
the previous one's errors.
Reduce Variance: Solves the
Primary Reduce Bias: Solves the underfitting
overfitting problem by averaging
Goal problem by focusing on errors.
predictions.
Performance-Based Weight: Models
Model Equal Weight: Every base model gets
are weighted based on their
Weighting an equal vote or say.
accuracy.
Data Bootstrap Sampling: Each model is Weighted Sampling: The entire
Sampling trained on a random subset of data. dataset is used, but the weights of
MACHINE LEARNING 4
Feature Bagging (e.g., Random Forest) Boosting (e.g., AdaBoost, XGBoost)
samples are adjusted at each step.
Best for complex, high-variance
Best for simple, high-bias models (like
When to Use models (like deep decision trees)
shallow decision trees or "stumps").
that are prone to overfitting.
Deep Dive into Specific Algorithms
Random Forest
Random Forest is the most popular implementation of the bagging technique. It's an ensemble of
decision trees with an extra twist to ensure the trees are diverse.
● How it Works: Like standard bagging, it builds many trees on bootstrap samples. However, it
also introduces randomness into the tree-building process itself. At each node, when deciding which
feature to split on, a regular decision tree considers all features. A Random Forest only considers a
random subset of the features.
● Why it Works: This dual randomness (in both samples and features) is crucial. It ensures the
trees in the forest are not highly correlated. By forcing each tree to be different, the Random Forest
maximizes the variance-reduction benefits of bagging.
● Key Hyperparameters:
o n_estimators: The number of trees in the forest.
o max_features: The size of the random subset of features to consider at each split.
o max_depth: The maximum depth of each tree.
o oob_score: Uses the "out-of-bag" samples for cross-validation to evaluate performance.
AdaBoost (Adaptive Boosting)
AdaBoost is the original boosting algorithm. Its name stands for Adaptive Boosting because it
adaptively adjusts the weights of the data samples at each iteration.
● The Algorithm:
1. Initialize Weights: Start by assigning an equal weight to every sample.
2. Iterate:
a. Train a weak learner (e.g., a decision stump) on the weighted data.
b. Calculate the model's error rate (the sum of weights of misclassified samples).
c. Calculate the model's "say" or weight in the final ensemble (lower error = higher weight).
d. Update Sample Weights: Increase the weights of misclassified samples and decrease the weights
of correct ones.
3. Final Prediction: The final prediction is a weighted vote of all the weak learners.
MACHINE LEARNING 5
● Key Vulnerability: AdaBoost is sensitive to noisy data and outliers. Because it aggressively
increases the weight of misclassified samples, an outlier can attract a huge amount of weight,
distorting subsequent models.
XGBoost (Extreme Gradient Boosting)
XGBoost is a modern, highly optimized implementation of the gradient boosting framework. At each
step, a new model is trained to predict the residual errors (the difference between the actual values
and the current ensemble's predictions) of the previous ensemble.
● Key Features:
o Regularization: Includes L1 (Lasso) and L2 (Ridge) regularization terms to discourage
complexity and prevent overfitting.
o Hardware Optimization (Parallel Processing): While the boosting process is sequential, the
training of each individual tree can be parallelized, making it exceptionally fast.
o Built-in Handling of Missing Values: Can learn a default direction for missing values during
training, so you don't need to impute them first.
o Tree Pruning: Employs advanced pruning techniques to remove splits that don't provide a
positive gain, further controlling complexity.
o Built-in Cross-Validation: Allows you to run cross-validation at each iteration to find the
optimal number of trees.
o Feature Importance: Can provide scores indicating the relative importance of each feature.
MACHINE LEARNING 6
MACHINE LEARNING 1