0% found this document useful (0 votes)
9 views18 pages

Ensemble Methods and Random Forest

Uploaded by

Ashu Kumar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Ensemble Methods and Random Forest

Uploaded by

Ashu Kumar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ensemble Methods and Random

Forest
Introduction
• Random Forest is a widely-used machine learning
algorithm developed by Leo Breiman and Adele
Cutler, which combines the output of multiple
decision trees to reach a single result.
• It handles both classification and regression
problems.
• A Random Forest is like a group decision-making
team in machine learning. It combines the
opinions of many “trees” (individual models) to
make better predictions, creating a more robust
and accurate overall model.
Ensemble Learning
• Ensemble simply means combining multiple
models.
• Thus a collection of models is used to make
predictions rather than an individual model.
Types of Ensemble
• Ensemble uses two types of methods:
– Bagging
– Boosting
• Bagging - It creates a different training subset
from sample training data with replacement &
the final output is based on majority voting. For
example, Random Forest.
• Boosting-It combines weak learners into strong
learners by creating sequential models such that
the final model has the highest accuracy. For
example, ADA BOOST, XG BOOST.
Bagging
• Bagging, also known as Bootstrap Aggregation, serves as the ensemble technique
in the Random Forest algorithm. Here are the steps involved in Bagging:
• Selection of Subset: Bagging starts by choosing a random sample, or subset, from
the entire dataset.
• Bootstrap Sampling: Each model is then created from these samples, called
Bootstrap Samples, which are taken from the original data with replacement. This
process is known as row sampling.
• Bootstrapping: The step of row sampling with replacement is referred to as
bootstrapping.
• Independent Model Training: Each model is trained independently on its
corresponding Bootstrap Sample. This training process generates results for each
model.
• Majority Voting: The final output is determined by combining the results of all
models through majority voting. The most commonly predicted outcome among
the models is selected.
• Aggregation: This step, which involves combining all the results and generating the
final output based on majority voting, is known as aggregation.
Bagging Example
• Now let’s look at an example by breaking it down with
the help of the following figure. Here the bootstrap
sample is taken from actual data (Bootstrap sample 01,
Bootstrap sample 02, and Bootstrap sample 03) with a
replacement which means there is a high possibility
that each sample won’t contain unique data. The
model (Model 01, Model 02, and Model 03) obtained
from this bootstrap sample is trained independently.
Each model generates results as shown. Now the
Happy emoji has a majority when compared to the Sad
emoji. Thus based on majority voting final output is
obtained as Happy emoji.
Bagging
Boosting

• Boosting is one of the techniques that use the


concept of ensemble learning.
• A boosting algorithm combines multiple
simple models (also known as weak learners
or base estimators) to generate the final
output.
• It is done by building a model by using weak
models in series.
Adaboost
• There are several boosting algorithms;
AdaBoost was the first really successful
boosting algorithm that was developed for the
purpose of binary classification. AdaBoost is
an abbreviation for Adaptive Boosting and is a
prevalent boosting technique that combines
multiple “weak classifiers” into a single
“strong classifier.”
Other Boosting Algorithms
• GBM
• LightGBM
• XGBoost
• CatBoost
Working of Random Forest Algorithm

• Step 1: In the Random forest model, a subset of data


points and a subset of features is selected for
constructing each decision tree. Simply put, n random
records and m features are taken from the data set
having k number of records.
• Step 2: Individual decision trees are constructed for
each sample.
• Step 3: Each decision tree will generate an output.
• Step 4: Final output is considered based on Majority
Voting or Averaging for Classification and regression,
respectively.
Random Forest Example
• Consider the fruit basket as the data as shown in
the figure below. Now n number of samples are
taken from the fruit basket, and an individual
decision tree is constructed for each sample. Each
decision tree will generate an output, as shown in
the figure. The final output is considered based
on majority voting. In the below figure, you can
see that the majority decision tree gives output
as an apple when compared to a banana, so the
final output is taken as an apple.
Important Features of Random Forest

• Diversity: Not all attributes/variables/features are considered while


making an individual tree; each tree is different.
• Immune to the curse of dimensionality: Since each tree does not
consider all the features, the feature space is reduced.
• Parallelization: Each tree is created independently out of different
data and attributes. This means we can fully use the CPU to build
random forests.
• Train-Test split: In a random forest, we don’t have to segregate the
data for train and test as there will always be 30% of the data which
is not seen by the decision tree.
• Stability: Stability arises because the result is based on majority
voting/ averaging.
Difference Between Decision Tree and
Random Forest

Decision trees Random Forest


1. Random forests are created from
1. Decision trees normally suffer from the subsets of data, and the final output is
problem of overfitting if it’s allowed to based on average or majority ranking;
grow without any control. hence the problem of overfitting is taken
care of.
2. A single decision tree is faster in
2. It is comparatively slower.
computation.
3. When a data set with features is taken
3. Random forest randomly selects
as input by a decision tree, it will
observations, builds a decision tree, and
formulate some rules to make
takes the average result.
predictions.
Important Hyperparameters in
Random Forest
• Hyperparameters are used in random forests
to either enhance the performance and
predictive power of models or to make the
model faster.
Hyperparameters to Increase the
Predictive Power
• n_estimators: Number of trees the algorithm
builds before averaging the predictions.
• max_features: Maximum number of features
random forest considers splitting a node.
• mini_sample_leaf: Determines the minimum
number of leaves required to split an internal
node.
• criterion: How to split the node in each tree?
(Entropy/Gini impurity/Log Loss)
• max_leaf_nodes: Maximum leaf nodes in each
tree
Hyperparameters to Increase the
Speed
• n_jobs: it tells the engine how many processors it is
allowed to use. If the value is 1, it can use only one
processor, but if the value is -1, there is no limit.
• random_state: controls randomness of the sample.
The model will always produce the same results if it
has a definite value of random state and has been
given the same hyperparameters and training data.
• oob_score: OOB means out of the bag. It is a random
forest cross-validation method. In this, one-third of the
sample is not used to train the data; instead used to
evaluate its performance. These samples are called
out-of-bag samples.

You might also like