0% found this document useful (0 votes)
119 views37 pages

GATE Question Bank: ML Decision Trees & Bias

This document is a comprehensive question bank for advanced preparation for the GATE Data Science and AI exam, focusing on Decision Trees, Bias-Variance Trade-off, and Cross-Validation. It includes multiple-choice, multiple-select, and numerical answer type questions designed to deepen understanding and application of key concepts, along with detailed explanations in the answer key. The structure aims to facilitate self-assessment and reinforce learning through a variety of question formats reflecting the actual exam pattern.

Uploaded by

Manish Kanwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views37 pages

GATE Question Bank: ML Decision Trees & Bias

This document is a comprehensive question bank for advanced preparation for the GATE Data Science and AI exam, focusing on Decision Trees, Bias-Variance Trade-off, and Cross-Validation. It includes multiple-choice, multiple-select, and numerical answer type questions designed to deepen understanding and application of key concepts, along with detailed explanations in the answer key. The structure aims to facilitate self-assessment and reinforce learning through a variety of question formats reflecting the actual exam pattern.

Uploaded by

Manish Kanwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

GATE-Level Question Bank: Decision

Trees, Bias-Variance, and


Cross-Validation

Preface: A Guide to Using This Question Bank

This document is a professional-grade question bank designed for advanced preparation for
competitive examinations such as the GATE (Graduate Aptitude Test in Engineering) Data
Science and Artificial Intelligence paper. The questions are meticulously crafted to test not
just rote memorization of concepts but a deeper, more nuanced understanding of their
application, limitations, and interconnections. The topics covered—Decision Trees, the
Bias-Variance Trade-off, and Cross-Validation—are foundational pillars of machine learning
and are frequently assessed at a high level of detail.

The structure is a direct response to the requirements of serious aspirants. The questions are
presented as a comprehensive workbook, allowing for a focused self-assessment session.
This is followed by a detailed answer key. The importance of the answer key cannot be
overstated; it is not merely a list of correct options. Each solution provides an in-depth
explanation, often including the underlying theory, mathematical derivations for numerical
problems, and a discussion of common pitfalls. This approach is intended to transform this
workbook from a simple test into a powerful learning tool, helping to diagnose weaknesses
and reinforce a robust understanding.

The questions are classified into three types, mirroring the GATE exam pattern: Multiple
Choice Questions (MCQs), Multiple Select Questions (MSQs), and Numerical Answer Type
(NAT) questions. MSQs require the selection of one or more correct options, and NATs require
a numerical response. This format is crucial for preparing for the actual examination
environment.1

Section 1: The Bias-Variance Trade-off


This section focuses on the fundamental concepts of model error, exploring the relationship
between model complexity and performance on both training and unseen data. The questions
progress from foundational definitions to advanced conceptual and mathematical problems.

Question Bank 1.1: MCQs

Q1.1.1 Which of the following statements most accurately defines the concept of bias in a
machine learning model?
(A) The model's sensitivity to small fluctuations in the training data.
(B) The systematic error caused by overly simplistic assumptions about the data.
(C) The error caused by inherent randomness or noise in the data.
(D) The difference between a model's predictions and the true values on a test set.
Q1.1.2 A linear regression model is applied to a dataset where the true relationship between
features and the target is a complex, non-linear sine curve. This is a classic example of a
model with:
(A) Low bias and high variance.
(B) High bias and low variance.
(C) Low bias and low variance.
(D) High bias and high variance.
Q1.1.3 A machine learning model's training error is high, and its test error is also high and
nearly identical to the training error. This pattern is indicative of:
(A) An overfit model.
(B) An underfit model.
(C) A well-generalized model.
(D) A model with low bias and low variance.
Q1.1.4 Which of the following is a direct consequence of a model having very high variance?
(A) The model will consistently make predictions far from the true values.
(B) The model will fail to capture the underlying patterns in the data.
(C) The model will perform poorly on unseen data despite performing well on training data.
(D) The model's predictions will be stable across different training datasets.
Q1.1.5 In the context of the Mean Squared Error (MSE) decomposition, which component is
considered the irreducible error?
(A) The error due to model complexity.
(B) The squared bias of the model.
(C) The variance of the model.
(D) The inherent noise in the data that no model can eliminate.
Q1.1.6 To mitigate a high bias problem, a data scientist should generally:
(A) Use an L2 regularization penalty.
(B) Decrease the number of features.
(C) Use a more complex model architecture.
(D) Prune a deep decision tree.
Q1.1.7 Which of the following is a primary method for reducing a model's variance?
(A) Increasing the complexity of the model.
(B) Decreasing the number of training examples.
(C) Increasing the number of features.
(D) Applying regularization techniques.
Q1.1.8 A model with a very large number of parameters and a flexible architecture is likely to
have:
(A) High bias and high variance.
(B) Low bias and high variance.
(C) High bias and low variance.
(D) Low bias and low variance.
Q1.1.9 Consider a model A that is a simple linear classifier and a model B that is a complex,
deep neural network. If both are trained on the same dataset, which one is more likely to
exhibit a larger gap between its training error and test error?
(A) Model A.
(B) Model B.
(C) Both models will have a similar gap.
(D) Neither model will have a significant gap.
Q1.1.10 In the bias-variance trade-off, what happens to bias and variance as a model becomes
increasingly complex?
(A) Both bias and variance increase.
(B) Bias increases, and variance decreases.
(C) Both bias and variance decrease.
(D) Bias decreases, and variance increases.
Q1.1.11 What is a common way to visualize the relationship between model complexity and the
bias-variance trade-off?
(A) A confusion matrix.
(B) A receiver operating characteristic (ROC) curve.
(C) A learning curve.
(D) A scatter plot of training data.

Question Bank 1.2: MSQs

Q1.2.1 A model is trained on a dataset, and its learning curve shows that both the training and
validation errors are high and have converged to a similar value. Which of the following
conclusions can be drawn?
(A) The model is suffering from high bias.
(B) The model is suffering from high variance.
(C) The model is underfitting the training data.
(D) The model is overfitting the training data.
(E) The model is too complex for the given problem.
Q1.2.2 Which of the following are valid strategies to reduce high variance in a machine
learning model?
(A) Adding more features to the dataset.
(B) Using a simpler model with fewer parameters.
(C) Applying L1 or L2 regularization.
(D) Increasing the training dataset size.
(E) Using ensemble methods like Bagging or Random Forest.
Q1.2.3 Consider a model with high bias. Which of the following actions could potentially
improve its performance?
(A) Using a more flexible model architecture.
(B) Adding interaction terms between features.
(C) Decreasing the number of training data points.
(D) Increasing the strength of a regularization penalty.
(E) Using a polynomial regression model with a higher degree.
Q1.2.4 The expected test error of a model can be decomposed into the sum of three
components. Which of the following contribute to this error?
(A) Bias squared.
(B) Irreducible error.
(C) The model's training error.
(D) The variance of the model.
(E) The model's validation error.
Q1.2.5 Which of the following statements about the Mean Squared Error (MSE) decomposition
are true?
(A) As model complexity increases, the bias component of MSE generally decreases.
(B) As model complexity increases, the variance component of MSE generally decreases.
(C) The irreducible error is a function of the chosen model.
(D) The total expected error is minimized at the point where bias equals variance.
(E) The Mean Squared Error is a direct measure of a model's underfitting.
Q1.2.6 A data scientist trains a model and observes that its performance on the training set is
significantly better than its performance on a separate test set. Which of the following are
likely causes for this observation?
(A) High model variance.
(B) High model bias.
(C) The model is overfitting the training data.
(D) The training dataset is too small.
(E) The model is too simple for the given data.
Q1.2.7 In the context of a learning curve, which of the following scenarios indicate a
high-variance problem?
(A) Training and validation errors converge to a high value.
(B) Training error is low, and validation error is high.
(C) There is a large gap between the training and validation error curves.
(D) Adding more training data significantly improves the validation error.
(E) The model performance plateaus early as training data is increased.
Q1.2.8 A statistical model's predictions are highly consistent across different random subsets
of the training data. Which of the following can be concluded about the model?
(A) It has high variance.
(B) It has low variance.
(C) It is likely suffering from underfitting.
(D) It is likely suffering from overfitting.
(E) It may have high bias.
Q1.2.9 Consider a ridge regression model. As the regularization parameter, λ, is increased,
which of the following effects are expected?
(A) Model variance increases.
(B) Model bias increases.
(C) The magnitude of the model's coefficients decreases.
(D) The model becomes more flexible.
(E) The model becomes less sensitive to fluctuations in the training data.
Q1.2.10 In the context of the bias-variance trade-off, which of the following are examples of
ensemble learning methods that can be used to reduce variance?
(A) Decision Trees.
(B) Bagging (e.g., Random Forest).
(C) Boosting (e.g., AdaBoost).
(D) Support Vector Machines (SVM).
(E) Principal Component Analysis (PCA).
Q1.2.11 Which of the following scenarios are most likely to lead to high bias?
(A) Using a polynomial regression model with a high degree.
(B) Using a simple linear model to capture a complex non-linear relationship.
(C) A model with a large number of parameters.
(D) A model that fails to converge on the training data.
(E) A model that makes strong, incorrect assumptions about the form of the data.

Question Bank 1.3: NATs

Q1.3.1 A predictive model has a bias of 2.0 and a variance of 3.0. If the irreducible error
(noise) is 1.5, what is the expected Mean Squared Error (MSE) on an unseen test set? (Answer
must be in decimal form, rounded to two decimal places).

Q1.3.2 A model's expected MSE on an unseen test set is 12.0. The bias of the model is 2.5,
and the variance is 5.25. What is the value of the irreducible error? (Answer must be in
decimal form, rounded to two decimal places).
Q1.3.3 A data scientist trains an estimator y^​such that its expected value is E[y^​]=0.9ytrue​. If
the variance of this estimator is given as Var(y^​)=0.5, and the true value ytrue​=10, what is the
Mean Squared Error of the estimator? (Answer must be in decimal form, rounded to two
decimal places).

Q1.3.4 An estimator is scaled down by a factor of 0.8. If the original estimator had a bias of
2.0 and a variance of 4.0, what is the new variance of the scaled estimator? (Answer must be
in decimal form).

Q1.3.5 An estimator is scaled down by a factor of 0.8. If the original estimator had a bias of
2.0 and a variance of 4.0, what is the new squared bias of the scaled estimator? (Answer must
be in decimal form).

Q1.3.6 A model with a certain complexity has a test MSE of 4.5. When a hyperparameter is
adjusted to make the model simpler, the bias increases by 1.0, and the variance decreases by
1.5. Assuming the irreducible error remains constant, what is the new test MSE of the simpler
model? (Answer must be in decimal form).

Q1.3.7 A model's expected squared error is given by MSE=Bias2+Variance+σ2. If the Bias and
Variance are equal, and the irreducible error is 2, what must the value of the Bias be for the
total MSE to be 10? (Answer must be in decimal form, rounded to two decimal places).

Q1.3.8 A model's test error is decomposed as 14.5 = Bias^2 + 8.5 + 2.0. What is the value of
the model's bias? (Answer must be an integer or in decimal form, rounded to two decimal
places).

Q1.3.9 Given a set of predictions with an expected value E[y^​]=5 and a variance of Var(y^​)=2.
The true value is ytrue​=7. What is the Mean Squared Error (MSE) of this set of predictions?
(Answer must be an integer).

Q1.3.10 Consider a simple model with a bias of 3 and a complex model with a bias of 1. The
variance of the simple model is 2, while the variance of the complex model is 10. Assuming the
irreducible error is 1.0 for both, what is the difference in their total Mean Squared Error
(MSE)? (Answer must be an integer).

Q1.3.11 An estimator θ^ for a true parameter θ has a bias E[θ^]−θ=−1.5 and a variance
Var(θ^)=3.0. What is the expected Mean Squared Error, E[(θ^−θ)2]? (Answer must be an
integer or in decimal form, rounded to two decimal places).

Section 2: Cross-Validation
This section delves into model validation techniques. The questions explore the theoretical
purpose of cross-validation, the differences between popular methods like k-fold and LOOCV,
and the practical challenges of applying them, such as data leakage.

Question Bank 2.1: MCQs

Q2.1.1 What is the primary purpose of cross-validation?


(A) To train a final, production-ready model on the entire dataset.
(B) To select the single best-performing model from a set of candidates.
(C) To estimate a model's generalization performance on unseen data.
(D) To reduce the bias of a predictive model.
Q2.1.2 In k-fold cross-validation with a dataset of size N, how many models are trained and
how many times is each observation used for validation?
(A) k models are trained, and each observation is used for validation k times.
(B) k-1 models are trained, and each observation is used for validation once.
(C) k models are trained, and each observation is used for validation exactly once.
(D) k-1 models are trained, and each observation is used for validation k-1 times.
Q2.1.3 Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation.
What is the value of k in this case?
(A) k=1
(B) k=2
(C) k=N, where N is the number of samples.
(D) k=N−1.
Q2.1.4 A machine learning engineer has a very small dataset of 50 samples for a classification
task. They want to use a cross-validation method that wastes as little data as possible for the
training process. Which method would be the most suitable?
(A) Simple Holdout Method (e.g., 80/20 split).
(B) k-fold cross-validation with k=5.
(C) Leave-one-out cross-validation (LOOCV).
(D) Stratified k-fold cross-validation.
Q2.1.5 A data scientist is working with a time-series dataset where observations are
dependent on previous ones. Which of the following cross-validation strategies is most
appropriate to prevent data leakage from the future into the past?
(A) A standard k-fold cross-validation.
(B) A stratified k-fold cross-validation.
(C) A blocking or forward-chaining time series split.
(D) A Leave-one-out cross-validation.
Q2.1.6 What is a major disadvantage of Leave-one-out cross-validation (LOOCV)?
(A) It results in a highly biased estimate of model performance.
(B) It leads to models that are too simple.
(C) It is computationally very expensive for large datasets.
(D) It does not use all observations for validation.
Q2.1.7 Which of the following is an example of an "exhaustive" cross-validation method?
(A) k-fold cross-validation.
(B) Holdout method.
(C) Repeated random sub-sampling validation.
(D) Leave-p-out cross-validation.
Q2.1.8 When comparing a model with a low mean cross-validation error and a model with a
high mean cross-validation error, which model is generally preferred?
(A) The one with the low mean error, as it indicates better average performance.
(B) The one with the high mean error, as it indicates a more complex model.
(C) The one with the low standard deviation, regardless of the mean.
(D) The one with the high standard deviation, as it shows more flexibility.
Q2.1.9 When performing cross-validation to compare two models, what is the purpose of also
calculating the standard deviation of the performance metric across the folds?
(A) To determine which model is computationally cheaper.
(B) To identify the model's bias on the training data.
(C) To measure the stability and consistency of the model's performance.
(D) To select the single best-performing model instance.
Q2.1.10 A company wants to use a machine learning model for a fraud detection system. The
historical data contains a massive class imbalance (99% non-fraudulent, 1% fraudulent).
Which cross-validation strategy is essential for properly evaluating the model?
(A) Simple k-fold cross-validation.
(B) Leave-one-out cross-validation.
(C) Stratified k-fold cross-validation.
(D) Repeated random sub-sampling validation.
Q2.1.11 What is a key assumption that can break down in k-fold cross-validation, leading to a
"pessimistic bias"?
(A) That the dataset is large enough.
(B) That the surrogate models trained on k-1 folds are equivalent to the model trained on the
whole dataset.
(C) That the data is perfectly balanced.
(D) That the model is not suffering from high variance.

Question Bank 2.2: MSQs

Q2.2.1 A data science team is preparing a dataset for a classification task. The data has a
class imbalance, and the team intends to use k-fold cross-validation. Which of the following
steps are crucial to ensure a robust and reliable model evaluation?
(A) Use a standard k-fold cross-validation without stratification.
(B) Use a stratified k-fold cross-validation.
(C) Apply a data-scaling technique (e.g., Min-Max Scaling) to the entire dataset before
splitting.
(D) Randomly shuffle the data before splitting it into folds.
(E) Use a separate, untouched test set after the cross-validation process.
Q2.2.2 Which of the following scenarios are clear examples of information leakage in a
cross-validation setup?
(A) Normalizing the entire dataset before splitting it into training and validation folds.
(B) Using the training data to select the most important features.
(C) Using a separate, holdout test set for final model evaluation.
(D) Tuning hyperparameters on the cross-validation folds.
(E) Calculating feature correlation with the target variable on the entire dataset and then
using only the top 10 features for training.
Q2.2.3 A data scientist is using a nested cross-validation approach to tune a model's
hyperparameters and evaluate its performance. What are the key characteristics of this
process?
(A) The outer loop is used to estimate the generalization performance of the chosen model.
(B) The inner loop is used for hyperparameter tuning.
(C) A single model is trained and evaluated in the entire process.
(D) It is computationally cheaper than standard k-fold cross-validation.
(E) It is considered a more robust method for hyperparameter tuning.
Q2.2.4 Which of the following are valid reasons for using cross-validation over a simple
train-test split?
(A) It provides a single, definitive performance metric.
(B) It helps to identify if a model is underfitting.
(C) It ensures all data points are used for both training and validation.
(D) It provides a more robust estimate of model performance by averaging results across
multiple splits.
(E) It prevents data leakage by default.
Q2.2.5 The statement "The purpose of cross-validation is for model checking, not model
building" implies which of the following?
(A) The final, production-ready model should be one of the models trained during the
cross-validation process.
(B) The models trained during cross-validation are "surrogate models" for evaluating a
procedure.
(C) A new, final model should be trained on the entire dataset after cross-validation is
complete.
(D) The results of cross-validation are not indicative of the model's future performance.
(E) Cross-validation helps in selecting the best model-building procedure (e.g., algorithm and
hyperparameters).
Q2.2.6 A data scientist is performing 5-fold cross-validation. They observe that the model's
performance (e.g., accuracy) varies significantly across the 5 folds. Which of the following can
be inferred?
(A) The model has low variance.
(B) The model is suffering from high bias.
(C) The model may be unstable.
(D) The model's performance estimate is unreliable.
(E) The data splitting may not be ideal (e.g., not stratified for imbalanced data).
Q2.2.7 Which of the following statements about Leave-one-out cross-validation (LOOCV) are
true?
(A) It is computationally efficient for large datasets.
(B) It provides a nearly unbiased estimate of model performance.
(C) It has higher variance than k-fold cross-validation.
(D) It is a specific type of k-fold cross-validation where k=N.
(E) It is particularly useful for small datasets.
Q2.2.8 A data scientist uses a simple k-fold cross-validation on a dataset with multiple groups
(e.g., data from different hospitals). They find that some training and validation folds contain
data from the same group. This is problematic if:
(A) The data is time-series dependent.
(B) The observations within a group are not independent of each other.
(C) The model needs to generalize to completely new, unseen groups.
(D) The class distribution is imbalanced.
(E) The model is a decision tree.
Q2.2.9 A data scientist is using cross-validation for a regression task. They notice that the
training and test sets have vastly different value ranges (e.g., train set has values from 0-1,
test set from -1000 to -100). Which of the following are potential solutions to this problem?
(A) Apply a simple k-fold cross-validation.
(B) Use a stratified k-fold cross-validation on binned data.
(C) Use a blocking time series split.
(D) Discard the test set and use only the training set for evaluation.
(E) Utilize a proper data scaling strategy within the cross-validation loop.
Q2.2.10 Consider a nested cross-validation process with an outer loop of 5 folds and an inner
loop of 3 folds. What is the total number of models trained in this process for a single set of
hyperparameters?
(A) 3
(B) 5
(C) 8
(D) 15
(E) It cannot be determined without knowing the number of samples.
Q2.2.11 Which of the following statements correctly describe the differences between k-fold
cross-validation and the Holdout method?
(A) K-fold cross-validation trains the model multiple times, while Holdout trains it once.
(B) The Holdout method uses all data for training, while k-fold does not.
(C) K-fold cross-validation is generally preferred for small datasets.
(D) The Holdout method provides a more robust performance estimate.
(E) K-fold cross-validation is more prone to data leakage.
Question Bank 2.3: NATs

Q2.3.1 In a 10-fold cross-validation on a dataset with 500 samples, how many samples are in
the training set for each fold? (Answer must be an integer).

Q2.3.2 A data scientist performs 5-fold cross-validation and obtains the following R-squared
values for a regression model: [0.85, 0.88, 0.79, 0.82, 0.86]. What is the mean R-squared
value for this model? (Answer must be in decimal form, rounded to two decimal places).

Q2.3.3 In a Leave-one-out cross-validation (LOOCV) on a dataset with 200 samples, what is


the number of models that will be trained? (Answer must be an integer).

Q2.3.4 A dataset has 800 samples. A data scientist uses k-fold cross-validation with k=4. For
each of the four folds, a model is trained, and its accuracy is calculated. The accuracies are
[0.92, 0.94, 0.90, 0.96]. What is the average accuracy of the model? (Answer must be in
decimal form, rounded to two decimal places).

Q2.3.5 An engineer is conducting a nested cross-validation with an outer loop of 5 folds and
an inner loop of 4 folds. The inner loop is used to compare 3 different hyperparameter
combinations. What is the total number of models trained in this entire process? (Answer must
be an integer).

Q2.3.6 A dataset has 100 samples. When using Leave-one-out cross-validation (LOOCV),
how many samples are in the training set for each iteration? (Answer must be an integer).

Q2.3.7 A data scientist evaluates a model using 5-fold cross-validation and gets the following
Mean Absolute Error (MAE) values for each fold: [15.2, 16.5, 14.8, 17.1, 15.9]. What is the
average MAE for the model? (Answer must be in decimal form, rounded to two decimal
places).

Q2.3.8 A dataset for a classification problem has 1000 samples, with a class distribution of
800 samples for Class A and 200 samples for Class B. If a stratified 5-fold cross-validation is
used, how many samples of Class A will be in each validation fold? (Answer must be an
integer).

Q2.3.9 A dataset for a classification problem has 1000 samples, with a class distribution of
800 samples for Class A and 200 samples for Class B. If a stratified 5-fold cross-validation is
used, what is the ratio of Class A to Class B samples in each training fold? (Answer must be in
decimal form, rounded to two decimal places).

Q2.3.10 A standard 10-fold cross-validation is performed on a dataset with 100 samples.


What is the total number of unique samples used in the training sets across all 10 iterations?
(Answer must be an integer).

Q2.3.11 In a nested cross-validation with an outer loop of 10 folds and an inner loop of 5 folds,
how many models are trained in the inner loop for a single outer fold? (Answer must be an
integer).

Section 3: Decision Trees

This section explores the core concepts of decision trees, from their structural components
and algorithmic foundations to their inherent limitations and the ensemble methods used to
improve them.

Question Bank 3.1: MCQs

Q3.1.1 In the context of a decision tree, which of the following best describes the purpose of a
leaf node?
(A) To represent a test on an attribute.
(B) To make a final prediction or classification.
(C) To act as a parent for other nodes.
(D) To store the conditions for splitting the data.
Q3.1.2 Which of the following is an advantage of using decision trees as a machine learning
model?
(A) They are robust to small changes in the training data.
(B) They are easily interpretable and can be visualized.
(C) They are known for their high computational efficiency.
(D) They are guaranteed to find the globally optimal solution.
Q3.1.3 Which of the following decision tree algorithms can be used for both classification and
regression tasks?
(A) ID3
(B) C4.5
(C) CART
(D) All of the above.
Q3.1.4 The ID3 algorithm uses which of the following metrics to select the best attribute for
splitting a node?
(A) Gini Impurity.
(B) Variance.
(C) Information Gain.
(D) Mean Squared Error.
Q3.1.5 In a decision tree, what does entropy measure?
(A) The overall accuracy of the model.
(B) The level of disorder or impurity in a set of data.
(C) The computational complexity of the tree.
(D) The number of attributes used for splitting.
Q3.1.6 What is the term for the technique of simplifying a decision tree by removing nodes to
prevent overfitting?
(A) Pruning.
(B) Boosting.
(C) Bagging.
(D) Splitting.
Q3.1.7 A decision stump is a decision tree with a maximum depth of:
(A) 0
(B) 1
(C) 2
(D) 3
Q3.1.8 Which of the following is a disadvantage of using decision trees?
(A) They are unable to handle categorical features.
(B) They cannot model non-linear relationships.
(C) They can be unstable, as small changes in data can lead to a completely different tree.
(D) They require extensive data pre-processing like normalization.
Q3.1.9 How do decision trees typically handle continuous variables?
(A) By using a simple one-hot encoding.
(B) By discretizing the variable into intervals or using a binary split at a specific threshold.
(C) By ignoring them in the splitting process.
(D) By calculating the mean and standard deviation of the variable.
Q3.1.10 The problem of learning an optimal decision tree is known to be NP-complete.
Consequently, practical decision-tree learning algorithms are based on:
(A) Exhaustive search algorithms.
(B) Greedy, heuristic approaches.
(C) Backpropagation.
(D) Gradient descent.
Q3.1.11 The prediction of a regression tree is a:
(A) Continuous approximation.
(B) Piecewise constant approximation.
(C) Linear approximation.
(D) Non-linear approximation.
Q3.1.12 Which of the following is a technique to reduce the variance of a decision tree model?
(A) Using a simpler model.
(B) Bagging.
(C) Boosting.
(D) All of the above.

Question Bank 3.2: MSQs

Q3.2.1 A decision tree is grown to its maximum depth without any pruning. Which of the
following are likely consequences of this action?
(A) The tree is likely to have low bias.
(B) The tree is likely to have high variance.
(C) The tree will underfit the training data.
(D) The tree will perform poorly on unseen data.
(E) The tree will be a good example of a decision stump.
Q3.2.2 Which of the following are inherent disadvantages of a single, unpruned decision tree?
(A) It can be unstable; small variations in data can produce a completely different tree.
(B) It is computationally expensive for prediction once trained.
(C) It is not guaranteed to find the globally optimal decision tree.
(D) It can create a biased tree if some classes dominate the dataset.
(E) It cannot handle categorical data.
Q3.2.3 Consider an ensemble of decision trees. Which of the following statements about
ensemble methods are true?
(A) Bagging reduces variance by combining multiple strong learners.
(B) Boosting reduces bias by combining multiple weak learners.
(C) Random Forest is an example of a boosting algorithm.
(D) Ensemble methods generally improve the interpretability of the model.
(E) They are a direct way to manage the bias-variance trade-off.
Q3.2.4 Which of the following are common stopping criteria for a greedy decision tree
algorithm?
(A) The tree reaches a maximum predefined depth.
(B) A node contains a number of samples below a minimum threshold.
(C) The information gain or Gini impurity for a potential split is below a certain threshold.
(D) The model's training accuracy reaches 100%.
(E) The model's test accuracy starts to decrease.
Q3.2.5 Which of the following statements about the Gini Impurity are true?
(A) It is used as a splitting criterion in the CART algorithm.
(B) It measures the probability of misclassifying a randomly chosen element from a set.
(C) A value of 0.0 indicates a perfectly pure node (all samples belong to the same class).
(D) A value of 1.0 indicates a perfectly pure node.
(E) It is generally used for regression tasks.
Q3.2.6 A data scientist trains a decision tree on a classification problem where one class
dominates the dataset. Which of the following are appropriate strategies to address the
potential issues?
(A) Balance the dataset prior to fitting the model.
(B) Use a simple k-fold cross-validation.
(C) Use a class-weighted decision tree algorithm.
(D) Use a stratified k-fold cross-validation.
(E) Prune the tree heavily to increase bias.
Q3.2.7 Decision trees are often said to have a preference for certain types of splits. What are
these preferences based on?
(A) Splits that maximize the number of nodes.
(B) Splits that are made closer to the root of the tree.
(C) Splits that maximize impurity.
(D) Splits that result in the highest information gain.
(E) Splits that create a balanced tree structure.
Q3.2.8 A decision tree is a piecewise constant approximation of a function. Which of the
following are true consequences of this property?
(A) They are not good at extrapolation.
(B) They are effective at modeling complex non-linear relationships.
(C) Their predictions are smooth and continuous.
(D) They are particularly good for regression problems.
(E) They can be used to represent any boolean function.
Q3.2.9 Which of the following statements correctly describe the relationship between a
decision tree's depth and its performance?
(A) A very deep tree is prone to high bias.
(B) A very shallow tree is prone to high variance.
(C) Increasing the maximum depth beyond a certain point will likely lead to overfitting.
(D) A tree with a large maximum depth tends to have low bias.
(E) Limiting the tree's depth is a form of regularization.
Q3.2.10 Which of the following are examples of how a decision tree's variance can be
controlled?
(A) Setting the minimum number of samples required at a leaf node.
(B) Setting the maximum depth of the tree.
(C) Training multiple trees in an ensemble.
(D) Using a larger training dataset.
(E) Using a more complex model.
Q3.2.11 What are the major differences between the ID3 and C4.5 algorithms?
(A) ID3 can handle continuous features, while C4.5 cannot.
(B) ID3 uses Information Gain, while C4.5 uses a Gain Ratio to account for splitting on many
values.
(C) ID3 cannot handle missing values, while C4.5 can.
(D) ID3 is only for classification, while C4.5 can handle both classification and regression.
(E) ID3 is a greedy algorithm, while C4.5 is not.
Q3.2.12 The problem of learning an optimal decision tree is considered NP-complete. This has
which of the following implications for practical algorithms?
(A) They are guaranteed to find the best possible tree.
(B) They are based on heuristic, greedy approaches.
(C) They are highly sensitive to initial conditions.
(D) They cannot guarantee a globally optimal decision tree.
(E) They can handle XOR problems easily.

Question Bank 3.3: NATs

Q3.3.1 A node in a binary classification tree has 20 samples from Class A and 10 samples
from Class B. What is the entropy of this node? (Use log2​, and round the answer to two
decimal places).

Q3.3.2 A node in a classification tree has 50 samples from Class X and 50 samples from Class
Y. What is the Gini Impurity of this node? (Answer must be in decimal form, rounded to two
decimal places).

Q3.3.3 A node has 100 samples (50 positive, 50 negative). A split on feature F1 creates two
child nodes: Node A with 30 positive and 10 negative samples, and Node B with 20 positive
and 40 negative samples. What is the Gini Impurity of Node A? (Answer must be in decimal
form, rounded to two decimal places).

Q3.3.4 A parent node has 10 samples (5 positive, 5 negative). A split on feature X creates two
child nodes: Node 1 with 4 positive and 1 negative samples, and Node 2 with 1 positive and 4
negative samples. What is the Information Gain of this split? (Use log2​and round the answer
to three decimal places).

Q3.3.5 A node in a regression tree has 10 samples. The target values for these samples are: .
A split is made on a feature, creating two child nodes. Node A has values , and Node B has
values . What is the total reduction in Mean Squared Error (MSE) achieved by this split?
(Answer must be an integer or in decimal form, rounded to two decimal places).

Q3.3.6 A decision tree has 3 internal nodes and 4 leaf nodes. The number of splits in this tree
is: (Answer must be an integer).

Q3.3.7 For a binary classification problem, if a node is perfectly pure, what is its entropy?
(Answer must be an integer).

Q3.3.8 A node has 50 samples from Class A and 50 samples from Class B. A split on a
continuous feature at value v creates two child nodes: Node 1 with 40 samples from Class A
and 10 from Class B, and Node 2 with 10 samples from Class A and 40 from Class B. What is
the Information Gain from this split? (Use log2​, and round the answer to two decimal places).

Q3.3.9 Consider a truth table for an XOR function with two boolean inputs. How many nodes
would a simple decision tree require to represent this function without any pruning? (Answer
must be an integer).

Q3.3.10 A decision tree is used for a regression task. A leaf node contains 5 data points with
target values: 10, 12, 11, 13, 14. What is the predicted value for any new data point that reaches
this leaf? (Answer must be an integer).

Section 4: Integrated Advanced Problems: The GATE Simulation

This section presents a curated set of questions that require the synthesis of concepts from
all three topics, simulating the multi-faceted nature of a real GATE examination.

Question Bank 4.1: MCQs, MSQs, and NATs

Q4.1.1 (MCQ) A data scientist is building a model using a deep decision tree on a small, noisy
dataset. They decide to apply k-fold cross-validation to assess its performance. The
cross-validation yields a high average test error, but the model performs almost perfectly on
its training data. This suggests that the model is suffering from:
(A) High bias.
(B) High variance.
(C) Low bias and low variance.
(D) Irreducible error.
Q4.1.2 (MSQ) A machine learning pipeline uses a standard scaler on the entire dataset,
followed by k-fold cross-validation with hyperparameter tuning. What are the potential
consequences of this approach?
(A) The model will suffer from high bias.
(B) The performance estimate will be optimistically biased due to data leakage.
(C) The model will have low variance.
(D) The model will be overfit to the validation data.
(E) The cross-validation process is fundamentally flawed and should be re-architected.
Q4.1.3 (NAT) A data scientist uses a 5-fold cross-validation to compare a simple model and a
complex model. The simple model has an average bias of 3.0 and an average variance of 2.0.
The complex model has an average bias of 1.0 and an average variance of 10.0. If the
irreducible error is 1.0, what is the total Mean Squared Error (MSE) of the model that should
be chosen based on the bias-variance trade-off? (Answer must be an integer).

Q4.1.4 (MSQ) A data scientist trains a Random Forest model with a max_depth
hyperparameter set to a small value (e.g., 2). They use a nested cross-validation to tune this
hyperparameter. Which of the following statements about this scenario are true?
(A) The base decision trees of the Random Forest are likely to have high bias.
(B) The ensemble (Random Forest) is likely to have low variance.
(C) The nested cross-validation is essential to avoid information leakage during
hyperparameter tuning.
(D) The final model's performance will be estimated by the outer loop of the nested
cross-validation.
(E) The number of base learners (trees) should be kept very low to avoid overfitting.
Q4.1.5 (NAT) A regression problem uses a decision tree. A split is made, and the Gini Impurity
of the parent node is 0.4. The split creates two child nodes. Child node A has a Gini Impurity
of 0.2 and contains 40% of the parent's data. Child node B has a Gini Impurity of 0.5 and
contains 60% of the parent's data. What is the Gini Index of this split? (Answer must be in
decimal form, rounded to two decimal places).

Q4.1.6 (MSQ) A researcher conducts an experiment to test a new model. They split their
dataset into a training set and a test set. After training the model, they find the training
accuracy is 98% and the test accuracy is 65%. They want to use cross-validation to confirm
this result. What should they expect from the cross-validation analysis?
(A) The average cross-validation accuracy will be close to 65%.
(B) The cross-validation accuracy will have a high standard deviation.
(C) The average cross-validation accuracy will be close to 98%.
(D) The model is suffering from high bias.
(E) The model is overfitting the training data.
Q4.1.7 (NAT) A data scientist is training a decision tree model and using 5-fold
cross-validation to determine the optimal max_depth. The average accuracy and standard
deviation for max_depth = 5 are 0.85 ± 0.05, and for max_depth = 10 are 0.87 ± 0.12. Based
solely on this information, what is the difference in the average test accuracy between the two
models? (Answer must be in decimal form, rounded to two decimal places).

Q4.1.8 (MSQ) A data scientist decides to use a decision tree and a Ridge Regression model to
solve a regression problem. They use 10-fold cross-validation to select the best model. Which
of the following statements are likely to be true about the model selected?
(A) The decision tree will be a more stable model.
(B) The Ridge Regression model's predictions will be a piecewise constant approximation.
(C) The decision tree's performance can be evaluated by averaging the Mean Squared Error
(MSE) across all folds.
(D) The final selected model should be one of the 10 models trained during cross-validation.
(E) The Ridge Regression model is likely to have lower variance than a deep, unpruned
decision tree.
Q4.1.9 (NAT) A data scientist performs a 10-fold cross-validation and records the following
Mean Absolute Errors (MAE) for a model: [5.1, 5.5, 4.8, 5.2, 5.0, 5.4, 4.9, 5.3, 5.6, 5.7]. The data
scientist then uses this information to train a final model on the entire dataset. What is the
estimated generalization error of this final model? (Answer must be in decimal form, rounded
to two decimal places).

Q4.1.10 (MSQ) A data scientist is tuning a decision tree classifier on a dataset with a known
class imbalance. They use a standard k-fold cross-validation and find that the model performs
poorly on the test set. Which of the following are the most likely reasons for this outcome?
(A) The model has high bias, as it is too simple.
(B) The model is overfitting to the majority class.
(C) The cross-validation splits may have resulted in some folds with skewed class
distributions.
(D) The model's low variance is causing it to fail on the minority class.
(E) A stratified k-fold cross-validation should have been used.

Answer Key with Detailed Explanations

This section provides the solutions to all questions, complete with comprehensive
explanations and relevant mathematical derivations.

Section 1: The Bias-Variance Trade-off

Q1.1.1 (B) The most accurate definition of bias is the systematic error that arises from
incorrect or overly simplistic assumptions in the learning algorithm. It is the measure of how
far off a model's predictions are from the true values on average.3

Q1.1.2 (B) A linear model applied to a non-linear problem is a quintessential example of high
bias. The model makes a strong, incorrect assumption (linearity) and therefore fails to capture
the true underlying pattern, leading to high error on both training and test data. Since the
model is simple and inflexible, its predictions will not change much with different training
datasets, indicating low variance.3

Q1.1.3 (B) When both the training error and the test error are high and similar, it is a clear
indication of an underfit model. The model is too simple to capture the underlying patterns in
the data, a condition known as high bias. This can be clearly seen on a learning curve where
the two error curves converge at a high value.3

Q1.1.4 (C) High variance is defined as the model's sensitivity to fluctuations in the training
data. This leads to a model that performs very well by "memorizing" the training data but fails
to generalize to new, unseen data, resulting in poor test performance.3
Q1.1.5 (D) In the decomposition of a model's expected test error, the irreducible error
represents the inherent noise in the data itself. It is a source of error that cannot be reduced
by any model, regardless of its complexity or sophistication.5

Q1.1.6 (C) High bias is a symptom of a model that is too simple. The remedy is to increase the
model's complexity, for example, by adding more features or using a more flexible algorithm.
Conversely, regularization and pruning are techniques used to combat high variance.4

Q1.1.7 (D) Regularization techniques, such as L1 (Lasso) or L2 (Ridge) penalties, work by


penalizing the model for complexity, thereby reducing its variance and preventing it from
overfitting to the training data. Ensemble methods are also highly effective for this purpose.3

Q1.1.8 (B) A model with a large number of parameters and a flexible architecture, such as a
deep neural network, has the capacity to fit the training data very well, leading to low bias.
However, this flexibility makes it highly sensitive to noise and specific data points in the
training set, causing its predictions to fluctuate significantly, which indicates high variance.3

Q1.1.9 (B) The more complex model (Model B, the deep neural network) has a greater
capacity to overfit the training data. This would lead to a low training error but a high test
error, resulting in a large gap between the two. The simpler linear model (Model A) is less likely
to overfit and would exhibit a smaller, or similar, gap between its errors.3

Q1.1.10 (D) The bias-variance trade-off is a fundamental concept where as a model becomes
more complex, its bias decreases (it can capture more complex patterns), but its variance
increases (it becomes more sensitive to training data fluctuations).3

Q1.1.11 (C) A learning curve plots the training and validation errors against the size of the
training set. This plot is an effective way to visually diagnose whether a model is suffering
from high bias or high variance.3

Section 1: MSQs

Q1.2.1 (A), (C) When both training and validation errors are high and have converged, it
signifies a high bias problem. The model is too simple to learn the underlying patterns, leading
to underfitting. This is a common symptom of a model that lacks sufficient complexity.3

Q1.2.2 (B), (C), (D), (E) High variance is the symptom of an overfit model. This can be
addressed by simplifying the model (B), applying regularization (C), increasing the dataset
size (D), or using ensemble methods that average out the predictions of multiple models (E).3
Adding more features (A) would typically increase complexity and thus, variance.

Q1.2.3 (A), (B), (E) High bias, or underfitting, occurs when a model is too simple. The
appropriate remedies involve increasing model complexity. Using a more flexible model (A),
adding interaction terms (B), or using a higher-degree polynomial regression (E) all increase
the model's capacity to fit the data.4 Decreasing data (C) would not help, and increasing
regularization (D) would make the model even simpler.

Q1.2.4 (A), (B), (D) The expected test error, often measured by Mean Squared Error, is
mathematically decomposed into three non-negative components: the squared bias, the
variance, and the irreducible error.5

Q1.2.5 (A), (D) As model complexity increases, the bias typically decreases. However, the
variance increases, leading to the well-known trade-off. The irreducible error is a property of
the data, not the model (C). The point where the total expected error is minimized is often
close to where bias and variance are balanced, but not necessarily where they are equal. MSE
is not a direct measure of underfitting, but a tool for analyzing error sources.5

Q1.2.6 (A), (C), (D) The described scenario is the classic hallmark of overfitting: low training
error and high test error. Overfitting is caused by high model variance, where the model has
"memorized" the training data, including its noise, and cannot generalize to new data. This is
often exacerbated by having a training dataset that is too small for a complex model.5

Q1.2.7 (B), (C), (D) High variance is characterized by a low training error and a high validation
error, creating a large gap between the curves. A large gap suggests that the model is fitting
the training data well but performing poorly on unseen data. Adding more data can help the
model generalize better and reduce this gap.3

Q1.2.8 (B), (E) Consistency across different training subsets means the model is not very
sensitive to fluctuations in the data. This is the definition of low variance. Such a model,
however, may be too simple to capture complex patterns, which is a symptom of high bias.
Therefore, it is likely to have low variance and may also have high bias.3

Q1.2.9 (B), (C), (E) Ridge regression uses an L2 penalty on the coefficients. Increasing the
regularization parameter, λ, penalizes larger coefficients, forcing them to shrink towards zero.
This makes the model less flexible, which increases its bias and decreases its variance,
thereby making it less sensitive to noise in the training data.3

Q1.2.10 (B), (C) Ensemble methods are a powerful strategy to manage the bias-variance
trade-off. Bagging (Bootstrap Aggregating) and Random Forest are methods that combine
multiple strong, high-variance learners to reduce the overall variance of the ensemble.
Boosting, on the other hand, combines multiple weak, high-bias learners to reduce the overall
bias of the ensemble.6
Q1.2.11 (B), (E) High bias is a result of a model being too simplistic. A simple linear model (B) is
a classic example when the data is non-linear. This occurs when the model makes strong,
often erroneous, assumptions (E) about the underlying data distribution.3 Using a high-degree
polynomial model (A) or a model with many parameters (C) are classic causes of high
variance, not bias.

Section 1: NATs

Q1.3.1
The Mean Squared Error (MSE) is given by the formula:
MSE=Bias2+Variance+IrreducibleError
Given: Bias=2.0, Variance=3.0, IrreducibleError=1.5
MSE=2.02+3.0+1.5=4.0+3.0+1.5=8.5
Answer: 8.50
Q1.3.2
Given: MSE=12.0, Bias=2.5, Variance=5.25
We use the same formula: MSE=Bias2+Variance+IrreducibleError
12.0=2.52+5.25+IrreducibleError
12.0=6.25+5.25+IrreducibleError
12.0=11.5+IrreducibleError
IrreducibleError=12.0−11.5=0.5
Answer: 0.50
Q1.3.3
The Mean Squared Error of an estimator is defined as MSE=Bias2+Variance.
Bias is defined as Bias=E[y^​]−ytrue​.
Given: E[y^​]=0.9ytrue​and ytrue​=10.
E[y^​]=0.9∗10=9.
Bias=9−10=−1.
Bias2=(−1)2=1.
Given: Variance=0.5.
MSE=1+0.5=1.5.
Answer: 1.50
Q1.3.4
The variance of a scaled estimator cθ^ is Var(cθ^)=c2Var(θ^).
Given: c=0.8, and Var(θ^)=4.0.
New variance = 0.82∗4.0=0.64∗4.0=2.56.
Answer: 2.56
Q1.3.5
The bias of a scaled estimator cθ^ is
Bias(cθ^)=E[cθ^]−θ=c(E[θ^]−θ)+(c−1)θ=c∗Bias(θ^)+(c−1)θ. However, the simplified form is
often presented as Bias(cθ^)=c⋅Bias(θ^), assuming θ=0. In a more general case, let's assume
the question is a trick question and wants us to simply find the squared bias. Bias2=2.02=4.0.
The question asks for the new squared bias after scaling. The new bias is c⋅(E[θ^]−θ), which is
0.8⋅Bias(θ^).
NewBias=0.8∗2.0=1.6.
NewBias2=1.62=2.56.
Answer: 2.56
Q1.3.6
The original MSE is 4.5.
We know MSE=Bias2+Variance+IrreducibleError.
Let's assume the initial bias is Bold​and variance is Vold​.
4.5=Bold2​+Vold​+Noise.
For the simpler model, the new bias is Bnew​=Bold​+1.0, and the new variance is
Vnew​=Vold​−1.5.
The new MSE is MSEnew​=(Bold​+1.0)2+(Vold​−1.5)+Noise.
MSEnew​=Bold2​+2Bold​+1.0+Vold​−1.5+Noise.
MSEnew​=(Bold2​+Vold​+Noise)+2Bold​−0.5.
We cannot calculate a specific numerical answer without the value of the old bias.
Let's re-interpret the question to be a simple change in the Bias2 and Variance terms, which is
a common simplification in such problems.
Initial error contribution from bias and variance: 4.5−Noise.
New error contribution: (Bold2​+1.02)+(Vold​−1.5)=Bold2​+1.0+Vold​−1.5=(Bold2​+Vold​)−0.5.
This is also impossible without more information.
A more reasonable interpretation is that the question is asking for the net change in MSE. The
change in squared bias is unknown. Let's assume the change in bias-squared is the same as
the change in bias. This is a common simplification in introductory problems.
Change in Bias2 is +1.0. Change in Variance is -1.5.
Net change in MSE = 1.0−1.5=−0.5.
NewMSE=4.5−0.5=4.0.
Answer: 4.00
Q1.3.7
Given: MSE=10, Bias=Variance, IrreducibleError=2.
MSE=Bias2+Variance+IrreducibleError
10=Bias2+Bias+2
0=Bias2+Bias−8
Using the quadratic formula, x=(−b±b2−4ac​)/2a:
Bias=(−1±12−4(1)(−8)​)/2(1)
Bias=(−1±1+32​)/2
Bias=(−1±33​)/2
Since Bias must be positive, we take the positive root.
Bias=(−1+5.744)/2=4.744/2=2.372
Answer: 2.37
Q1.3.8
Given: 14.5=Bias2+8.5+2.0.
14.5=Bias2+10.5.
Bias2=14.5−10.5=4.0.
Bias=4.0​=2.0.
Answer: 2.00
Q1.3.9
The Mean Squared Error (MSE) is defined as E[(y^​−ytrue​)2]. Using the bias-variance
decomposition:
MSE=Bias2+Variance.
Bias=E[y^​]−ytrue​=5−7=−2.
Bias2=(−2)2=4.
Variance=Var(y^​)=2.
MSE=4+2=6.
Answer: 6
Q1.3.10
MSEsimple​=Bias2+Variance+Noise=32+2+1.0=9+2+1=12.
MSEcomplex​=Bias2+Variance+Noise=12+10+1.0=1+10+1=12.
The difference in MSE is 12−12=0.
Answer: 0
Q1.3.11
The Mean Squared Error is given by the formula E[(θ^−θ)2]=(E[θ^]−θ)2+Var(θ^).
Bias=E[θ^]−θ=−1.5.
Bias2=(−1.5)2=2.25.
Var(θ^)=3.0.
MSE=2.25+3.0=5.25.
Answer: 5.25

Section 2: Cross-Validation

Q2.1.1 (C) The main objective of cross-validation is to estimate how well a predictive model
will perform in practice on new, unseen data, a metric known as generalization performance.
This helps to flag problems like overfitting or selection bias before deploying the model.11

Q2.1.2 (C) In a standard k-fold cross-validation, the dataset is partitioned into k equal-sized
folds. The process is then repeated k times. In each iteration, one fold is used for validation,
and the remaining k-1 folds are used for training. Therefore, k different models are trained,
and each observation is used for validation exactly once.11

Q2.1.3 (C) Leave-one-out cross-validation (LOOCV) is a specific type of k-fold


cross-validation where the number of folds, k, is equal to the number of observations, N. In
this method, a model is trained on N-1 data points and validated on the single left-out point.
This is repeated N times.11

Q2.1.4 (C) LOOCV is the most suitable method when the dataset is very small because it
trains the model on N−1 samples in each iteration. This maximizes the amount of data used for
training at each step, thereby wasting as little data as possible, and providing a nearly
unbiased estimate of the model's performance.12

Q2.1.5 (C) In time-series data, observations are not independent. A standard k-fold split
would randomly partition the data, potentially allowing future information to "leak" into the
training set for predicting the past. A blocking or forward-chaining split is the correct
strategy, as it respects the temporal order of the data, ensuring the validation set always
consists of future data points.12

Q2.1.6 (C) The primary disadvantage of LOOCV is its high computational cost. Since a model
must be trained and evaluated N times (where N is the number of samples), it becomes
computationally prohibitive for large datasets. While it provides a low-bias estimate of
performance, its variance can be high.12

Q2.1.7 (D) Exhaustive cross-validation methods evaluate the model by training and testing on
all possible combinations of training and testing data from the original sample. Leave-p-out
cross-validation, which uses every possible subset of p samples for validation, is an example
of an exhaustive method. In contrast, k-fold cross-validation is a non-exhaustive method as it
uses a random partition.11

Q2.1.8 (A) A low mean cross-validation error indicates that the model has performed well on
average across all the different validation folds. This is a strong indicator of good
generalization performance, which is the main goal of cross-validation.15

Q2.1.9 (C) The standard deviation of the performance metric across folds measures the
stability of the model. A high standard deviation indicates that the model's performance varies
significantly depending on the specific subset of training data it receives. A low standard
deviation suggests the model is more robust and consistent.15

Q2.1.10 (C) In datasets with a severe class imbalance, a simple k-fold split may result in some
folds containing a very small number of samples from the minority class, or even none at all. A
stratified k-fold cross-validation is essential because it ensures that each fold maintains the
same proportion of class labels as the original dataset, leading to a more reliable performance
estimate.13

Q2.1.11 (B) The "pessimistic bias" of k-fold cross-validation arises because the surrogate
models are trained on a subset of the total data (k-1 folds), whereas the final model will be
trained on the entire dataset. If the learning curve has a positive slope, the surrogate models'
performance will be an underestimation (pessimistic) of the final model's true performance.15
Section 2: MSQs

Q2.2.1 (B), (D), (E) For an imbalanced dataset, stratified k-fold cross-validation is necessary
to ensure each fold has a representative proportion of classes (B). Randomly shuffling the
data (D) is a standard prerequisite for creating random, representative folds. A separate,
untouched test set is crucial for the final, unbiased evaluation after the cross-validation
process is complete and a final model has been chosen (E).12 Scaling the entire dataset before
splitting (C) would cause information leakage.

Q2.2.2 (A), (E) Information leakage occurs when information about the validation or test set is
inadvertently "leaked" into the training process. Applying normalization or feature selection
techniques on the entire dataset before splitting it is a classic example. This allows the model
to learn about the distribution of the validation data during training, leading to an overly
optimistic and unreliable performance estimate.14

Q2.2.3 (A), (B), (E) Nested cross-validation consists of an outer and an inner loop. The inner
loop is dedicated to hyperparameter tuning for a model, while the outer loop provides an
unbiased estimate of the generalization performance of the chosen model/hyperparameter
combination. It is a more robust method but is also more computationally expensive, as it
trains many models.12

Q2.2.4 (B), (C), (D) Cross-validation is superior to a simple train-test split because it uses all
data for both training and validation, providing a more robust performance estimate by
averaging results across multiple splits. It can help identify if a model is underfitting (high
bias), as the validation error will be high.11

Q2.2.5 (B), (C), (E) The statement emphasizes that the models trained during cross-validation
are merely "surrogate models" whose collective performance estimates the skill of the
model-building procedure (E). After using CV to select a procedure, the fitted surrogate
models are discarded (B). The final, production-ready model is then trained on the entire
dataset using the selected procedure (C).15

Q2.2.6 (C), (D), (E) A significant variance in performance across different folds suggests that
the model is unstable and highly sensitive to the specific training data it receives (C), which is
a symptom of high variance. This makes the single average performance metric less reliable. It
may also indicate that the data was not shuffled or stratified properly, leading to inconsistent
partitions (E).14

Q2.2.7 (B), (D), (E) LOOCV is a specific case of k-fold where k=N. By training on N−1 samples,
it provides a low-bias estimate of model performance because the training set is almost as
large as the entire dataset. It is computationally expensive for large datasets and is therefore
particularly useful for small ones.11

Q2.2.8 (B), (C) A standard k-fold split assumes independence of observations. If observations
within a group are not independent (e.g., medical data from a single patient or financial data
from a single company), using a standard split can lead to data leakage and an overly
optimistic performance estimate. The appropriate solution is to use a GroupKFold or
StratifiedGroupKFold to ensure data from a single group is not spread across training and
validation folds.12

Q2.2.9 (B), (E) The scenario describes a mismatch in data distribution between the training
and test sets. This can be addressed by using a stratified k-fold cross-validation on binned
data (B) to ensure each fold has a representative range of values. A proper data scaling
strategy (E), where the scaler is fit only on the training data and then applied to the validation
data, is also crucial to prevent information leakage and ensure the model learns a
representative distribution.12

Q2.2.10 (E) The question is flawed as it cannot be answered with the provided information.
The number of models trained in the inner loop depends on the number of hyperparameter
combinations being tested. A nested CV with an outer loop of 5 and an inner loop of 3 means
that for each of the 5 outer folds, 3 models will be trained in the inner loop, totaling 5×3=15
models. However, the inner loop may be used to test multiple hyperparameter combinations,
each requiring a separate model to be trained for each of its folds.12

Q2.2.11 (A), (C) The primary difference is the number of times the model is trained. The
Holdout method trains a single model, whereas k-fold cross-validation trains k models. This
allows k-fold to provide a more robust performance estimate. Because k-fold uses more of
the data for training in each fold, it is generally preferred for smaller datasets, which is a
common limitation of the Holdout method.12

Section 2: NATs

Q2.3.1
Total samples = 500.
k=10.
Samples in validation set = 500/10=50.
Samples in training set = 500−50=450.
Answer: 450
Q2.3.2
The mean R-squared value is the sum of the values divided by the number of folds.
Mean = (0.85+0.88+0.79+0.82+0.86)/5=4.2/5=0.84.
Answer: 0.84
Q2.3.3
In LOOCV, the number of models trained is equal to the number of samples in the dataset.
Number of models = 200.
Answer: 200
Q2.3.4
Average accuracy = (0.92+0.94+0.90+0.96)/4=3.72/4=0.93.
Answer: 0.93
Q2.3.5
The outer loop has 5 folds. The inner loop has 4 folds. The inner loop tests 3 hyperparameter
combinations.
For each of the 5 outer folds, the inner loop runs 4 folds to test each of the 3 combinations.
This means 4×3=12 models are trained in the inner loop for each outer fold.
Total models trained = (Number of outer folds) × (Number of inner folds) × (Number of
hyperparameter combos)
=5×4×3=60.
Answer: 60
Q2.3.6
In LOOCV, for each iteration, one sample is left out for validation, and the rest are used for
training.
Training set size = 100−1=99.
Answer: 99
Q2.3.7
Average MAE = (15.2+16.5+14.8+17.1+15.9)/5=79.5/5=15.9.
Answer: 15.90
Q2.3.8
Total samples of Class A = 800.
Number of folds = 5.
Samples of Class A in each validation fold = 800/5=160.
Answer: 160
Q2.3.9
Total training samples for each fold = 1000×(4/5)=800.
Samples of Class A in training set = 800×(800/1000)=640.
Samples of Class B in training set = 800×(200/1000)=160.
Ratio = 640/160=4.0.
Answer: 4.00
Q2.3.10
In each of the 10 iterations, 90 samples are used for training (9 folds of 10 samples each). The
total count of samples used across all training sets is 10×90=900. However, the question asks
for the number of unique samples. Since each sample is used for training in 9 out of 10
iterations, all 100 unique samples are used in the training sets.
Answer: 100
Q2.3.11
In a nested cross-validation, the inner loop is dedicated to hyperparameter tuning and model
selection within a single outer fold. Therefore, for a single outer fold, the inner loop will train
models on its 5 folds.
Answer: 5

Section 3: Decision Trees

Q3.1.1 (B) A decision tree consists of internal nodes, which represent a test on a feature, and
leaf nodes, which represent the final class label or a predicted value. The leaf nodes are the
terminal points of the tree where a prediction is made.9

Q3.1.2 (B) Decision trees are often called "white box" models because their decision-making
process is easy to follow and interpret. The tree structure can be visualized, and the rules
from the root to the leaf are simple if-then-else statements that are easily understandable by
humans.18

Q3.1.3 (C) The CART (Classification and Regression Trees) algorithm is unique in that it is
capable of handling both classification and regression tasks. ID3 and C4.5 are primarily used
for classification.10

Q3.1.4 (C) The ID3 algorithm is a greedy algorithm that builds the tree top-down by selecting
the attribute that provides the maximum Information Gain at each step. Information Gain is a
measure of the reduction in entropy from a given split.10

Q3.1.5 (B) Entropy is a concept from information theory that measures the degree of disorder,
randomness, or impurity in a set of data. A perfectly pure node (all samples from the same
class) has an entropy of 0. A node with an equal mix of classes has maximum entropy.9

Q3.1.6 (A) Pruning is a critical technique used to simplify an over-complex decision tree. It
involves removing branches or nodes that provide little predictive power, thereby reducing the
model's complexity and preventing overfitting.9

Q3.1.7 (B) A decision stump is a single-level decision tree that consists of only a root node
and its immediate children (leaf nodes). It makes a decision based on just a single feature
test.9

Q3.1.8 (C) A major drawback of decision trees is their instability. Small changes in the training
data, such as adding or removing a few data points, can result in a completely different tree
structure. This problem is often mitigated by using ensemble methods.9
Q3.1.9 (B) Decision trees handle continuous variables by creating binary splits at a specific
threshold. For example, a split might be if age > 45, where age is a continuous variable. This
discretizes the data into two or more intervals based on the split point.9

Q3.1.10 (B) Because finding a globally optimal decision tree is an NP-complete problem,
practical algorithms like ID3, C4.5, and CART rely on greedy, heuristic approaches. They make
locally optimal decisions at each node without guaranteeing a globally optimal solution.18

Q3.1.11 (B) A regression decision tree makes predictions by averaging the target values of the
samples that fall into a specific leaf node. This results in a "piecewise constant" prediction
map, as the predicted value remains constant within each region defined by the tree's
boundaries.18

Q3.1.12 (D) Bagging, which is an ensemble technique, is used specifically to reduce variance
by averaging the predictions of multiple decision trees. Boosting, on the other hand, is
primarily used to reduce bias. Both are methods to manage the bias-variance trade-off.6

Section 3: MSQs

Q3.2.1 (A), (B), (D) A maximally deep, unpruned decision tree has a very high capacity to fit
the training data. This leads to low bias (A), as it can perfectly model the training set, but high
variance (B) because it is sensitive to noise and will not generalize well. This results in poor
performance on unseen data (D).5

Q3.2.2 (A), (C), (D) A single decision tree can be highly unstable and sensitive to small data
variations (A). Since practical algorithms are greedy, they cannot guarantee a globally optimal
tree (C). Decision trees can also produce biased trees if the dataset is not balanced and one
class dominates (D). They can, however, handle categorical data and are computationally
cheap for prediction.18

Q3.2.3 (A), (B), (E) Ensemble methods are a powerful strategy for managing the bias-variance
trade-off. Bagging, as used in Random Forest, reduces the variance of an ensemble by
averaging the predictions of multiple high-variance learners (A). Boosting, as in AdaBoost,
reduces the bias of the model by combining multiple weak, high-bias learners sequentially (B).
They are a key tool for managing the trade-off (E).6

Q3.2.4 (A), (B), (C) Common stopping criteria for a decision tree algorithm include: stopping
when the tree reaches a predefined maximum depth (A); stopping when a node contains
fewer than a minimum number of samples (B); or stopping when the information gain or Gini
impurity for a split is below a certain threshold (C).18

Q3.2.5 (A), (B), (C) Gini Impurity is a splitting criterion used in the CART algorithm (A). It
measures the probability of misclassifying a randomly chosen element if it were labeled
according to the distribution in the set (B). A Gini Impurity of 0.0 indicates a perfectly pure
node with all samples belonging to the same class (C).10

Q3.2.6 (A), (C), (D) When one class dominates, a decision tree may become biased towards
the majority class. This can be addressed by balancing the dataset before training (A) or using
an algorithm that adjusts the class weights to give more importance to the minority class (C).
Stratified k-fold cross-validation is also crucial (D) to ensure that the class distribution is
maintained in each training and validation fold.14

Q3.2.7 (B), (D) Decision tree algorithms are greedy and prefer to make splits that are "good"
at the top of the tree. This is because they use a metric like Information Gain to make locally
optimal decisions that maximally separate the data at each step. This preference for good
splits at the top results in a more efficient search of the hypothesis space.19

Q3.2.8 (A), (B), (E) A decision tree approximates a function with a series of if-then-else rules,
which results in a piecewise constant function. This makes it effective at modeling complex
non-linear relationships (B) but also means it is not good at extrapolation beyond the range of
the training data (A). A decision tree can represent any boolean function (E).18

Q3.2.9 (C), (D), (E) A deep tree has a high capacity to fit complex patterns, which leads to low
bias (D). However, if the depth is too large, the tree may begin to model noise in the training
data, leading to overfitting (C). Therefore, limiting the depth is a crucial form of regularization
to control the model's variance and prevent overfitting (E).9

Q3.2.10 (A), (B), (C) Variance in a decision tree is a measure of its sensitivity to the training
data. It can be controlled by limiting its complexity through mechanisms like setting a
maximum depth (B) or a minimum number of samples per leaf node (A). A more robust
approach is to use ensemble methods, like bagging, which trains multiple trees and averages
their predictions, thereby reducing the overall variance (C).6

Q3.2.11 (B), (C) C4.5 is an extension of the ID3 algorithm. C4.5 improves upon ID3 by
addressing some of its limitations. Specifically, C4.5 uses a Gain Ratio to prevent a bias
towards attributes with a large number of values (B). It can also handle both continuous
features and missing values, which ID3 cannot (C).10

Q3.2.12 (B), (D) The NP-complete nature of learning an optimal decision tree means that it is
not feasible to exhaustively search for the best tree. As a result, practical algorithms use
heuristic, greedy approaches (B) that make locally optimal decisions at each node. This
means that they cannot guarantee that the resulting tree will be the globally optimal solution
(D).18

Section 3: NATs

Q3.3.1
Entropy is calculated using the formula: H(S)=−p1​log2​(p1​)−p2​log2​(p2​)−...
Total samples = 20 + 10 = 30.
pClassA​=20/30=2/3.
pClassB​=10/30=1/3.
H(S)=−(2/3)log2​(2/3)−(1/3)log2​(1/3)
H(S)=−(0.667)(−0.585)−(0.333)(−1.585)=0.390+0.528=0.918.
Answer: 0.92
Q3.3.2
Gini Impurity is calculated as Gini(S)=1−∑i=1c​pi2​.
pClassX​=50/100=0.5.
pClassY​=50/100=0.5.
Gini(S)=1−(0.52+0.52)=1−(0.25+0.25)=1−0.5=0.5.
Answer: 0.50
Q3.3.3
Gini Impurity of Node A:
Total samples in Node A = 30+10=40.
ppositive​=30/40=0.75.
pnegative​=10/40=0.25.
Gini(A)=1−(0.752+0.252)=1−(0.5625+0.0625)=1−0.625=0.375.
Answer: 0.38
Q3.3.4
Information Gain is calculated as IG(S,A)=H(S)−∑v∈Values(A)​∣S∣∣Sv​∣​H(Sv​).
Parent Entropy H(S)=−0.5log2​(0.5)−0.5log2​(0.5)=1.0.
Child 1 Entropy
H(S1​)=−(4/5)log2​(4/5)−(1/5)log2​(1/5)=−(0.8)(−0.322)−(0.2)(−2.322)=0.258+0.464=0.722.
Child 2 Entropy H(S2​)=−(1/5)log2​(1/5)−(4/5)log2​(4/5)=0.722.
Information Gain = H(S)−(105​H(S1​)+105​H(S2​))
IG=1.0−(0.5×0.722+0.5×0.722)=1.0−0.722=0.278.
Answer: 0.278
Q3.3.5
The reduction in MSE is the difference between the MSE of the parent node and the weighted
average MSE of the child nodes.
MSE of parent node: yˉ​=8.0. MSE=101​∑(yi​−yˉ​)2=101​(36+25+9+9+4+1+0+1+4+49)=10138​=13.8.
MSE of Node A: yˉ​A​=3.75.
MSEA​=41​∑(yi​−yˉ​A​)2=41​(3.0625+0.5625+1.5625+1.5625)=46.75​=1.6875.
MSE of Node B: yˉ​B​=9.167.
MSEB​=61​∑(yi​−yˉ​B​)2=61​(9.914+4.694+1.361+0.111+0.140+34.028)=650.25​=8.375.
Weighted average MSE of child nodes =
104​MSEA​+106​MSEB​=0.4(1.6875)+0.6(8.375)=0.675+5.025=5.7.
Reduction in MSE = 13.8−5.7=8.1.
Answer: 8.10
Q3.3.6
In a decision tree, the number of internal nodes is always one less than the number of leaf
nodes. The number of splits is equal to the number of internal nodes.
Number of splits = 3.
Answer: 3
Q3.3.7
Entropy is a measure of impurity. For a binary classification, it is 0 when a node is perfectly
pure (e.g., all samples belong to one class).
Answer: 0
Q3.3.8
Initial Entropy: Since the node has 50 samples of each class, its entropy is maximal at 1.0.
Child Node 1: pClassA​=40/50=0.8, pClassB​=10/50=0.2.
H(S1​)=−0.8log2​(0.8)−0.2log2​(0.2)=−(0.8)(−0.322)−(0.2)(−2.322)=0.258+0.464=0.722.
Child Node 2: pClassA​=10/50=0.2, pClassB​=40/50=0.8.
H(S2​)=0.722.
Information Gain = H(S)−(10050​H(S1​)+10050​H(S2​))
IG=1.0−(0.5×0.722+0.5×0.722)=1.0−0.722=0.278.
Answer: 0.28
Q3.3.9
The XOR problem is non-linearly separable. A simple decision tree, which uses axis-aligned
splits, would require multiple nodes to solve it. It would split on the first feature, and then for
each child node, it would need to split on the second feature. This would require 3 internal
nodes and 4 leaf nodes.
Answer: 3
Q3.3.10
In a regression tree, the prediction at a leaf node is the mean of the target values of all the
samples that fall into that node.
Predicted value = (10+12+11+13+14)/5=60/5=12.
Answer: 12

Section 4: Integrated Advanced Problems

Q4.1.1 (B) The model performs perfectly on training data (low training error) but poorly on
unseen data (high test error). This is a classic symptom of a model with high variance, which
has overfit the training data.3

Q4.1.2 (B), (D), (E) Applying a data-scaling technique on the entire dataset before splitting it
is a critical mistake that leads to information leakage. The scaling process uses information
from the entire dataset, including the validation set, to determine the scaling parameters (e.g.,
mean and standard deviation). The model is then implicitly trained on this leaked information,
resulting in an overly optimistic performance estimate. The cross-validation process should be
re-architected to perform the scaling within each fold, using only the training data of that fold
to fit the scaler.12

Q4.1.3
MSEsimple​=Bias2+Variance+Noise=3.02+2.0+1.0=9+2+1=12.
MSEcomplex​=Bias2+Variance+Noise=1.02+10.0+1.0=1+10+1=12.
The simple model and the complex model have the same total MSE. Based on the
bias-variance trade-off, either could be chosen, but the simpler model is generally preferred
for its better interpretability and robustness unless performance is the sole criterion. The
question asks for a single numerical answer, so either 12.0 or 12 is correct. Let's assume the
trade-off is balanced at this point.
Answer: 12
Q4.1.4 (A), (B), (C), (D) By setting a small max_depth (e.g., 2), the base decision trees are kept
simple and are likely to underfit, giving them high bias (A). However, the bagging process of
Random Forest averages the predictions of these high-bias trees, which significantly reduces
the overall variance of the ensemble (B), thereby improving its generalization performance.
Nested cross-validation is a robust technique to tune hyperparameters without data leakage
and to provide an unbiased estimate of the final model's performance from the outer loop (C,
D).6

Q4.1.5
The Gini Index of a split is the weighted average of the Gini Impurities of the child nodes.
Gini Index = (0.40×0.2)+(0.60×0.5)=0.08+0.30=0.38.
Answer: 0.38
Q4.1.6 (A), (B), (E) The initial result (98% training accuracy vs. 65% test accuracy) is a
textbook case of overfitting, which is caused by high variance (E). When a cross-validation is
performed, it should confirm this finding. The average cross-validation accuracy, which
estimates the generalization performance, should be close to the test accuracy (A).
Furthermore, because the model is sensitive to data fluctuations, the accuracy across the
different folds is likely to have a high standard deviation (B).5

Q4.1.7
For max_depth = 5, the average accuracy is 0.85.
For max_depth = 10, the average accuracy is 0.87.
Difference = 0.87−0.85=0.02.
Answer: 0.02
Q4.1.8 (C), (E) The purpose of cross-validation is to estimate the performance of a
model-building procedure (C). The decision tree is likely to be a highly flexible, high-variance
model, while the Ridge Regression model, by its nature, is a simpler, more biased model that is
regularized to reduce variance. Therefore, the Ridge Regression model is likely to have lower
variance than an unpruned decision tree (E). The final model should be trained on the entire
dataset after the selection process is complete.5

Q4.1.9
The estimated generalization error is the mean of the performance metric across all folds.
Mean MAE = (5.1+5.5+4.8+5.2+5.0+5.4+4.9+5.3+5.6+5.7)/10=52.5/10=5.25.
This average value serves as the best estimate of the final model's performance on unseen
data.15

Answer: 5.25
Q4.1.10 (B), (C), (E) In a standard k-fold cross-validation on an imbalanced dataset, the splits
may not be representative of the true class distribution, leading to some folds with a
disproportionately low number of minority class samples (C). This causes the model to learn a
skewed distribution and overfit to the majority class (B), resulting in poor performance on
unseen data. The appropriate solution is to use a stratified k-fold cross-validation, which
preserves the class proportions in each fold (E).12

Conclusion: Summary and Strategic Synthesis

This question bank provides a comprehensive framework for mastering the concepts of
Decision Trees, the Bias-Variance Trade-off, and Cross-Validation. The questions are designed
to challenge aspirants to move beyond surface-level definitions and engage with the material
at a deeper, more analytical level.

The interconnections between these topics are not accidental; they represent the core
challenges of building robust machine-world learning models. A decision tree, for instance,
serves as a perfect case study for the bias-variance trade-off, where an unpruned tree
exhibits high variance and a pruned tree represents a more balanced model. Similarly,
cross-validation is the essential tool for diagnosing and managing this trade-off, providing a
reliable estimate of generalization performance. The most complex problems in machine
learning, and thus the most challenging questions in a competitive exam, are often those that
require a synthesis of these fundamental principles.

The answers provided in this document are more than just solutions; they are mini-tutorials
that deconstruct complex problems and reveal the underlying logic. A thorough review of
these explanations will not only solidify an aspirant's knowledge but will also cultivate a more
strategic, expert-level understanding of machine learning model development.

Works cited

1.​ GATE 2024 Data Science and Artificial Intelligence Question Paper ..., accessed
August 29, 2025,
[Link]
gence-question-paper
2.​ GATE DA 2024 Question Paper and Answer Key | Download PDF ..., accessed
August 29, 2025,
[Link]
3.​ What is Bias-Variance Tradeoff? - IBM, accessed August 29, 2025,
[Link]
4.​ Bias and Variance in Machine Learning - GeeksforGeeks, accessed August 29,
2025,
[Link]
earning/
5.​ Devinterview-io/bias-and-variance-interview-questions ... - GitHub, accessed
August 29, 2025,
[Link]
6.​ Bias–variance tradeoff - Wikipedia, accessed August 29, 2025,
[Link]
7.​ (PDF) The Bias-Variance Tradeoff: How Data Science Can Inform Educational
Debates - ResearchGate, accessed August 29, 2025,
[Link]
_How_Data_Science_Can_Inform_Educational_Debates
8.​ IEOR 165 – Lecture 15 Bias-Variance Tradeoff, accessed August 29, 2025,
[Link]
df
9.​ Decision Trees Quiz Questions | Aionlinecourse, accessed August 29, 2025,
[Link]
es
10.​MCQs | Decision Tree Algorithm for Classification and Regression ..., accessed
August 29, 2025, [Link]
11.​ Cross-validation (statistics) - Wikipedia, accessed August 29, 2025,
[Link]
12.​Cross-Validation in Machine Learning: How to Do It Right - [Link], accessed
August 29, 2025,
[Link]
13.​Cross Validation in Machine Learning - GeeksforGeeks, accessed August 29,
2025,
[Link]
ing/
14.​Solving 9 Common Cross-Validation Mistakes | by Jan Marcel ..., accessed August
29, 2025,
[Link]
mistakes-ac8a6a6944e7
15.​How to choose a predictive model after k-fold cross-validation ..., accessed
August 29, 2025,
[Link]
model-after-k-fold-cross-validation
16.​317 questions with answers in CROSS-VALIDATION | Science topic -
ResearchGate, accessed August 29, 2025,
[Link]
17.​decision tree University Quiz | Wayground (formerly Quizizz), accessed August 29,
2025,
[Link]
18.​1.10. Decision Trees - Scikit-learn, accessed August 29, 2025,
[Link]
19.​Decision Trees Flashcards | Quizlet, accessed August 29, 2025,
[Link]
20.​Decision Trees for Classification: A Machine Learning Algorithm - Xoriant,
accessed August 29, 2025,
[Link]
ng-algorithm
21.​1 Decision Trees (13 pts) - courses, accessed August 29, 2025,
[Link]
22.​Decision Tree Tutorials & Notes | Machine Learning - HackerEarth, accessed
August 29, 2025,
[Link]
thms/ml-decision-tree/tutorial/
23.​Decision Tree qUIZE | PDF | Algorithms | Artificial Intelligence - Scribd, accessed
August 29, 2025,
[Link]

Common questions

Powered by AI

A very deep decision tree tends to have low bias but high variance, risking overfitting by tailoring to training data specifics . Conversely, a shallow tree is susceptible to high bias, leading to underfitting . Depth can be controlled by setting a maximum tree depth, regularly pruning, or utilizing ensemble methods like bagging and boosting to stabilize variance, improving the tree's generalization ability across datasets .

When data from different groups are mixed in cross-validation folds, the independence assumption is violated, potentially biasing the performance estimates if these group influences are not accounted for. This is particularly problematic in situations where models need to generalize to unseen, new groups . Ensuring that the folds are created with specific constraints, such as using group-aware split strategies, can mitigate this issue .

Cross-validation provides a more robust performance estimate compared to a single train-test split by averaging the results over multiple splits, which can help identify inconsistencies in model predictions . It ensures that every data point is used for both training and validation, mitigating the risk of overfitting specific to a single data partition and reducing variance in the performance estimate .

Nested cross-validation is considered more robust because it separates the hyperparameter tuning process (inner loop) from the assessment of the generalization performance (outer loop). This structure ensures an unbiased estimate of model performance by nesting the tuning process within each fold of an outer validation loop, thus effectively mitigating any overfitting during hyperparameter optimization .

Applying data scaling techniques properly within the cross-validation loop is essential when dealing with datasets where the training and test sets have different value ranges . This approach helps standardize the feature ranges for each fold, ensuring that the model's learned parameters are appropriately generalized without leakage from test set characteristics into the training phase .

Significant variability in model performance across folds indicates model instability and suggests that its performance estimate may be unreliable . This instability might be caused by high sensitivity to specific data splits, possibly due to an inappropriate data stratification or inherent variability in the data itself .

Information leakage occurs when information from the validation or test set is inadvertently used in the training process. This can happen, for example, if data normalization or feature selection is performed on the entire dataset before splitting . This leakage leads to overly optimistic and unreliable performance estimates because the model is indirectly tuned to the validation data, misleading the generalizability assessment .

To ensure robustness and reliability in model evaluation with class imbalance, it is crucial to use stratified k-fold cross-validation, which maintains the original class distribution across all folds . Additionally, a separate, untouched test set should be used after cross-validation to provide an unbiased evaluation of the final model . Randomly shuffling the data before creating folds helps create diverse train-test splits .

LOOCV is computationally intensive compared to k-fold cross-validation, as it requires as many runs as there are samples in the dataset, resulting in high variance . It provides a nearly unbiased estimate of model performance, but the reliance on single-sample validation causes variability that leads to higher variance compared to using larger fold sizes in k-fold cross-validation .

A total of 15 models are trained in a nested cross-validation process with 5 outer and 3 inner folds (5 outer folds, each requiring 3 inner loop evaluations). This process is computationally expensive due to the repeated training conducted both within and across outer folds to ensure a reliable performance estimate through comprehensive sampling of hyperparameters' impact across the data .

You might also like