DATA SCIENCE ASSIGNMENT
1. Perform Exploratory Data Analysis (EDA) on Iris dataset
Ans:
[2]: importpandas as pd
importseaborn as sns
importmatplotlib . pyplot as plt
# Load Iris dataset
iris = sns . load_dataset ( 'iris' )
# Display first few rows
print ( iris . head())
# Pairplot to visualize relationships
sns . pairplot ( iris , hue='species' )
plt . show()
# Boxplot for each feature
plt . figure ( figsize =( 10, 6))
sns . boxplot ( data =iris )
plt . xticks ( rotation =45)
plt . show()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
2. What is decision tree? Draw a decision tree by taking the example of
Play Tennis.
Ans: Decision tree is a flowchart used to make decisions in machine learning. It consists of
nodes which represents decisions or events, branches represents possible outcomes and leaves
represents final results. The tree is constructed based on data, which facilitates a systematic
approach to decision-making and prediction.
OUTLOOK
RAINY CLOUDY SUNNY
WINDY CAN PLAY HUMIDITY
HIGH NORMAL HIGH NORMAL
CANNOT PLAY CAN PLAY CANNOT PLAY CAN PLAY
3. In k-means or KNN, we use Euclidean distance to calculate the
distance between nearest neighbors. Why not Manhattan distance?
Ans: Euclidean and Manhattan distance are distance metrics which are used in different
scenarios. Euclidean distance is based on the straight line distance between two points whereas
Manhattan distance which is also known as L1 distance is based on the sum of absolute
differences between the coordinates of the points.
Euclidean distance is sensitive to magnitude of differences between coordinates, and it assumes
that all the dimensions contribute equally to the distance. Manhattan distance is less sensitive to
individual dimensions and may be suitable when the dimensions are not comparable.
4. How to test and know whether or not we have overfitting problem?
Ans: Overfitting is a common problem in machine learning where a model learns the training
data too well, including its noise and outliers, and performs poorly on new, unseen data.
HOLDOUT VALIDATION:
Split your dataset into training and validation sets.
Train your model on the training set and evaluate its performance on the validation set.
LEARNING CURVES:
Plot learning curves that show the model’s performance on both training and validation sets
over time.
If the training performance improves while the validation performance worsens it is
considered as overfitting.
PERFORMANCE METRICS:
Monitor performance metrics such as accuracy, recall, precision etc., on both training and
validation sets.
DATA AUGMENTATION:
Use data augmentation method to artificially increase the size of your training data set.
FEATURE SELECTION:
Evaluate whether all the features used in the model are necessary if not remove irrelevant
features.
5. How is KNN different from k-means clustering?
Ans: KNN (k-Nearest Neighbors) and k-means clustering are two different machine learning
techniques used for different purposes:
1. KNN (k-Nearest Neighbors):
Type: Supervised learning algorithm.
Purpose: Used for classification and regression tasks.
Operation: Predicts the class or value of a data point based on the majority class or average
value of its k-nearest neighbors in the feature space.
Training: Stores the entire training dataset in memory.
Usage: Commonly used for pattern recognition and predictive modeling.
2. k-means Clustering:
Type: Unsupervised learning algorithm.
Purpose: Used for clustering or grouping similar data points together.
Operation: Divides the dataset into k clusters based on similarity in feature space, with each
cluster represented by its centroid.
Training: Iteratively updates cluster centroids until convergence.
Usage: Commonly used for segmentation and identifying natural groupings in data.
6. Can you explain the difference between a Test Set and a Validation Set?
Ans:
1. Validation Set:
Purpose: The validation set is used to fine-tune the model during the training phase.
Usage: After training the model on the training set, it is evaluated on the validation set. This
evaluation helps in adjusting hyper parameters and making decisions about the model
architecture.
Prevents Overfitting: The validation set helps in preventing overfitting to the training data by
providing a separate dataset for model adjustment.
Data Separation: Typically, the dataset is split into training and validation sets, with the training
set used for actual model training.
2. Test Set:
Purpose: the test set is used to assess the generalization performance of the model after it has
been trained and fine-tuned.
Usage: Once the model is trained and tuned using the training and validation sets, it is evaluated
on the test set to provide an unbiased estimate of its performance on new, unseen data.
Prevents Overfitting to Validation Set: It ensures that the model has not over fit to the
validation set by providing a completely independent dataset for evaluation.
Data Separation: The test set is kept entirely separate from the training and validation sets.
7. How can you avoid overfitting in KNN?
Ans:
Choose optimal value for K
Feature selection
Standardized input features
Employ data augmentation
Implement cross validation
Use appropriate distance metric
8. What is precision?
Ans: Precision is a metric used in machine learning to measure the accuracy of positive
predictions made by a model. It is the ratio of true positive predictions to the total number of
positive predictions (true positives + false positives).
9. Explain How a ROC Curve works.
Ans:
Evaluate Model Performance: ROC curve assesses how well a classification model works.
Threshold Adjustment: Varies the decision threshold to see how the model performs at different
levels of sensitivity and specificity.
True Positive Rate (TPR): Shows the proportion of correctly identified positive cases.
False Positive Rate (FPR): Indicates the proportion of incorrectly identified negative cases.
Graphical Representation: Plots TPR against FPR to visualize the trade-off.
Ideal vs. Random: A perfect model's curve hugs the top-left corner, while a random guess forms
a diagonal line.
Area Under the Curve (AUC): A single number summarizing overall model performance.
Interpretation: A higher AUC suggests better model discrimination.
10. What is Accuracy?
Ans: Accuracy measures how often a classification model makes correct predictions overall. It's
the ratio of correct predictions to the total predictions. But, it may not be the best measure if the
dataset is imbalanced.
11. What is F1 Score?
Ans: The F1 Score is a single metric that balances precision and recall in a classification model.
It ranges from 0 to 1, with higher values indicating a better balance between precision and recall.
12. What is Recall?
Ans: Recall measures how well a model finds all the relevant instances of a class. It's the ratio of
correctly identified positive cases to all actual positives. High recall means the model is good at
capturing positives, but it doesn't tell us about false positives.
13. What is a Confusion Matrix, and why do we need it?
Ans: A confusion matrix is a table that shows how well a classification model is performing. It
compares the predicted values to the actual values and breaks them down into categories: true
positives, true negatives, false positives, and false negatives.
Why we need it:
-It provides a clear summary of a model's performance.
-Helps identify where the model is making mistakes.
-Useful for calculating various metrics like accuracy, precision, recall, and F1 score, which give a
more nuanced understanding of the model's strengths and weaknesses.
14. What do you mean by AUC curve?
Ans: The AUC curve or Area Under the Receiver Operating Characteristic Curve is a graphical
representation of the performance of a binary classification model. It measures the model's
ability to distinguish between the classes. A higher AUC value (closer to 1) indicates better
classification performance.
15. What is Precision-Recall Trade-Off?
Ans: The Precision-Recall trade-off refers to the balance between precision (the fraction of true
positives among all positive predictions) and recall (the fraction of true positives identified
correctly.
16. What are Decision Trees?
Ans: Decision Trees are supervised machine learning algorithms that make decisions by splitting
data based on features to predict outcomes.
17. Explain the structure of a Decision Tree
Ans: A Decision Tree has a hierarchical structure namely:
1. Root Node: Represents the entire dataset and is the starting point for the tree.
2. Internal Nodes: Represent features in the dataset. Each internal node tests a specific feature's
value.
3. Branches: Connect nodes and represent the outcome of a feature test.
4. Leaf Nodes: Terminal nodes that represent the final outcome or decision. Each leaf node
corresponds to a class label or a numerical value.
The tree is built by iteratively splitting the dataset based on features until a stopping criterion) is
met or no further improvement is observed in the data's homogeneity.
18. What are some advantages of using Decision Trees?
Ans: Advantages of using Decision Trees:
1. Interpretability: Easy to understand and visualize.
2. No Assumptions: Can handle various data types without assumptions.
3. Non-linearity: Captures non-linear relationships.
4. Handles Missing Values: Addresses datasets with missing values.
5. Efficient: Faster training on smaller datasets.
6. Handles Mixed Data: Suitable for both numerical and categorical data.
7. Interaction Effects: Captures interaction effects between features.
19. How is a Random Forest related to Decision Trees?
Ans: A Random Forest is a collection of decision trees. It combines multiple decision trees to
improve prediction accuracy reducing overfitting. Each tree is trained on a random subset of the
data and a random subset of features.
20. How are the different nodes of decision trees represented?
Ans: Nodes in a decision tree represent conditions or decisions. The types of nodes in decision
tree are:
1. Root node
2. Internal nodes/Decision nodes
3. Leaf nodes/Terminal nodes
4. Branches
21. What type of node is considered Pure?
Ans: A node in a decision tree is considered "pure" if all the data points (samples) at that node
belong to the same class or category.
22. How would you deal with an Overfitted Decision Tree?
Ans: To address overfitting in a decision tree:
1. Prune the tree.
2. Limit tree depth
3. Increase samples for split
4. Regularize the model.
5. Validate with a separate set or cross-validation.
6. Consider feature selection.
23. What are some disadvantages of using Decision Trees and how would you
solve them?
Ans: Disadvantages of Decision Trees:
1. Prone to overfitting.
2. Can be unstable for example small variations in data can result in a different tree.
3. May not capture complex relationships in data.
Solutions:
1. Prune the tree or use ensemble methods.
2. Use ensemble methods like Random Forest.
3. Combine with other algorithms or use ensemble techniques.
24. What is Gini Index and how is it used in Decision Trees?
Ans: The Gini Index measures impurity in dataset. In decision trees, it's used to find the best
splits and determine feature importance based on impurity reduction.
25. How would you define the Stopping Criteria for decision trees?
Ans: Stopping criteria for decision trees are:
1. Minimum samples per node.
2. Maximum tree depth.
3. Minimum node impurity.
4. Maximum leaf nodes.
5. Threshold for improvement in impurity.
26. What is entropy?
Ans: Entropy measures dataset uncertainty. It's used in decision trees to find the best feature
splits, aiming to reduce entropy and classify data more effectively.
27. How do we measure the Information?
Ans: Information can be measured using concepts such as bits (binary digits), entropy, Shannon
entropy, Kullback-Leibler divergence, mutual information, data compression effectiveness, and
perplexity. The appropriate measure depends on the context and type of information.
28. What is the difference between Post-pruning and Pre-pruning?
Ans:
1. Pre-pruning:
Definition: Pre-pruning involves setting constraints on the tree-building process before the tree
is fully grown. These constraints dictate when to stop growing the tree.
Strategy: Common pre-pruning techniques include setting a maximum depth for the tree, setting
a minimum number of samples required to split an internal node, or setting a minimum number
of samples required to be at a leaf node.
Advantage: Pre-pruning can be computationally more efficient as it avoids building an overly
complex tree in the first place.
2. Post-pruning:
Definition: Post-pruning, also known as "pruning by error estimation," involves growing a full
decision tree first and then removing (or collapsing) parts of the tree that do not provide
significant improvement in prediction accuracy.
Strategy: After the tree is fully grown, statistical measures (e.g., cross-validation) are used to
evaluate the significance of each subtree. If removing a subtree does not significantly decrease
the tree's predictive performance, that subtree (or branch) is pruned (removed).
Advantage: Post-pruning can potentially result in more accurate trees since the full tree is first
constructed, and then unnecessary branches are pruned based on actual performance.
29. Compare Linear Regression and Decision Trees
Ans:
Linear Regression:
Model Type: Supervised learning for regression tasks.
Model Nature: Assumes a linear relationship between independent and dependent variables.
Decision Boundary: Produces a straight line (in simple linear regression) or hyperplane (in
multiple linear regression) to make predictions.
Interpretability: Easy to interpret, especially in simple cases.
Overfitting: Prone to overfitting if not regularized.
Decision Trees:
Model Type: Supervised learning for both regression and classification tasks.
Model Nature: Makes decisions based on feature splits at internal nodes.
Decision Boundary: Produces a piecewise constant approximation to the target variable.
Interpretability: Can be easy to interpret, but can also grow complex trees that are harder to
understand.
Overfitting: Can overfit, especially if the tree is allowed to grow too deep.
30. What is the relationship between Information Gain and Information
Gain Ratio?
Ans: Information Gain (IG) measures how much splitting on a feature reduces uncertainty.
Information Gain Ratio (IGR) is a version of IG that also considers the complexity of the feature.
IG is the raw reduction in uncertainty, while IGR adjusts for feature complexity.
Information Gain (IG) formula:
IG (D, A) = H (D) - H (D|A)
Information Gain Ratio (IGR) formula:
IGR (D, A) = IG (D, A)/Split Info (A)
31. Compare Decision Trees and k-Nearest Neighbors
Ans:
Decision Trees:
Structured decision-making via tree splits.
Piecewise approximations for classification/regression.
Interpretability varies; can overfit.
k-Nearest Neighbours (k-NN):
Makes predictions based on nearest training examples.
Produces nonlinear decision boundaries.
Less interpretable; memorizes data; can be computationally intensive.
32. While building a Decision Tree how do you choose which attribute
to split at each node?
Ans: To choose which attribute to split at each node in a Decision Tree:
1. Calculate the Information Gain (or a similar criterion) for each attribute based on the dataset's
current state.
2. Select the attribute with the highest Information Gain (or similar metric) as the splitting
criterion for the node.
33. How would you compare different Algorithms to build Decision Trees?
Ans:
1. ID3
Uses entropy and information gain.
Categorical data only.
Prone to overfitting.
2. C4.5
Extension of ID3.
Handles continuous and categorical data.
Uses gain ratio.
Prunes to avoid overfitting.
3. CART
Handles classification and regression.
Uses Gini impurity.
Binary splits only.
4. Random Forest
Ensemble of trees.
Combats overfitting.
Uses bootstrapping and random features.
5. Gradient Boosted Trees
Sequential tree building.
Corrects previous trees' errors.
High accuracy but can overfit.
6. CHAID
Based on chi-square test.
For categorical target variables.
7. M5
Extension of C4.5.
Linear regression at leaf nodes.
Handles both data types.
34. How do you Gradient Boosted decision trees?
Ans:
1. Initialization: Start with a basic model (e.g., simple decision tree).
2. Residual Calculation: Find the difference between predictions and actual values.
3. Sequential Tree Building:
Train new trees on residuals.
Each tree corrects errors of the previous one.
4. Learning Rate: Hyper parameter controlling tree contributions.
5. Stopping: End boosting after a set number of trees or desired performance.
6. Prediction: Aggregate predictions from all trees for final output.
7. Regularization: Techniques like tree depth limits or subsampling prevent overfitting.
8. Popular Libraries: Tools like XGBoost, LightGBM, and CatBoost implement GBDT.
35. What are the differences between Decision Trees and Neural Networks?
Ans:
1. Structure:
Decision Trees: Hierarchical nodes and branches.
Neural Networks: Interconnected layers of nodes.
2. Training:
Decision Trees: Greedy splitting algorithms.
Neural Networks: Backpropagation for weight adjustments.
3. Complexity:
Decision Trees: Simple and interpretable.
Neural Networks: Can capture complex patterns but less interpretable.
4. Data Type:
Decision Trees: Handles both numerical and categorical.
Neural Networks: Primarily for numerical; can use encoding for categorical.
5. Overfitting:
Decision Trees: Prone with depth.
Neural Networks: Risk with many layers; uses dropout for prevention.
6. Applications:
Decision Trees: Classification, regression, structured data.
Neural Networks: Image recognition, NLP, complex patterns.
7. Interpretability:
Decision Trees: Direct decisions from inputs.
Neural Networks: Less direct interpretation due to complexity.
8. Training Speed:
Decision Trees: Faster, especially for smaller datasets.
Neural Networks: Slower with multiple layers or large data.