0% found this document useful (0 votes)
21 views3 pages

Machine Learning Assignment 4 Guide

This document outlines Assignment-4 for the Machine Learning course, detailing instructions for submission, including individual work requirements and plagiarism policies. It consists of three main tasks involving Decision Trees, Random Forests, Ensemble Methods, and K-Means clustering, with specific points allocated for each section. Students must submit their work in a specified format, including a .zip file containing code and reports, by the deadline of April 7, 2024.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views3 pages

Machine Learning Assignment 4 Guide

This document outlines Assignment-4 for the Machine Learning course, detailing instructions for submission, including individual work requirements and plagiarism policies. It consists of three main tasks involving Decision Trees, Random Forests, Ensemble Methods, and K-Means clustering, with specific points allocated for each section. Students must submit their work in a specified format, including a .zip file containing code and reports, by the deadline of April 7, 2024.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSE343/CSE543/ECE363/ECE563: Machine Learning

Winter 2025

Assignment-4 (30 points)

Release: April 1, 2024 (Tuesday) Submission: 11:59 pm April 7, 2024 (Monday)

Instructions
• Institute Plagiarism Policy Applicable. Both programming and theoretical questions will be subjected
to strict plagiarism check.
• This assignment should be attempted individually. All questions are compulsory.
• Theory [T]: For theory questions, only hand-written solutions are acceptable. Attempt each question on a
different sheet & staple them together (for the ease of checking). Do not start a new question at the back of
the previous one. Do not forget to mention page number (bottom centre) and your credentials (bottom right)
on each sheet. It must be submitted in Assignment submission box kept in class during class time.
• Programming [P]: For programming questions, the use of python programming language is allowed only.
You must submit a single .py file named as A4 [Link]. Make sure the submission is self-complete &
replicable i.e., you are able to reproduce your results with the submitted files only. Use random seed wherever
applicable to retain reproducability.
• [Link] : Create a .pdf report of programming questions that contains your applied approach, pre-
processing, assumptions, analysis, visualizations, etc.. Anything not in the report will not be evaluated. Al-
ternatively, a well-documented .ipynb file (in addition to a single .py file mentioned in the previous bullet) with
answers to all the questions may be submitted as a report. The report must be named as A4 RollNo [Link]
or A4 RollNo [Link].
• File Submission: Submit a .zip named A4 [Link] (e.g., A4 [Link]) file containing the report and
code files.
• Submission Policy: Turn-in your submission as early as possible to avoid late submissions. In case of
multiple submissions, the latest submission will be evaluated. Late submissions will not be evaluated
and hence will be awarded zero marks.
• Clarifications: Symbols have their usual meaning. Assume the missing information & mention it in the
report. You are allowed to use any machine learning library until exclusively mentioned in the question that
it is supposed to be done from scratch. You can always use basic python libraries such as numpy, pandas,
and matplotlib, unless specified otherwise. Use Google Classroom for any queries. In order to keep it fair for
all, no email queries will be entertained. You may attend office/TA hours for personal resolutions. Start your
assignment early. No queries will be answered in Google Classroom comments during the last time.
• Compliance: The questions in this assignment are structured to meet the Course Outcomes CO1, CO2,
CO3, and CO4, as described in the course directory.

• There could be multiple ways to approach a question. Please justify your answers mathematically in the
theory questions, and via commented text in the programming questions appropriately. Questions without
justification will get zero marks.
1. Decision Trees and Random Forest (10 points)
[P ∥ CO1, CO2, CO3, CO4] Bank Marketing Data
(a) Use the ”Bank Marketing” dataset to preprocess the data, handling missing values, encoding categorical
variables, and splitting the dataset into training and testing sets. (1 point)
(b) Decision Tree Classifier:
i. Implement a Decision Tree classifier from scratch (without using high-level libraries like scikit-learn).
Your implementation should include:
• Building the decision tree using appropriate splitting criteria (e.g., information gain, Gini impu-
rity).
• Pruning techniques to prevent overfitting.
• Handling continuous and categorical features.
(3 points)
ii. Evaluate the performance of your Decision Tree classifier on the test set using appropriate metrics.
You should report accuracy, precision, recall, F1-score, area under the ROC curve. (1 point)
(c) Random Forest Classifier:
i. Implement a Random Forest classifier from scratch (without using high-level libraries). Your imple-
mentation should include:
• Building individual decision trees on bootstrapped samples of the training data.
• Combining the predictions of multiple decision trees using techniques like majority voting (for
classification) or averaging (for regression).
• Tuning hyperparameters like the number of trees, maximum depth, and others.
(3 points)
ii. Evaluate the performance of your Random Forest classifier on the test set using appropriate metrics.
You should report accuracy, precision, recall, F1-score, area under the ROC curve. (1 point)
(d) Compare the performance of your Decision Tree and Random Forest classifiers, and discuss the strengths
and weaknesses of each approach. (1 point)

2. Ensemble Methods for Credit Card Default Prediction (13 points)


[P ∥ CO1, CO2, CO3, CO4] Credit Card Default Data
(a) Use the ”Credit Card Default” dataset to preprocess the data, handling missing values, encoding cate-
gorical variables, and splitting the data into training and testing sets. (1
mark)
(b) Individual Classifiers:
i. Implement the following individual classifiers from scratch (without using high-level libraries like
scikit-learn):
• Logistic Regression
• Decision Tree Classifier
• K-Nearest Neighbors Classifier
(3 marks)
ii. Evaluate the performance of each individual classifier on the test set using appropriate metrics (e.g.,
accuracy, precision, recall, F1-score, area under the ROC curve). (1 mark)
(c) Ensemble Methods:
i. Implement the following ensemble methods from scratch:
• Bagging (Bootstrap Aggregating) with Decision Trees
• Boosting (AdaBoost) with Decision Trees
(3 marks)
ii. Evaluate the performance of your ensemble models on the test set using the same metrics as in step
3. You should report accuracy, precision, recall, F1-score, area under the ROC curve. (1.5 mark)
(d) Compare the performance of your ensemble models with the individual classifiers, and discuss the
strengths and weaknesses of the ensemble approaches. (1 mark)
(e) Experiment with five different set of hyperparameters (e.g., number of estimators, maximum depth of
trees, splitting criteria, max depth of the the tree) for your ensemble models and analyze their impact
on performance. ( 2.5 mark)
3. K-Means and Gaussian Mixture Models on the Iris Dataset (7 points)
[P ∥ CO1, CO2, CO3, CO4] Iris Dataset Analysis
(a) Load and preprocess the Iris dataset, handling any missing values and scaling the features as necessary.
(1 marks)
(b) K-Means Clustering Algorithm:
i. Implement the K-Means clustering algorithm from scratch:
• Implement the algorithm using random initialization and the k-means++ initialization method.
• Experiment with different values of K (number of clusters) and determine the optimal value
using evaluation metrics like the Silhouette score or the Elbow method.
• Visualize the clusters using scatter plots and t-sne plots.
(5 marks)
(c) Compare the performance of the K-Means with random initialization and k-means++ initialization the
Iris dataset using appropriate evaluation metrics. (1 mark)

You might also like