0% found this document useful (0 votes)
13 views5 pages

Data Modeuling & Evaluation - Practical - List

The document outlines the course framework for 'Data Modeling and Evaluation' as part of the MCA program, detailing practical assignments and topics covered over 60 hours. Key areas include data normalization, feature engineering, various machine learning models, clustering techniques, classification metrics, and model validation. The course emphasizes hands-on experience with real-world datasets and the application of different machine learning techniques and evaluation methods.

Uploaded by

shimtone21
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Data Modeuling & Evaluation - Practical - List

The document outlines the course framework for 'Data Modeling and Evaluation' as part of the MCA program, detailing practical assignments and topics covered over 60 hours. Key areas include data normalization, feature engineering, various machine learning models, clustering techniques, classification metrics, and model validation. The course emphasizes hands-on experience with real-world datasets and the application of different machine learning techniques and evaluation methods.

Uploaded by

shimtone21
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DATA MODELING AND EVALUTION

L-T-P: 0-0-2

COURSE FRAMEWORK
PROGRAM MCA
COURSE CODE/ 23MCDS403 DATA MODELING AND EVALUTION
TITLE
SEMESTER IV SEMESTER
Credits for the Course 2 Credits

SYLLABUS OF THE COURSE


Course Code Name of the Course
23MCAL401 Lab – III DATA MODELING AND EVALUTION
Total hours: 60 hours
Assignment Data Model Design, ER Model Practical
No. Hours
1 1. Normalize a given dataset up to 3NF. 5
2. Perform denormalization on a normalized dataset
and compare results.
3. Write the ER-Diagram of Online Shopping
Management System
4. Convert the Table into First Normal Form (1NF)
5. Drow the ER- Diagram of University Management
System .
6. Consider the following unnormalized table (UNF)
for a Student Course Registration System:
Stude Cour Cours
Stude Instruct Instructo
nt_Na se_I e_Na
nt_ID or r_Phone
me D me
CSE Databa Prof. 123-456-
101 Alice
101 se John 7890
CSE Algorit Prof. 987-654-
101 Alice
102 hms Smith 3210
CSE Databa Prof. 123-456-
102 Bob
101 se John 7890
2 Data Preparation and Feature Engineering

1. Data Cleaning & Preprocessing: 5


a) Identify and handle any missing values in the
Purchase_Frequency column.
b) Detect and treat potential outliers in the
Annual_Income and Spending_Score columns.
2. Encoding Categorical Variables:
a) Encode the Gender and Membership Type columns
appropriately for machine learning models.
3. Feature Engineering:
a) Perform feature selection to identify the most
relevant features for predicting spending behavior.
b) Apply Principal Component Analysis (PCA) and
explain how it helps in dimensionality reduction.
c) Normalize or scale the Annual Income and Spending
Score columns using an appropriate technique.

3 Linear Regression and Its Applications

1. Load a dataset (e.g., housing prices dataset). 5


2. Preprocess the data (handle missing values,
normalize features if necessary).
3. Split the data into training and testing sets.
4. Train a linear regression model using libraries like
Scikit-learn.
5. Evaluate the model using metrics like Mean Squared
Error (MSE).
6. Visualize the regression line and residuals.

4 Decision Trees and Random Forest


1. Load a dataset (e.g., Titanic survival dataset).
2. Preprocess categorical and numerical features.
3. Split into training and testing sets.
4. Train a logistic regression model.
5. Evaluate using accuracy and confusion matrix.

5 Tree-Based Models
1. Load a classification dataset. 5
2. Train a Decision Tree and Random Forest model.
3. Compare their accuracies.
4. Load a dataset (e.g., Titanic, Iris, or a custom
dataset).
5. Preprocess data (handle missing values, encode
categorical variables, and normalize if needed).
6. Split data into training and testing sets (80/20 split).
7. Train a Decision Tree classifier with different depths
and criteria (Gini vs. Entropy).
8. Visualize the tree and analyze the decision paths.

6 Gradient Boosting Machines (GBM, XGBoost,


LightGBM)
1. To implement GBM and compare with other 5
tree models.
2. Train an XGBoost model using different values
of learning rate (0.01, 0.1, 0.3, 0.5). How does
accuracy change?
3. What happens if you increase the number of
boosting rounds (e.g., n_estimators=500
instead of 100) in XGBoost and LightGBM?
4. Implement early stopping in XGBoost and
LightGBM. How does it affect performance?
5. Try using GPU acceleration for XGBoost and
LightGBM. How much does training time
improve?

Clustering (K-Means, Hierarchical Clustering) and


7
Dimensionality Reduction (PCA, t-SNE)
1. Data Preprocessing 5
1. Load a real-world dataset (e.g., Iris, MNIST,
Wine, or any suitable dataset).
2. Perform data cleaning and preprocessing:
3. Handle missing values (if any).
4. Normalize or standardize the data (if needed).
2. Clustering Algorithms
1. Apply K-Means Clustering:
2. Use the Elbow Method to determine the optimal
number of clusters.
3. Compute and visualize the clustering results.
4. Apply Hierarchical Clustering:
5. Use both Agglomerative and Divisive clustering.
6. Generate a dendrogram and analyze the cluster
structure.
3. Dimensionality Reduction
1. Apply Principal Component Analysis (PCA):
2. Reduce the dataset to 2 or 3 dimensions for
visualization.
3. Analyze the explained variance of principal
components.
4. Apply t-Distributed Stochastic Neighbor
Embedding (t-SNE):
5. Visualize the high-dimensional data in a 2D
space.
6. Compare t-SNE vs. PCA results.
4. Analysis & Interpretation
1. Compare the performance of K-Means vs.
Hierarchical Clustering.
2. Discuss the effectiveness of PCA vs. t-SNE for
visualization.
3. Provide insights into clustering quality and
visualization results.

8 Classification Metrics: Confusion Matrix, Accuracy,


Precision, Recall, F1 Score
1. Train a classification model on a real-world dataset 5
and compute the confusion matrix. What do the
values in each cell represent?
2. Given a highly imbalanced dataset, why might
accuracy not be a reliable metric? Compute
precision, recall, and F1 score to support your
argument.
3. How does changing the decision threshold of a
classifier affect precision and recall? Demonstrate
with an experiment.
4. Plot the ROC curve for a classification model and
compute the AUC score. What does the AUC value
tell you about the model’s performance?
5. Compare the ROC curves of two different models.
How do you determine which model is better? Can a
model have high accuracy but a low AUC score?
Explain with an example.

9 Time Series Modeling ARIMA, SARIMA, and Prophet

Building an MLP for Structured Data 5


1. Load a structured dataset (e.g.,
[Link].load_diabetes).
2. Normalize the data and split into training/testing
sets.
3. Build a Multi-Layer Perceptron (MLP) using
TensorFlow/Keras.
4. Train the model and evaluate performance using
MSE/RMSE.

10 Model Validation Techniques


Implementing -Stratified K-Fold Cross-Validation 5
1. Load an imbalanced classification dataset (e.g.,
breast cancer dataset).
2. Apply Stratified K-Fold to ensure equal class
distribution in each fold.
3. Compare results with standard K-Fold

Implementing -Time-Based Split for Time Series Data


1. Load a time series dataset (e.g., stock prices,
weather data)
2. Perform a time-based split where training data is
from past timestamps, and testing data from future
timestamps.
3. Compare results with a random split.

Ensuring Reproducibility in Machine Learning


Experiments.
1. Set random seeds for NumPy, TensorFlow, and
Scikit-learn.
2. Use DVC (Data Version Control) for dataset tracking.
3. Store model training scripts in Git for version
control.

11 Overfitting and Underfitting


Compare and Interpret Results: 5
1. Evaluate how L1, L2, and Dropout affect model
performance.
2. Discuss which regularization technique works best
and why

Setting Up Random Seeds:


1. Implement a script where all necessary libraries
(NumPy, TensorFlow, Scikit-Learn, etc.) have fixed
random seeds.
2. Run the same model multiple times and verify that
the results are consistent

Version Control for Machine Learning Projects:


1. Use Git to track changes in your model.
2. Create different branches to experiment with
different regularization techniques.
3. Document model performance results for
comparison.

Using Reproducibility Tools:


1. Utilize tools like MLflow or DVC (Data Version
Control) to track experiments.
2. Save hyperparameters and evaluation metrics for
future reference.
3. Demonstrate how to reproduce previous
experiments exactly.

You might also like