REGRESSION ASSIGNMENT
Objective:
select any regression dataset (from Kaggle or any open-source repository) and apply
the complete ML pipeline covered in class, including data understanding, cleaning,
visualization, outlier detection, encoding, scaling, model training, evaluation, and
interpretation.
Instructions:
Choose a regression dataset with:
- At least 1000 rows
- 1 numerical target variable
- Mixed numerical and categorical features
Required Steps:
1) Load & Inspect the Dataset
- Load dataset using pandas
- Show first 10 rows
- Print .info() and .describe()
- Identify numerical vs categorical features
2) Handle Missing Values
- Check for null values
- Remove or impute
- Explain the choice
3) Remove Duplicates
- Detect duplicated rows
- Remove them
- Report how many were removed
4) Outlier Analysis
- Boxplots for numerical features
- Identify outliers using IQR
- Decide whether to keep/remove with justification
5) EDA (Exploratory Data Analysis)
- Histograms for numerical features
- Countplots for categorical features
- Correlation matrix + heatmap
- Interpret strongest 2 correlations
6) Encode Categorical Variables
- Label Encoding or One-Hot Encoding
- Explain why chosen method is appropriate
7) Feature Scaling
- Apply StandardScaler on numerical features
- Fit only on training set, transform both
8) Train/Test Split
- 80% Train / 20% Test
- random_state=42
- Show shapes
9) Train Regression Model
- Use Linear Regression
- Fit and show coefficients
10) Predictions
- Predict on test set
- Show first 10 actual vs predicted
11) Model Evaluation
Compute:
- MAE
- MSE
- RMSE
- R2 Score
Bonus (Optional):
• Model Interpretation
Explain:
- Which feature has strongest effect and why
- Whether model is underfitting/overfitting
- Whether scaling improved accuracy
Also,
- Learning Curve
- Residual Plot
- Compare Linear Regression with Ridge/Lasso
- Polynomial Regression