1/17/2024 Machine Learning
Project Report
Muhammad Wasim
SUBMITTED TO DR ISRAR ALI KHAN
1 DATA PREPROCESSING:
1.1 ABOUT DATA:
The dataset comprises eight columns, each offering valuable information about residential properties. The 'beds'
column denotes the number of bedrooms in a property, providing insights into its spatial layout. The 'baths'
column represents the count of bathrooms, a crucial factor influencing property value and functionality. The
'size' column signifies the total size of the property, offering an understanding of its overall spatial capacity.
'Size_units' serves as the unit of measurement for property size, providing clarity on the scale employed.
Additionally, the 'lot_size' column, with associated 'lot_size_units,' describes the size of the land associated with
the property. This information is vital for assessing outdoor space. The 'zip_code' column contains the zip code
of the property location, offering geographical context. Finally, the 'price' column represents the monetary value
associated with each property, serving as the target variable for predictive modeling. Together, these columns
encompass key features for comprehensive analysis and modeling within the real estate domain.
1.2 IMPORTING DATA
I began the data preprocessing by importing the training and testing datasets using the Pandas library. The code
snippet is as follows:
1.3 MERGING DATASETS
I merged the datasets to create a unified dataset for comprehensive analysis and preprocessing. The code
snippet is as follows:
The concat function from Pandas was used to concatenate the training and testing datasets along the rows
(axis=0).
2 EXPLORATORY DATA ANALYSIS (EDA)
2.1 CHECK DATA TYPES
I inspected the data types of each feature to ensure proper understanding of the dataset. This step is crucial for
identifying categorical variables and numerical features.
2.2 EXAMINE MISSING VALUES AND DUPLICATES
I conducted a thorough examination of missing values and duplicate entries within the dataset. Handling missing
values is crucial for maintaining data integrity and ensuring reliable model performance.
2.3 HANDLE MISSING VALUES
I implemented appropriate techniques to handle missing values, ensuring a complete and clean dataset for
subsequent analysis and modeling.
2.4 DETECT AND HANDLE OUTLIERS USING INTERQUARTILE RANGE (IQR)
Outliers were detected and addressed using the Interquartile Range (IQR) method. This ensures that extreme
values do not unduly influence the analysis and model performance.
2.5 APPLY LOG TRANSFORMATION FOR RESCALING
To address skewed data distributions and improve model performance, I applied log transformation to rescale
the dataset. This transformation is particularly useful for features with right-skewed distributions.
2.6 ENCODING CATEGORICAL VARIABLES
Categorical variables were encoded using appropriate techniques to convert them into a numerical format,
making them suitable for machine learning algorithms.
The completion of these EDA and preprocessing steps sets the foundation for building and evaluating regression
and classification models. Moving forward, the data will be split into training and testing sets for model
development.
3 DATA SPLITTING
I have successfully split the dataset into training and testing sets. This step is crucial for training machine
learning models on a subset of the data and evaluating their performance on unseen data.
The data splitting was performed using a 70:30 ratio, with 70% of the data allocated to the training set and 30%
to the testing set. This division ensures that the models are trained on a sufficiently large portion of the data
while allowing for robust evaluation on a separate set.
The training set will be utilized for developing and training regression and classification models, while the testing
set remains untouched until the evaluation phase. This separation helps gauge the models' generalization
performance on new, unseen data.
Next, I will proceed with developing and evaluating regression models on the training set. Afterward,
classification models will be constructed and assessed using the same split data.
4 REGRESSION MODEL DEVELOPMENT AND EVALUATION:
4.1 LINEAR REGRESSION METRICS:
4.2 MEAN SQUARED ERROR (MSE): 0.0842
The MSE is a measure of the average squared difference between the actual and predicted values. A lower MSE
indicates better model performance.
4.3 R2 SCORE: 0.6768
The R2 score, also known as the coefficient of determination, measures the proportion of the variance in the
dependent variable that is predictable from the independent variables. An R2 score of 0.6768 suggests a
moderate level of explanatory power in the model.
These metrics provide insights into the accuracy and goodness of fit of the Linear Regression model.
4.4 DRAW DENSITY PLOT BETWEEN ORIGINAL AND PREDICTED PRICES.
4.5 KNN REGRESSION METRICS:
4.6 MEAN SQUARED ERROR (MSE): 0.0825
The MSE measures the average squared difference between the actual and predicted values. In the case of KNN,
this indicates the average squared distance between the true and predicted values for the target variable.
4.7 R2 SCORE: 0.6835
The R2 score represents the proportion of the variance in the dependent variable that is predictable from the
independent variables. A higher R2 score indicates a better fit of the model to the data.
These metrics provide insights into the accuracy and goodness of fit of the KNN regression model. The
comparison with the Linear Regression model's metrics allows for a comprehensive assessment of their relative
performance.
4.8 DRAW DENSITY PLOT.
4.9 DECISION TREE REGRESSION METRICS:
4.10 MEAN SQUARED ERROR (MSE): 0.1194
The MSE measures the average squared difference between the actual and predicted values. For the Decision
Tree model, this indicates the average squared distance between the true and predicted values for the target
variable.
4.11 R2 SCORE: 0.5421
The R2 score represents the proportion of the variance in the dependent variable that is predictable from the
independent variables. A higher R2 score indicates a better fit of the model to the data.
These metrics provide insights into the accuracy and goodness of fit of the Decision Tree regression model. The
comparison with the metrics from the Linear Regression and KNN models allows for a comprehensive
assessment of their relative performance.
4.12 DRAW DENSITY PLOT.
4.13 RANDOM FOREST REGRESSION METRICS:
4.14 MEAN SQUARED ERROR (MSE): 0.0680
The MSE measures the average squared difference between the actual and predicted values. For the Random
Forest model, this indicates the average squared distance between the true and predicted values for the target
variable.
4.15 R2 SCORE: 0.7391
The R2 score represents the proportion of the variance in the dependent variable that is predictable from the
independent variables. A higher R2 score indicates a better fit of the model to the data.
These metrics provide insights into the accuracy and goodness of fit of the Random Forest regression model. The
comparison with the metrics from the Linear Regression, KNN, and Decision Tree models allows for a
comprehensive assessment of their relative performance.
The Random Forest model demonstrates promising results, achieving a lower MSE and a higher R2 score
compared to the other regression models.
4.16 DRAW DENSITY PLOT.
In the evaluation of regression models, the Linear Regression model served as a baseline, providing a moderate
fit with a mean squared error (MSE) of 0.0842 and an R2 score of 0.6768. Subsequently, the K-Nearest
Neighbors (KNN) model exhibited improvement with a lower MSE of 0.0825 and a higher R2 score of 0.6835,
indicating better capturing of underlying patterns. However, the Decision Tree model, while informative,
showed a higher MSE of 0.1194 and a lower R2 score of 0.5421, suggesting sensitivity to outliers. The Random
Forest model outperformed the others with a lower MSE of 0.0680 and a higher R2 score of 0.7391, showcasing
its effectiveness in capturing complex relationships and providing the most accurate predictions. The Random
Forest model is thus selected as the preferred regression model.
5 CLASSIFICATION MODEL DEVELOPMENT AND EVALUATION:
5.1 DATA CATEGORIZATION
Before delving into the implementation and evaluation of classification models, a critical step involved the
creation of price categories based on a predefined logic. The dataset was enriched with a new column,
'price_category', determined by binning the 'price' feature into four categories:
1. Low: Values ranging from the minimum to the first quartile (min-q1).
2. Medium: Values between the first quartile (q1) and the median.
3. High: Values between the median and the third quartile (median-q3).
4. Very High: Values from the third quartile (q3) to the maximum.
This categorization provides a structured representation of the target variable 'price' and serves as the basis for
training and evaluating the classification models. It allows for a multi-class classification problem, enabling the
models to predict the price category each instance falls into.
Next, I will proceed with implementing and evaluating various classification models, starting with Logistic
Regression. Afterward, other models such as K-Nearest Neighbors, Decision Tree, Random Forest, Naive Bayes,
and Support Vector Machine will be explored and their performances compared. The assessment will include
metrics like accuracy, confusion matrices, and feature importance, providing a comprehensive understanding of
each model's effectiveness in predicting price categories.
5.2 DATA SPLITTING:
For the classification task, the dataset was split into training and testing sets using the train_test_split function
with a test size of 30%. This division ensures that 70% of the data is used for training the classification models,
while the remaining 30% is reserved for assessing their performance on unseen data.
The training set (X_train_clf and y_train_clf) will be utilized for training the classification models, while the
testing set (X_test_clf and y_test_clf) remains untouched until the evaluation phase. This separation allows for a
fair evaluation of each model's ability to generalize to new, unseen instances.
5.3 LOGISTIC REGRESSION METRICS:
5.4 ACCURACY SCORE: 0.2287
The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. In this case, the Logistic Regression model achieved an accuracy of approximately 22.87%.
For a more detailed evaluation, a classification report and confusion matrix were generated.
5.5 CLASSIFICATION REPORT:
5.6 CONFUSION MATRIX:
5.7 K-NEAREST NEIGHBORS METRICS:
5.8 ACCURACY SCORE: 0.6037
The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The KNN model achieved an accuracy of approximately 60.37%.
5.9 CLASSIFICATION REPORT:
5.10 CONFUSION MATRIX:
5.11 DECISION TREE METRICS:
5.12 ACCURACY SCORE: 0.5918
The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The Decision Tree model achieved an accuracy of approximately 59.18%.
5.13 CLASSIFICATION REPORT:
5.14 CONFUSION MATRIX:
5.15 RANDOM FOREST METRICS:
5.16 ACCURACY SCORE: 0.6516
The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The Random Forest model achieved an accuracy of approximately 65.16%.
The accuracy score suggests a notable improvement compared to both Logistic Regression and Decision Tree
models, indicating that the ensemble nature of Random Forest, combining multiple decision trees, contributes
to enhanced predictive performance.
5.17 CLASSIFICATION REPORT:
5.18 CONFUSION MATRIX:
5.19 NAIVE BAYES METRICS:
5.20 ACCURACY SCORE: 0.5785
The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The Naive Bayes model achieved an accuracy of approximately 57.85%.
The accuracy score provides insight into the overall correctness of the model in predicting the categorized price
labels. While it falls below the accuracy of the Random Forest model.
5.21 CLASSIFICATION REPORT:
5.22 CONFUSION MATRIX:
5.23 SVM METRICS:
5.24 ACCURACY SCORE: 0.2287
The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The SVM model achieved an accuracy of approximately 22.87%.
The accuracy score provides insight into the overall correctness of the model in predicting the categorized price
labels. In this case, the SVM model demonstrates a lower accuracy compared to other classification models,
indicating that it may not be as effective for this particular task.
5.25 CLASSIFICATION REPORT:
5.26 CONFUSION MATRIX:
6 VISUALIZATION OF KEY FINDINGS:
Visualization of Feature Importance
Feature importance plots were generated for three classification models: Logistic Regression, Decision Tree (DT),
and Random Forest (RF). These plots provide insights into the significance of different features in predicting the
categorized price labels.
6.1 LOGISTIC REGRESSION FEATURE IMPORTANCE:
The Logistic Regression feature importance plot illustrates the contribution of each feature in determining the
predicted price categories. This aids in understanding which features have a more significant impact on the
logistic regression model's decision-making process.
6.2 DECISION TREE (DT) FEATURE IMPORTANCE:
The Decision Tree feature importance plot visualizes the importance of each feature in the decision-making
process of the DT model. Features with higher importance values play a more crucial role in splitting the dataset
into different categories.
6.3 RANDOM FOREST (RF) FEATURE IMPORTANCE:
The Random Forest feature importance plot highlights the relevance of each feature in the ensemble of decision
trees. Features with higher importance contribute more to the overall predictive power of the Random Forest
model.
These feature importance plots offer valuable insights into the models' interpretability and provide guidance on
which features contribute significantly to predicting price categories. Understanding feature importance aids in
model interpretation and can guide feature selection for improved model performance.
7 CONCLUSION:
In this comprehensive analysis, we embarked on a data preprocessing journey, including merging datasets,
handling missing values, encoding categorical variables, and transforming data for regression and classification
tasks. For regression, we implemented and evaluated Linear Regression, K-Nearest Neighbors (KNN), Decision
Tree, and Random Forest models. Random Forest emerged as the top-performing model, demonstrating
superior accuracy and a high R2 score.
Transitioning to classification, we categorized prices into four classes and evaluated Logistic Regression, KNN,
Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine (SVM) models. Random Forest
exhibited the highest accuracy, outperforming other models. While KNN showed competitive results, Logistic
Regression, Naive Bayes, and SVM had lower accuracy.
Feature importance plots for Logistic Regression, Decision Tree, and Random Forest provided insights into the
relevance of features in predicting price categories. Logistic Regression's feature importance was visualized
through coefficient magnitudes.
In conclusion, Random Forest excelled in both regression and classification tasks, showcasing its versatility and
predictive power. Feature importance analyses shed light on the significant contributors to predictions. This
report not only provides a detailed account of the modeling process but also serves as a valuable resource for
understanding the intricacies of the dataset and the strengths and weaknesses of various machine learning
models applied to the given problem.