0% found this document useful (0 votes)

17 views24 pages

Machine Learning Real Estate Analysis

This document provides a summary of a machine learning project report on residential property price prediction. It describes the dataset containing features of properties like number of bedrooms, bathrooms, size, lot size, and location. It details the data preprocessing steps performed, including importing, merging, handling missing values, detecting outliers, log transformations, and encoding categorical variables. Regression models like linear regression, KNN, decision tree, and random forest were developed and their performance compared using metrics like MSE and R2 score. The random forest model performed best. The data was also categorized into price ranges to form a classification problem for predicting the price category. Classification models like logistic regression, KNN, decision tree, random forest, naive bayes and SVM will

Uploaded by

Faizan Bajwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views24 pages

Machine Learning Real Estate Analysis

Uploaded by

Faizan Bajwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1/17/2024 Machine Learning

Project Report

Muhammad Wasim
SUBMITTED TO DR ISRAR ALI KHAN
1 DATA PREPROCESSING:

1.1 ABOUT DATA:

The dataset comprises eight columns, each offering valuable information about residential properties. The 'beds'
column denotes the number of bedrooms in a property, providing insights into its spatial layout. The 'baths'
column represents the count of bathrooms, a crucial factor influencing property value and functionality. The
'size' column signifies the total size of the property, offering an understanding of its overall spatial capacity.
'Size_units' serves as the unit of measurement for property size, providing clarity on the scale employed.

Additionally, the 'lot_size' column, with associated 'lot_size_units,' describes the size of the land associated with
the property. This information is vital for assessing outdoor space. The 'zip_code' column contains the zip code
of the property location, offering geographical context. Finally, the 'price' column represents the monetary value
associated with each property, serving as the target variable for predictive modeling. Together, these columns
encompass key features for comprehensive analysis and modeling within the real estate domain.

1.2 IMPORTING DATA

I began the data preprocessing by importing the training and testing datasets using the Pandas library. The code
snippet is as follows:

1.3 MERGING DATASETS

I merged the datasets to create a unified dataset for comprehensive analysis and preprocessing. The code
snippet is as follows:
The concat function from Pandas was used to concatenate the training and testing datasets along the rows
(axis=0).
2 EXPLORATORY DATA ANALYSIS (EDA)

2.1 CHECK DATA TYPES

I inspected the data types of each feature to ensure proper understanding of the dataset. This step is crucial for
identifying categorical variables and numerical features.

2.2 EXAMINE MISSING VALUES AND DUPLICATES

I conducted a thorough examination of missing values and duplicate entries within the dataset. Handling missing
values is crucial for maintaining data integrity and ensuring reliable model performance.
2.3 HANDLE MISSING VALUES
I implemented appropriate techniques to handle missing values, ensuring a complete and clean dataset for
subsequent analysis and modeling.

2.4 DETECT AND HANDLE OUTLIERS USING INTERQUARTILE RANGE (IQR)

Outliers were detected and addressed using the Interquartile Range (IQR) method. This ensures that extreme
values do not unduly influence the analysis and model performance.
2.5 APPLY LOG TRANSFORMATION FOR RESCALING
To address skewed data distributions and improve model performance, I applied log transformation to rescale
the dataset. This transformation is particularly useful for features with right-skewed distributions.

2.6 ENCODING CATEGORICAL VARIABLES

Categorical variables were encoded using appropriate techniques to convert them into a numerical format,
making them suitable for machine learning algorithms.
The completion of these EDA and preprocessing steps sets the foundation for building and evaluating regression
and classification models. Moving forward, the data will be split into training and testing sets for model
development.

3 DATA SPLITTING

I have successfully split the dataset into training and testing sets. This step is crucial for training machine
learning models on a subset of the data and evaluating their performance on unseen data.

The data splitting was performed using a 70:30 ratio, with 70% of the data allocated to the training set and 30%
to the testing set. This division ensures that the models are trained on a sufficiently large portion of the data
while allowing for robust evaluation on a separate set.

The training set will be utilized for developing and training regression and classification models, while the testing
set remains untouched until the evaluation phase. This separation helps gauge the models' generalization
performance on new, unseen data.

Next, I will proceed with developing and evaluating regression models on the training set. Afterward,
classification models will be constructed and assessed using the same split data.
4 REGRESSION MODEL DEVELOPMENT AND EVALUATION:

4.1 LINEAR REGRESSION METRICS:

4.2 MEAN SQUARED ERROR (MSE): 0.0842

The MSE is a measure of the average squared difference between the actual and predicted values. A lower MSE
indicates better model performance.

4.3 R2 SCORE: 0.6768

The R2 score, also known as the coefficient of determination, measures the proportion of the variance in the
dependent variable that is predictable from the independent variables. An R2 score of 0.6768 suggests a
moderate level of explanatory power in the model.

These metrics provide insights into the accuracy and goodness of fit of the Linear Regression model.

4.4 DRAW DENSITY PLOT BETWEEN ORIGINAL AND PREDICTED PRICES.

4.5 KNN REGRESSION METRICS:

4.6 MEAN SQUARED ERROR (MSE): 0.0825

The MSE measures the average squared difference between the actual and predicted values. In the case of KNN,
this indicates the average squared distance between the true and predicted values for the target variable.
4.7 R2 SCORE: 0.6835
The R2 score represents the proportion of the variance in the dependent variable that is predictable from the
independent variables. A higher R2 score indicates a better fit of the model to the data.

These metrics provide insights into the accuracy and goodness of fit of the KNN regression model. The
comparison with the Linear Regression model's metrics allows for a comprehensive assessment of their relative
performance.

4.8 DRAW DENSITY PLOT.

4.9 DECISION TREE REGRESSION METRICS:

4.10 MEAN SQUARED ERROR (MSE): 0.1194

The MSE measures the average squared difference between the actual and predicted values. For the Decision
Tree model, this indicates the average squared distance between the true and predicted values for the target
variable.
4.11 R2 SCORE: 0.5421
The R2 score represents the proportion of the variance in the dependent variable that is predictable from the
independent variables. A higher R2 score indicates a better fit of the model to the data.

These metrics provide insights into the accuracy and goodness of fit of the Decision Tree regression model. The
comparison with the metrics from the Linear Regression and KNN models allows for a comprehensive
assessment of their relative performance.

4.12 DRAW DENSITY PLOT.

4.13 RANDOM FOREST REGRESSION METRICS:

4.14 MEAN SQUARED ERROR (MSE): 0.0680

The MSE measures the average squared difference between the actual and predicted values. For the Random
Forest model, this indicates the average squared distance between the true and predicted values for the target
variable.

4.15 R2 SCORE: 0.7391

The R2 score represents the proportion of the variance in the dependent variable that is predictable from the
independent variables. A higher R2 score indicates a better fit of the model to the data.
These metrics provide insights into the accuracy and goodness of fit of the Random Forest regression model. The
comparison with the metrics from the Linear Regression, KNN, and Decision Tree models allows for a
comprehensive assessment of their relative performance.

The Random Forest model demonstrates promising results, achieving a lower MSE and a higher R2 score
compared to the other regression models.

4.16 DRAW DENSITY PLOT.

In the evaluation of regression models, the Linear Regression model served as a baseline, providing a moderate
fit with a mean squared error (MSE) of 0.0842 and an R2 score of 0.6768. Subsequently, the K-Nearest
Neighbors (KNN) model exhibited improvement with a lower MSE of 0.0825 and a higher R2 score of 0.6835,
indicating better capturing of underlying patterns. However, the Decision Tree model, while informative,
showed a higher MSE of 0.1194 and a lower R2 score of 0.5421, suggesting sensitivity to outliers. The Random
Forest model outperformed the others with a lower MSE of 0.0680 and a higher R2 score of 0.7391, showcasing
its effectiveness in capturing complex relationships and providing the most accurate predictions. The Random
Forest model is thus selected as the preferred regression model.
5 CLASSIFICATION MODEL DEVELOPMENT AND EVALUATION:

5.1 DATA CATEGORIZATION

Before delving into the implementation and evaluation of classification models, a critical step involved the
creation of price categories based on a predefined logic. The dataset was enriched with a new column,
'price_category', determined by binning the 'price' feature into four categories:

1. Low: Values ranging from the minimum to the first quartile (min-q1).

2. Medium: Values between the first quartile (q1) and the median.

3. High: Values between the median and the third quartile (median-q3).

4. Very High: Values from the third quartile (q3) to the maximum.

This categorization provides a structured representation of the target variable 'price' and serves as the basis for
training and evaluating the classification models. It allows for a multi-class classification problem, enabling the
models to predict the price category each instance falls into.

Next, I will proceed with implementing and evaluating various classification models, starting with Logistic
Regression. Afterward, other models such as K-Nearest Neighbors, Decision Tree, Random Forest, Naive Bayes,
and Support Vector Machine will be explored and their performances compared. The assessment will include
metrics like accuracy, confusion matrices, and feature importance, providing a comprehensive understanding of
each model's effectiveness in predicting price categories.
5.2 DATA SPLITTING:
For the classification task, the dataset was split into training and testing sets using the train_test_split function
with a test size of 30%. This division ensures that 70% of the data is used for training the classification models,
while the remaining 30% is reserved for assessing their performance on unseen data.

The training set (X_train_clf and y_train_clf) will be utilized for training the classification models, while the
testing set (X_test_clf and y_test_clf) remains untouched until the evaluation phase. This separation allows for a
fair evaluation of each model's ability to generalize to new, unseen instances.

5.3 LOGISTIC REGRESSION METRICS:

5.4 ACCURACY SCORE: 0.2287

The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. In this case, the Logistic Regression model achieved an accuracy of approximately 22.87%.

For a more detailed evaluation, a classification report and confusion matrix were generated.

5.5 CLASSIFICATION REPORT:

5.6 CONFUSION MATRIX:

5.7 K-NEAREST NEIGHBORS METRICS:

5.8 ACCURACY SCORE: 0.6037

The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The KNN model achieved an accuracy of approximately 60.37%.
5.9 CLASSIFICATION REPORT:

5.10 CONFUSION MATRIX:

5.11 DECISION TREE METRICS:

5.12 ACCURACY SCORE: 0.5918

The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The Decision Tree model achieved an accuracy of approximately 59.18%.

5.13 CLASSIFICATION REPORT:

5.14 CONFUSION MATRIX:

5.15 RANDOM FOREST METRICS:

5.16 ACCURACY SCORE: 0.6516

The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The Random Forest model achieved an accuracy of approximately 65.16%.

The accuracy score suggests a notable improvement compared to both Logistic Regression and Decision Tree
models, indicating that the ensemble nature of Random Forest, combining multiple decision trees, contributes
to enhanced predictive performance.

5.17 CLASSIFICATION REPORT:

5.18 CONFUSION MATRIX:

5.19 NAIVE BAYES METRICS:

5.20 ACCURACY SCORE: 0.5785

The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The Naive Bayes model achieved an accuracy of approximately 57.85%.

The accuracy score provides insight into the overall correctness of the model in predicting the categorized price
labels. While it falls below the accuracy of the Random Forest model.

5.21 CLASSIFICATION REPORT:

5.22 CONFUSION MATRIX:

5.23 SVM METRICS:

5.24 ACCURACY SCORE: 0.2287

The accuracy score represents the proportion of correctly classified instances out of the total instances in the
testing set. The SVM model achieved an accuracy of approximately 22.87%.

The accuracy score provides insight into the overall correctness of the model in predicting the categorized price
labels. In this case, the SVM model demonstrates a lower accuracy compared to other classification models,
indicating that it may not be as effective for this particular task.

5.25 CLASSIFICATION REPORT:

5.26 CONFUSION MATRIX:
6 VISUALIZATION OF KEY FINDINGS:
Visualization of Feature Importance

Feature importance plots were generated for three classification models: Logistic Regression, Decision Tree (DT),
and Random Forest (RF). These plots provide insights into the significance of different features in predicting the
categorized price labels.

6.1 LOGISTIC REGRESSION FEATURE IMPORTANCE:

The Logistic Regression feature importance plot illustrates the contribution of each feature in determining the
predicted price categories. This aids in understanding which features have a more significant impact on the
logistic regression model's decision-making process.
6.2 DECISION TREE (DT) FEATURE IMPORTANCE:

The Decision Tree feature importance plot visualizes the importance of each feature in the decision-making
process of the DT model. Features with higher importance values play a more crucial role in splitting the dataset
into different categories.

6.3 RANDOM FOREST (RF) FEATURE IMPORTANCE:

The Random Forest feature importance plot highlights the relevance of each feature in the ensemble of decision
trees. Features with higher importance contribute more to the overall predictive power of the Random Forest
model.
These feature importance plots offer valuable insights into the models' interpretability and provide guidance on
which features contribute significantly to predicting price categories. Understanding feature importance aids in
model interpretation and can guide feature selection for improved model performance.

7 CONCLUSION:

In this comprehensive analysis, we embarked on a data preprocessing journey, including merging datasets,
handling missing values, encoding categorical variables, and transforming data for regression and classification
tasks. For regression, we implemented and evaluated Linear Regression, K-Nearest Neighbors (KNN), Decision
Tree, and Random Forest models. Random Forest emerged as the top-performing model, demonstrating
superior accuracy and a high R2 score.

Transitioning to classification, we categorized prices into four classes and evaluated Logistic Regression, KNN,
Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine (SVM) models. Random Forest
exhibited the highest accuracy, outperforming other models. While KNN showed competitive results, Logistic
Regression, Naive Bayes, and SVM had lower accuracy.

Feature importance plots for Logistic Regression, Decision Tree, and Random Forest provided insights into the
relevance of features in predicting price categories. Logistic Regression's feature importance was visualized
through coefficient magnitudes.

In conclusion, Random Forest excelled in both regression and classification tasks, showcasing its versatility and
predictive power. Feature importance analyses shed light on the significant contributors to predictions. This
report not only provides a detailed account of the modeling process but also serves as a valuable resource for
understanding the intricacies of the dataset and the strengths and weaknesses of various machine learning
models applied to the given problem.

Properties of Bounded Variation Functions
No ratings yet
Properties of Bounded Variation Functions
4 pages
Understanding AI, ML, and Deep Learning
No ratings yet
Understanding AI, ML, and Deep Learning
1 page
Categorical Data Analysis Techniques
No ratings yet
Categorical Data Analysis Techniques
6 pages
Logistic Regression for Student Pass Prediction
No ratings yet
Logistic Regression for Student Pass Prediction
2 pages
Cox Model Approximation with Poisson Regression
No ratings yet
Cox Model Approximation with Poisson Regression
4 pages
Time Response and Stability in Control Systems
No ratings yet
Time Response and Stability in Control Systems
48 pages
Channel Capacity Explained: Types & Examples
No ratings yet
Channel Capacity Explained: Types & Examples
11 pages
Understanding Hash Functions in Security
No ratings yet
Understanding Hash Functions in Security
26 pages
2602.05927v1 Test
No ratings yet
2602.05927v1 Test
51 pages
Kohonen Self-Organizing Feature Maps Guide
No ratings yet
Kohonen Self-Organizing Feature Maps Guide
3 pages
Numerical Integration Techniques Overview
No ratings yet
Numerical Integration Techniques Overview
25 pages
Dynamic Programming in Power Systems
No ratings yet
Dynamic Programming in Power Systems
4 pages
Secure Fraud Detection with FL and HE
No ratings yet
Secure Fraud Detection with FL and HE
10 pages
Mango Disease Detection App Using ML
No ratings yet
Mango Disease Detection App Using ML
16 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
25 pages
Differential Equations Practice Problems
No ratings yet
Differential Equations Practice Problems
1 page
MCQ, MSQ, NAT on Machine Learning
No ratings yet
MCQ, MSQ, NAT on Machine Learning
6 pages
New Symmetric Key Cryptography Algorithm
No ratings yet
New Symmetric Key Cryptography Algorithm
4 pages
Predicting Drug Addiction in Students Using Artifi
No ratings yet
Predicting Drug Addiction in Students Using Artifi
8 pages
Overview of Information Theory Concepts
No ratings yet
Overview of Information Theory Concepts
26 pages
Automata Theory MCQ Question Bank
No ratings yet
Automata Theory MCQ Question Bank
2 pages
Macf244 Scaf
No ratings yet
Macf244 Scaf
3 pages
Monte Carlo Method in Board Game AI
No ratings yet
Monte Carlo Method in Board Game AI
30 pages
Self-Adapting Language Models (SEAL)
No ratings yet
Self-Adapting Language Models (SEAL)
25 pages
Module 2: Basic Ann Models: AIT 352 - Artificial Neural Networks Techniques
No ratings yet
Module 2: Basic Ann Models: AIT 352 - Artificial Neural Networks Techniques
35 pages
Amazon Customer Behavior Analysis
No ratings yet
Amazon Customer Behavior Analysis
3 pages
Overview of Computation Theory
No ratings yet
Overview of Computation Theory
33 pages
Smoothed Analysis of Algorithms
No ratings yet
Smoothed Analysis of Algorithms
28 pages
Math Exam Questions for B.E./B.Tech
No ratings yet
Math Exam Questions for B.E./B.Tech
3 pages

Machine Learning Real Estate Analysis

Uploaded by

Machine Learning Real Estate Analysis

Uploaded by

1/17/2024 Machine Learning

1.1 ABOUT DATA:

1.2 IMPORTING DATA

1.3 MERGING DATASETS

2.1 CHECK DATA TYPES

2.2 EXAMINE MISSING VALUES AND DUPLICATES

2.4 DETECT AND HANDLE OUTLIERS USING INTERQUARTILE RANGE (IQR)

2.6 ENCODING CATEGORICAL VARIABLES

4.1 LINEAR REGRESSION METRICS:

4.2 MEAN SQUARED ERROR (MSE): 0.0842

4.3 R2 SCORE: 0.6768

4.4 DRAW DENSITY PLOT BETWEEN ORIGINAL AND PREDICTED PRICES.

4.5 KNN REGRESSION METRICS:

4.6 MEAN SQUARED ERROR (MSE): 0.0825

4.8 DRAW DENSITY PLOT.

4.9 DECISION TREE REGRESSION METRICS:

4.10 MEAN SQUARED ERROR (MSE): 0.1194

4.12 DRAW DENSITY PLOT.

4.13 RANDOM FOREST REGRESSION METRICS:

4.14 MEAN SQUARED ERROR (MSE): 0.0680

4.15 R2 SCORE: 0.7391

4.16 DRAW DENSITY PLOT.

5.1 DATA CATEGORIZATION

5.3 LOGISTIC REGRESSION METRICS:

5.4 ACCURACY SCORE: 0.2287

5.5 CLASSIFICATION REPORT:

5.7 K-NEAREST NEIGHBORS METRICS:

5.8 ACCURACY SCORE: 0.6037

5.10 CONFUSION MATRIX:

5.12 ACCURACY SCORE: 0.5918

5.13 CLASSIFICATION REPORT:

5.14 CONFUSION MATRIX:

5.16 ACCURACY SCORE: 0.6516

5.17 CLASSIFICATION REPORT:

5.19 NAIVE BAYES METRICS:

5.20 ACCURACY SCORE: 0.5785

5.21 CLASSIFICATION REPORT:

5.23 SVM METRICS:

5.24 ACCURACY SCORE: 0.2287

5.25 CLASSIFICATION REPORT:

6.1 LOGISTIC REGRESSION FEATURE IMPORTANCE:

6.3 RANDOM FOREST (RF) FEATURE IMPORTANCE:

You might also like