House Price Prediction with Python
House Price Prediction with Python
USING PYTHON
A PROJECT REPORT
Submitted by
PRATHEBA.R
of
MASTER OF ENGINEERING
IN
(AUTONOMOUS), NAGAPATTINAM
Certified that this Reporttitled “HOUSE PRICE PREDICTION USING PYTHON ”is the
bonafide work of PRATHEBA R( 2303820840522010) who carried out the work under
my supervision. Certified further that to the best of my knowledge the work reported
here in does not form part of any other thesis or dissertation on the basis of which a
degree or award was conferred on an earlier occasion on this or any other candidate.
Signature of the HOD with date Signature of the HOD with date
Nagapattinam Nagapattinam
1
DECLARATION
I submit the project work entitle “ HOUSE PRICE PREDICTION USING PYTHON” to "E.G.S
Pillay Engineering College" in partial fulfilment of the requirements for the award of the degree
declare that was carried out independently by me under the guidelines of [Link] M.E.,Ph.D.,
PRATHEBA.R( 2303820840522010)
2
ACKNOWLEDGEMENT
PRATHEBA R
3
ABSTRACT
House prices are a major financial decision for everyone involved in the
housing market, including potential home buyers. A major part of the real estate
industry is housing. An accurate housing price prediction is a valuable tool for buyer
and seller as well as real estate agents. The study is done for the purpose of knowledge
among the people to understand and estimate the pricing of their houses. The
prediction will be made using four machine learning algorithms such as linear
regression, polynomial regression, random forest, decision tree. Linear Regression has
Regression and classification issues are resolved with random forests .Among the
given algorithm, Random forest provides better accuracy of about 89% for given
dataset.
4
TABLE OF CONTENTS
CHAPTER TITLE
PAGE NO.
NO.
ABSTRACT i
LIST OF ABBREVIATIONS x
1 INTRODUCTION 1
6
1.3.1 Efficient Prediction of Water 6
Learning Algorithms
7
1.3.2 Efficient Water Quality Prediction
learning techniques.
7
2 PROBLEM ANALYSIS 11
2.4 Modules 12
Python
8
2.4.2 Predicting house rate using 15
AutoML
[Link] Tpot 16
2.7 Implementation 19
2.8 Result 22
9
2.8.1 Source code 22
2.8.2 Snapshots 28
3 CONCLUSION
3.1 Conclusion 32
REFERENCES 33
10
LIST OF FIGURES
NUMBER
Python
TPOT
with Values
Complexity in Python
11
[Link] Heatmap Graph Using 30
Complexity in AutoML
12
LIST OF ABBREVIATIONS
13
CHAPTER 1
INTRODUCTION
1.1 MACHINE LEARNING
Machine Learning can review large volumes of data and discover specific
trends and patterns that would not be apparent to humans. For instance, for an e-
commerce website like Amazon, it serves to understand the browsing behaviors and
purchase histories of its users to help cater to the right products, deals, and reminders
relevant to them. It uses the results to reveal relevant advertisements to them.
1
2. No human intervention needed (automation)
With ML, you don’t need to babysit your project every step of the way. Since
it means giving machines the ability to learn, it lets them make predictions and also
improve the algorithms on their own. A common example of this is anti-virus
softwares; they learn to filter new threats as they are recognized. ML is also good at
recognizing spam
3. Continuous Improvement
4. Wide Applications
You could be an e-tailer or a healthcare provider and make ML work for you.
Where it does apply, it holds the capability to help deliver a much more personal
experience to customers while also targeting the right customers.
1. Data Acquisition
Machine Learning requires massive data sets to train on, and these should be
inclusive/unbiased, and of good quality. There can also be times where they must wait
for new data to be generated.
ML needs enough time to let the algorithms learn and develop enough to fulfill
their purpose with a considerable amount of accuracy and relevancy. It also needs
massive resources to function. This can mean additional requirements of computer
power for you.
2
3. Interpretation of Results
4. High error-susceptibility
3
Figure 1.2 AutoML Concept
You get quicker results. With AutoML you can skip a lot of the machine
learning and data science work and as a result save some time. If you are
building a prototype to test a product market fit or get early experience on how
users will react and use the AI when applied.
You’re less likely to be outdated. A very common problem in AI development
is that your models are getting old almost as you’re making them. The AI
technology is moving so fast that what today might be state of art and requires a
lot of hard and intelligent efforts to achieve is tomorrow performing worse than
out of the box solutions. With AutoML that pain is placed with the big tech
AutoML vendors that have the economics of scale so they can invest in staying
ahead.
With a hosted AutoML solution you will save a lot of time on not having to
build the surrounding infrastructure. In this paper by Google the conclusion is
that on average 95% of code in machine learning solutions is “glue code” that
4
builds the infrastructure around the models. The actual machine learning code is
just 5%
You will need less experts. A common problem in AI is that it requires a lot of
expertise and usually involves several different kinds of experts. That’s
extremely expensive and experts are hard to find. You can skip a lot of that with
AI.
Get less insights on data. One of the big advantages in doing machine learning
by hand is that get a lot of learnings into why the models do not work as
expected. This gives valuable insight in both the existing business domain but
also a good idea of what data might need to go out and collect in order to get
the necessary results.
AutoML is inflexible. It might get results quickly but as most IT goes it won’t
be long before the requirements are changing and with a AutoML solution are
running the risk of these requirements suddenly being out of scope of what
AutoML can handle. If that happens to start all over with a custom model and
that can be a cold start.
Running cost might be too high when you scale. A pay-per-use model is great
for small to medium scale solutions. But if you’re expecting very high volumes
in use then you might end up spending too much money with your vendor. In
this case it might be a better solution to make and host the models yourself.
Quality probably won’t be the highest. If the most important feature in your
solution is the highest quality compared to competitors AutoML probably will
fall short. AutoML is generalized models and can in most cases not compete
5
1.3 LITERATURE SURVEY
In this paper, the resale price prediction of house is done using different classifications
algorithms like Leaner regression, Decision Tree, K-Means and Random Forest is used.
There are so many factors are affected on house price include physical attributes, location
and also economic factor as well. Here we consider RMSE as the performance matrix for
different dataset and these algorithms are applied and find out most accuracy model which
predict better results.
ADVANTAGES
In this model we have to add additional features like tax, air quality so it become
different from other prediction system.
DISADVANTAGES
The paper entitled has presented to predict house price based on various features on
given data. From our analysis we set value of RMSE as 2.9131889.
6
1.3.2 TITLE : House Price Prediction using Machine Learning Algorithms
XGBoost uses a CART tree at the bottom, where nodes are values instead of
categories, which allows efficient optimization and increases performance. Overfitting
can be effectively avoided using feature sampling and regularisation. This paper has
some drawbacks, including the difficulty of obtaining reasonably complete housing
information. This paper has two reasons. A house transaction price is a trade secret of
the intermediary and developer. There is a lack of transaction data in government
departments.
ADVANTAGES
Decision Tree algorithm has high mean squared error when compared to Random
Forest. Random forest algorithm has low mean absolute error when compared to Decision
Tree.
DISADVANTAGES
would like to use more dataset and perform large analysis using deep learning algorithm.
7
1.3.3 TITLE : House Price Prediction Analysis using Machine Learning
AUTHOR: Aniket Singh , Adarsh Kumar Singh , Aditya Raj, Harshit Jain,Mrs.
Asha M S
House price prediction can be done by using a multiple prediction models (Machine
Learning Model) such as support vector regression, artificial neural network, and more.
There are many benefits that home buyers, property investors, and house builders can reap
from the house-price model. This model will provide a lot of information and knowledge to
home buyers, property investors and house builders, such as the valuation of house prices in
the present market, which will help them determine house prices. Meanwhile, this model can
help potential buyers decide the characteristics of a house they want according to their
budget. Previous studies focused on analyzing the attributes that affect house price and
ADVANTAGES
This paper examined and analyzed the current research on the significant attributes of house
price and analyzed the data mining techniques used to predict house price.
DISADVANTAGES
The impact of this research was intended to help and assist other researchers in developing a
real model which can easily and accurately predict house prices. Further work on a real
model needs to be done with the utilization of our findings to confirm them
8
1.3.4 TITLE : House Price Prediction using Machine
Garg.
The study focuses on housing cost trends, which serve as indicators of the current
economic situation and have direct implications .The actual cost of a house depends on
Tree, and Random Forest, to construct a predictive model with their working accuracy.
ADVANTAGES
This research paper concludes that employing the Random Forest machine learning
algorithm will lead to more precise pricing predictions using the collected data, which
DISADVANTAGES
The trained data can be utilized to generate new predicted values. This approach offers cost
savings and reduces the need for extensive physical efforts for both buyers and sellers.
Notably, Random Forest predicted house prices with the highest accuracy value of 0.90.
9
1.3.5 TITLE : Employing Machine Learning for House Price
ADVANTAGES
If a high risk of humanerror during this process,which can lead to inaccuracies with your
predictions and reliability issues.
DISADVANTAGES
10
CHAPTER 2
PROBLEM ANALYSIS
2.1 Problem definition
People looking to buy a new home tend to be more conservative with their budgets and
marketstrategies. This project aims to analyse various parameters like average income,
average area etc. andpredict the house price accordingly. This application will help
fast way of performing operations. To provide proper house price to the customers.
To eliminate need of real estate agent to gain information regarding house prices. To
provide best price to user without getting cheated. To enable user to search home as
per the budget. The aim is to predict the efficient house pricing for real estate
customers with respect totheir budgets and priorities. By analyzing previous market
trends and price range,and alsoupcoming developments future prices will be predicted.
House prices increase every year, so there is a need for a system to predict house
prices in the future. House price prediction can help the developer determine the
selling price of a house and can help the customer to arrange the right time to purchase
a house. We use linear regression algorithm in machine learning for predicting the
learning algorithms like Leaner Regression, Decision Tree, k-Means, and Random
11
Forest. We proposed the system “House Price Prediction Using Machine Learning”
we have predict the house price using multiple features. In this proposed system, we
are able to train model from various features like ZN, INDUS, CHAS, RAD etc. the
previous data taken and out of this 80% of data is used for training purpose and
remaining 20% of data used for testing purpose. Here, the raw data is stored in ‘.csv’
file. We are majorly used two machine learning libraries to solve these problems. The
first one was ‘pandas’ and another one is ‘numpy’. The pandas used for to load ‘.csv’
file into Jupiter notebook and also used to clean the data as well as manipulate the
data. Another was sklearner, which was used for real analysis and it has containing
various inbuilt functions which help to solve the [Link] more library was used
which is nothing but numpy. For the purpose of train-test splitting numpy was used.
12
2.2 Proposed System
trend of housing prices in certain locations. Proposed method uses attributes such as
price, property type, location, square feet, city, bedrooms and other parameters for
house price prediction. The classification of attributes makes easier to analyze the
Polynomial, XGBoost, Random Forest and Decision Tree algorithms for comparing
and analysing accuracy of house prices and predict better result This paper imported
various packages such as numpy, pandas, seaborn, sklearn metrics and other packages
for evaluating our ML model into our python environment. Encoding is a technique of
converting categorical variables into numerical values so that it could be easily fitted
2.3 Modules
13
and classes. It supports multiple programming paradigms beyond object-oriented
programming, such as procedural and functional [Link] facilitates
developers to increase the confidence and productivity about their developing software
from development to deployment and maintenance. The benefits of making Python the
perfect solution for machine learning and AI-driven projects include simplicity and
consistency, flexibility, access to powerful AI and machine learning (ML) libraries and
frameworks, platform independence, and large communities. These things increase the
popularity of the language.
14
● Scikit-learn to handle basic ML algorithms such as clustering, logistic and
linear regression, regression, and classification.
● Pandas are used for advanced structure and data analysis. It allows you to merge
and filter data and collect data from other external sources (such as Excel).
● Matplotlib is a plotting library for Python. It is used along with NumPy to
provide an environment that is an effective open source alternative for MatLab.
It can also be used with graphics toolkits like PyQt and wxPython.
● NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
● Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be
used to visualize random distributions.
15
[Link] Random Forest classification
16
[Link] K-nearest neighbor
F
Figure [Link] K-nearest neighbor
17
each step in the process must be handled [Link] machine learning can
target various stages of the machine learning [Link] to automate are:
● Data preparation and ingestion (from raw data and miscellaneous formats)
○ Column type detection; e.g., boolean, discrete numerical,
continuous numerical, or text.
○ Column intent detection; e.g., target/label, stratification field,
numerical feature, categorical text feature, or free text feature.
○ Task detection; e.g., binary classification, regression, clustering,
or ranking.
● Feature engineering
○ Feature selection
○ Feature extraction
○ Meta learning and transfer learning
○ Detection and handling of skewed data and/or missing values
● Model selection - choosing which machine learning algorithm to use,
often including multiple competing software implementations
● Ensembling - a form of consensus where using multiple models often
gives better results than any single model[3]
Secondly we implement [Link] AutoML we using TPOT (tree based pipeline
optimization tool). Accurate the water quality level and find the time complexity of
the program.
[Link] TPOT
18
offer the “end-to-end” of an Machine Learning [Link] can see the phases
automated by TPOT and the ones specifically addressed by the Data Scientist or
Machine Learning Engineer.
19
2.4 Requirement specification
➢ 64-bit CPU
➢ 4GB RAM
20
2.5 System Design
A system architecture is the conceptual model that defines the structure, behavior,
and more views of a system. It consists of system components and the sub- systems
developed, that will work together to implement the overall system.
2.6 Implementation
Data from different sources are usually messy, scattered, they contain missing values, and
are sometimes unstructured.
Before building a regression model, it's crucial to understand the data, and clean and
optimize it for the best result. For an in-depth explanation check out this article on data
cleaning and preprocessing.
Let's go over the steps you should take before building your model.
[Link]().sum()
This will give you a list of columns that have null values and the rows themselves. There are
different ways to deal with this such as:
22
Step 4:Explore the correlation between variables
This code will show the relationships between the columns of independent / variables / features,
and dependent/ target variables.
It will also show which columns or features determine the outcome of the target variable
more than others.
This scatter plot shows the relationship between independent and dependent variables and a straight
line is drawn to show the relationship
Here, the data set is first separated into X (independent variable(s)/ features) and Y
(dependent variable/ Target):
We handle the missing values by dropping columns with missing/ null values and split the
dataset into training and testing in a 80:20 ratio.
23
We create a model by calling an instance of the model into a variable as shown below and
train the model by fitting the training dataset into the model.
● “R-squared”: This tells you how well the model explains the variation in house prices.
● Mean Squared Error (MSE): This measures the average difference between predicted
● Precision score.
24
2.7 Result
import time
Start_time=[Link]()
import pandas as pd
import numpy as np
data = pd.read_csv('C:/Users/ELCOT/Downloads/water_potability.csv')
[Link]()
[Link]() [Link]().sum()
[Link].value_counts() data=
[Link]()
[Link].value_counts()
notpotable = data[data['Potability']==0]
potable = data[data['Potability']==1]
data = shuffle(data)
25
[Link]
[Link].value_counts()
corrmat = [Link]()
f, ax = [Link](figsize=(12, 9))
[Link](figsize = (15,9))
corr = [Link]()
corr["Potability"].sort_values(ascending=False)
g = [Link](data, col='Potability')
[Link](data[cols])
[Link]()
26
from [Link] import accuracy_score
x = [Link](['Potability'], axis = 1)
y = data['Potability']
st = StandardScaler()
col= [Link]
x[col] = st.fit_transform(x[col])
x[col]
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
grid_knn.fit(X_train,
Y_train) grid_rf.fit(X_train,
Y_train)
27
models = [('K Nearest Neighbours', knn),('Random Forest', rf)]
28
for model_name, model in models:
[Link](X_train, Y_train)
y_pred = [Link](X_test)
accuracy = accuracy_score(Y_test,y_pred)
end_time=[Link]()
Print("execution time:",end_time-start_time,"secs")
AUTOML
import time
Start_time=[Link]()
import pandas as pd
import numpy as np
data =
pd.read_csv('C:/Users/ELCOT/Downloads/water_potability.csv')
[Link]()
[Link]()
[Link]().sum()
[Link].value_counts()
data= [Link]()
[Link].value_counts()
notpotable = data[data['Potability']==0]
potable = data[data['Potability']==1]
29
from [Link] import resample
data = shuffle(data)
[Link]
[Link].value_counts()
[Link](figsize=(12,9));
df = [Link]()
fig, ax = [Link](figsize=(12,8))
sns.set_style('white')
[Link]()
30
tpot = TPOTClassifier(generations=1,verbosity=2)
x = [Link](['Potability'], axis = 1)
y = data['Potability']
= StandardScaler()
[Link](x_train,y_train)
end_time=[Link]()
Print("execution time:",end_time-start_time,"secs")
31
2.8.1 Snapshots
ADVANTAGE OF LSTM OVER OTHER MODELS
The LSTM model can be tuned for various parameters such as changing the number of
LSTM layers, adding dropoutvalue or increasing the number of epochs.
Long Short Term Memory (LSTM)
LSTMs are widely used for sequence prediction problems and have proven to be extremely
effective. The reason they work so well is because LSTM is able to store past information
that is important, and forget the information that is not. LSTM has three gates:
The input gate: The input gate adds information to the cell state
The forget gate: It removes the information that is no longer required by the model.
The output gate: Output Gate at LSTM selects the information to be shown as
output.
32
Figure [Link] Data Parameters
Then we check if our data has some null values i.e missing values.
Since if the data is incomplete , then there will be error during processing state
which may lead to loss of accuracy in predicting model.
33
Figure [Link] Given data
Since our data contains no missing value, the program will skip the dropping phase in data
processing, where data is dropped to increase accuracy and fit missing values in a way so
that it is suitable for modeling.
Next we try to describe the data in such a way so that both people and machines find
it easy to understand the given data . In order to do this we use the describe() function.
34
Figure [Link] Approx Estimation
Counts refers to the number of instances of data in each column i.e 506 since there are
506 rows of data for each columnMean refers to mean value of data in given colum.
Std means the standard value i.e the most common value in given set of data for a
particular column.
25% refers that 25 percentile of the data in that column is equal to or below that value.
Next we try to understand the correlation between the different values, in order to do
that, the best way is by using heat map. Heat map is a representation of data in the form of a
map or diagram in which data values are represented as colours.
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate) There are two types of
correlation, they are:
These are two types of correlation are represented numerically and as well as by shade of
color in the heat map.
36
3 Conclusion
Thus the machine learning model to predict the house price based ongiven dataset is
executed successfully using xg regressor (a upgraded/slighted boosted form of regular linear
regression, this gives lesser error). This model further helps people understand whether this
place is more suited for them based on heatmap correlation. It also helps people looking to
sell a house at best time for greater profit. Any house price in any location can be predicted
with minimum errorby giving appropriate dataset
3.1 Future Enhancement
It is currently working on deployment using flask and automate the result file.
Use another country housing data set for prediction. This project is also in other
sectors as well as other countries, is yet to be explored.
37
REFERENCES
38