0% found this document useful (0 votes)
9 views52 pages

House Price Prediction with Python

The document is a project report titled 'House Price Prediction Using Python' submitted by Pratheba R for a Master's degree in Computer Science and Engineering. It outlines the use of various machine learning algorithms, including linear regression and random forest, for predicting house prices, emphasizing the importance of accurate predictions in the real estate market. The report includes sections on problem analysis, system design, and implementation, along with acknowledgments and a declaration of independent work.

Uploaded by

haripriyame1993
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views52 pages

House Price Prediction with Python

The document is a project report titled 'House Price Prediction Using Python' submitted by Pratheba R for a Master's degree in Computer Science and Engineering. It outlines the use of various machine learning algorithms, including linear regression and random forest, for predicting house prices, emphasizing the importance of accurate predictions in the real estate market. The report includes sections on problem analysis, system design, and implementation, along with acknowledgments and a declaration of independent work.

Uploaded by

haripriyame1993
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

HOUSE PRICE PREDICTION

USING PYTHON

A PROJECT REPORT

Submitted by

PRATHEBA.R

in partial fulfillment for the award of the degree

of

MASTER OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING


[Link] ENGINEERING COLLEGE

(AUTONOMOUS), NAGAPATTINAM

ANNAUNIVERSITY,CHENNAI JULY 2024


ANNA UNIVERSITY CHENNAI
BONAFIDE CERTIFICATE

Certified that this Reporttitled “HOUSE PRICE PREDICTION USING PYTHON ”is the

bonafide work of PRATHEBA R( 2303820840522010) who carried out the work under

my supervision. Certified further that to the best of my knowledge the work reported

here in does not form part of any other thesis or dissertation on the basis of which a

degree or award was conferred on an earlier occasion on this or any other candidate.

Signature of the HOD with date Signature of the HOD with date

[Link], M.E.,Ph.D., [Link], M.E.,Ph.D.,

Professor and Head Professor and Head

Department of CSE Department of CSE

Nagapattinam Nagapattinam

Submitted for the ANNA UNIVERSITY viva-voice examination held On


at E.G.S Pillay Engineering College(Autonomous), Nagapattinam

INTERNAL EXAMINER EXTERNAL EXAMINER

1
DECLARATION

I submit the project work entitle “ HOUSE PRICE PREDICTION USING PYTHON” to "E.G.S

Pillay Engineering College" in partial fulfilment of the requirements for the award of the degree

of "MASTER OF ENGINEERING" in "COMPUTER SCIENCE AND ENGINEERING". I

declare that was carried out independently by me under the guidelines of [Link] M.E.,Ph.D.,

professor in Computer Science and Engineering Department.

PRATHEBA.R( 2303820840522010)

I certify that the declaration made above by the candidate is true.

[Link] M.E., Ph.D


Professor
Department of Computer Science Engineering
E.G.S Pillay Engineering college
(Autonomous),
Nagapattinam.

2
ACKNOWLEDGEMENT

It is my indeed responsibility to express my heartfelt gratitude


to my founder Chairman Late Chevalier [Link]. I would like to be
obedient for my chairman [Link] [Link] a
good hearted, vital of my institution. I also express my sincere thanks to my
honourable Advisor Chevalier [Link].,[Link].,FCCA
(London). I would like to thank my honourable Secretary [Link]
KUMAR [Link]., for having a great dream to serve for the nation by educating
the rural community with the help of their institution. I would like to thank
my honourable Joint Secretary [Link] GANESH. I also express
my sincere thanks to my Principal [Link]., M.E., Ph.D., for
providing me with the necessary facilities to carry out this project work.

I am very much honoured and would like to profound my deep


grateful to myHead of the Department [Link] M.E., Ph.D., for her
expert, sincere and valuable guidance to me.

I would like to express my sincere thanks to my project


[Link] M.E., Ph.D.,for her wide knowledge, serious research
attitude and enthusiasm in work deeply impressed and made me to complete
this project to the at most satisfaction of my superiors in a stipulated period.
I express my hearty render and thanks to all the faculty members and
technical assistants of the Department of Computer Science and
Engineering for their timely assistance. My humble gratitude and thanks to
my parents and friends who have supported and helped me to complete the
project work.

PRATHEBA R
3
ABSTRACT

House prices are a major financial decision for everyone involved in the

housing market, including potential home buyers. A major part of the real estate

industry is housing. An accurate housing price prediction is a valuable tool for buyer

and seller as well as real estate agents. The study is done for the purpose of knowledge

among the people to understand and estimate the pricing of their houses. The

prediction will be made using four machine learning algorithms such as linear

regression, polynomial regression, random forest, decision tree. Linear Regression has

good interpretability. Decision tree is a graphical representation of all possible

solutions. Polynomial regression can be easily fitted to a wide variety of curves.

Regression and classification issues are resolved with random forests .Among the

given algorithm, Random forest provides better accuracy of about 89% for given

dataset.

4
TABLE OF CONTENTS

CHAPTER TITLE

PAGE NO.

NO.

ABSTRACT i

LIST OF TABLES vii

LIST OF FIGURES viii

LIST OF ABBREVIATIONS x

1 INTRODUCTION 1

1.1 Machine learning 1

1.1.1 Advantages of Machine learning 1

1.1.2 Disadvantages of Machine


Learning 2
5
1.2 Automatic machine learning 3

1.2.1 Advantages of AutoML 4

1.2.2 Disadvantages of AutoML 5

1.3 LITERATURE SURVEY 6

6
1.3.1 Efficient Prediction of Water 6

Quality Index (WQI) Using Machine

Learning Algorithms

7
1.3.2 Efficient Water Quality Prediction

Using Supervised Machine Learning

1.3.3 Predictive Analysis of Water 8

Quality Parameters using Deep Learning.

1.3.4 Analysis and prediction of water 9

quality using deep learning and auto deep

learning techniques.

1.3.5 Automating water quality analysis 10

using ML and auto ML techniques

7
2 PROBLEM ANALYSIS 11

2.1 Problem definition 11

2.2 Existing System 11

2.3 Proposed System 12

2.4 Modules 12

2.4.1 Predicting house rate using 12

Python

[Link] Random forest classification 14

[Link] K-nearest neighbour 15

8
2.4.2 Predicting house rate using 15

AutoML

[Link] Tpot 16

2.4.3 Performance of evaluation. 17

2.5 Requirement specification 18

2.5.1 Hardware specification 18

2.5.2 Software specification 18

2.6 System Design 19

2.6.1 System architecture 19

2.7 Implementation 19

2.8 Result 22

9
2.8.1 Source code 22

2.8.2 Snapshots 28

3 CONCLUSION

3.1 Conclusion 32

3.2 Future Enhancement 32

REFERENCES 33

10
LIST OF FIGURES

FIGURES TITLE PAGE NUMBER

NUMBER

1.2 AutoML Concept 4

2.4.1 Predicting house rate using 13

Python

[Link] Random forest classification 14

[Link] K-nearest neighbor 15

[Link] House rate prediction using 17

TPOT

2.6.1 System architecture 19

[Link] Heatmap Graph Using Python 28

[Link] Heatmap Graph Using Python 28

with Values

[Link] Facegrid Graph Using Python 29

[Link] Accuracy level and Time 29

Complexity in Python

[Link] Facegrid Graph Using AutoML 30

11
[Link] Heatmap Graph Using 30

AutoML with Values

[Link] Accuracy level and Time 31

Complexity in AutoML

12
LIST OF ABBREVIATIONS

ABBREVIATIONS FULL FORM

RFC Random forest classification

KNN K-nearest neighbor

AUTOML Automatic machine learning

TPOT Tree based pipeline optimization tool

13
CHAPTER 1

INTRODUCTION
1.1 MACHINE LEARNING

Machine learning is an application of AI that enables systems to learn and


improve from experience without being explicitly programmed. Machine learning
focuses on developing computer programs that can access data and use it to learn for
themselves. Similar to how the human brain gains knowledge and understanding,
machine learning relies on input, such as training data or knowledge graphs, to
understand entities, domains and the connections between them. With entities defined,
deep learning can begin. The machine learning process begins with observations or
data, such as examples, direct experience or instruction. It looks for patterns in data so
it can later make inferences based on the examples provided. The primary aim of ML
is to allow computers to learn autonomously without human intervention or assistance
and adjust actions accordingly. Machine learning as a concept has been around for
quite some time. The term “machine learning” was coined by Arthur Samuel, a
computer scientist at IBM and a pioneer in AI and computer gaming. Samuel designed
a computer program for playing checkers. The more the program played, the more it
learned from experience, using algorithms to make predictions.

1.1.1 Advantages of Machine learning

1. Easily identifies trends and patterns

Machine Learning can review large volumes of data and discover specific
trends and patterns that would not be apparent to humans. For instance, for an e-
commerce website like Amazon, it serves to understand the browsing behaviors and
purchase histories of its users to help cater to the right products, deals, and reminders
relevant to them. It uses the results to reveal relevant advertisements to them.

1
2. No human intervention needed (automation)

With ML, you don’t need to babysit your project every step of the way. Since
it means giving machines the ability to learn, it lets them make predictions and also
improve the algorithms on their own. A common example of this is anti-virus
softwares; they learn to filter new threats as they are recognized. ML is also good at
recognizing spam

3. Continuous Improvement

As ML algorithms gain experience, they keep improving in accuracy and


efficiency. This lets them make better decisions. Say you need to make a weather
forecast model. As the amount of data you have keeps growing, your algorithms learn
to make more accurate predictions faster.

4. Wide Applications

You could be an e-tailer or a healthcare provider and make ML work for you.
Where it does apply, it holds the capability to help deliver a much more personal
experience to customers while also targeting the right customers.

1.1.2 Disadvantages of Machine learning

1. Data Acquisition

Machine Learning requires massive data sets to train on, and these should be
inclusive/unbiased, and of good quality. There can also be times where they must wait
for new data to be generated.

2. Time and Resources

ML needs enough time to let the algorithms learn and develop enough to fulfill
their purpose with a considerable amount of accuracy and relevancy. It also needs
massive resources to function. This can mean additional requirements of computer
power for you.

2
3. Interpretation of Results

Another major challenge is the ability to accurately interpret results generated


by the algorithms. You must also carefully choose the algorithms for your purpose.

4. High error-susceptibility

Machine Learning is autonomous but highly susceptible to errors. Suppose you


train an algorithm with data sets small enough to not be inclusive. You end up with
biased predictions coming from a biased training set. This leads to irrelevant
advertisements being displayed to customers. In the case of ML, such blunders can set
off a chain of errors that can go undetected for long periods of time. And when they do
get noticed, it takes quite some time to recognize the source of the issue, and even
longer to correct it.

1.2 AUTOMATIC MACHINE LEARNING

Automated machine learning is the process of applying machine learning


models to use real-world problems using automation. More specifically, it automates
the selection, composition and parameterization of Machine Learning models.
Automating the ML process makes it more user-friendly and often provides faster,
more accurate outputs than hand-coded algorithms. AutoML is a typically platform or
open source library that simplifies each step in the ML process, from handling a raw
dataset to deploying a practical ML model. In traditional ML, models are developed
by hand, and each step in the process must be handled separately.

3
Figure 1.2 AutoML Concept

1.2.1 Advantages of AutoML

You get quicker results. With AutoML you can skip a lot of the machine
learning and data science work and as a result save some time. If you are
building a prototype to test a product market fit or get early experience on how
users will react and use the AI when applied.
You’re less likely to be outdated. A very common problem in AI development
is that your models are getting old almost as you’re making them. The AI
technology is moving so fast that what today might be state of art and requires a
lot of hard and intelligent efforts to achieve is tomorrow performing worse than
out of the box solutions. With AutoML that pain is placed with the big tech
AutoML vendors that have the economics of scale so they can invest in staying
ahead.
With a hosted AutoML solution you will save a lot of time on not having to
build the surrounding infrastructure. In this paper by Google the conclusion is
that on average 95% of code in machine learning solutions is “glue code” that

4
builds the infrastructure around the models. The actual machine learning code is
just 5%
You will need less experts. A common problem in AI is that it requires a lot of
expertise and usually involves several different kinds of experts. That’s
extremely expensive and experts are hard to find. You can skip a lot of that with
AI.

1.2.2 Disadvantages of AutoML

Get less insights on data. One of the big advantages in doing machine learning
by hand is that get a lot of learnings into why the models do not work as
expected. This gives valuable insight in both the existing business domain but
also a good idea of what data might need to go out and collect in order to get
the necessary results.
AutoML is inflexible. It might get results quickly but as most IT goes it won’t
be long before the requirements are changing and with a AutoML solution are
running the risk of these requirements suddenly being out of scope of what
AutoML can handle. If that happens to start all over with a custom model and
that can be a cold start.
Running cost might be too high when you scale. A pay-per-use model is great
for small to medium scale solutions. But if you’re expecting very high volumes
in use then you might end up spending too much money with your vendor. In
this case it might be a better solution to make and host the models yourself.
Quality probably won’t be the highest. If the most important feature in your
solution is the highest quality compared to competitors AutoML probably will
fall short. AutoML is generalized models and can in most cases not compete

5
1.3 LITERATURE SURVEY

1.3.1 TITLE : House Price Prediction Using Machine Learning

AUTHOR: Anand G. Rawool,Dattatray V. Rogye, Sainath G. Rane, DR. Vinayk A.


Bharadi

In this paper, the resale price prediction of house is done using different classifications
algorithms like Leaner regression, Decision Tree, K-Means and Random Forest is used.
There are so many factors are affected on house price include physical attributes, location
and also economic factor as well. Here we consider RMSE as the performance matrix for
different dataset and these algorithms are applied and find out most accuracy model which
predict better results.

ADVANTAGES

In this model we have to add additional features like tax, air quality so it become
different from other prediction system.

DISADVANTAGES

The paper entitled has presented to predict house price based on various features on
given data. From our analysis we set value of RMSE as 2.9131889.

6
1.3.2 TITLE : House Price Prediction using Machine Learning Algorithms

AUTHOR: Angulakshmi M, Deepa M, Mala Serene I , Thilagavathi M , Aarthi P

XGBoost uses a CART tree at the bottom, where nodes are values instead of
categories, which allows efficient optimization and increases performance. Overfitting
can be effectively avoided using feature sampling and regularisation. This paper has
some drawbacks, including the difficulty of obtaining reasonably complete housing
information. This paper has two reasons. A house transaction price is a trade secret of
the intermediary and developer. There is a lack of transaction data in government
departments.

ADVANTAGES

Decision Tree algorithm has high mean squared error when compared to Random
Forest. Random forest algorithm has low mean absolute error when compared to Decision
Tree.

DISADVANTAGES

Random forest provides higher accuracy compared to other algorithm Therefore we

would like to use more dataset and perform large analysis using deep learning algorithm.

7
1.3.3 TITLE : House Price Prediction Analysis using Machine Learning
AUTHOR: Aniket Singh , Adarsh Kumar Singh , Aditya Raj, Harshit Jain,Mrs.
Asha M S

House price prediction can be done by using a multiple prediction models (Machine

Learning Model) such as support vector regression, artificial neural network, and more.

There are many benefits that home buyers, property investors, and house builders can reap

from the house-price model. This model will provide a lot of information and knowledge to

home buyers, property investors and house builders, such as the valuation of house prices in

the present market, which will help them determine house prices. Meanwhile, this model can

help potential buyers decide the characteristics of a house they want according to their

budget. Previous studies focused on analyzing the attributes that affect house price and

predicting house price based on the model of machine learning separately.

ADVANTAGES

This paper examined and analyzed the current research on the significant attributes of house

price and analyzed the data mining techniques used to predict house price.

DISADVANTAGES

The impact of this research was intended to help and assist other researchers in developing a

real model which can easily and accurately predict house prices. Further work on a real

model needs to be done with the utilization of our findings to confirm them

8
1.3.4 TITLE : House Price Prediction using Machine

Learning AUTHOR: Ujjwal Kumar, Rishu Kunwar, DR. Neha

Garg.

The study focuses on housing cost trends, which serve as indicators of the current

economic situation and have direct implications .The actual cost of a house depends on

numerous factors, including the number of bedrooms, bathrooms, and [Link]

employed different Machine Learning models, including Linear Regression, Decision

Tree, and Random Forest, to construct a predictive model with their working accuracy.

Our approach involved a step-by-step process, encompassing Data Collection,

Pre-Processing Data, Data Analysis to Model Building

ADVANTAGES

This research paper concludes that employing the Random Forest machine learning

algorithm will lead to more precise pricing predictions using the collected data, which

serves as the core component of machine learning.

DISADVANTAGES

The trained data can be utilized to generate new predicted values. This approach offers cost

savings and reduces the need for extensive physical efforts for both buyers and sellers.

Notably, Random Forest predicted house prices with the highest accuracy value of 0.90.

9
1.3.5 TITLE : Employing Machine Learning for House Price

Prediction AUTHOR: Nikita Malik,Vidhu Gaba,Priyansh

Generation of unprocessed effluents, municipal refuse, factory wastes,


junking of compostable and non-compostable effluents has hugely contaminated
nature-provided water bodies like rivers, lakes and ponds. Therefore, there is a
necessity to look into the water standards before the usage. This is a problem
that can greatly benefit from Artificial Intelligence (AI). However, it is still a
field under research. This work aims to recognize the areas where an AutoML
system falls short or outperforms a traditional expert system built by data
scientists. Keeping this as the motive, this work dives into the Machine Learning
(ML) algorithms for comparing AutoML and an expert architecture built by the
authors for Water Quality Assessment to evaluate the Water Quality Index,
which gives the general water quality, and the Water Quality Class, a term
classified on the basis of the Water Quality Index. The results prove that the
accuracy of AutoML and TPOT was 1.4 % higher than conventional ML
techniques for binary class water data. For Multi class water data, AutoML was
0.5 % higher and TPOT was 0.6% higher than conventional ML techniques.

ADVANTAGES

If a high risk of humanerror during this process,which can lead to inaccuracies with your
predictions and reliability issues.

DISADVANTAGES

Its takes more time

10
CHAPTER 2

PROBLEM ANALYSIS
2.1 Problem definition
People looking to buy a new home tend to be more conservative with their budgets and

marketstrategies. This project aims to analyse various parameters like average income,

average area etc. andpredict the house price accordingly. This application will help

customers to invest in an estate without approaching an agent To provide a better and

fast way of performing operations. To provide proper house price to the customers.

To eliminate need of real estate agent to gain information regarding house prices. To

provide best price to user without getting cheated. To enable user to search home as

per the budget. The aim is to predict the efficient house pricing for real estate

customers with respect totheir budgets and priorities. By analyzing previous market

trends and price range,and alsoupcoming developments future prices will be predicted.

House prices increase every year, so there is a need for a system to predict house

prices in the future. House price prediction can help the developer determine the

selling price of a house and can help the customer to arrange the right time to purchase

a house. We use linear regression algorithm in machine learning for predicting the

house price trends.

2.2. Existing System

In this proposed system, we focus on predicting house price using machine

learning algorithms like Leaner Regression, Decision Tree, k-Means, and Random

11
Forest. We proposed the system “House Price Prediction Using Machine Learning”

we have predict the house price using multiple features. In this proposed system, we

are able to train model from various features like ZN, INDUS, CHAS, RAD etc. the

previous data taken and out of this 80% of data is used for training purpose and

remaining 20% of data used for testing purpose. Here, the raw data is stored in ‘.csv’

file. We are majorly used two machine learning libraries to solve these problems. The

first one was ‘pandas’ and another one is ‘numpy’. The pandas used for to load ‘.csv’

file into Jupiter notebook and also used to clean the data as well as manipulate the

data. Another was sklearner, which was used for real analysis and it has containing

various inbuilt functions which help to solve the [Link] more library was used

which is nothing but numpy. For the purpose of train-test splitting numpy was used.

12
2.2 Proposed System

House price predictions make property investors to be benefited to know the

trend of housing prices in certain locations. Proposed method uses attributes such as

price, property type, location, square feet, city, bedrooms and other parameters for

house price prediction. The classification of attributes makes easier to analyze the

effects of different attributes on different [Link] paper applied Linear,

Polynomial, XGBoost, Random Forest and Decision Tree algorithms for comparing

and analysing accuracy of house prices and predict better result This paper imported

various packages such as numpy, pandas, seaborn, sklearn metrics and other packages

for evaluating our ML model into our python environment. Encoding is a technique of

converting categorical variables into numerical values so that it could be easily fitted

to a machine learning model.

2.3 Modules

● Predicting house rate using Python


○ Random forest Classification
○ K-nearest neighbor
● Predicting house infrasrructure using AutoML
○ TPOT
● Performance of evaluation

2.4.1 Predicting house rate using Python

Python is an interpreted, interactive, object-oriented programming language. It


incorporates modules, exceptions, dynamic typing, very high level dynamic data types,

13
and classes. It supports multiple programming paradigms beyond object-oriented
programming, such as procedural and functional [Link] facilitates
developers to increase the confidence and productivity about their developing software
from development to deployment and maintenance. The benefits of making Python the
perfect solution for machine learning and AI-driven projects include simplicity and
consistency, flexibility, access to powerful AI and machine learning (ML) libraries and
frameworks, platform independence, and large communities. These things increase the
popularity of the language.

14
● Scikit-learn to handle basic ML algorithms such as clustering, logistic and
linear regression, regression, and classification.
● Pandas are used for advanced structure and data analysis. It allows you to merge
and filter data and collect data from other external sources (such as Excel).
● Matplotlib is a plotting library for Python. It is used along with NumPy to
provide an environment that is an effective open source alternative for MatLab.
It can also be used with graphics toolkits like PyQt and wxPython.
● NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
● Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be
used to visualize random distributions.

In our project,We Implement in [Link] python we implement two algorithms RFC


(random forest classification) and KNN(K-nearest neighbor). Accurate the water
quality level and find the time complexity of the program.

Figure 2.4.1 Predicting house rate using Python

15
[Link] Random Forest classification

● A random forest is a machine learning technique, that is used to solve


regression and classification problems. It utilizes ensemble learning, which is
technique that combines many classifiers to provide solutions to complex
problems.

● A random forest Classification algorithm consists of many decision trees. The


‘forest’ generated by the random forest classification algorithm is trained
through bagging or bootstrap aggregating. Bagging is an ensemble meta-
algorithm that improves the accuracy of ML algorithms.
● The (random forest classification) algorithm establishes the outcome based on
the predictions of the decision trees. It predicts by taking average or mean of the
output from various trees. Increasing the number of trees and increases the
precision of the outcome.

Figure [Link] Random Forest classification

16
[Link] K-nearest neighbor

K Nearest Neighbor algorithm(KNN) falls under the Supervised Learning


category and is used for classification and regression. It is a versatile algorithm and
used for imputing missing values also resampling datasets. As the name (KNN)
suggests it considers K Nearest Neighbors to predict the class or continuous value for
the new Datapoint.

F
Figure [Link] K-nearest neighbor

2.4.2 Predicting house infrastructure using AutoML

Automated machine learning (AutoML) is the process of applying machine


learning (ML) models to real-world problems using automation. More specifically, it
automates the selection, composition and parameterization of machine learning
models. Automating the machine learning process makes it more user-friendly and
often provides faster, more accurate outputs than hand-coded [Link] is
typically a platform or open source library that simplifies each step in the machine
learning process, from handling a raw dataset to deploying a practical machine
learning model. In traditional machine learning, models are developed by hand, and

17
each step in the process must be handled [Link] machine learning can
target various stages of the machine learning [Link] to automate are:

● Data preparation and ingestion (from raw data and miscellaneous formats)
○ Column type detection; e.g., boolean, discrete numerical,
continuous numerical, or text.
○ Column intent detection; e.g., target/label, stratification field,
numerical feature, categorical text feature, or free text feature.
○ Task detection; e.g., binary classification, regression, clustering,
or ranking.
● Feature engineering
○ Feature selection
○ Feature extraction
○ Meta learning and transfer learning
○ Detection and handling of skewed data and/or missing values
● Model selection - choosing which machine learning algorithm to use,
often including multiple competing software implementations
● Ensembling - a form of consensus where using multiple models often
gives better results than any single model[3]
Secondly we implement [Link] AutoML we using TPOT (tree based pipeline
optimization tool). Accurate the water quality level and find the time complexity of
the program.

[Link] TPOT

TPOT (Tree-based Pipeline Optimization Tool) is a AutoML tool


specifically designed for the efficient construction of optimal pipelines through
genetic programming. TPOT is a open source library and makes use of scikit-learn
components for data transformation, feature decomposition, feature selection and
model selection .Although TPOT is classified as AutoML tool, as such it does not

18
offer the “end-to-end” of an Machine Learning [Link] can see the phases
automated by TPOT and the ones specifically addressed by the Data Scientist or
Machine Learning Engineer.

Figure [Link] House Rate Prediction using TPOT

2.4.3 Performance of Evaluation

We are going to implement a house rate prediction using machine learning


techniques. We will implement in this project in Random forest classification and K-
nearest neighbor algorithm in supervised learning and Tpot in AutoML. Then we
compare python and AI framework, Finally we find which one is find the less time
complexity.

19
2.4 Requirement specification

Requirements specification defines Hardware and Software [Link]


requirements specification for any system can be broadly stated as given below:

● The system should be able to interface with the existing system.


2.5.1 Hardware specification

➢ 64-bit CPU

➢ 4GB RAM

➢ 5GB free disk space

2.5.2 Software specifications

➢ Windows 10 operating system


➢ Python version 3.8
➢ AutoML

20
2.5 System Design

2.6.1 System architecture

A system architecture is the conceptual model that defines the structure, behavior,
and more views of a system. It consists of system components and the sub- systems
developed, that will work together to implement the overall system.

Figure 2.6.1 System Architecture

2.6 Implementation

Step 1:Import libraries and load data


If you are new to machine learning models, the libraries are imported as abbreviations
for the sole purpose of writing shorter [Link] dataset is loaded using pandas’ read_csv
21
function and then the first five rows are displayed using [Link]().

Step 2:Exploratory Data Analysis (EDA)

Data from different sources are usually messy, scattered, they contain missing values, and
are sometimes unstructured.

Before building a regression model, it's crucial to understand the data, and clean and
optimize it for the best result. For an in-depth explanation check out this article on data
cleaning and preprocessing.

Let's go over the steps you should take before building your model.

Step 3:Check for missing values


Machine learning models cannot function when there are missing values in the dataset:

[Link]().sum()

This will give you a list of columns that have null values and the rows themselves. There are
different ways to deal with this such as:

● Deleting all rows with null values.


● Using the mean or median of the column to fill in the missing values for numerical
data.
● Filling the missing values with the most occurring data for qualitative data.

22
Step 4:Explore the correlation between variables

[Link]([Link](), annot=True, cmap='coolwarm')


[Link]()

This code will show the relationships between the columns of independent / variables / features,
and dependent/ target variables.

It will also show which columns or features determine the outcome of the target variable
more than others.

Step 5:Visualize the relationship between independent and dependent variables


Scatter plots can show how well your predicted prices align with actual values. Residual
plots help visualize any patterns in the errors, revealing potential issues.

This scatter plot shows the relationship between independent and dependent variables and a straight
line is drawn to show the relationship

Step 6:Data preprocessing


This is a crucial step as the quality of data that is used to train the model also determines the accuracy
and efficiency of the model.

Here, the data set is first separated into X (independent variable(s)/ features) and Y
(dependent variable/ Target):

We handle the missing values by dropping columns with missing/ null values and split the
dataset into training and testing in a 80:20 ratio.

Step 7:Building the Regression Model


Finally , it is time to create and train our linear regression model.

23
We create a model by calling an instance of the model into a variable as shown below and
train the model by fitting the training dataset into the model.

Step 8:Make Predictions


The trained model is used to make predictions on the test set. Predictions can be made on the
entire feature column as shown below or each column can be predicted individually.

Step 9:Evaluate the Model


Evaluating the model's performance is an important step to determine the accuracy of the

model and reusability. We can check using metrics such as:

● “R-squared”: This tells you how well the model explains the variation in house prices.

A higher value (closer to 1) indicates a better fit.

● Mean Squared Error (MSE): This measures the average difference between predicted

and actual prices. Lower is better.

● Precision score.

24
2.7 Result

Thus the House price prediction using python is implemented successfully.


2.8 Coding

import time

Start_time=[Link]()

import pandas as pd

import numpy as np

data = pd.read_csv('C:/Users/ELCOT/Downloads/water_potability.csv')

[Link]()

[Link]() [Link]().sum()

[Link].value_counts() data=

[Link]()

[Link].value_counts()

notpotable = data[data['Potability']==0]

potable = data[data['Potability']==1]

from [Link] import resample

df_minority_upsampled = resample(potable, replace = True, n_samples = 1200) from

[Link] import shuffle

data = [Link]([notpotable, df_minority_upsampled])

data = shuffle(data)

25
[Link]

[Link].value_counts()

import [Link] as plt

import seaborn as sns

corrmat = [Link]()

f, ax = [Link](figsize=(12, 9))

[Link](corrmat, vmax=.8, square=True)

[Link](figsize = (15,9))

[Link]([Link](), annot = True)

corr = [Link]()

corr["Potability"].sort_values(ascending=False)

g = [Link](data, col='Potability')

[Link]([Link], 'Solids', bins=25)

cols = ['Solids', 'Turbidity', 'Chloramines',


'ph','Trihalomethanes','Hardness','Sulfate','Conductivity','Organic_carbon']

[Link](data[cols])

[Link]()

from sklearn.model_selection import train_test_split

from [Link] import RandomForestClassifier

from [Link] import KNeighborsClassifier

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

26
from [Link] import accuracy_score

x = [Link](['Potability'], axis = 1)

y = data['Potability']

from [Link] import StandardScaler

st = StandardScaler()

col= [Link]

x[col] = st.fit_transform(x[col])

x[col]

X_train, X_test, Y_train, Y_test = train_test_split(x,y, test_size = 0.1)

knn = KNeighborsClassifier()

rf = RandomForestClassifier()

para_knn = {'n_neighbors':[Link](1, 50)} #parameters of knn

grid_knn = GridSearchCV(knn, param_grid=para_knn, cv=5) #search knn for 5


fold cross validation

params_rf = {'n_estimators':[100,200, 350, 500], 'min_samples_leaf':[2, 10, 30]}

grid_rf = GridSearchCV(rf, param_grid=params_rf, cv=5)

grid_knn.fit(X_train,

Y_train) grid_rf.fit(X_train,

Y_train)

print("Best parameters for KNN:", grid_knn.best_params_)

print("Best parameters for Random Forest:", grid_rf.best_params_)

27
models = [('K Nearest Neighbours', knn),('Random Forest', rf)]

28
for model_name, model in models:

[Link](X_train, Y_train)

y_pred = [Link](X_test)

accuracy = accuracy_score(Y_test,y_pred)

print('{:s} : {:.2f}'.format(model_name, accuracy))

end_time=[Link]()

Print("execution time:",end_time-start_time,"secs")

AUTOML

import time

Start_time=[Link]()

import pandas as pd

import numpy as np

data =

pd.read_csv('C:/Users/ELCOT/Downloads/water_potability.csv')

[Link]()

[Link]()

[Link]().sum()

[Link].value_counts()

data= [Link]()

[Link].value_counts()

notpotable = data[data['Potability']==0]

potable = data[data['Potability']==1]
29
from [Link] import resample

df_minority_upsampled = resample(potable, replace = True, n_samples = 1200)

from [Link] import shuffle

data = [Link]([notpotable, df_minority_upsampled])

data = shuffle(data)

[Link]

[Link].value_counts()

import [Link] as plt

import seaborn as sns

[Link](figsize=(12,9));

df = [Link]()

fig, ax = [Link](figsize=(12,8))

mask = [Link](np.ones_like(df, dtype=bool))

[Link](df, cmap='seismic', annot=True, mask=mask, ax=ax, vmin=-0.2,


vmax=0.2)

[Link](0.5, 1.05, 'Correlation Heat Map', horizontalalignment='center',


verticalalignment='center', fontsize=14, fontweight='bold',
transform=[Link])

sns.set_style('white')

[Link]()

from tpot import TPOTClassifier

from tpot import TPOTRegressor

30
tpot = TPOTClassifier(generations=1,verbosity=2)

x = [Link](['Potability'], axis = 1)

y = data['Potability']

from sklearn.model_selection import train_test_split

from [Link] import StandardScaler st

= StandardScaler()

x_train,x_test,y_train,y_test =train_test_split(x,y, test_size = 0.1)

[Link](x_train,y_train)

end_time=[Link]()

Print("execution time:",end_time-start_time,"secs")

31
2.8.1 Snapshots
ADVANTAGE OF LSTM OVER OTHER MODELS
The LSTM model can be tuned for various parameters such as changing the number of
LSTM layers, adding dropoutvalue or increasing the number of epochs.
Long Short Term Memory (LSTM)
LSTMs are widely used for sequence prediction problems and have proven to be extremely
effective. The reason they work so well is because LSTM is able to store past information
that is important, and forget the information that is not. LSTM has three gates:
The input gate: The input gate adds information to the cell state
The forget gate: It removes the information that is no longer required by the model.
The output gate: Output Gate at LSTM selects the information to be shown as
output.

Figure [Link] Output Gate at LSTM


First we import a sample data from sklearn library , you can get
different types of sample data from Kaggle. The data taken here is the data of
various parameters and the house prices in a given city called boston in the year
between 1970 to 2020.

Here the data parameters are explained as follows:

32
Figure [Link] Data Parameters

Here for understanding purpose we have taken first 5 index/instance of


data and printed them. In total there are 506 rows ofdata from the dataset , of
which we have printed first 5 rows using head() function.

There are 14 columns in total, i.e, 13 colums containing data of the


place, and the 14th column is the target column which contains the house prices.

Then we check if our data has some null values i.e missing values.
Since if the data is incomplete , then there will be error during processing state
which may lead to loss of accuracy in predicting model.

Here in our given data , there is nomissing value as we can see.

33
Figure [Link] Given data
Since our data contains no missing value, the program will skip the dropping phase in data
processing, where data is dropped to increase accuracy and fit missing values in a way so
that it is suitable for modeling.
Next we try to describe the data in such a way so that both people and machines find
it easy to understand the given data . In order to do this we use the describe() function.

34
Figure [Link] Approx Estimation

Counts refers to the number of instances of data in each column i.e 506 since there are
506 rows of data for each columnMean refers to mean value of data in given colum.

Std means the standard value i.e the most common value in given set of data for a
particular column.

Min refers the least data value in each column.

Max refers to the maximum data value in each column.

25% refers that 25 percentile of the data in that column is equal to or below that value.

Next we try to understand the correlation between the different values, in order to do
that, the best way is by using heat map. Heat map is a representation of data in the form of a
map or diagram in which data values are represented as colours.

Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate) There are two types of
correlation, they are:

1. Positive correlation: A positive correlation is a relationship between two variables that


move in tandem—that is, inthe same direction. A positive correlation exists when one
variable decreases as the other variable decreases, or onevariable increases while the
other increases.

2. Negative correlation: Negative correlation is a relationship between two variables in


which one variable increasesas the other decreases, and vice versa. In statistics, a perfect
35
negative correlation is represented by the value -1.0, while a 0 indicates no correlation, and
+1.0 indicates a perfect positive correlation. A perfect negative correlation means the
relationship that exists between two variables is exactly opposite all of the time.

These are two types of correlation are represented numerically and as well as by shade of
color in the heat map.

36
3 Conclusion
Thus the machine learning model to predict the house price based ongiven dataset is
executed successfully using xg regressor (a upgraded/slighted boosted form of regular linear
regression, this gives lesser error). This model further helps people understand whether this
place is more suited for them based on heatmap correlation. It also helps people looking to
sell a house at best time for greater profit. Any house price in any location can be predicted
with minimum errorby giving appropriate dataset
3.1 Future Enhancement
It is currently working on deployment using flask and automate the result file.
Use another country housing data set for prediction. This project is also in other
sectors as well as other countries, is yet to be explored.

37
REFERENCES

• Gongzhu Hu, Jinping Wang, and Wenying Feng Multivariate Regression


Modellingfor Home Value Estimates with Evaluation using Maximum
Information Coefficient
• Byeonghwa Park , Jae Kwon Bae (2015). Using machine learning algorithms
for housing price prediction , Volume 42, Pages 2928-2934 [4] Douglas C.
Montgomery, Elizabeth A. Peck, G. Geoffrey Vining, 2015. Introduction to
Linear Regression Analysis.
• Iain Pardoe, 2008, Modelling Home Prices Using Realtor Data
• Aaron Ng, 2015, Machine Learning for a London Housing Price Prediction
Mobile Application
• Wang, X., Wen, J., Zhang, [Link], Y. (2014). Real estate price forecasting
based on SVM optimized by PSO. Optik-International Journal for Light and
Electron Optics, 125(3), 14391443

38

You might also like