0% found this document useful (0 votes)

47 views40 pages

SkillDzire Internship Report: ML Focus

Uploaded by

saiprajwal7244

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views40 pages

SkillDzire Internship Report: ML Focus

Uploaded by

saiprajwal7244

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

SKILLDZIRE

MACHINE
LEARNING
An Internship Report Submitted at the end
of seventh semester

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted By
KAMISETTY SAi VIKASH
(21981A0570)

Under the esteemed guidance of

[Link] Rao
Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

RAGHU ENGINEERING COLLEGE

AUTONOMOUS
Approved by AICTE, New Delhi, Accredited by NBA (CIV, ECE, MECH, CSE),
NAAC with 'A+' gradePermanently Affiliated to JNTU-GV, Vizianagaram

2024-2025
RAGHU ENGINEERING COLLEGE
AUTONOMOUS
Approved by AICTE, New Delhi, Accredited by NBA (CIV, ECE, MECH, CSE),
NAAC with 'A+' grade, Permanently Affiliated to JNTU-GV, Vizianagaram

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that this project entitled "Machine Learning" done by "KAMISETTY SAi VIKASH
(21981A0570)" is a student of [Link] in the Department of Computer Science and Engineering, RAGHU
ENGINEERING COLLEGE, during the period 2021-2025, in partial fulfillment for the award of the Degree of
Bachelor of Technology in Computer Science and Engineering to the Jawaharlal Nehru Technological University,
Gurajada Vizianagaram is a record of bonafide work carried out under my guidance and supervision.

The results embodied in this internship report have not been submitted to any other University or
Institute for the award of any Degree.

Internal Guide Head of the Department

[Link] Rao, Dr.R. Sivaranjani,
Assistant Professor Professor

EXTERNAL EXAMINER
DISSERTATION APPROVAL SHEET
This is to certify that the dissertation titled

HOUSING PRICE PREDICTION MODEL

BY
KAMSETTY SAi VIKASH
(21981A0570)

Is approved for the degree of Bachelor of Technology

PROJECT GUIDE
(Assistant Professor)

Internal Examiner

External Examiner

HOD
(Professor)

Date:

11
DECLARATION

This is to certify that this internship titled "Machine Learning" is bonafied work done
by my me, impartial fulfillment of the requirements for the award of the degree [Link] and
submitted to the Department of Computer Science and Engineering, RAGHU
ENGINEERING COLLEGE, Dakamarri.
I also declare that this internship is a result of my own effort and that has not been
copied from anyone and I have taken only citations from the sources which are mentioned in the
references.
This work was not submitted earlier at any other University or Institute for the reward of
any degree.

Date:
Place:

KAMISETTY SAi VIKASH

(21981A0570)

111
CERTIFICATE

Sl<illDzire

CERTIFICATE OF INTERNSHIP

This is to Certify that Mr./M

Sai Vikash Kamisett
Enrolled in the Computer Science and Engineering - 21981A0570
From College Raghu Engineering College

of univer ity JNTUGV, Vizianagaram

ha Succe fully Completed hort-term Intern hip programme titled
Machine Learnin
under SkillDzire for 2 Month .Organized By SkillDzire in collaboration
with Andhra Pradesh State Council of Higher Education.

C rtificate ID: (t ol0 10,.9-,_

SDST-30995
I u d On: M o
lf DER,_ I'

01-Jul-24
Approved By AICTE Authorized Signature

iv
ACKNOWLEDGEMENT

I express smcere gratitude to my esteemed Institute "RAGHU ENGINEERING

COLLEGE" has provided us an opportunity to fulfill the most cherished desire to reach my goal.
I take this opportunity with great pleasure to put on record our ineffable personal
indebtedness to Mr. Raghu Kalidindi, Chairman of RAGHU ENGINEERING COLLEGE
for providing necessary departmental facilities.

I would like to thank the Principal [Link] of "RAGHU ENGINEERING

COLLEGE", for providing the requisite facilities to carry out projects on campus. Your expertise
in the subject matter and dedication towards our project have been a source of inspiration for all
of us.

I sincerely express our deep sense of gratitude to Dr.R. Sivaranjani, Professor, Head of
Department in Department of Computer Science and Engineering, Raghu Engineering College,
for her perspicacity, wisdom and sagacity coupled with compassion and patience. It is my great
pleasure to submit this work under her wing. I thank for guiding us for the successful completion
of this project work.

I would like to thank Mr. V. Tata Rao, Assistant Professor for providing the technical
guidance to carry out module assigned. Your expertise in the subject matter and dedication
towards our project have been a source of inspiration for all of us.

I extend my deep hearted thanks to all faculty members of the Computer Science
department for their value-based imparting of theory and practical subjects, which were used in
the project.

Regards
KAMISETTY SAi VIKASH
(21981A0570)

V
TABLE OF CONTENTS

[Link] CONTENT PAGE NUMBER

1. INTRODUCTION TO PYTHON 1
2. OOPS IN PYTHON 4
3.
NUMPY AND PANDAS 6

4. INTRODUCTION TO MACHINE LEARNING 10

5. ADVANCED ML ALGORITHMS 16

6. HOUSING PRICE PREDICTION 26

7. CONCLUSION 31

vi
1. INTRODUCTION TO PYTHON
1.1 PYTHON
Python is a widely used, high-level, general-purpose programming language known for its
simplicity and readability. It enables developers to express concepts with fewer lines of
code compared to languages like C++ or Java. Python's clean syntax and design philosophy
make it accessible to beginners while providing powerful tools for experienced developers
to work on both small and large-scale applications.

Python supports multiple programming paradigms, including object-oriented, procedural,

imperative, and functional programming styles. Additionally, Python features a dynamic
type system, automatic memory management, and an extensive standard library, making it
suitable for a wide range of applications, from web development to automation.

1.2 APPLICATIONS OF PYTHON

Python's versatility allows it to be used in a wide variety of fields and applications,
including:
• Web Development: Frameworks like Django and Flask make it easy to build
web applications with Python.
• Machine Learning: Python is widely used for building and deploying machine
learning models, which simplify tasks such as regression, classification, and clustering
in various domains like finance, healthcare, and marketing.
• Data Science: Apart from machine learning, Python is heavily used for data
analysis, statistical modeling, and data visualization.
• Artificial Intelligence: Beyond ML, Python is also used in broader AI fields such
as natural language processing (NLP), computer vision, and reinforcement learning.

1.3 PYTHON IN MACHINE LEARNING

Python is particularly popular in the field of machine learning (ML) due to its simplicity,
community support, and the availability of powerful libraries and frameworks.
Python has several libraries and frameworks that make machine learning easier, such as:
• NumPy and Pandas for efficient data manipulation and analysis.
• Scikit-learn for classical machine learning algorithms (like regression,
classification, clustering).
• TensorFlow and PyTorch for deep learning and neural networks.

1.4 DATATYPES IN PYTHON

The primary data types in Python are:

• Integer int: Represents whole numbers, both positive and negative, without a
fractional component. e.g: 10, -5.
• Float float: Represents numbers that contain decimal points or fractions. e.g: 3.14,
- 0.75.
• String str: Represents a sequence of characters enclosed in single, -double, or triple
quotes. e.g: "Python", 'Hello World'.
• Boolean bool: Represents logical values, either True or False.
1
• List list: An ordered, mutable collection of items, which can be of different data types.
e.g: [1, "apple", 3.5].
• Tuple tuple: An ordered, immutable collection of items. e.g: (1, 2, "banana").
• Set set: An unordered collection of unique elements. e.g: {1, 2, 3, "apple"}.
• Dictionary diet: A collection of key-value pairs, where each key is associated with a
specific value. e.g: {"name": "Alice", "age": 25}.

1.5 CONDITIONAL STATEMENTS IN PYTHON

1. if Statement: Executes a block of code if a condition is true.
Ex:
X = 10
if X > 5:
print("x is greater than 5")

2. else Statement: Executes a block of code if the condition in the if statement is

false. Ex:
x=3
if X > 5:
print("x is greater than 5")
else:
print("x is not greater than 5")

3. elif Statement: Checks another condition if the previous if condition is false.

Ex:
x=5
if X > 5:
print("x is greater than 5")
elif x == 5:
print("x is equal to 5")

4. Nested if Statement: An if statement within another if statement.

Ex:
x=lO
if x>5:
if x<15:
print("Between 5 and 15")

1.6 ITERATIVE STATEMENTS IN PYTHON

1. for Loop: Iterates over a sequence (such as a list, tuple, or string) and executes a block
of code for each item in the sequence.
Ex:
fruits= ["apple", "banana", "cherry"]
for fruit in fruits:
print(fruit)

2. while Loop: Repeatedly executes a block of code as long as a specified condition is True.
2
Ex:
x=O
while x < 5:
print(x)
X += 1

3. break Statement: Exits the loop prematurely when a certain condition is

met. Ex:
for i in range(5):
if i == 3:
break
print(i)

4. continue Statement: Skips the current iteration and moves to the next iteration in the
loop. Ex:
for i in range(5):
if i == 2:
continue
print(i)

1.7 FUNCTIONS IN PYTHON

Functions are reusable blocks of code that perform a specific task. They allow for code
modularization, making programs easier to read, maintain, and debug.

A function in Python is defined using the def keyword, followed by the function name and
parentheses containing optional parameters.

Syntax
def function_name(parameters):

# Function body
# Optional return statement

Ex:

def add(a, b):

"""Returns the sum of two numbers."""

return a+ b

result= add(5, 3) # Result: 8

3
2.00PS IN PYTHON
Python is an object-oriented programming (OOP) language that allows for the creation and
utilization of classes and objects. The major principles of object-oriented programming in
Python include the following:

1. Object:
An object is an entity that has a state (attributes) and behavior (methods). Objects can be
physical (e.g., a mouse, keyboard) or logical (e.g., a user account). In Python, everything
is an object, and all objects have associated attributes and methods. For instance,
functions have a built-in attribute _doc_, which returns the docstring defined in the
function source code.
2. Class:
A class is a blueprint for creating objects. It defines a set of attributes and methods
that the created objects (instances) will have. For example, an Employee class may
contain attributes such as email, name, age, and salary.
Syntax:

class ClassName:

# Class attributes and methods

attribute = value

def method_name(self):

# Method body

3. Method:
A method is a function that is associated with an object. In Python, methods are not
unique to class instances; any object type can have methods.

4. Inheritance:
Inheritance is a feature of OOP that allows one class (the derived class or child class) to
inherit the properties and behaviors of another class (the base class or parent class).
This promotes code reusability and reduces redundancy.
Example:

class Parent:

def greet(self):
print("Hello from Parent")

class Child(Parent):

def greet_child(self):
print("Hello from Child")

4
5. Polymorphism:
Polymorphism allows methods to perform different functions based on the object that is
calling them. It enables a single interface to be used for different underlying forms (data
types). For instance, a method named talk can have different implementations for
different animal classes.
Example:

class Animal:

def talk(self):
pass

class Dog(Animal):

def talk(self):
return "Bark"

class Cat(Animal):

def talk(self):
return "Meow"

6. Encapsulation:
Encapsulation is a mechanism that restricts access to certain components of an object
and bundles the data (attributes) and methods (functions) that operate on the data
within a single unit. This can be implemented using private variables and getter/setter
methods to control access.
Example:

class Account:

def _init_(self, balance):

self._balance = balance # Private attribute

def get_balance(self):
return self._balance

def deposit(self, amount):

self._balance += amount
7. Data Abstraction:
Data abstraction refers to the concept of hiding the complex reality while exposing only
the necessary parts. It allows for the implementation of complex systems while
providing a simpler interface to the user.

5
3. NUMPY & PANDAS
3.1 NUMPY
NumPy is a powerful library for numerical computing in Python, providing support for
large, multi-dimensional arrays and matrices, along with a collection of mathematical functions
to operate on these arrays.

3.1. lARRAY
An array is a central data structure in NumPy, allowing for efficient storage and
manipulation of numerical data. It is a grid of values, all of the same type, indexed by a tuple
of non-negative integers. The number of dimensions is called the rank of the array.

3.1.2 PYTHON LIST vs NUMPY ARRAY

NumPy gives you an enormous range of fast and efficient numerically-related options. While a
Python list can contain different data types within a single list, all of the elements in a NumPy
array should be homogenous. The mathematical operations that are meant to be performed on
arrays wouldn't be possible if the arrays weren't homogenous.

WhyNumPy?
NumPy arrays are faster and more compact than Python lists. An array consumes less
memory and is far more convenient to use. NumPy uses much less memory to store data
and it provides a mechanism of specifying the data types, which allows the code to be
optimized even further.

3.1.3 CREATING AND ACCESSING 1D ARRAY

To create an array in NumPy, first, you need to import the library.

import numpy as np

# Creating a 1D array
array_ld = [Link]([l, 2, 3, 4, 5])

Adding Elements to an Array:

NumPy arrays have a fixed size, so you cannot add elements directly. However, you can use
[Link]() to create a new array.

# Adding elements to an array

new_array = [Link](array_ld, [6, 7])

Removing Elements from an Array:

You can use [Link]() to remove elements from an array.

# Removing elements from an array

modified_array = [Link](array_ld, 2) # Removes the element at index 2
Sorting an Array:
NumPy provides a built-in method to sort arrays.

# Sorting an array
sorted_array = [Link](array_ld)
6
Resizing an Array:
You can change the shape of an array using the reshape() method.

# Resizing an array
reshaped_array = [Link](array_ld, (5, 1)) # Reshaping to 5 rows and 1 column

Indexing:
You can access elements of a NumPy array using indexing.

# Indexing an array
element= array_ ld[2] # Accesses the element at index 2

Slicing:
Slicing allows you to access a range of elements in an array.

# Slicing an array
slice_array = array_1d[1:4] # Gets elements from index 1 to 3

Broadcasting:
Broadcasting refers to the ability ofNumPy to perform operations on arrays of different shapes.
For instance, you can add a scalar to an array.

# Broadcasting example
array_broadcast = array_1d + 5 # Adds 5 to each element in the array

3.1.4 2D ARRAYS (MATRIX)

To create a 2D array (matrix), you can pass a list of lists to the [Link]() function.

# Creating a 2D array
array_2d = [Link]([[l, 2, 3], [4, 5, 6], [7, 8, 9]])

Accessing Elements in a 2D Array:

Elements in a 2D array can be accessed using row and column indices.
The syntax is array[row_index, column_index].

# Accessing an element
element = array_2d[1, 2] # Accesses the element in the second row and third column
print("Element at (1, 2):", element) # Output: 6

7
3.2 PANDAS
Pandas is a powerful Python library used for data manipulation and analysis, especially
in structured data environments. It provides versatile data structures to efficiently work
with large datasets.

3.2.1 IMPORTING PANDAS

To use Pandas, you need to import the library:

import pandas as pd

3.2.2 DATASTRUCTRES IN PYTHON

1. Series: A one-dimensional labeled array, similar to a list, where each element has
an associated index.
2. DataFrame: A two-dimensional, tabular data structure with labeled axes (rows
and columns), similar to a table in relational databases.
3. Panel: A three-dimensional container for data, though less commonly used and
largely replaced by multi-dimensional arrays in 'NumPy'.

3.2.3 DATAFRAME
A 'DataFrame' is the primary data structure in Pandas, used to store and manipulate tabular
data.

data= {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}

df = [Link](data)
print(df)

3.2.4 IMPORTING DATA FROM CSV FILE

To load data from a CSV file into a Pandas DataFrame:

df = pd.read_csv('[Link]')
print([Link]()) # Displays the first 5 rows

3.2.5 SORTING AND SLICING RECORDS

Pandas allows sorting and slicing of records for data manipulation.

# Sorting by a column
sorted_df = df.sort_values(by='Age')

# Slicing specific rows and columns

subset_df= [Link][0:2, ['Name', 'Age']] # Select rows Oto 2 and columns 'Name' and 'Age'

3.2.6 MERGING TWO DATAFRAMES

To merge two DataFrames based on a common column:

dfl = [Link]({'ID': [1, 2], 'Name': ['Alice', 'Bob']})

df2 = [Link]({'ID': [1, 2], 'Score': [85, 90]})
merged_df = [Link](dfl, df2, on='ID')

8
3.2.7 GROUPING AND CONCATENATION
Pandas allows grouping data for aggregation and concatenating DataFrames.

# Grouping by a column and applying an aggregate function

grouped_df = [Link]('Age').mean()

# Concatenating two DataFrames

concatenated_ df = [Link]([dfl, df2], axis=1) # Concatenate along columns

9
4. INTRODUCTION TO MACHINE LEARNING
Machine Leaming (ML) is a branch of artificial intelligence (Al) that creates algorithms that learn
from data and make predictions or decisions. Rather than being manually programmed for specific
tasks, ML models find patterns in data and improve automatically with experience. This ability
makes machine learning useful for many applications, including data analysis and automating
decisions.

4.1 MACHINE LEARNING WORKFLOW

1. Data Collection
The first step involves gathering and preparing data from different sources. The quality and
quantity of data significantly influence model performance.

2. Data Preprocessing
The data is cleaned and transformed to prepare it for the model. This includes handling missing
values, normalizing features, and encoding categorical variables.

3. Model Building
A machine learning algorithm is selected and trained on the prepared data to learn the underlying
patterns.

4. Model Evaluation
The model's performance is assessed using metrics like accuracy, precision, recall, and mean
squared error, often with a validation dataset.

5. Model Deployment
Once the model is trained and evaluated, it is deployed for real-time predictions or integrated into
applications for decision-making.

4.2 APPLICATIONS OF MACHINE LEARNING

Machine learning has a wide array of applications across different industries:

Healthcare: Predicting diseases, medical image analysis, personalized treatment recommendations.

Finance: Fraud detection, stock market prediction, risk assessment, credit scoring.
Marketing: Customer segmentation, recommendation engines, targeted advertising.
Autonomous Vehicles: Self-driving cars use machine learning to perceive the environment and
make driving decisions.

4.3 MACHINE LEARNING CATEGORIES

Machine learning techniques are broadly categorized into the following types:
1. Supervised Learning
In supervised learning, the algorithm learns from labeled training data to make predictions.
Each training example includes both the input data and the corresponding output label.
Example Algorithms: Linear Regression, Decision Trees, Support Vector Machines (SVM).
2. Unsupervised Learning
In unsupervised learning, the algorithm is given data without explicit labels and must find
structure within the data. This is often used in exploratory data analysis.

10
Example Algorithms: K-Means Clustering, Hierarchical Clustering, Principal
Component Analysis (PCA).
3. Semi-Supervised Learning
Semi-supervised learning is a hybrid approach where the model is trained on a small
amount of labeled data along with a large amount of unlabeled data. This technique helps
when labeling data is costly or time-consuming.
4. Reinforcement Learning
In reinforcement learning, the model learns through trial and error by interacting with an
environment and receiving feedback in the form of rewards or penalties.

4.4 INTRODUCTION TO LINEAR REGRESSION

Linear regression is a statistical technique used to model the relationship between one or more
independent variables and a dependent variable by fitting a linear equation to observed data. It is
widely used in predictive analytics, machine learning, and data science to understand and predict
continuous outcomes. The primary goal of linear regression is to find the best-fitting line that
minimizes the difference between predicted and actual values.

4.4.1 SIMPLE LINEAR REGRESSION

Simple linear regression is the most basic form of linear regression where we model the
relationship between a single independent variable (X) and a dependent variable (Y). The
relationship between the two is represented by the equation:

Y = o+ 1 X + E:

Where:

• Y is the dependent variable (outcome),

• X is the independent variable (predictor),
• Ois the intercept (value of Y when X=O),
• 1 is the slope (change in Y for a unit change in X),
• E: is the error term (the difference between observed and predicted values).

Simple linear regression is useful when there is only one predictor variable, and the relationship
between the variables is assumed to be linear.

4.4.2 MULTIPLE LINEAR REGRESSION

Multiple linear regression extends simple linear regression by incorporating more than one
independent variable to predict the dependent variable. The equation for multiple linear regression
is:

Where:

• X1, X2, ... , Xn are the independent variables (predictors),

• o, 1, ... , n are the coefficients (representing the influence of each predictor on Y).
11
Multiple linear regression helps in modeling more complex relationships between the dependent
variable and multiple predictors, providing more accurate predictions than simple linear regression
when multiple factors are involved.

4.4.3 MODEL EVALUATION MATRICES

To evaluate the performance of a linear regression model, several metrics are commonly used:

1. Mean Squared Error (MSE): The average squared difference between the observed
and predicted values. Lower values indicate better model performance.

i=l
. 2
MSE = 1 L} - Yi)
n

2. Root Mean Squared Error (RMSE): The square root of MSE, providing an
interpretable error metric in the same units as the dependent variable.

RMSE =MSE
3. R-squared (R2): Represents the proportion of variance in the dependent variable that
is explained by the independent variables. Values range from O to 1, with higher
values indicating a better fit

2 L 1( t)2
R = 1 L7_1( Y)2

4. Adjusted R-squared: Similar to R2, but adjusted for the number of predictors in the
model, preventing overfitting when using multiple independent variables.

Each of these metrics provides insights into how well the linear regression model fits the data and
how accurately it predicts outcomes.

4.5 INTRODUCTION TO LOGISTIC REGRESSION

Logistic regression is a statistical method used to model the relationship between one or more
independent variables and a categorical dependent variable. Unlike linear regression, logistic
regression is used when the dependent variable is binary or categorical. It estimates the probability
that a given input belongs to a certain category by fitting the data to a logistic curve. The output of
logistic regression is a probability between O and 1, which can then be used to classify
observations.

4.5.1 BINARY LOGISTIC REGRESSION

Binary logistic regression is the simplest form of logistic regression where the dependent
variable has two possible outcomes (e.g., 0 or 1, yes or no, true or false). The logistic regression
model for binary outcomes can be written as:

12
Where:

• p is the probability that the dependent variable is 1,

• 0, 1, 2,..., n are the coefficients,
• Xl,X2,...,Xn are the independent variables.

The logistic function transforms the linear equation into a value between O and 1, representing
the probability of a certain class.

4.5.2 MULTINOMIAL LOGISTIC REGRESSION

Multinomial logistic regression is used when the dependent variable has more than two
categories. Unlike binary logistic regression, which predicts probabilities for two classes,
multinomial logistic regression predicts probabilities for multiple classes (e.g., A, B, C). Each
class has its own logistic function, and the probability of an outcome belonging to a particular
class is calculated based on a set of logistic functions. The model is represented as:

Where pk is the probability of class k and pl is the probability of a reference class. This model is
widely used in scenarios like predicting customer choices or categorizing text into multiple
classes.

4.5.3 MODEL EVALUATION METRICS

Several metrics are used to evaluate the performance of a logistic regression model, especially in
classification problems:

1. Accuracy: The proportion of correctly classified observations. It is a basic measure but

may not be useful in cases of imbalanced datasets.

Accuracy= True Positives True [Link]

Total Observations

2. Precision: The ratio of correctly predicted positive observations to the total

predicted positives. It is useful when the cost of false positives is high.

Precision= True Positives

True Positives False Positives

3. Recall (Sensitivity): The ratio of correctly predicted positive observations to all

actual positives. It is crucial when missing positive cases (false negatives) is more
critical.
True Positives
Recall=
True Positives cgatives
13
4. Fl Score: The harmonic mean of precision and recall. It balances the trade-off between
precision and recall and is useful when both false positives and false negatives are
costly.

14
Fl= 2 Precision x Recall
x----------------------
Recall

5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve): This metric

evaluates the trade-off between the true positive rate and false positive rate across
different thresholds. A higher AUC indicates better model performance.

These metrics help assess how well the logistic regression model performs in classification tasks.

4.6 INTRODUCTION TO DECISION TREES

Decision trees are a popular supervised machine learning method used for classification and
regression tasks. They work by recursively splitting the data into subsets based on the value of
input features, resulting in a tree-like structure of decisions. Each internal node in the tree
represents a decision based on a feature, each branch represents the outcome of a decision, and
each leaf node represents the final classification or regression result.

Decision trees are intuitive, easy to interpret, and can handle both numerical and categorical data,
making them useful for a wide range of problems.

4.6.1 DECISION TREE CONSTRUCTION

Constructing a decision tree involves selecting the feature that best separates the data at each step,
typically based on a criterion like information gain or Gini impurity. The tree is built recursively by
splitting the dataset until all data points in a subset belong to the same class (for classification) or
until a stopping condition is reached (for regression).

The basic steps in decision tree construction are:

1. Select the Best Feature: Choose the feature that provides the most significant
separation between data points using criteria like information gain or Gini index.
2. Create Decision Nodes: Split the dataset into smaller subsets based on the selected feature.
3. Repeat the Process: Recursively apply the same process to the resulting subsets.
4. Stopping Condition: Stop splitting when a certain condition is met (e.g., maximum
depth, all samples in a node belong to the same class).

4.6.2 DECISION TREE ALGORITHM

Several algorithms are used to construct decision trees, the most common being:

1. 1D3 (Iterative Dichotomiser 3): This algorithm uses information gain based on entropy
to select the best feature for splitting at each node. It continues building the tree until all
attributes are exhausted or the data is perfectly classified.
2. CART (Classification and Regression Trees): CART is used for both classification
and regression tasks. It splits the data using the Gini impurity for classification or the
least squared deviation for regression. Unlike ID3, CART builds binary trees where each
split produces exactly two branches.
3. C4.5: An extension of ID3, C4.5 handles both categorical and continuous features and
uses a concept of gain ratio (a normalized version of information gain) to split the nodes.
It can handle missing values and allows pruning after tree construction.

15
4.6.3 MODEL EVALUATION

Evaluating a decision tree model involves several metrics depending on whether the task is
classification or regression:

• Accuracy: The percentage of correct predictions (for classification tasks).

• Confusion Matrix: A table summarizing the model's performance, showing true positives,
true negatives, false positives, and false negatives.
• Precision, Recall, Fl Score: Common classification metrics that evaluate the
performance of the model on positive class predictions.
• Mean Squared Error (MSE): Used in regression tasks to calculate the average
squared difference between predicted and actual values.
• ROC-AUC: Evaluates the trade-off between true positive and false positive rates
for classification problems.

4.6.4 PRUNING

Pruning is a crucial step in decision tree optimization that helps reduce overfitting by removing
sections of the tree that are not significant. There are two main types of pruning:

1. Pre-pruning (Early Stopping): Stops the tree construction early, based on predefined
conditions like maximum depth, minimum samples per leaf, or maximum number of
nodes.
2. Post-pruning: Involves growing the full tree first and then removing the least important
branches. Post-pruning techniques like cost-complexity pruning use a validation set to
prune branches that do not improve performance.

Pruning enhances the generalization ability of the model by reducing the complexity of the tree,
preventing overfitting to the training data.

16
5. ADVANCED ML ALGORITHMS
Advanced Algorithms and Applications in Machine Leaming focus on developing sophisticated
techniques to solve complex problems more efficiently. These algorithms include deep learning,
reinforcement learning, ensemble methods (e.g., Random Forest, Gradient Boosting), and
optimization techniques like Genetic Algorithms and Simulated Annealing. They are applied in
areas such as natural language processing, computer vision, recommendation systems, and
autonomous systems. Advanced algorithms improve accuracy, scalability, and generalization,
making them essential for real-world applications in AI-driven industries.

5.1 INTRODUCTION TO K-NEAREST NEIGHBOURS (KNN)

K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised learning algorithm used for
both classification and regression tasks. It is based on the idea that data points with similar
features tend to be near each other. The KNN algorithm works by identifying the "K" nearest
neighbors to a new data point and classifying it based on the majority label (for classification)
or averaging the labels (for regression). KNN is a non-parametric and lazy learning algorithm,
meaning it makes no prior assumptions about the data distribution and does not learn a model
until predictions are required.

5.1.1 KNN ALGORITHM

The KNN algorithm follows these steps:

1. Choose the value of K: Select the number of nearest neighbors (K). A smaller K can lead
to overfitting, while a larger K may smooth out the decision boundary but risk underfitting.
2. Calculate the Distance: For a new data point, calculate the distance to all training
data points using a distance metric such as Euclidean distance (most common),
Manhattan distance, or Minkowski distance.

n
d(p, q) L(p·
i-1

3. Identify Neighbors: Select the K nearest neighbors (those with the smallest distances).
4. Voting or Averaging:

• Classification: Assign the new data point to the class that is most
common among its K neighbors (majority vote).
• Regression: Calculate the average of the values of the K neighbors to predict
the output.

5. Predict the Label: Return the predicted class label or continuous value.

5.1.2 MODEL TRAINING AND PREDICTION

Unlike many algorithms, KNN does not involve an explicit training phase. Instead, KNN
stores the entire dataset and performs calculations only when making predictions. This makes
KNN a "lazy learner," as it defers the generalization of the data until prediction time.
17
For prediction, given a test sample, KNN computes the distance from the test sample to all
training samples, selects the K closest samples, and determines the output based on the majority
or average of those neighbors. While this process is simple, it can become computationally
expensive when the dataset is large, as KNN must compare the test sample to every training
sample.

5.1.3 KNN APPLICATIONS

KNN has a wide range of applications in various fields, including:

1. Pattern Recognition: Used to classify images, handwriting, and speech, as KNN

can identify patterns based on similarity.
2. Recommendation Systems: By comparing user preferences and recommending
products, movies, or services based on the preferences of similar users.
3. Medical Diagnosis: Applied in medical fields to predict diseases or classify
patient conditions based on similarities in medical records or symptoms.
4. Anomaly Detection: Detecting unusual patterns in data for fraud detection, network
security, or fault detection in machinery by finding data points that do not have
enough neighbors close by.
5. Text Classification: Applied in natural language processing (NLP) for spam
detection, sentiment analysis, and categorizing documents.

5.2 CORRELATION
Correlation is a statistical measure that describes the strength and direction of a relationship
between two variables. It quantifies how changes in one variable are associated with changes in
another. A correlation coefficient ranges from -1 to 1, where:

• 1 indicates a perfect positive correlation (both variables move in the same direction),
• -1 indicates a perfect negative correlation (one variable increases while the other decreases),
• 0 indicates no correlation (no relationship between the variables).

5.2.1 PEARSON CORRELATION COEFFICIENT

The Pearson correlation coefficient (r) is the most commonly used measure of correlation and
assesses the linear relationship between two continuous variables. It assumes that the variables
are normally distributed and measures how much one variable changes as the other changes.
The formula for Pearson correlation is:

I: (Xi X)( Y)
r = - --;=I:=(=X=i=x=)2=L=(Y:=i =Y=)2

Where:

• X and Y are the two variables,

• x- and y- are the mean values of X and Y,
• r ranges from -1 to 1.

Pearson correlation is useful when you want to assess the strength and direction of a linear
relationship.

18
5.2.2 SPEARMAN CORRELATION COEFFICIENT

The Spearman correlation coefficient (p or "rho") is a non-parametric measure of rank

correlation. It assesses how well the relationship between two variables can be described by a
monotonic function. Unlike Pearson correlation, Spearman correlation does not assume that the
data is normally distributed, making it more robust for non-linear relationships or when data
contains outliers. The formula for Spearman's rank correlation is:

6 I: d;
p=1
n(n2 - 1)

Where:

• did_idi is the difference between the ranks of corresponding values,

• nnn is the number of data points.

Spearman is commonly used when the data is ordinal or when the relationship is not linear.

5.2.3 CORRELATION ANALYSIS AND INTERPRETATION

Correlation analysis is used to determine the strength and direction of relationships between
variables. When interpreting correlation results, consider the following guidelines:

• Positive correlation: As one variable increases, the other also increases (closer to +1).
• Negative correlation: As one variable increases, the other decreases (closer to -1).
• No correlation: The variables do not move together in any discernible way (closer to 0).

While correlation helps identify relationships, it does not imply causation. A strong correlation
between two variables does not necessarily mean that changes in one cause changes in the
other; there may be other factors or variables influencing the relationship.

Example:

• A Pearson correlation of 0.85 between hours studied and exam scores indicates a strong
positive relationship, meaning that more study hours are associated with higher exam
scores.
• A Spearman correlation of -0.75 between rank in a race and finish time suggests a
strong negative relationship; the faster the finish time, the higher the rank.

5.3 CONFUSION MATRIX

A confusion matrix is a performance measurement tool for machine learning classification
models. It provides insight into the model's predictions by showing how well the predicted
classifications align with the actual classifications. The matrix is a table with four outcomes,
offering a comprehensive understanding of both correct and incorrect predictions, and is
structured as follows:

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

19
True Positives (TP)

True Positives represent instances where the model correctly predicted the positive class. For
example, in a disease classification task, TP would be cases where the model correctly identified
patients as having the disease.

True Negatives (TN)

True Negatives are instances where the model correctly predicted the negative class. In the same
disease classification example, TN would be cases where the model correctly identified patients as
not having the disease.

False Positives (FP)

False Positives, also known as Type I errors, occur when the model incorrectly predicts the positive
class for an actual negative instance. This would be when a healthy patient is wrongly classified as
having the disease.

False Negatives (FN)

False Negatives, or Type II errors, happen when the model incorrectly predicts the negative
class for an actual positive instance. In this case, the model fails to detect the disease in a patient
who actually has it.

5.3.1 ACCURACY

Accuracy is a measure of how often the model's predictions are correct, calculated as the ratio of
correct predictions (TP + TN) to the total number of predictions. It is useful when the classes are
balanced.

TP- TN
Accuracy == TP + TN+ FP+ FN

5.3.2 PRECISION

Precision measures the proportion of true positive predictions out of all positive predictions. It
is crucial when the cost of false positives is high.
TP
Pred ion = -- -
TP-FP

5.3.3 RECALL

Recall, or Sensitivity, measures the proportion of true positive predictions out of all actual positive
instances. It is important in scenarios where missing positive instances (false negatives) is costly.

TP
Recall == -- -
TPFN

20
5.3.4 Fl SCORE

The Fl Score is the harmonic mean of precision and recall, providing a single metric that balances
the trade-off between them. It is especially useful when the dataset is imbalanced, and both false
positives and false negatives are critical.

Fl Score = Precision x Recall

2 x----------------------
Precision t- Recall

5.4 COMPUTER VISION (CV)

Computer Vision is a field of artificial intelligence (Al) that enables computers to interpret,
analyze, and understand visual information from the world, such as images and videos. It aims
to replicate human vision and automate tasks that require visual understanding. By extracting
and processing visual data, computer vision systems can perform tasks like image
classification, object detection, face recognition, and even self-driving car navigation.

Computer vision has rapidly evolved due to advancements in machine learning, especially deep
learning, which allows models to learn from vast amounts of visual data. It plays a critical role
in various sectors, including healthcare, retail, automotive, and robotics.

Fig 1: Computer Vision

5.4.1 IMAGE PROCESSING TECHNIQUES

Image processing involves manipulating and analyzing images to improve their quality or
extract useful information. Some commonly used techniques in image processing include:

1. Image Smoothing: This technique reduces noise and variations in an image, making
it clearer. Methods such as Gaussian Blurring or Median Filtering are used to
smoothen images by reducing unnecessary details.
2. Edge Detection: Detecting the boundaries of objects in images is crucial for identifying
shapes. Algorithms like Canny Edge Detection and Sobel Filters are widely used to detect
sharp transitions in pixel intensity, highlighting the edges of objects.
3. Thresholding: Thresholding converts grayscale images into binary images (black and
white) by setting a pixel intensity threshold. Pixels above the threshold are set to white,
and
21
those below are set to black. This technique is used in applications like document scanning
and segmentation.
4. Morphological Operations: These operations, such as Erosion and Dilation, are used to
remove small objects or enhance structures within an image. They are commonly applied
in preprocessing tasks, such as noise removal or shape enhancement.
5. Image Segmentation: Image segmentation divides an image into meaningful regions
or segments. Watershed Segmentation and K-means Clustering are popular techniques
to separate objects from the background or other regions.

5.4.2 OBJECT DETECTION AND RECOGNIZATION

Object Detection involves locating and identifying objects within an image or video. It not only
classifies objects but also predicts their positions using bounding boxes. Object Recognition, on
the other hand, goes a step further by identifying specific instances of objects, such as
recognizing a particular face in a crowd or detecting a specific brand of a product.

Popular techniques for object detection and recognition include:

1. Convolutional Neural Networks (CNNs): CNNs are the backbone of modern object
detection models. They can automatically learn features from images and are
particularly effective in tasks like image classification and face detection.
2. YOLO (You Only Look Once): YOLO is a real-time object detection algorithm that
divides an image into grids and predicts bounding boxes and class probabilities for objects
in a single pass through the network. It is widely used for tasks that require real-time
object detection, such as video surveillance or autonomous driving.
3. R-CNN (Region-based CNN): R-CNN family algorithms (e.g., Fast R-CNN, Faster
R CNN) are another set of powerful object detection models. They first generate
potential object regions (proposals) and then classify each region using a CNN.
4. Haar Cascades: Haar Cascade classifiers are an older but efficient method for object
detection, especially in detecting faces. They work by applying multiple stages of
classifiers to an image to detect objects in real-time.

5.4.3 COMPUTER VISION APPLICATIONS

Computer vision is transforming numerous industries with a wide array of applications,

including:

1. Healthcare: In medical imaging, computer vision is used to detect tumors, diagnose

diseases, and assist in surgeries. Techniques like image segmentation help in analyzing
X rays, MRis, and CT scans to identify abnormalities.
2. Autonomous Vehicles: Self-driving cars rely on computer vision to interpret the
surroundings by detecting pedestrians, other vehicles, traffic signs, and road conditions
in real-time. Object detection algorithms ensure safe navigation.
3. Face Recognition: Facial recognition technology is widely used in security systems,
unlocking devices, and even social media tagging. Models trained on facial features
can identify individuals in images or videos with high accuracy.
4. Retail: In retail, computer vision enables automated checkout systems, inventory
management, and customer behavior analysis. AI-powered cameras can track products on
shelves and identify when restocking is needed.

22
5. Agriculture: Computer vision is used in precision agriculture to monitor crops, detect
diseases, and analyze soil health. Drones equipped with computer vision technologies
can survey large farmlands and provide insights to farmers.
6. Augmented Reality (AR): In AR applications, computer vision helps in overlaying
digital content onto the physical world. This is used in mobile apps, gaming, and even
navigation systems to enhance the user's real-world experience.

5.5 CONVOLUTIONAL NEURAL NETWORKS (CNN)

Convolutional Neural Networks (CNNs) are a class of deep learning models specifically
designed for analyzing visual data, such as images and videos. Inspired by the human visual
system, CNNs are highly effective in identifying patterns, textures, and objects in images,
making them the backbone of modern computer vision tasks. Unlike traditional neural
networks, CNNs take advantage of the spatial structure of images, allowing them to process
data with a grid-like topology efficiently.

CNNs are widely used in applications such as image classification, object detection, face
recognition, and even natural language processing. By using convolutional layers, CNNs can
automatically learn hierarchical feature representations, from simple features like edges and
corners to more complex features like shapes and objects.

5.5.1 CNN ARCHITECTURE

A typical CNN consists of several key layers, each contributing to the extraction and processing
of image features:

1. Convolutional Layers: The convolutional layer is the core building block of CNNs. It
applies filters (or kernels) to the input image to detect features like edges, textures, and
patterns. This layer generates feature maps, which highlight the presence of specific
features across the image. Each filter produces a different feature map, helping the network
capture various aspects of the image.
2. Pooling Layers: Pooling layers are used to down-sample the feature maps, reducing their
spatial dimensions and computational complexity while preserving important features.
Common pooling techniques include Max Pooling (which selects the maximum value from
a window of the feature map) and Average Pooling. Pooling helps make the model
invariant to small translations in the input image, ensuring that the model can detect
features regardless of their location.
3. Fully Connected Layers (FC Layers): After the convolutional and pooling layers, the
network typically includes one or more fully connected layers. These layers take the high
level features extracted by the convolutional layers and map them to the output classes. In
an image classification task, the final fully connected layer typically outputs a probability
distribution over the possible classes.
4. Activation Functions: Activation functions such as ReLU (Rectified Linear Unit)
introduce non-linearity into the network, allowing it to learn complex patterns. Without
these non linear functions, the network would behave like a simple linear model.
5. Softmax Layer: In classification tasks, the final layer of a CNN is usually a Softmax
layer, which converts the output into probabilities for each class.

23
5.5.2 TRANSFER LAERNING WITH PRE-TRAINDED MODELS

Transfer learning is a technique where a model that has been pre-trained on a large dataset is
fine-tuned on a smaller, domain-specific dataset. In the context of CNNs, popular pre-
trained models such as VGG, ResNet, Inception, and MobileNet are trained on massive
datasets like ImageNet, which contains millions of labeled images across thousands of
categories.

By leveraging transfer learning, the pre-trained model's convolutional layers can be used to
extract relevant features from a new dataset, while the fully connected layers can be re-trained
to suit the specific task. This approach saves time and computational resources and often leads
to improved performance, particularly when the target dataset is small or similar to the pre
trained dataset.

Transfer learning is especially effective in scenarios where labeled data is scarce, as the pre
trained model already has a good understanding of general features like edges, textures, and
shapes. Fine-tuning these models can significantly reduce the need for extensive training and
improve the performance of image classification tasks.

5.5.3 IMAGE CLASSIFICATION WITH CNN

CNNs are highly effective for image classification tasks, where the goal is to categorize an
image into one of several predefined classes. The process of image classification using CNNs
involves the following steps:

1. Input Preprocessing: Before feeding images into the CNN, they are typically resized to
a fixed size, normalized (to adjust pixel values), and augmented (to generate variations
and reduce overfitting).
2. Feature Extraction: The CNN's convolutional layers extract high-level features from the
input image. For instance, in a task to classify images of animals, the early layers may
detect basic features like edges and textures, while deeper layers capture more complex
structures like fur patterns or specific shapes.
3. Classification: The fully connected layers at the end of the network take the
extracted features and map them to the output classes, where the final prediction is
made. The Softmax layer then converts these predictions into probabilities for each
class.
4. Training and Optimization: During training, the model's weights are adjusted using
techniques like Backpropagation and Stochastic Gradient Descent (SGD) to minimize
the loss function, which measures the difference between the predicted class and the
actual class.

CNNs have achieved state-of-the-art performance on several benchmark image classification

tasks and are widely used in applications like face recognition, medical image analysis, and
object detection. Their ability to automatically learn features makes them ideal for complex
visual recognition problems.

5.6 NATURAL LANGUAGE PROCESSING (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses
24
on the interaction between computers and humans through natural language. The primary
goal of NLP is to enable machines to understand, interpret, and generate human language in
a

25
meaningful way. It encompasses various tasks, including language translation, sentiment
analysis, text summarization, and conversational agents like chatbots.

NLP combines computational linguistics, machine learning, and linguistics to analyze and
understand human language. With the increasing volume of text data generated daily-from
social media, emails, articles, and customer reviews-NLP has become essential for businesses
and researchers seeking to extract insights from unstructured text. Advances in deep learning
and the availability of large datasets have significantly improved NLP models, enabling them to
achieve impressive performance in tasks such as text classification and language generation.

5.6.1 TEXT PREPROCESSING TECHNIQUES

Text preprocessing is a crucial step in NLP that involves transforming raw text into a clean and
structured format suitable for analysis. This step helps to improve the performance of NLP
models by reducing noise and standardizing the input data. Common text preprocessing
techniques include:

1. Tokenization: Tokenization is the process of breaking down text into smaller units, called
tokens. Tokens can be words, phrases, or sentences, depending on the granularity required
for analysis. For example, the sentence "Natural Language Processing is fascinating!" can
be tokenized into ["Natural", "Language", "Processing", "is", "fascinating","!"].
2. Lowercasing: Converting all text to lowercase helps to ensure consistency by treating
words with different casing (e.g., "NLP" and "nlp") as the same token.
3. Removing Punctuation and Special Characters: Punctuation marks and special
characters often do not contribute to the meaning of the text and can be removed to simplify
analysis. For instance, "Hello, world!" would become "Hello world".
4. Stopword Removal: Stopwords are common words like "and," "the," and "is" that often
do not add significant meaning to the text. Removing stopwords can reduce the
dimensionality of the data and improve model performance.
5. Stemming and Lemmatization: Stemming involves reducing words to their base or
root form (e.g., "running" to "run"), while lemmatization considers the context and
converts words to their dictionary form (e.g., "better" to "good"). Both techniques help
standardize words and reduce variations.

5.6.2 TEXT REPRESENTATION

Once text has been preprocessed, the next step is to represent it in a format that machine
learning models can understand. Several text representation techniques include:

1. Bag of Words (BoW): The Bag of Words model represents text as an unordered collection
of words, disregarding grammar and word order. Each document is converted into a vector
of word frequencies, where each dimension corresponds to a unique word in the
vocabulary. While simple, BoW does not capture word context or relationships.
2. Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF improves upon BoW
by weighing the frequency of words relative to their importance in a collection of
documents. It assigns higher weights to words that are frequent in a particular document
but rare across the entire corpus, helping to identify key terms.
3. Word Embeddings: Word embeddings, such as Word2Vec, GloVe, and FastText,
represent words as dense vectors in a continuous vector space. These embeddings capture
semantic relationships and contextual meanings, allowing similar words to have similar
vector

26
representations. For example, the words "king" and "queen" would have embeddings that
reflect their relationship in context.
4. Transformers: Transformers, particularly models like BERT and GPT, have revolutionized
text representation by using self-attention mechanisms to capture contextual information
and relationships between words in a sentence. These models have achieved state-of-the-art
performance in various NLP tasks.

5.6.3 SENTIMENT ANALYSIS AND TEXT CLASSIFICATION

Sentiment analysis is an NLP task that involves determining the sentiment or emotional tone of
a piece of text. It is widely used in applications like social media monitoring, customer
feedback analysis, and brand reputation management. Sentiment analysis can be categorized
into:

1. Binary Sentiment Analysis: This involves classifying text as either positive or

negative. For example, a review stating, "I love this product!" would be classified as
positive.
2. Multi-class Sentiment Analysis: This expands on binary sentiment analysis by
including neutral or multiple sentiment categories (e.g., positive, negative, neutral). This
approach provides a more nuanced understanding of public opinion.
3. Aspect-Based Sentiment Analysis: This advanced technique analyzes sentiments
towards specific aspects of a product or service. For instance, in the sentence "The battery
life is great, but the camera quality is poor," the sentiment is positive towards battery life
and negative towards camera quality.

Text classification is a broader NLP task that involves categorizing text into predefined classes
or labels. This can include tasks like spam detection in emails, topic classification of news
articles, and intent recognition in chatbots. The process typically involves feature extraction,
training a classifier (e.g., logistic regression, support vector machines, or neural networks), and
evaluating model performance using metrics such as accuracy, precision, recall, and Fl score.

27
6. HOUSING PRICE PREDICTION MODEL
6.1 INTRODUCTION
The Housing Price Prediction Model aims to predict the price of houses based on various
features such as location, size, number of bedrooms, and other factors. By utilizing machine
learning algorithms, particularly linear regression, the model analyzes historical data to
establish a relationship between housing features and their respective prices. This project
showcases how machine learning can be applied in the real estate market to make informed
decisions based on data patterns.

6.2 PURPOSE
The primary purpose of this project is to develop a model that accurately predicts house prices
using regression techniques. By feeding historical housing data into the model, it will learn the
relationship between various features and the house prices, enabling it to predict the price of
new houses based on their features. This helps buyers, sellers, and real estate agents make
more data-driven decisions.

6.3 PURPOSE
The algorithm used for this project is Linear Regression, which is a supervised learning
technique. Linear regression helps in modeling the relationship between a dependent variable
(house price) and one or more independent variables (features such as size, location, etc.). The
goal is to fit a linear equation to observed data, which will be used to predict the house price
based on the given features.

6.4 INTRODUCTION
1. Data Collection
We collect housing data from reliable sources, including features like house size, number of
rooms, location, and price.
2. Data Preprocessing

Handling Missing Values: Any missing values in the dataset are handled by either filling
them with mean/median or removing them.
Feature Scaling: Standardization or normalization of features is performed to ensure
they are on the same scale, which improves model accuracy.
Encoding Categorical Variables: Categorical variables (e.g., location) are converted into
numerical representations using techniques like one-hot encoding.

3. Splitting Data

The dataset is split into two parts: training and testing sets. Typically, 80% of the data is
used for training, and 20% is used for testing.

28
4. Model Building

The Linear Regression algorithm is applied to the training data. The model learns the
relationship between the features and the house prices during this phase.

5. Model Evaluation
o The trained model is evaluated using the testing dataset. Evaluation metrics such
as Mean Squared Error (MSE) and R-squared (R2) are used to measure how well
the model performs.
6. Prediction
o The model is used to predict the prices of houses based on the input features

6.5 IMPLEMENTATION:
[Link] LIBRARIES
Ensure that you have the required libraries installed in your Python environment.

pip install pandas numpy scikit-learn jupyter

2. CREATE JUPYTER NOTEBOOK

Run the following command in cmd to start Jupyter Notebook:

jupyter notebook

This will open the Jupyter interface in your browser. You can create a new notebook by clicking
on "New" and selecting "Python 3."

• C 0 localhoslS888/t,l'l' * o D s :

• python CPP 'l mdm doc 4- gdb ij Learn JAVA P,ogram nr srv ['J tbntpc D All Booktm11ks

Jupyter
File View Settings Help

• Fil•s O Running
Select item5 to perform actions on them. • N•w !. Upl0<1d C

Fig 2: Jupyter Interface

3. CREATE CODE SNIPPET CELLS:

Import Libraries (Code Cell)

Import the necessary libraries for data handling, model building, and evaluation.

# Importing the required libraries

import pandas as pd

29
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from [Link] import mean_squared_error, r2_score

Load and Explore the Dataset (Code Cell)

Load your housing dataset (assumed to be in CSV format) and perform basic exploration.

# Load the housing dataset (replace '[Link]' with your dataset path)

data= pd.read_csv('[Link]')

# Display the first few rows of the dataset

[Link]()

Data Preprocessing (Code Cell)

Handle missing values and split the dataset into features and target variables.

# Handle missing values

[Link]( [Link](), inplace=True)

# Split the dataset into features (X) and target variable (y)

X = [Link]('Price', axis=]) # Features

y = data['Price'] # Target variable

Train-Test Split (Code Cell)

Split the dataset into training and testing sets.

# Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Building and Training (Code Cell)

Initialize and train the Linear Regression model.

# Initialize the Linear Regression model

model = LinearRegression()

# Train the model on the training data

30
[Link](X_train, y_train)

Model Evaluation (Code Cell)

Evaluate the model using the test data.

# Predict house prices using the test data

y_pred = [Link](X_test)
# Evaluate the model performance

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: [mse)")

print(f"R-squared Score: [r2}")

Prediction for New Data (Code Cell)

Use the trained model to make predictions for new house data.

# Predict the price of a new house with features [2000, 3, 2, 1J

new_house = [Link]([[2000, 3, 2, 1]]) # Example: [size, bedrooms, bathrooms,floors]

predicted_price = [Link]( new_house)

print(f"Predicted Price for the new house: {predicted_price}")

Execution

1. Save the Notebook: After writing all the cells, save the notebook (File-> Save As) and run
the cells (Cell -> Run All).
2. Running the Notebook: You can run each cell one by one by clicking the "Run" button, or
you can run all cells at once (Cell-> Run All).
3. Exporting the Notebook: You can export the notebook as a PDF or HTML by going to File
-> Download As and selecting your preferred format.

6.6 TESTCASES
1. Test Case 1: Basic Prediction
• Input: Features [2000, 3, 2, 1] representing a house of 2000 square feet with 3
bedrooms, 2 bathrooms, and 1 floor.
• Expected Output: Predicted price (e.g., 450,000).
2. Test Case 2: Model Evaluation
• Input: Test dataset (automatically selected from the dataset).

31
•
Expected Output: Mean Squared Error and R-squared score indicating the
performance of the model.
3. Test Case 3: Missing Values Handling
• Input: Dataset with missing values in some columns.
• Expected Output: The missing values are handled, and the model should
still function without errors.
4. Test Case 4: Prediction for Different House Features
• Input: Features [1500, 2, 1, l] representing a house of 1500 square feet with 2
bedrooms, 1 bathroom, and 1 floor.
• Expected Output: A predicted price (e.g., 320,000).

6.7 RESULTS

;::'. JUpyter Untitled Last Checkpoint: yesterday

File Edit View Run Kernel Settings Help Trusted
it + X I[] LI ► ■ C ... Code
Jupyterlab C' 0 Python 3 (ipykernel)0 =
I MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 B. 3252 41.0 6. 984127 1.023810 322.0 2. 555556 37 .88
8. 3014 21.0 6. 238137 0.971880 2401.0 2 .109842 37.86
7. 2574 52.0 8. 288136 1.073446 496.0 2. 802260 37 .85
3 5. 6431 52 .0 5. 817352 1.073059 558.0 2. 547945 37 .85
4 3. 8462 52.0 6. 281853 1.081081 565.0 2 .181467 37.85

Longitude PRICE
0 -122.23 4.526
-122.22 3. 585
-122.24 3. 521
-122.25 3.413
4 -122.25 3.422
MedInc
HouseAge
AveRooms
AveBedrms
Population
AveOccup
Latitude
Longitude
PRICE
dtype: int64
Training set size: 16512, Test set size: 4128
Mean Squared Error: 0. 555891598695244
R- squared score: 0. 5757877060324511

The dataset used for the housing price prediction model contains key features such as median
income ('Medine'), house age ('HouseAge'), average rooms ('AveRooms'), bedrooms
('AveBedrms'), population ('Population'), occupancy ('AveOccup'), geographic coordinates
('Latitude', 'Longitude'), and the target variable, housing price ('PRICE'). The data is
complete with no missing values, ensuring consistency for model training. The dataset was
split into a training set of 16,512 records and a test set of 4,128 records.

The model's performance was evaluated using Mean Squared Error (MSE), which was
calculated as 0.555, and the R-squared score was 0.576, indicating the model explained 57.6%
of the variance in housing prices.

32
7. CONCLUSION

The internship provided a comprehensive understanding of Python programming and

its applications in machine learning, specifically focusing on linear regression and its
role in predicting housing prices. Python's flexibility and wide range of libraries such
as Pandas, NumPy, and Scikit-learn enabled efficient data manipulation, analysis, and
model building.

Through this project, I gained practical experience in data collection, preprocessing,

model development, and evaluation, essential steps in the machine learning
workflow. The house price prediction task demonstrated how machine learning
models can uncover valuable insights from data and provide accurate predictions
based on historical patterns. By applying supervised learning techniques like linear
regression, I was able to develop a simple yet effective model to predict housing
prices based on various features.

This not only deepened my understanding of fundamental machine learning concepts

but also enhanced my skills in handling real-world datasets, building predictive
models, and evaluating their performance.

Python Machine Learning Internship Report
No ratings yet
Python Machine Learning Internship Report
5 pages
IoT LED Blink Code with Arduino/Pi
No ratings yet
IoT LED Blink Code with Arduino/Pi
16 pages
CCNA Presentation for V Semester Training
No ratings yet
CCNA Presentation for V Semester Training
13 pages
Basic HTML Web Page Project Report
No ratings yet
Basic HTML Web Page Project Report
9 pages
Strategic Plan 2010-2012 for TRIPC
No ratings yet
Strategic Plan 2010-2012 for TRIPC
50 pages
REST Based API
No ratings yet
REST Based API
5 pages
OSPF Configuration and Analysis Lab
No ratings yet
OSPF Configuration and Analysis Lab
26 pages
Computer Networks Lab Manual 2021-22
No ratings yet
Computer Networks Lab Manual 2021-22
76 pages
ATM and MPLS in Computer Networks
No ratings yet
ATM and MPLS in Computer Networks
280 pages
Networking Hardware Configuration Guide
No ratings yet
Networking Hardware Configuration Guide
32 pages
JNTUA B.Tech CSE R20 4-1 Syllabus
No ratings yet
JNTUA B.Tech CSE R20 4-1 Syllabus
16 pages
CS3591 Networking Unit 1 Overview
No ratings yet
CS3591 Networking Unit 1 Overview
62 pages
Understanding Network Command Utilities
No ratings yet
Understanding Network Command Utilities
9 pages
Btech Sem 7 8 Btec 905a 18 Routing and Switching
No ratings yet
Btech Sem 7 8 Btec 905a 18 Routing and Switching
2 pages
Classification and Decision Trees Overview
No ratings yet
Classification and Decision Trees Overview
18 pages
Understanding the Network Layer
No ratings yet
Understanding the Network Layer
33 pages
Cloud Computing Lab Manual 2017
No ratings yet
Cloud Computing Lab Manual 2017
55 pages
Computer Networks Lab Manual
No ratings yet
Computer Networks Lab Manual
24 pages
Cisco Packet Tracer Labs Guide
100% (1)
Cisco Packet Tracer Labs Guide
37 pages
Network Security Question Bank for B.E. CSE
No ratings yet
Network Security Question Bank for B.E. CSE
5 pages
IoT Lesson Plan Overview
No ratings yet
IoT Lesson Plan Overview
2 pages
Network Simulator 2 (NS2) - Features & Basic Architecture of NS2
No ratings yet
Network Simulator 2 (NS2) - Features & Basic Architecture of NS2
2 pages
Computer Networks Lab Syllabus
No ratings yet
Computer Networks Lab Syllabus
51 pages
Introduction to Raspberry Pi in IoT
No ratings yet
Introduction to Raspberry Pi in IoT
41 pages
TMS320C24x DSP Architecture Overview
No ratings yet
TMS320C24x DSP Architecture Overview
37 pages
A Crash Course in Networking
No ratings yet
A Crash Course in Networking
12 pages
Essential TCP/IP Commands and Utilities
No ratings yet
Essential TCP/IP Commands and Utilities
100 pages
Zero Trust Cloud Security Internship Report
No ratings yet
Zero Trust Cloud Security Internship Report
55 pages
Wired LAN: Ethernet Standards Overview
No ratings yet
Wired LAN: Ethernet Standards Overview
36 pages
5D Optical Disk Data Storage Seminar
No ratings yet
5D Optical Disk Data Storage Seminar
34 pages
CS3591 Computer Networks Lab Manual
0% (1)
CS3591 Computer Networks Lab Manual
50 pages
TCP/IP Protocol Suite Overview
No ratings yet
TCP/IP Protocol Suite Overview
31 pages
CupCarbon Installation and Python Setup
No ratings yet
CupCarbon Installation and Python Setup
4 pages
JNTUK R20 CNS Lab Manual
No ratings yet
JNTUK R20 CNS Lab Manual
25 pages
Hierarchical Routing in Network Layer
No ratings yet
Hierarchical Routing in Network Layer
22 pages
HTML Basics for Web Design
No ratings yet
HTML Basics for Web Design
55 pages
Sintai OTNS8600 Product Overview
No ratings yet
Sintai OTNS8600 Product Overview
37 pages
Wireless Communication Lab Manual 3171608
No ratings yet
Wireless Communication Lab Manual 3171608
32 pages
Network Design Case Study with Cisco
No ratings yet
Network Design Case Study with Cisco
4 pages
Streetlight Automation Using IoT
No ratings yet
Streetlight Automation Using IoT
21 pages
Computer Networking Lab Manual
No ratings yet
Computer Networking Lab Manual
112 pages
CSE Laboratory Experiments Report
60% (5)
CSE Laboratory Experiments Report
26 pages
Pulse Code Modulation
No ratings yet
Pulse Code Modulation
78 pages
Batch 6 Report
No ratings yet
Batch 6 Report
86 pages
MK-PPT Chapter 2
No ratings yet
MK-PPT Chapter 2
198 pages
Computer Networks Lab Manual
No ratings yet
Computer Networks Lab Manual
58 pages
FSD Module 5 Notes
No ratings yet
FSD Module 5 Notes
13 pages
Evaluating Home Internet Speeds
No ratings yet
Evaluating Home Internet Speeds
10 pages
Azure IoT Developer Training Overview
No ratings yet
Azure IoT Developer Training Overview
8 pages
Training Overview for Lanchat Leadership
No ratings yet
Training Overview for Lanchat Leadership
23 pages
DHCP and ICMP: Networking Essentials
No ratings yet
DHCP and ICMP: Networking Essentials
12 pages
Email Protocols Overview: SMTP, POP3, IMAP
No ratings yet
Email Protocols Overview: SMTP, POP3, IMAP
39 pages
Project Report for MCA Degree Completion
No ratings yet
Project Report for MCA Degree Completion
17 pages
Advanced Computer Networking Lab Manual
No ratings yet
Advanced Computer Networking Lab Manual
70 pages
Firewall Seminar Report Overview
No ratings yet
Firewall Seminar Report Overview
27 pages
Digital Image Forgery Detection Techniques
100% (1)
Digital Image Forgery Detection Techniques
25 pages
Advanced Computer Network Syllabus 2021
No ratings yet
Advanced Computer Network Syllabus 2021
11 pages
Deep Learning and Machine Learning Overview
No ratings yet
Deep Learning and Machine Learning Overview
26 pages
Internship Report on Machine Learning
No ratings yet
Internship Report on Machine Learning
38 pages
Machine Learning Internship Report
No ratings yet
Machine Learning Internship Report
39 pages
OHE Fittings and PG Clamps Overview
100% (2)
OHE Fittings and PG Clamps Overview
8 pages
Deflagration to Detonation in Vapour Clouds
No ratings yet
Deflagration to Detonation in Vapour Clouds
6 pages
Nitriding Methods: Plasma vs Gas Effects
No ratings yet
Nitriding Methods: Plasma vs Gas Effects
6 pages
Synthesis of PANI/CeO2 Nanocomposites
No ratings yet
Synthesis of PANI/CeO2 Nanocomposites
4 pages
FlowCon Automatic Balancing Valves Guide
No ratings yet
FlowCon Automatic Balancing Valves Guide
4 pages
Enus209 135
No ratings yet
Enus209 135
58 pages
Mash Feed Production Line Quotation
No ratings yet
Mash Feed Production Line Quotation
2 pages
Power in Driven RLC Circuits
No ratings yet
Power in Driven RLC Circuits
43 pages
Recruitment & Selection at Fortune Company
No ratings yet
Recruitment & Selection at Fortune Company
21 pages
Construction Material Specifications List
No ratings yet
Construction Material Specifications List
1 page
DFX Cipoxy 16D Certificate of Analysis
No ratings yet
DFX Cipoxy 16D Certificate of Analysis
2 pages
Quick Release Towing Hooks Overview
No ratings yet
Quick Release Towing Hooks Overview
6 pages
Overview of Offshore Drilling Operations
No ratings yet
Overview of Offshore Drilling Operations
9 pages
Redox Reaction Formula Cheat Sheet
100% (2)
Redox Reaction Formula Cheat Sheet
2 pages
AHU Guideline 01: Quality Standards
100% (1)
AHU Guideline 01: Quality Standards
24 pages
SIWES Training Report: May-Oct 2016
No ratings yet
SIWES Training Report: May-Oct 2016
36 pages
2014 EE Licensure Exam School Performance
100% (1)
2014 EE Licensure Exam School Performance
20 pages
Split Ends in Long Product Rolling
No ratings yet
Split Ends in Long Product Rolling
13 pages
Electrical Maintenance & Repair Guide
No ratings yet
Electrical Maintenance & Repair Guide
7 pages
Boeing 737 NG Systems Questionnaire
100% (9)
Boeing 737 NG Systems Questionnaire
150 pages
201510301136485832
100% (2)
201510301136485832
154 pages
OSPM Mini-Steering Unit Overview
No ratings yet
OSPM Mini-Steering Unit Overview
16 pages
Gas Well Deliverability Analysis Techniques
No ratings yet
Gas Well Deliverability Analysis Techniques
20 pages
Sap2000 Eng Brochure
No ratings yet
Sap2000 Eng Brochure
44 pages
Overview, Road Map, and Strategy For Sap S/4Hana: Public
No ratings yet
Overview, Road Map, and Strategy For Sap S/4Hana: Public
28 pages
Mahendran Magalingam: RPA Expert Profile
No ratings yet
Mahendran Magalingam: RPA Expert Profile
8 pages
Study of A Novel Energy Efficient Single-Phase Induction Motor With Three Series-Connected Windings and Two Capacitors
No ratings yet
Study of A Novel Energy Efficient Single-Phase Induction Motor With Three Series-Connected Windings and Two Capacitors
8 pages
Training Center Profile - PT Trakindo Utama
No ratings yet
Training Center Profile - PT Trakindo Utama
24 pages
Cell Ranger R Kit Analysis Tutorial
No ratings yet
Cell Ranger R Kit Analysis Tutorial
10 pages