Python Programs for Data Analysis
Python Programs for Data Analysis
Write a program to perform central tendency (mean, median, mode) with and
without using built-in function on the data.
Date:07/07/2025
----------------------------------------------------------------------------------------------------------------
WITH BUILT-IN:
import numpy as np
import statistics as stat
data=12,12,23,24,56,32,23,23
print(".....................USING BUILT IN FUNCTION.................")
mean=[Link](data)
median=[Link](data)
mod=[Link](data)
print("The mean value is:",mean)
print("The median value is:",median)
print("The mod value is:",mod)
Output:
.....................USING BUILT IN FUNCTION.................
The mean value is: 25.625
The median value is: 23.0
The mod value is: 23
1|Page
WITHOUT-BUILT IN:
data=[]
n=int(input("Enter number of elemnts:"))
for i in range(n):
ele=int(input(f"Enter the elemnts {i+1}:"))
[Link](ele)
print(" The created list is:",data)
mean=sum(data)/ n
print("Mean without built in function:",mean)
sorted_data=sorted(data)
print("The sorted data is:",sorted_data)
if n%2==0:
mid_index1=n//2
mid_index2=mid_index1-1
median=(sorted_data[mid_index1] + sorted_dat[mid_index]) / 2
else:
mid_index=n//2
median=sorted_data[mid_index]
print("Median is:",median)
count={}
for element in data:
if element in count:
count[element] += 1
else:
count[element]=1
max_count=max([Link]())
mode=[element for element, c in [Link]() if c==max_count]
print("Mode is",mode)
Output:
Enter number of elemnts: 5
Enter the elemnts 1: 6
Enter the elemnts 2: 3
Enter the elemnts 3: 2
Enter the elemnts 4: 8
Enter the elemnts 5: 6
The created list is: [6, 3, 2, 8, 6]
Mean without built in function: 5.0
The sorted data is: [2, 3, 6, 6, 8]
Median is: 6 Mode is [6]
2|Page
2. Write a python program to:
i) read multiple files from single folder
ii) read multiple files from multiple folders.
Date:
----------------------------------------------------------------------------------------------------------------
i) read multiple files from single folder
import os
path=[Link]()
main_folder_name=[Link](path)
print(f"main folder name:{main_folder_name}")
for file in [Link](path):
if [Link](".txt"):
file_path=[Link](path,file)
print(f"file path:{file_path}")
with open(file_path,'r')as f:
print([Link]())
Output:
main folder name:mca
file path:C:\Users\DELL\mca\[Link]
Hello.............!Good Morning
3|Page
ii) read multiple files from multiple folders.
import os
def read_text_files_from_folders(root_folder):
texts = []
for folder_name, subfolders, filenames in [Link](root_folder):
for filename in filenames:
if [Link](".txt"):
file_path = [Link](folder_name, filename)
try:
with open(file_path, 'r', encoding='utf-8') as file:
content = [Link]()
print(folder_name)
print(filename)
print(content)
[Link](content)
except Exception as e:
print(f"Error reading file {file_path}: {e}")
return texts
root_folder = "C:/"
Output :
C:/
[Link]
Deployment Image Servicing and Management tool
Version: 10.0.10240.16384
Image Version: 10.0.10240.16384
Packages listing:
Package Identity : Microsoft-Windows-Client-LanguagePack-
Package~31bf3856ad364e35~amd64~en-US~10.0.10240.16384
State : Installed
Release Type : Language Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-DiagTrack-Internal-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
4|Page
State : Installed
Release Type : Feature Pack
Install Time : 7/10/2015 12:20 PM
Package Identity : Microsoft-Windows-Foundation-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : Foundation
Install Time : 7/10/2015 12:20 PM
Package Identity : Microsoft-Windows-LanguageFeatures-Basic-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-LanguageFeatures-Handwriting-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-LanguageFeatures-OCR-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:14 PM
Package Identity : Microsoft-Windows-LanguageFeatures-Speech-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:14 PM
Package Identity : Microsoft-Windows-LanguageFeatures-TextToSpeech-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:14 PM
Package Identity : Microsoft-Windows-Prerelease-Client-
Package~31bf3856ad364e35~amd64~en-US~10.0.10240.16384
5|Page
State : Installed
Release Type : Language Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-Prerelease-Client-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : Feature Pack
Install Time : 7/10/2015 12:20 PM
Package Identity : Microsoft-Windows-RetailDemo-OfflineContent-Content-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:16 PM
Package Identity : Microsoft-Windows-RetailDemo-OfflineContent-Content-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:16 PM
Package Identity : Package_for_KB3074667~31bf3856ad364e35~amd64~~[Link]
State : Installed
Release Type : Security Update
Install Time : 7/24/2015 3:15 AM
Package Identity : Package_for_KB3081444~31bf3856ad364e35~amd64~~[Link]
State : Install Pending
Release Type : Security Update
Install Time : 8/25/2015 8:38 AM
6|Page
3. Write a python program to read and display various kinds of data (image, text, and
numeric, audio, video) saved in different format using various python libraries.
Date:
----------------------------------------------------------------------------------------------------------------
#jpg
from PIL import Image
image='[Link]'
image=[Link](image)
[Link]()
Output:
#png
from PIL import Image
image='[Link]'
image=[Link](image)
[Link]()
Output:
7|Page
#gif
import cv2
video = '[Link]'
cap = [Link](video)
if not [Link]():
print("Error: could not open video")
exit()
while True:
ret, frame = [Link]()
if not ret:
break
[Link]('Frame', frame)
if [Link](25) & 0xFF == ord('q'):
break
[Link]()
[Link]()
Output:
# audio
from [Link] import Audio
import [Link]
import numpy as np
import [Link] as plt
audio='computer-keyboard-typing-290582.mp3'
y,sr=[Link](audio,sr=None)
print(f'sampling rate:{sr}Hz')
print(f'Number of sample:{len(y)}')
[Link](figsize=(14,5))
[Link](3,3,3)
[Link]([Link](len(y))/sr,y)
8|Page
[Link]('waveform')
[Link]('Time(s)')
[Link]('Amptitude')
[Link]()
Audio(data=y,rate=sr)
Output:
sampling rate:48000Hz
Number of sample:1793664
# TEXT DATA
#.txt .json .exel
#.txt
file_path ='[Link]'
try:
with open(file_path, 'r') as file:
print([Link]())
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
except IOError:
print(f"Error: Could not read file '{file_path}'.")
Output:
Hello.............!Good Morning
9|Page
# json
import json
data={
"name":"john doe",
"age":30,
"city":"newyork",
"intrests":["python","data science","Reading"]
}
file_path='[Link]'
try:
with open(file_path,'w')as file:
[Link](data,file,indent=4)
print("Json data has been written to successfully")
except IOError:
print("Error:Could not write to file")
file_path='[Link]'
with open(file_path,'r')as file:
print([Link]())
Output:
Json data has been written to successfully
{
"name": "john doe",
"age": 30,
"city": "newyork",
"intrests": [
"python",
"data science", "Reading" }
#xls
import pandas as pd
file_path='[Link]'
df=pd.read_excel(file_path)
print([Link]())
Output:
10 | P a g e
4 Nereida Magwood Female US 58 2016-08-16 2468
#CSV
import pandas as pd
file_path='[Link]'
df=pd.read_csv(file_path)
print([Link]())
Output:
area bedroom age price
0 2600 3.0 20 550000
1 3000 4.0 15 565000
2 3200 NaN 18 580000
3 3500 3.0 30 595000
4 4000 5.0 8 610000
#tsv
def read_tsv_file(file_path):
with open(file_path,'r')as file:
lines=[Link]()
for line in lines:
fields=[Link]().split('\t')
print(fields)
file_path='[Link]'
read_tsv_file(file_path)
Output:
["[' ']"]
["['Annual budget tracker']"]
["['Plan and track your monthly spending for the entire year']"]
["[' ']"]
["[' ']"]
["['How to use this temple']"]
['']
11 | P a g e
4. Write a program to perform measure of dispersion (range, variance, standard
deviation, IQR) with and without using built-in function on the data set.
Date:09-07-2025
----------------------------------------------------------------------------------------------------------------
WITH BUILT IN :
import numpy as np
data=[]
n=int(input("Enter the number do you want to enter:"))
for i in range(n):
ele=int(input("Enter the elements:"))
[Link](ele)
print("The numbers are:",data)
range_builtin = [Link](data)
variance_builtin = [Link](data)
std_deviation_builtin = [Link](data)
iqr_builtin = [Link](data, 75) - [Link](data,25)
print(".....................Using Built-in Functions............................")
print(f"Range: {range_builtin}")
print(f"Variance: {variance_builtin}")
print(f"Standard Deviation: {std_deviation_builtin}")
print(f"IQR: {iqr_builtin}")
Output:
Enter the number do you want to enter: 14
Enter the elements: 8
Enter the elements: 9
Enter the elements: 4
Enter the elements: 2
Enter the elements: 3
Enter the elements: 5
Enter the elements: 4
Enter the elements: 12
Enter the elements: 78
Enter the elements: 71
Enter the elements: 61
Enter the elements: 36
Enter the elements: 78
Enter the elements: 90
The numbers are: [8, 9, 4, 2, 3, 5, 4, 12, 78, 71, 61, 36, 78, 90]
12 | P a g e
.....................Using Built-in Functions............................
Range: 88
Variance: 1107.4948979591838
Standard Deviation: 33.27904592922074
IQR: 64.25
data=[]
n=int(input("Enter the number do you want to enter:"))
for i in range(n):
ele=int(input("Enter the elements:"))
[Link](ele)
print("The numbers are:",data)
range_custom = max(data) - min(data)
mean = sum(data) / len(data)
variance_custom = sum((x - mean) ** 2 for x in data) / (len(data) - 1)
std_dev_custom = variance_custom ** 0.5
sorted_data = sorted(data)
q75_custom = sorted_data[int(len(data) * 0.75)]
q25_custom = sorted_data[int(len(data) * 0.25)]
iqr_custom = q75_custom - q25_custom
print(".......................Without Using Built-in Functions................................")
print(f"Range: {range_custom}")
print(f"Variance: {variance_custom}")
print(f"Standard Deviation: {std_dev_custom}")
print(f"IQR: {iqr_custom}")
Output:
Enter the number do you want to enter: 10
Enter the elements: 10
Enter the elements: 20
Enter the elements: 30
Enter the elements: 40
Enter the elements: 50
Enter the elements: 60
Enter the elements: 70
Enter the elements: 80
13 | P a g e
Enter the elements: 90
Enter the elements: 100
The numbers are: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
.......................Without Using Built-in Functions................................
Range: 90
Variance: 916.6666666666666
Standard Deviation: 30.276503540974915
IQR: 50
14 | P a g e
5. Write a program to perform linear regression using
i) Single variable
ii) Multiple variables.
Date:11-07-2025
----------------------------------------------------------------------------------------------------------------
i) Single variable
import pandas as pd
from sklearn.linear_model import LinearRegression
import [Link] as plot
df=pd.read_csv('[Link]')
[Link]()
Output:
[Link]()
Output:
Index(['area', 'bedrooms', 'age ', ' price'], dtype='object')
df=[Link](columns={'age ':'age'})
print([Link])
Output:
Index(['area', 'bedrooms', 'age', ' price'], dtype='object')
[Link]('area')
[Link]('price')
[Link]([Link],[Link],color='green',marker='+')
[Link]()
15 | P a g e
Output:
x=df[['area']]
y=df[['price']]
x=[Link][:,0].values
y=[Link][:,3].values
print(x)
print(y)
Output:
[2600 3000 3200 3500 4000]
[550000 556500 580000 595000 610000]
reg=LinearRegression()
x = [Link](-1, 1)
[Link](x,y)
Output:
reg.coef_
Output:
array([46.50179856])
[Link]([[3000]])
16 | P a g e
Output:
array([566209.5323741])
import pandas as pd
import [Link] as plt
from sklearn.linear_model import LinearRegression
df=pd.read_csv('homeprices_multiple.csv')
[Link]()
Output:
[Link]('age')
[Link]('price')
[Link]([Link],[Link],color='green',marker='+')
[Link]()
Output:
17 | P a g e
[Link](df[['area', 'bedrooms', 'age']], df['price'])
Output:
print("coefficient:",reg.coef_)
print("Intercept:",reg.intercept_)
Output:
coefficient: [ 148.64130435 35135.86956522 -1603.26086957]
Intercept: 2581.5217391273472
Output:
Predicted price: 529864.130434782
18 | P a g e
6..Program to fit Multiple Linear Regression model on House_ prices dataset .consider
the below table containing hose prices in Monroe ,New Jersey(USA)
area bedrooms age price
2600 3 20 550000
3000 4 15 565000
3200 18 610000
3600 3 30 595000
4000 5 8 760000
4100 5 8 810000
Here price depends on the area(square feet),bedrooms and age of the house(in years).
Predict the prices of new homes based on the following area ,bedrooms and age.
Date:
----------------------------------------------------------------------------------------------------------------
import pandas as pd
import numpy as np
from sklearn import linear_model
import warnings
[Link]('ignore')
df=pd.read_excel('[Link]')
print("The home price dataset is\n",df)
Output:
The home price dataset is
area bedrooms age price
0 2600 3.0 20 550000
1 3000 4.0 15 565000
2 3200 NaN 18 610000
3 3600 3.0 30 595000
4 4000 5.0 8 760000
5 4100 5.0 8 810000
Output:
The description of the dataset
area bedrooms age price
count 6.000000 5.0 6.000000 6.000000
mean 3416.666667 4.0 16.500000 648333.333333
std 587.934237 1.0 8.288546 109117.673484
19 | P a g e
min 2600.000000 3.0 8.000000 550000.000000
25% 3050.000000 3.0 9.750000 572500.000000
50% 3400.000000 4.0 16.500000 602500.000000
75% 3900.000000 5.0 19.500000 722500.000000
max 4100.000000 5.0 30.000000 810000.000000
Output:
To check if there is any missing value
area False
bedrooms True
age False
price False
dtype: bool
Output:
The median of bedrooms= 4.0
[Link]=[Link]([Link]())
print("Data set After replacing the missing values with median")
print(df)
Output:
20 | P a g e
import [Link] as plt
[Link]([Link],[Link])
[Link]('Age of home(in years)')
[Link]('price')
[Link]()
Output:
x=[Link]('price',axis='columns')
print("The datset after dropping price is\n")
print(x)
Output:
The datset after dropping price is
area bedrooms age
0 2600 3.0 20
1 3000 4.0 15
2 3200 4.0 18
3 3600 3.0 30
4 4000 5.0 8
5 4100 5.0 8
y=df['price']
print("The dataset having price is\n")
print(y)
21 | P a g e
Output:
The dataset having price is
0 550000
1 565000
2 610000
3 595000
4 760000
5 810000
Name: price, dtype: int64
reg=linear_model.LinearRegression()
[Link](x,y)
Output:
coef=reg.coef_
print(coef)
Output:
[ 189.57096766 -94877.34896436 -13068.36933232]
b1=coef[0]
b2=coef[1]
b3=coef[2]
print(b1)
print(b2)
print(b3)
Output:
189.57096766248824
-94877.34896436246
-13068.369332315371
22 | P a g e
a=reg.intercept_
print(a)
Output:
595770.0169938187
[Link]([[3000,3,40]])
new_area=3000
new_bedrooms=3
new_age=40
predicted_price=a+(b1*new_area)+(b2*new_bedrooms)+(b3+new_age)
print("The predicted_price\n",predicted_price)
Output:
The predicted_price
866822.5037558805
23 | P a g e
[Link] a program to perform various data visualization technique on sample
dataset.
Date:
----------------------------------------------------------------------------------------------------------------
import pandas as pd
import [Link] as plt
import seaborn as sns
data=pd.read_csv('fruit_data_with_colours.csv')
print("First few rows of the dataset:")
print([Link]())
Output:
First few rows of the dataset:
fruit_label fruit_name fruit_subtype mass width height color_score
0 1 apple granny_smith 192 8.4 7.3 0.55
1 1 apple granny_smith 180 8.0 6.8 0.59
2 1 apple granny_smith 176 7.4 7.2 0.60
3 2 mandarin mandarin 86 6.2 4.7 0.80
4 2 mandarin mandarin 84 6.0 4.6 0.79
[Link](2,2,1)
[Link](data['width'],bins=10,kde=True)
[Link]('Histogram of fruit width')
Output:
Text(0.5, 1.0, 'Histogram of fruit width')
24 | P a g e
[Link](2,2,2)
[Link](x='width',y='height',data=data)
[Link]('scatter plot of width vs Height')
Output:
Text(0.5, 1.0, 'scatter plot of width vs Height')
[Link](2,2,3)
[Link](x='mass',y='color_score',data=data)
[Link]('box plot of mass level color_score')
Output:
Text(0.5, 1.0, 'box plot of mass level color_score')
[Link](2,2,4)
[Link](x='fruit_name',data=data)
[Link]('count of fruit_name')
25 | P a g e
Output:
Text(0.5, 1.0, 'count of fruit_name')
26 | P a g e
[Link] a program to perform Numeric data-processing with and without using built-in
function.
Date:17/07/2025
----------------------------------------------------------------------------------------------------------------
WITH BUILT-IN FUNCTION:
import numpy as np
import pandas as pd
df=pd.read_csv("[Link]")
[Link]()
Output:
[Link]
Output:
(768, 9)
[Link]()
Output:
27 | P a g e
features = [Link][:, :-1]
class_label = [Link][:, -1]
duplicate_values = df[[Link]()]
print("\nDuplicate values:")
print(duplicate_values)
Output:
Duplicate values:
Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI,
DiabetesPedigreeFunction, Age, Outcome]
Index: []
df.drop_duplicates(inplace=True)
print("DataFrame after removing duplicates:")
[Link](5)
Output:
DataFrame after removing duplicates:
missing_values=[Link]().sum()
print("Missing values:\n",missing_values)
Output:
Missing values:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
28 | P a g e
WITH OUT BUILT-IN FUNCTION:
import csv
with open('[Link]','r')as file:
reader=[Link](file)
data=list(reader)
header=data[0]
rows=data[1:]
for i in range(len(rows)):
rows[i]=[float(x)for x in rows[i]]
cols_to_fix=[1,2,3,4,5]
Output:
Preprocessed Data(without built_in functions):
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']
[6.0, 148.0, 72.0, 35.0, 125.0, 33.6, 0.627, 50.0, 1.0]
[1.0, 85.0, 66.0, 29.0, 125.0, 26.6, 0.351, 31.0, 0.0]
[8.0, 183.0, 64.0, 29.0, 125.0, 23.3, 0.672, 32.0, 1.0]
[1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0, 0.0]
[0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 33.0, 1.0]
29 | P a g e
9. Write a Python program to perform the following using employees_data.csv file:
1. Load the dataset and display the first 5 rows.
2. Clean the data:
* Check for and handle any missing values.
* Convert Joining_Date to datetime format.
3. Feature Engineering:
* Create a new column Years_of_Service calculated from today'sdate.
4. Data Analysis:
* Calculate the average salary per department.
* Find the number of employees in each gender category.
* Identify the department with the highest average Years_of_Service.
5. Data Visualization:
* Create a bar plot of average salary by department.
* Plot a pie chart showing the gender distribution.
6. Export the cleaned and enriched dataset to a new CSV file called
employees_cleaned.csv.
Date:08/08/2025
----------------------------------------------------------------------------------------------------------------
import pandas as pd
import [Link] as plt
import seaborn as sns
from datetime import datetime
df = pd.read_csv('employees .csv')
print("First 5 rows of the dataset:")
print([Link]())
Output:
30 | P a g e
COMMISSION_PCT MANAGER_ID DEPARTMENT
0 NaN 3 Finance
1 0.24 20 Sales
2 0.07 9 Finance
3 NaN 17 Marketing
4 NaN 10 Finance
Output:
dtype: int64
[Link](inplace=True)
# Convert Joining_Date to datetime format
df['JOINING_DATE'] = pd.to_datetime(df['JOINING_DATE'],dayfirst=True,errors='coerce')
[Link](subset=['JOINING_DATE'], inplace=True) # Drop rows where Joining_Date
couldn't be parsed
print([Link]())
31 | P a g e
Output:
EMPLOYEE_ID FIRST_NAME LAST_NAME EMAIL \
1 2 Vanessa Patel [Link]@[Link]
2 3 Tammy Woods [Link]@[Link]
6 7 Frances Massey [Link]@[Link]
7 8 Brenda Rogers [Link]@[Link]
13 14 Joanne Stephens [Link]@[Link]
# 3. Feature Engineering
Output:
# 4. Data Analysis
print("\nAverage salary per department:")
print([Link]('DEPARTMENT')['SALARY'].mean())
32 | P a g e
print("\nEmployee count by gender:")
print(df['GENDER'].value_counts())
Output:
33 | P a g e
Output:
Output:
df.to_csv('employees_cleaned.csv', index=False)
print("\nCleaned and enriched dataset exported as 'employees_cleaned.csv'.")
Output:
Cleaned and enriched dataset exported as 'employees_cleaned.csv'.
34 | P a g e
[Link] a program to perform text data preprocessing for the dataset [Link] using
with and without building function.
Date:
----------------------------------------------------------------------------------------------------------------
import numpy as pd
import pandas as pd
df=pd.read_csv("[Link]")
[Link](5)
#expanding the display of text sms columns
pd.set_option('display.max_colwidth',1)
df=df[['Tweets','Retweets']]
[Link]()
Output:
df['Retweets'].value_counts().
Output:
Retweets
88 9
102 8
137 8
89 8
133 7
..
2575 1
5276 1
868 1
6886 1
11302 1
Name: count, Length: 1993, dtype: int64
35 | P a g e
import pandas as pd
import string
import re
from [Link] import WhitespaceTokenizer
def remove_urls(text):
if isinstance(text,str):
return [Link](r'http\S+|www\S+','',text,flags=[Link])
else:
return text
def remove_punctuation(text):
if isinstance(text,str):
return "".join([char for char in text if char not in [Link]])
else:
return text
def tokenization(text):
if isinstance(text,str):
tk=WhitespaceTokenizer()
return [Link](text)
else:
return text
#remove contraction
import contractions
def conc(text):
expanded_text = [Link](text)
return expanded_text
df['lower_case'] = df['Tweets'].apply(lambda x: [Link]()) #lower case
df['rem_url'] = df['lower_case'].apply(remove_urls) # Remove URLs first
df['rem_punct'] = df['rem_url'].apply(lambda x: remove_punctuation(x)) #apply punctuation
df['rem_conct'] = df['rem_punct'].apply(conc) #apply contraction
df['tokenised_msg'] = df['rem_punct'].apply(lambda x: tokenization(x)) #apply tokenization
[Link](10)
36 | P a g e
Output:
Output:
"a,about,above,after,again,against,ain,all,am,an,and,any,are,aren,aren't,as,at,be,because,been,
before,being,below,between,both,but,by,can,couldn,couldn't,d,did,didn,didn't,do,does,doesn,d
oesn't,doing,don,don't,down,during,each,few,for,from,further,had,hadn,hadn't,has,hasn,hasn't,
have,haven,haven't,having,he,he'd,he'll,her,here,hers,herself,he's,him,himself,his,how,i,i'd,if,i'
ll,i'm,in,into,is,isn,isn't,it,it'd,it'll,it's,its,itself,i've,just,ll,m,ma,me,mightn,mightn't,more,most,
mustn,mustn't,my,myself,needn,needn't,no,nor,not,now,o,of,off,on,once,only,or,other,our,our
s,ourselves,out,over,own,re,s,same,shan,shan't,she,she'd,she'll,she's,should,shouldn,shouldn't,
should've,so,some,such,t,than,that,that'll,the,their,theirs,them,themselves,then,there,these,they
,they'd,they'll,they're,they've,this,those,through,to,too,under,until,up,ve,very,was,wasn,wasn't,
we,we'd,we'll,we're,were,weren,weren't,we've,what,when,where,which,while,who,whom,why
,will,with,won,won't,wouldn,wouldn't,y,you,you'd,you'll,your,you're,yours,yourself,yourselve
s,you've"
import nltk
stopwords=[Link]('english')
def remove_stopwords(text):
output=[i for i in text if i not in stopwords]
return output
df['rem_stopwords']=df['tokenised_msg'].apply(lambda x:remove_stopwords(x))
df
37 | P a g e
Output:
def stemming(text):
stem_text=[porter_stemmer.stem(word)for word in text]
return stem_text
def lemmatizer(text):
if isinstance(text,str):
words=word_tokenize(text)
lemmatized_words=[wordnet_lemmatizer.lemmatize(word)for word in words]
return ''.join(lemmatized_words)
else:
return text
38 | P a g e
#applying function for stemming
df['Stemmed_msg']=df['rem_stopwords'].apply(lambda x:stemming(x))
Output:
#remove emojis
import emoji
#import demoji
#demoji.download_codes()
def emo(text):
temp=[Link](text,delimiters=(" "," "))
temp=[Link]("_"," ")
return temp
df['rem_emo']=df["rem_punct"].apply(lambda x:emo(x))
[Link](5)
39 | P a g e
Output:
40 | P a g e
[Link] python program for implement chi-square test for feature selection to train
SVM classifier using suitable dataset.
Date:14/08/2025
----------------------------------------------------------------------------------------------------------------
# 1. Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from [Link] import SVC
from [Link] import accuracy_score, classification_report
# 2. Load Dataset
data = pd.read_csv('fruit_data_with_colours.csv')
print("First 5 rows of the dataset:")
print([Link](5))
Output:
First 5 rows of the dataset:
fruit_label fruit_name fruit_subtype mass width height color_score
0 1 apple granny_smith 192 8.4 7.3 0.55
1 1 apple granny_smith 180 8.0 6.8 0.59
2 1 apple granny_smith 176 7.4 7.2 0.60
3 2 mandarin mandarin 86 6.2 4.7 0.80
4 2 mandarin mandarin 84 6.0 4.6 0.79
41 | P a g e
Output:
Selected top features (Chi-Square):
Index(['mass', 'width', 'height', 'color_score'], dtype='object')
# 6. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
X_selected, y, test_size=0.2, random_state=42
)
# 7. Train the SVM Classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
Output:
# 8. Predictions
y_pred = svm_classifier.predict(X_test)
# 9. Evaluation
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("\nAccuracy:", accuracy)
print("\nClassification Report:")
print(report)
Output:
Accuracy: 0.75
Classification Report:
precision recall f1-score support
42 | P a g e
[Link] a program to implement ANOVA test for feature selection to train SVM
classifier using suitable datasets.
Date:26/08/2025
----------------------------------------------------------------------------------------------------------------
import numpy as np
from [Link] import load_iris
from sklearn.feature_selection import SelectPercentile,f_classif
from [Link] import Pipeline
from [Link] import StandardScaler
from [Link] import SVC
from sklearn.model_selection import cross_val_score
import [Link] as plt
X,y=load_iris(return_X_y=True)
mg=[Link](0)
X=[Link]((X,2*[Link](([Link][0],36))))
clf=Pipeline([("anova",SelectPercentile(f_classif)),
("scaler",StandardScaler()),
("svc",SVC(gamma="auto"))
])
score_means=[]
score_stds=[]
percentiles=[1,3,6,10,15,20,30,40,60,80,100]
[Link](percentiles,score_means,[Link](score_stds))
[Link]("Performance of the SVM-Anova varying the percentile offeatures selected")
[Link]([Link](0,100,11,endpoint=True))
[Link]("Percentile")
[Link]("tight")
[Link]()
43 | P a g e
Output:
44 | P a g e
[Link] a program to classify the IRIS dataset using the support vector classifier(svc)
algorithm with data visualization, pre-processing and performance evaluation.
Date:26/08/2025
----------------------------------------------------------------------------------------------------------------
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import SVC
from [Link] import StandardScaler
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score,confusion_matrix,classification_report
from sklearn import datasets
iris=datasets.load_iris()
X=[Link]
Y=[Link]
df=[Link](X,columns=iris.feature_names)
df['target']=Y
df['target']=df['target'].map(dict(enumerate(iris.target_names)))
[Link](df,hue="target",palette="Set2")
[Link]("pairplot of iris Dataset",y=1.02)
[Link]()
Output:
45 | P a g e
[Link](figsize=(8,6))
[Link]([Link][:,:-1].corr(),annot=True,cmap='coolwarm')
[Link]("Correlation Heatmap of iris Features")
[Link]()
Output:
Output:
Missing values in dataset:
sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
target 0
dtype: int64
X=[Link]
Y=[Link]
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=1)
print("X_train",X_train)
print("X_test",X_test)
print("Y_train",Y_train)
print("Y_test",Y_test)
46 | P a g e
Output:
X_train [[7.7 2.6 6.9 2.3]
[5.7 3.8 1.7 0.3]
[5. 3.6 1.4 0.2]
[4.8 3. 1.4 0.3]
[5.2 2.7 3.9 1.4]
[5.1 3.4 1.5 0.2]
[5.5 3.5 1.3 0.2]
[7.7 3.8 6.7 2.2]
[6.9 3.1 5.4 2.1]
[7.3 2.9 6.3 1.8]
[6.4 2.8 5.6 2.2]
[6.2 2.8 4.8 1.8]
[6. 3.4 4.5 1.6]
[7.7 2.8 6.7 2. ]
[5.7 3. 4.2 1.2]
[4.8 3.4 1.6 0.2]
[5.7 2.5 5. 2. ]
[6.3 2.7 4.9 1.8]
[4.8 3. 1.4 0.1]
[4.7 3.2 1.3 0.2]
[6.5 3. 5.8 2.2]
[4.6 3.4 1.4 0.3]
[6.1 3. 4.9 1.8]
[6.5 3.2 5.1 2. ]
[6.7 3.1 4.4 1.4]
[5.7 2.8 4.5 1.3]
[6.7 3.3 5.7 2.5]
[6. 3. 4.8 1.8]
[5.1 3.8 1.6 0.2]
[6. 2.2 4. 1. ]
[6.4 2.9 4.3 1.3]
[6.5 3. 5.5 1.8]
[5. 2.3 3.3 1. ]
[6.3 3.3 6. 2.5]
[5.5 2.5 4. 1.3]
[5.4 3.7 1.5 0.2]
[4.9 3.1 1.5 0.2]
[5.2 4.1 1.5 0.1]
[6.7 3.3 5.7 2.1]
[4.4 3. 1.3 0.2]
[6. 2.7 5.1 1.6]
[6.4 2.7 5.3 1.9]
[5.9 3. 5.1 1.8]
47 | P a g e
[5.2 3.5 1.5 0.2]
[5.1 3.3 1.7 0.5]
[5.8 2.7 4.1 1. ]
[4.9 3.1 1.5 0.1]
[7.4 2.8 6.1 1.9]
[6.2 2.9 4.3 1.3]
[7.6 3. 6.6 2.1]
[6.7 3. 5.2 2.3]
[6.3 2.3 4.4 1.3]
[6.2 3.4 5.4 2.3]
[7.2 3.6 6.1 2.5]
[5.6 2.9 3.6 1.3]
[5.7 4.4 1.5 0.4]
[5.8 2.7 3.9 1.2]
[4.5 2.3 1.3 0.3]
[5.5 2.4 3.8 1.1]
[6.9 3.1 4.9 1.5]
[5. 3.4 1.6 0.4]
[6.8 2.8 4.8 1.4]
[5. 3.5 1.6 0.6]
[4.8 3.4 1.9 0.2]
[6.3 3.4 5.6 2.4]
[5.6 2.8 4.9 2. ]
[6.8 3.2 5.9 2.3]
[5. 3.3 1.4 0.2]
[5.1 3.7 1.5 0.4]
[5.9 3.2 4.8 1.8]
[4.6 3.1 1.5 0.2]
[5.8 2.7 5.1 1.9]
[4.8 3.1 1.6 0.2]
[6.5 3. 5.2 2. ]
[4.9 2.5 4.5 1.7]
[4.6 3.2 1.4 0.2]
[6.4 3.2 5.3 2.3]
[4.3 3. 1.1 0.1]
[5.6 3. 4.1 1.3]
[4.4 2.9 1.4 0.2]
[5.5 2.4 3.7 1. ]
[5. 2. 3.5 1. ]
[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.9 2.4 3.3 1. ]
[4.6 3.6 1. 0.2]
[5.9 3. 4.2 1.5]
48 | P a g e
[6.1 2.9 4.7 1.4]
[5. 3.4 1.5 0.2]
[6.7 3.1 4.7 1.5]
[5.7 2.9 4.2 1.3]
[6.2 2.2 4.5 1.5]
[7. 3.2 4.7 1.4]
[5.8 2.7 5.1 1.9]
[5.4 3.4 1.7 0.2]
[5. 3. 1.6 0.2]
[6.1 2.6 5.6 1.4]
[6.1 2.8 4. 1.3]
[7.2 3. 5.8 1.6]
[5.7 2.6 3.5 1. ]
[6.3 2.8 5.1 1.5]
[6.4 3.1 5.5 1.8]
[6.3 2.5 4.9 1.5]
[6.7 3.1 5.6 2.4]
[4.9 3.6 1.4 0.1]]
X_test [[5.8 4. 1.2 0.2]
[5.1 2.5 3. 1.1]
[6.6 3. 4.4 1.4]
[5.4 3.9 1.3 0.4]
[7.9 3.8 6.4 2. ]
[6.3 3.3 4.7 1.6]
[6.9 3.1 5.1 2.3]
[5.1 3.8 1.9 0.4]
[4.7 3.2 1.6 0.2]
[6.9 3.2 5.7 2.3]
[5.6 2.7 4.2 1.3]
[5.4 3.9 1.7 0.4]
[7.1 3. 5.9 2.1]
[6.4 3.2 4.5 1.5]
[6. 2.9 4.5 1.5]
[4.4 3.2 1.3 0.2]
[5.8 2.6 4. 1.2]
[5.6 3. 4.5 1.5]
[5.4 3.4 1.5 0.4]
[5. 3.2 1.2 0.2]
[5.5 2.6 4.4 1.2]
[5.4 3. 4.5 1.5]
[6.7 3. 5. 1.7]
[5. 3.5 1.3 0.3]
[7.2 3.2 6. 1.8]
[5.7 2.8 4.1 1.3]
49 | P a g e
[5.5 4.2 1.4 0.2]
[5.1 3.8 1.5 0.3]
[6.1 2.8 4.7 1.2]
[6.3 2.5 5. 1.9]
[6.1 3. 4.6 1.4]
[7.7 3. 6.1 2.3]
[5.6 2.5 3.9 1.1]
[6.4 2.8 5.6 2.1]
[5.8 2.8 5.1 2.4]
[5.3 3.7 1.5 0.2]
[5.5 2.3 4. 1.3]
[5.2 3.4 1.4 0.2]
[6.5 2.8 4.6 1.5]
[6.7 2.5 5.8 1.8]
[6.8 3. 5.5 2.1]
[5.1 3.5 1.4 0.3]
[6. 2.2 5. 1.5]
[6.3 2.9 5.6 1.8]
[6.6 2.9 4.6 1.3]]
Y_train [2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0 1 1 2 1 2 1 0 0
0201220010212212210101101002220010202
2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1 2 0]
Y_test [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
0 1 2 2 0 2 2 1]
sc=StandardScaler()
[Link](X_train)
X_train_std=[Link](X_train)
X_test_std=[Link](X_test)
print("X_train_std=\n",X_train_std)
print("X_test_std=\n",X_test_std)
Output:
X_train_std=
[[ 2.26050169e+00 -1.05089682e+00 1.77622921e+00 1.42370971e+00]
[-1.18973773e-01 1.82764665e+00 -1.14491883e+00 -1.14263397e+00]
[-9.51790185e-01 1.34788940e+00 -1.31344660e+00 -1.27095115e+00]
[-1.18973773e+00 -9.13823325e-02 -1.31344660e+00 -1.14263397e+00]
[-7.13842639e-01 -8.11018201e-01 9.09514958e-02 2.68855052e-01]
[-8.32816412e-01 8.68132159e-01 -1.25727068e+00 -1.27095115e+00]
[-3.56921319e-01 1.10801078e+00 -1.36962252e+00 -1.27095115e+00]
50 | P a g e
[ 2.26050169e+00 1.82764665e+00 1.66387736e+00 1.29539252e+00]
[ 1.30871150e+00 1.48496290e-01 9.33590353e-01 1.16707534e+00]
[ 1.78460660e+00 -3.31260955e-01 1.43917367e+00 7.82123787e-01]
[ 7.13842639e-01 -5.71139578e-01 1.04594220e+00 1.29539252e+00]
[ 4.75895093e-01 -5.71139578e-01 5.96534810e-01 7.82123787e-01]
[ 2.37947546e-01 8.68132159e-01 4.28007039e-01 5.25489419e-01]
[ 2.26050169e+00 -5.71139578e-01 1.66387736e+00 1.03875815e+00]
[-1.18973773e-01 -9.13823325e-02 2.59479267e-01 1.22206842e-02]
[-1.18973773e+00 8.68132159e-01 -1.20109475e+00 -1.27095115e+00]
[-1.18973773e-01 -1.29077545e+00 7.08886658e-01 1.03875815e+00]
[ 5.94868866e-01 -8.11018201e-01 6.52710734e-01 7.82123787e-01]
[-1.18973773e+00 -9.13823325e-02 -1.31344660e+00 -1.39926834e+00]
[-1.30871150e+00 3.88374913e-01 -1.36962252e+00 -1.27095115e+00]
[ 8.32816412e-01 -9.13823325e-02 1.15829405e+00 1.29539252e+00]
[-1.42768528e+00 8.68132159e-01 -1.31344660e+00 -1.14263397e+00]
[ 3.56921319e-01 -9.13823325e-02 6.52710734e-01 7.82123787e-01]
[ 8.32816412e-01 3.88374913e-01 7.65062582e-01 1.03875815e+00]
[ 1.07076396e+00 1.48496290e-01 3.71831115e-01 2.68855052e-01]
[-1.18973773e-01 -5.71139578e-01 4.28007039e-01 1.40537868e-01]
[ 1.07076396e+00 6.28253536e-01 1.10211813e+00 1.68034407e+00]
[ 2.37947546e-01 -9.13823325e-02 5.96534810e-01 7.82123787e-01]
[-8.32816412e-01 1.82764665e+00 -1.20109475e+00 -1.27095115e+00]
[ 2.37947546e-01 -2.01041131e+00 1.47127420e-01 -2.44413683e-01]
[ 7.13842639e-01 -3.31260955e-01 3.15655191e-01 1.40537868e-01]
[ 8.32816412e-01 -9.13823325e-02 9.89766277e-01 7.82123787e-01]
[-9.51790185e-01 -1.77053269e+00 -2.46104047e-01 -2.44413683e-01]
[ 5.94868866e-01 6.28253536e-01 1.27064590e+00 1.68034407e+00]
[-3.56921319e-01 -1.29077545e+00 1.47127420e-01 1.40537868e-01]
[-4.75895093e-01 1.58776803e+00 -1.25727068e+00 -1.27095115e+00]
[-1.07076396e+00 1.48496290e-01 -1.25727068e+00 -1.27095115e+00]
[-7.13842639e-01 2.54728252e+00 -1.25727068e+00 -1.39926834e+00]
[ 1.07076396e+00 6.28253536e-01 1.10211813e+00 1.16707534e+00]
[-1.66563282e+00 -9.13823325e-02 -1.36962252e+00 -1.27095115e+00]
[ 2.37947546e-01 -8.11018201e-01 7.65062582e-01 5.25489419e-01]
[ 7.13842639e-01 -8.11018201e-01 8.77414430e-01 9.10440971e-01]
[ 1.18973773e-01 -9.13823325e-02 7.65062582e-01 7.82123787e-01]
[-7.13842639e-01 1.10801078e+00 -1.25727068e+00 -1.27095115e+00]
[-8.32816412e-01 6.28253536e-01 -1.14491883e+00 -8.85999602e-01]
[-1.05669938e-15 -8.11018201e-01 2.03303343e-01 -2.44413683e-01]
[-1.07076396e+00 1.48496290e-01 -1.25727068e+00 -1.39926834e+00]
[ 1.90358037e+00 -5.71139578e-01 1.32682182e+00 9.10440971e-01]
[ 4.75895093e-01 -3.31260955e-01 3.15655191e-01 1.40537868e-01]
[ 2.14152792e+00 -9.13823325e-02 1.60770144e+00 1.16707534e+00]
[ 1.07076396e+00 -9.13823325e-02 8.21238506e-01 1.42370971e+00]
51 | P a g e
[ 5.94868866e-01 -1.77053269e+00 3.71831115e-01 1.40537868e-01]
[ 4.75895093e-01 8.68132159e-01 9.33590353e-01 1.42370971e+00]
[ 1.66563282e+00 1.34788940e+00 1.32682182e+00 1.68034407e+00]
[-2.37947546e-01 -3.31260955e-01 -7.75762758e-02 1.40537868e-01]
[-1.18973773e-01 3.26691839e+00 -1.25727068e+00 -1.01431679e+00]
[-1.05669938e-15 -8.11018201e-01 9.09514958e-02 1.22206842e-02]
[-1.54665905e+00 -1.77053269e+00 -1.36962252e+00 -1.14263397e+00]
[-3.56921319e-01 -1.53065407e+00 3.47755719e-02 -1.16096500e-01]
[ 1.30871150e+00 1.48496290e-01 6.52710734e-01 3.97172236e-01]
[-9.51790185e-01 8.68132159e-01 -1.20109475e+00 -1.01431679e+00]
[ 1.18973773e+00 -5.71139578e-01 5.96534810e-01 2.68855052e-01]
[-9.51790185e-01 1.10801078e+00 -1.20109475e+00 -7.57682419e-01]
[-1.18973773e+00 8.68132159e-01 -1.03256698e+00 -1.27095115e+00]
[ 5.94868866e-01 8.68132159e-01 1.04594220e+00 1.55202689e+00]
[-2.37947546e-01 -5.71139578e-01 6.52710734e-01 1.03875815e+00]
[ 1.18973773e+00 3.88374913e-01 1.21446997e+00 1.42370971e+00]
[-9.51790185e-01 6.28253536e-01 -1.31344660e+00 -1.27095115e+00]
[-8.32816412e-01 1.58776803e+00 -1.25727068e+00 -1.01431679e+00]
[ 1.18973773e-01 3.88374913e-01 5.96534810e-01 7.82123787e-01]
[-1.42768528e+00 1.48496290e-01 -1.25727068e+00 -1.27095115e+00]
[-1.05669938e-15 -8.11018201e-01 7.65062582e-01 9.10440971e-01]
[-1.18973773e+00 1.48496290e-01 -1.20109475e+00 -1.27095115e+00]
[ 8.32816412e-01 -9.13823325e-02 8.21238506e-01 1.03875815e+00]
[-1.07076396e+00 -1.29077545e+00 4.28007039e-01 6.53806603e-01]
[-1.42768528e+00 3.88374913e-01 -1.31344660e+00 -1.27095115e+00]
[ 7.13842639e-01 3.88374913e-01 8.77414430e-01 1.42370971e+00]
[-1.78460660e+00 -9.13823325e-02 -1.48197437e+00 -1.39926834e+00]
[-2.37947546e-01 -9.13823325e-02 2.03303343e-01 1.40537868e-01]
[-1.66563282e+00 -3.31260955e-01 -1.31344660e+00 -1.27095115e+00]
[-3.56921319e-01 -1.53065407e+00 -2.14003519e-02 -2.44413683e-01]
[-9.51790185e-01 -2.49016856e+00 -1.33752200e-01 -2.44413683e-01]
[-8.32816412e-01 1.10801078e+00 -1.31344660e+00 -1.27095115e+00]
[-1.07076396e+00 -9.13823325e-02 -1.31344660e+00 -1.27095115e+00]
[-1.07076396e+00 -1.53065407e+00 -2.46104047e-01 -2.44413683e-01]
[-1.42768528e+00 1.34788940e+00 -1.53815030e+00 -1.27095115e+00]
[ 1.18973773e-01 -9.13823325e-02 2.59479267e-01 3.97172236e-01]
[ 3.56921319e-01 -3.31260955e-01 5.40358887e-01 2.68855052e-01]
[-9.51790185e-01 8.68132159e-01 -1.25727068e+00 -1.27095115e+00]
[ 1.07076396e+00 1.48496290e-01 5.40358887e-01 3.97172236e-01]
[-1.18973773e-01 -3.31260955e-01 2.59479267e-01 1.40537868e-01]
[ 4.75895093e-01 -2.01041131e+00 4.28007039e-01 3.97172236e-01]
[ 1.42768528e+00 3.88374913e-01 5.40358887e-01 2.68855052e-01]
[-1.05669938e-15 -8.11018201e-01 7.65062582e-01 9.10440971e-01]
[-4.75895093e-01 8.68132159e-01 -1.14491883e+00 -1.27095115e+00]
52 | P a g e
[-9.51790185e-01 -9.13823325e-02 -1.20109475e+00 -1.27095115e+00]
[ 3.56921319e-01 -1.05089682e+00 1.04594220e+00 2.68855052e-01]
[ 3.56921319e-01 -5.71139578e-01 1.47127420e-01 1.40537868e-01]
[ 1.66563282e+00 -9.13823325e-02 1.15829405e+00 5.25489419e-01]
[-1.18973773e-01 -1.05089682e+00 -1.33752200e-01 -2.44413683e-01]
[ 5.94868866e-01 -5.71139578e-01 7.65062582e-01 3.97172236e-01]
[ 7.13842639e-01 1.48496290e-01 9.89766277e-01 7.82123787e-01]
[ 5.94868866e-01 -1.29077545e+00 6.52710734e-01 3.97172236e-01]
[ 1.07076396e+00 1.48496290e-01 1.04594220e+00 1.55202689e+00]
[-1.07076396e+00 1.34788940e+00 -1.31344660e+00 -1.39926834e+00]]
X_test_std=
[[-1.05669938e-15 2.30740390e+00 -1.42579845e+00 -1.27095115e+00]
[-8.32816412e-01 -1.29077545e+00 -4.14631819e-01 -1.16096500e-01]
[ 9.51790185e-01 -9.13823325e-02 3.71831115e-01 2.68855052e-01]
[-4.75895093e-01 2.06752527e+00 -1.36962252e+00 -1.01431679e+00]
[ 2.49844924e+00 1.82764665e+00 1.49534959e+00 1.03875815e+00]
[ 5.94868866e-01 6.28253536e-01 5.40358887e-01 5.25489419e-01]
[ 1.30871150e+00 1.48496290e-01 7.65062582e-01 1.42370971e+00]
[-8.32816412e-01 1.82764665e+00 -1.03256698e+00 -1.01431679e+00]
[-1.30871150e+00 3.88374913e-01 -1.20109475e+00 -1.27095115e+00]
[ 1.30871150e+00 3.88374913e-01 1.10211813e+00 1.42370971e+00]
[-2.37947546e-01 -8.11018201e-01 2.59479267e-01 1.40537868e-01]
[-4.75895093e-01 2.06752527e+00 -1.14491883e+00 -1.01431679e+00]
[ 1.54665905e+00 -9.13823325e-02 1.21446997e+00 1.16707534e+00]
[ 7.13842639e-01 3.88374913e-01 4.28007039e-01 3.97172236e-01]
[ 2.37947546e-01 -3.31260955e-01 4.28007039e-01 3.97172236e-01]
[-1.66563282e+00 3.88374913e-01 -1.36962252e+00 -1.27095115e+00]
[-1.05669938e-15 -1.05089682e+00 1.47127420e-01 1.22206842e-02]
[-2.37947546e-01 -9.13823325e-02 4.28007039e-01 3.97172236e-01]
[-4.75895093e-01 8.68132159e-01 -1.25727068e+00 -1.01431679e+00]
[-9.51790185e-01 3.88374913e-01 -1.42579845e+00 -1.27095115e+00]
[-3.56921319e-01 -1.05089682e+00 3.71831115e-01 1.22206842e-02]
[-4.75895093e-01 -9.13823325e-02 4.28007039e-01 3.97172236e-01]
[ 1.07076396e+00 -9.13823325e-02 7.08886658e-01 6.53806603e-01]
[-9.51790185e-01 1.10801078e+00 -1.36962252e+00 -1.14263397e+00]
[ 1.66563282e+00 3.88374913e-01 1.27064590e+00 7.82123787e-01]
[-1.18973773e-01 -5.71139578e-01 2.03303343e-01 1.40537868e-01]
[-3.56921319e-01 2.78716114e+00 -1.31344660e+00 -1.27095115e+00]
[-8.32816412e-01 1.82764665e+00 -1.25727068e+00 -1.14263397e+00]
[ 3.56921319e-01 -5.71139578e-01 5.40358887e-01 1.22206842e-02]
[ 5.94868866e-01 -1.29077545e+00 7.08886658e-01 9.10440971e-01]
[ 3.56921319e-01 -9.13823325e-02 4.84182963e-01 2.68855052e-01]
[ 2.26050169e+00 -9.13823325e-02 1.32682182e+00 1.42370971e+00]
[-2.37947546e-01 -1.29077545e+00 9.09514958e-02 -1.16096500e-01]
53 | P a g e
[ 7.13842639e-01 -5.71139578e-01 1.04594220e+00 1.16707534e+00]
[-1.05669938e-15 -5.71139578e-01 7.65062582e-01 1.55202689e+00]
[-5.94868866e-01 1.58776803e+00 -1.25727068e+00 -1.27095115e+00]
[-3.56921319e-01 -1.77053269e+00 1.47127420e-01 1.40537868e-01]
[-7.13842639e-01 8.68132159e-01 -1.31344660e+00 -1.27095115e+00]
[ 8.32816412e-01 -5.71139578e-01 4.84182963e-01 3.97172236e-01]
[ 1.07076396e+00 -1.29077545e+00 1.15829405e+00 7.82123787e-01]
[ 1.18973773e+00 -9.13823325e-02 9.89766277e-01 1.16707534e+00]
[-8.32816412e-01 1.10801078e+00 -1.31344660e+00 -1.14263397e+00]
[ 2.37947546e-01 -2.01041131e+00 7.08886658e-01 3.97172236e-01]
[ 5.94868866e-01 -3.31260955e-01 1.04594220e+00 7.82123787e-01]
[ 9.51790185e-01 -3.31260955e-01 4.84182963e-01 1.40537868e-01]]
svm=SVC(kernel='linear',random_state=1,C=0.1)
[Link](X_train_std,Y_train)
Y_pred=[Link](X_test_std)
acc=accuracy=accuracy_score(Y_test,Y_pred)
print('Accuracy:%3f'% acc)
Output:
Accuracy:0.955556
Output:
classiefication Report:
precision recall f1-score support
cm=confusion_matrix(Y_test,Y_pred)
[Link](figsize=(6,5))
[Link](cm,annot=True,fmt='d',cmap='Blues',
xticklabels=iris.target_names, yticklabels=iris.target_names)
[Link]("Confusion Matrix")
[Link]("Predicted Label")
[Link]("True label")
[Link]()
54 | P a g e
Output:
55 | P a g e
[Link] a program to implement K-means clustering. Using data visualization
technique to illustrate the clustering.
Date:28/08/2025
----------------------------------------------------------------------------------------------------------------
import numpy as nm
import [Link] as mtp
import pandas as pd
dataset=pd.read_csv("Mall_Customers.csv")
[Link](5)
Output:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
x=[Link][:,[3,4]].values
from [Link] import KMeans
wcss_list=[]
for i in range(1,11):
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
[Link](x)
wcss_list.append(kmeans.inertia_)
[Link](range(1,11),wcss_list)
[Link]('The Elbow Method cluster(k)')
[Link]('Number of clusters(k)')
[Link]('wcss_list')
[Link]()
Output:
56 | P a g e
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
y_predict=kmeans.fit_predict(x)
[Link](x[y_predict == 0, 0], x[y_predict == 0, 1], s=100, c='blue', label='Cluster 1')
[Link](x[y_predict == 1, 0], x[y_predict == 1, 1], s=100, c='green', label='Cluster 2')
[Link](x[y_predict == 2, 0], x[y_predict == 2, 1], s=100, c='red', label='Cluster 3')
[Link](x[y_predict == 3, 0], x[y_predict == 3, 1], s=100, c='cyan', label='Cluster 4')
[Link](x[y_predict == 4, 0], x[y_predict == 4, 1], s=100, c='magenta', label='Cluster 5')
[Link](kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow',
label='Centroid')
[Link]('Clusters of customers')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]()
[Link]()
Output:
57 | P a g e
[Link] a program to implement hierarchical clustering algorithm. Using data
visualization technique to illustrate the clustering.
Date:28/08/2025
----------------------------------------------------------------------------------------------------------------
import numpy as np
import [Link] as plt
import pandas as pd
df=pd.read_csv("Mall_Customers.csv")
[Link]()
Output:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
[Link]().sum()
Output:
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
x=[Link][:,[3,4]].values
import [Link] as sch
dendrogram=[Link]([Link](x,method='ward'))
[Link]("Dendrogram")
[Link]("Customer")
[Link]("Euclidean sistance")
[Link]()
Output:
58 | P a g e
from [Link] import AgglomerativeClusteringhc =
AgglomerativeClustering(n_clusters=5, linkage='ward')
y_hc = hc.fit_predict(x)
[Link](x[y_hc == 0, 0], x[y_hc == 0, 1], s=100, c="red", label="cluster 1")
[Link](x[y_hc == 1, 0], x[y_hc == 1, 1], s=100, c="blue", label="cluster 2")
[Link](x[y_hc == 2, 0], x[y_hc == 2, 1], s=100, c="green", label="cluster 3")
[Link](x[y_hc == 3, 0], x[y_hc == 3, 1], s=100, c="cyan", label="cluster 4")
[Link](x[y_hc == 4, 0], x[y_hc == 4, 1], s=100, c="orange", label="cluster 5")
[Link]("Clusters of customers")
[Link]("Annual Income")
[Link]("Spending Score (1-100)")
[Link]()
[Link]()
Output:
59 | P a g e
[Link] a program to implement grid-based clustering using a suitable dataset.
Visualize the data using scatter plot.
Date:
---------------------------------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
df=pd.read_csv("Mall_Customers.csv")
[Link](5)
Output:
import numpy as np
def grid_based_clustering(X, grid_size):
x_edges = [Link]([Link](X[:, 0]), [Link](X[:, 0]), grid_size[0] + 1)
y_edges = [Link]([Link](X[:, 1]), [Link](X[:, 1]), grid_size[1] + 1)
grid_cells, _, _ = np.histogram2d(X[:, 0], X[:, 1], bins=[x_edges, y_edges])
return grid_cells, x_edges, y_edges
60 | P a g e
Output:
61 | P a g e