0% found this document useful (0 votes)
8 views61 pages

Python Programs for Data Analysis

The document contains Python programs for performing central tendency measures (mean, median, mode) and measures of dispersion (range, variance, standard deviation, IQR) both with and without using built-in functions. It also includes programs to read multiple files from single and multiple folders, and to read and display various types of data (image, text, numeric, audio, video) using different libraries. Additionally, it provides examples of linear regression using single and multiple variables.

Uploaded by

pavansabaloor
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views61 pages

Python Programs for Data Analysis

The document contains Python programs for performing central tendency measures (mean, median, mode) and measures of dispersion (range, variance, standard deviation, IQR) both with and without using built-in functions. It also includes programs to read multiple files from single and multiple folders, and to read and display various types of data (image, text, numeric, audio, video) using different libraries. Additionally, it provides examples of linear regression using single and multiple variables.

Uploaded by

pavansabaloor
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Write a program to perform central tendency (mean, median, mode) with and
without using built-in function on the data.
Date:07/07/2025
----------------------------------------------------------------------------------------------------------------

WITH BUILT-IN:

import numpy as np
import statistics as stat
data=12,12,23,24,56,32,23,23
print(".....................USING BUILT IN FUNCTION.................")
mean=[Link](data)
median=[Link](data)
mod=[Link](data)
print("The mean value is:",mean)
print("The median value is:",median)
print("The mod value is:",mod)

Output:
.....................USING BUILT IN FUNCTION.................
The mean value is: 25.625
The median value is: 23.0
The mod value is: 23

1|Page
WITHOUT-BUILT IN:

data=[]
n=int(input("Enter number of elemnts:"))
for i in range(n):
ele=int(input(f"Enter the elemnts {i+1}:"))
[Link](ele)
print(" The created list is:",data)
mean=sum(data)/ n
print("Mean without built in function:",mean)
sorted_data=sorted(data)
print("The sorted data is:",sorted_data)
if n%2==0:
mid_index1=n//2
mid_index2=mid_index1-1
median=(sorted_data[mid_index1] + sorted_dat[mid_index]) / 2
else:
mid_index=n//2
median=sorted_data[mid_index]
print("Median is:",median)
count={}
for element in data:
if element in count:
count[element] += 1
else:
count[element]=1
max_count=max([Link]())
mode=[element for element, c in [Link]() if c==max_count]
print("Mode is",mode)

Output:
Enter number of elemnts: 5
Enter the elemnts 1: 6
Enter the elemnts 2: 3
Enter the elemnts 3: 2
Enter the elemnts 4: 8
Enter the elemnts 5: 6
The created list is: [6, 3, 2, 8, 6]
Mean without built in function: 5.0
The sorted data is: [2, 3, 6, 6, 8]
Median is: 6 Mode is [6]

2|Page
2. Write a python program to:
i) read multiple files from single folder
ii) read multiple files from multiple folders.
Date:
----------------------------------------------------------------------------------------------------------------
i) read multiple files from single folder

import os
path=[Link]()
main_folder_name=[Link](path)
print(f"main folder name:{main_folder_name}")
for file in [Link](path):
if [Link](".txt"):
file_path=[Link](path,file)
print(f"file path:{file_path}")
with open(file_path,'r')as f:
print([Link]())

Output:
main folder name:mca
file path:C:\Users\DELL\mca\[Link]
Hello.............!Good Morning

3|Page
ii) read multiple files from multiple folders.

import os
def read_text_files_from_folders(root_folder):
texts = []
for folder_name, subfolders, filenames in [Link](root_folder):
for filename in filenames:
if [Link](".txt"):
file_path = [Link](folder_name, filename)
try:
with open(file_path, 'r', encoding='utf-8') as file:
content = [Link]()
print(folder_name)
print(filename)
print(content)
[Link](content)
except Exception as e:
print(f"Error reading file {file_path}: {e}")
return texts
root_folder = "C:/"

Output :
C:/
[Link]
Deployment Image Servicing and Management tool
Version: 10.0.10240.16384
Image Version: 10.0.10240.16384
Packages listing:
Package Identity : Microsoft-Windows-Client-LanguagePack-
Package~31bf3856ad364e35~amd64~en-US~10.0.10240.16384
State : Installed
Release Type : Language Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-DiagTrack-Internal-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384

4|Page
State : Installed
Release Type : Feature Pack
Install Time : 7/10/2015 12:20 PM
Package Identity : Microsoft-Windows-Foundation-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : Foundation
Install Time : 7/10/2015 12:20 PM
Package Identity : Microsoft-Windows-LanguageFeatures-Basic-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-LanguageFeatures-Handwriting-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-LanguageFeatures-OCR-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:14 PM
Package Identity : Microsoft-Windows-LanguageFeatures-Speech-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:14 PM
Package Identity : Microsoft-Windows-LanguageFeatures-TextToSpeech-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:14 PM
Package Identity : Microsoft-Windows-Prerelease-Client-
Package~31bf3856ad364e35~amd64~en-US~10.0.10240.16384

5|Page
State : Installed
Release Type : Language Pack
Install Time : 7/10/2015 1:13 PM
Package Identity : Microsoft-Windows-Prerelease-Client-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : Feature Pack
Install Time : 7/10/2015 12:20 PM
Package Identity : Microsoft-Windows-RetailDemo-OfflineContent-Content-en-us-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:16 PM
Package Identity : Microsoft-Windows-RetailDemo-OfflineContent-Content-
Package~31bf3856ad364e35~amd64~~10.0.10240.16384
State : Installed
Release Type : OnDemand Pack
Install Time : 7/10/2015 1:16 PM
Package Identity : Package_for_KB3074667~31bf3856ad364e35~amd64~~[Link]
State : Installed
Release Type : Security Update
Install Time : 7/24/2015 3:15 AM
Package Identity : Package_for_KB3081444~31bf3856ad364e35~amd64~~[Link]
State : Install Pending
Release Type : Security Update
Install Time : 8/25/2015 8:38 AM

The operation completed successfully.

6|Page
3. Write a python program to read and display various kinds of data (image, text, and
numeric, audio, video) saved in different format using various python libraries.
Date:
----------------------------------------------------------------------------------------------------------------
#jpg
from PIL import Image
image='[Link]'
image=[Link](image)
[Link]()

Output:

#png
from PIL import Image
image='[Link]'
image=[Link](image)
[Link]()

Output:

7|Page
#gif
import cv2
video = '[Link]'
cap = [Link](video)
if not [Link]():
print("Error: could not open video")
exit()
while True:
ret, frame = [Link]()
if not ret:
break
[Link]('Frame', frame)
if [Link](25) & 0xFF == ord('q'):
break
[Link]()
[Link]()

Output:

# audio
from [Link] import Audio
import [Link]
import numpy as np
import [Link] as plt
audio='computer-keyboard-typing-290582.mp3'
y,sr=[Link](audio,sr=None)
print(f'sampling rate:{sr}Hz')
print(f'Number of sample:{len(y)}')
[Link](figsize=(14,5))
[Link](3,3,3)
[Link]([Link](len(y))/sr,y)

8|Page
[Link]('waveform')
[Link]('Time(s)')
[Link]('Amptitude')
[Link]()
Audio(data=y,rate=sr)

Output:
sampling rate:48000Hz
Number of sample:1793664

# TEXT DATA
#.txt .json .exel
#.txt
file_path ='[Link]'
try:
with open(file_path, 'r') as file:
print([Link]())
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
except IOError:
print(f"Error: Could not read file '{file_path}'.")

Output:
Hello.............!Good Morning

9|Page
# json
import json
data={
"name":"john doe",
"age":30,
"city":"newyork",
"intrests":["python","data science","Reading"]
}
file_path='[Link]'
try:
with open(file_path,'w')as file:
[Link](data,file,indent=4)
print("Json data has been written to successfully")
except IOError:
print("Error:Could not write to file")
file_path='[Link]'
with open(file_path,'r')as file:
print([Link]())

Output:
Json data has been written to successfully
{
"name": "john doe",
"age": 30,
"city": "newyork",
"intrests": [
"python",
"data science", "Reading" }

#xls
import pandas as pd
file_path='[Link]'
df=pd.read_excel(file_path)
print([Link]())

Output:

First Name Last Name Gender Country Age Date Id


0 Dulce Abril Female US 32 2017-10-15 1562
1 Mara Hashmimoto Female Britain 25 2016-08-16 1582
2 Philip Gent male Goa 15 2015-05-21 2587
3 Kathleen Hanner Female US 18 2017-10-15 3549

10 | P a g e
4 Nereida Magwood Female US 58 2016-08-16 2468
#CSV
import pandas as pd
file_path='[Link]'
df=pd.read_csv(file_path)
print([Link]())

Output:
area bedroom age price
0 2600 3.0 20 550000
1 3000 4.0 15 565000
2 3200 NaN 18 580000
3 3500 3.0 30 595000
4 4000 5.0 8 610000

#tsv
def read_tsv_file(file_path):
with open(file_path,'r')as file:
lines=[Link]()
for line in lines:
fields=[Link]().split('\t')
print(fields)
file_path='[Link]'
read_tsv_file(file_path)

Output:
["[' ']"]
["['Annual budget tracker']"]
["['Plan and track your monthly spending for the entire year']"]
["[' ']"]
["[' ']"]
["['How to use this temple']"]
['']

11 | P a g e
4. Write a program to perform measure of dispersion (range, variance, standard
deviation, IQR) with and without using built-in function on the data set.
Date:09-07-2025
----------------------------------------------------------------------------------------------------------------
WITH BUILT IN :

import numpy as np
data=[]
n=int(input("Enter the number do you want to enter:"))
for i in range(n):
ele=int(input("Enter the elements:"))
[Link](ele)
print("The numbers are:",data)
range_builtin = [Link](data)
variance_builtin = [Link](data)
std_deviation_builtin = [Link](data)
iqr_builtin = [Link](data, 75) - [Link](data,25)
print(".....................Using Built-in Functions............................")
print(f"Range: {range_builtin}")
print(f"Variance: {variance_builtin}")
print(f"Standard Deviation: {std_deviation_builtin}")
print(f"IQR: {iqr_builtin}")

Output:
Enter the number do you want to enter: 14
Enter the elements: 8
Enter the elements: 9
Enter the elements: 4
Enter the elements: 2
Enter the elements: 3
Enter the elements: 5
Enter the elements: 4
Enter the elements: 12
Enter the elements: 78
Enter the elements: 71
Enter the elements: 61
Enter the elements: 36
Enter the elements: 78
Enter the elements: 90
The numbers are: [8, 9, 4, 2, 3, 5, 4, 12, 78, 71, 61, 36, 78, 90]

12 | P a g e
.....................Using Built-in Functions............................
Range: 88
Variance: 1107.4948979591838
Standard Deviation: 33.27904592922074
IQR: 64.25

WITHOUT BUILT IN FUNCTION:

data=[]
n=int(input("Enter the number do you want to enter:"))
for i in range(n):
ele=int(input("Enter the elements:"))
[Link](ele)
print("The numbers are:",data)
range_custom = max(data) - min(data)
mean = sum(data) / len(data)
variance_custom = sum((x - mean) ** 2 for x in data) / (len(data) - 1)
std_dev_custom = variance_custom ** 0.5
sorted_data = sorted(data)
q75_custom = sorted_data[int(len(data) * 0.75)]
q25_custom = sorted_data[int(len(data) * 0.25)]
iqr_custom = q75_custom - q25_custom
print(".......................Without Using Built-in Functions................................")
print(f"Range: {range_custom}")
print(f"Variance: {variance_custom}")
print(f"Standard Deviation: {std_dev_custom}")
print(f"IQR: {iqr_custom}")

Output:
Enter the number do you want to enter: 10
Enter the elements: 10
Enter the elements: 20
Enter the elements: 30
Enter the elements: 40
Enter the elements: 50
Enter the elements: 60
Enter the elements: 70
Enter the elements: 80

13 | P a g e
Enter the elements: 90
Enter the elements: 100
The numbers are: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
.......................Without Using Built-in Functions................................
Range: 90
Variance: 916.6666666666666
Standard Deviation: 30.276503540974915
IQR: 50

14 | P a g e
5. Write a program to perform linear regression using
i) Single variable
ii) Multiple variables.
Date:11-07-2025
----------------------------------------------------------------------------------------------------------------
i) Single variable

import pandas as pd
from sklearn.linear_model import LinearRegression
import [Link] as plot
df=pd.read_csv('[Link]')
[Link]()

Output:

[Link]()

Output:
Index(['area', 'bedrooms', 'age ', ' price'], dtype='object')

df=[Link](columns={'age ':'age'})
print([Link])

Output:
Index(['area', 'bedrooms', 'age', ' price'], dtype='object')

[Link]('area')
[Link]('price')
[Link]([Link],[Link],color='green',marker='+')
[Link]()

15 | P a g e
Output:

x=df[['area']]
y=df[['price']]
x=[Link][:,0].values
y=[Link][:,3].values
print(x)
print(y)

Output:
[2600 3000 3200 3500 4000]
[550000 556500 580000 595000 610000]

reg=LinearRegression()
x = [Link](-1, 1)
[Link](x,y)

Output:

reg.coef_
Output:
array([46.50179856])

[Link]([[3000]])

16 | P a g e
Output:
array([566209.5323741])

ii) Multiple variables.

import pandas as pd
import [Link] as plt
from sklearn.linear_model import LinearRegression
df=pd.read_csv('homeprices_multiple.csv')
[Link]()
Output:

area bedrooms age price

0 3000 4.0 15 565000

1 3200 NaN 18 610000

2 3600 3.0 30 595000

3 4000 5.0 8 760000

4 4100 6.0 8 810000

[Link]('age')
[Link]('price')
[Link]([Link],[Link],color='green',marker='+')
[Link]()

Output:

17 | P a g e
[Link](df[['area', 'bedrooms', 'age']], df['price'])

Output:

print("coefficient:",reg.coef_)
print("Intercept:",reg.intercept_)

Output:
coefficient: [ 148.64130435 35135.86956522 -1603.26086957]
Intercept: 2581.5217391273472

input_data = [Link]([[3000, 3, 15]], columns=['area', 'bedrooms', 'age'])


predicted_price = [Link](input_data)
print(f"Predicted price: {predicted_price[0]}")

Output:
Predicted price: 529864.130434782

18 | P a g e
6..Program to fit Multiple Linear Regression model on House_ prices dataset .consider
the below table containing hose prices in Monroe ,New Jersey(USA)
area bedrooms age price
2600 3 20 550000
3000 4 15 565000
3200 18 610000
3600 3 30 595000
4000 5 8 760000
4100 5 8 810000
Here price depends on the area(square feet),bedrooms and age of the house(in years).
Predict the prices of new homes based on the following area ,bedrooms and age.
Date:
----------------------------------------------------------------------------------------------------------------

import pandas as pd
import numpy as np
from sklearn import linear_model
import warnings
[Link]('ignore')
df=pd.read_excel('[Link]')
print("The home price dataset is\n",df)

Output:
The home price dataset is
area bedrooms age price
0 2600 3.0 20 550000
1 3000 4.0 15 565000
2 3200 NaN 18 610000
3 3600 3.0 30 595000
4 4000 5.0 8 760000
5 4100 5.0 8 810000

print("The description of the dataset\n",[Link]())

Output:
The description of the dataset
area bedrooms age price
count 6.000000 5.0 6.000000 6.000000
mean 3416.666667 4.0 16.500000 648333.333333
std 587.934237 1.0 8.288546 109117.673484

19 | P a g e
min 2600.000000 3.0 8.000000 550000.000000
25% 3050.000000 3.0 9.750000 572500.000000
50% 3400.000000 4.0 16.500000 602500.000000
75% 3900.000000 5.0 19.500000 722500.000000
max 4100.000000 5.0 30.000000 810000.000000

print("To check if there is any missing value\n",[Link]().any())

Output:
To check if there is any missing value
area False
bedrooms True
age False
price False
dtype: bool

print("The median of bedrooms=",[Link]())

Output:
The median of bedrooms= 4.0

[Link]=[Link]([Link]())
print("Data set After replacing the missing values with median")
print(df)

Output:

Data set After replacing the missing values with median


area bedrooms age price
0 2600 3.0 20 550000
1 3000 4.0 15 565000
2 3200 4.0 18 610000
3 3600 3.0 30 595000
4 4000 5.0 8 760000
5 4100 5.0 8 810000

20 | P a g e
import [Link] as plt
[Link]([Link],[Link])
[Link]('Age of home(in years)')
[Link]('price')
[Link]()

Output:

x=[Link]('price',axis='columns')
print("The datset after dropping price is\n")
print(x)

Output:
The datset after dropping price is
area bedrooms age
0 2600 3.0 20
1 3000 4.0 15
2 3200 4.0 18
3 3600 3.0 30
4 4000 5.0 8
5 4100 5.0 8

y=df['price']
print("The dataset having price is\n")
print(y)

21 | P a g e
Output:
The dataset having price is

0 550000
1 565000
2 610000
3 595000
4 760000
5 810000
Name: price, dtype: int64

reg=linear_model.LinearRegression()
[Link](x,y)

Output:

coef=reg.coef_
print(coef)

Output:
[ 189.57096766 -94877.34896436 -13068.36933232]

b1=coef[0]
b2=coef[1]
b3=coef[2]
print(b1)
print(b2)
print(b3)

Output:
189.57096766248824
-94877.34896436246
-13068.369332315371

22 | P a g e
a=reg.intercept_
print(a)

Output:
595770.0169938187

[Link]([[3000,3,40]])
new_area=3000
new_bedrooms=3
new_age=40
predicted_price=a+(b1*new_area)+(b2*new_bedrooms)+(b3+new_age)
print("The predicted_price\n",predicted_price)

Output:
The predicted_price
866822.5037558805

23 | P a g e
[Link] a program to perform various data visualization technique on sample
dataset.
Date:
----------------------------------------------------------------------------------------------------------------
import pandas as pd
import [Link] as plt
import seaborn as sns
data=pd.read_csv('fruit_data_with_colours.csv')
print("First few rows of the dataset:")
print([Link]())

Output:
First few rows of the dataset:
fruit_label fruit_name fruit_subtype mass width height color_score
0 1 apple granny_smith 192 8.4 7.3 0.55
1 1 apple granny_smith 180 8.0 6.8 0.59
2 1 apple granny_smith 176 7.4 7.2 0.60
3 2 mandarin mandarin 86 6.2 4.7 0.80
4 2 mandarin mandarin 84 6.0 4.6 0.79

[Link](2,2,1)
[Link](data['width'],bins=10,kde=True)
[Link]('Histogram of fruit width')

Output:
Text(0.5, 1.0, 'Histogram of fruit width')

24 | P a g e
[Link](2,2,2)
[Link](x='width',y='height',data=data)
[Link]('scatter plot of width vs Height')

Output:
Text(0.5, 1.0, 'scatter plot of width vs Height')

[Link](2,2,3)
[Link](x='mass',y='color_score',data=data)
[Link]('box plot of mass level color_score')

Output:
Text(0.5, 1.0, 'box plot of mass level color_score')

[Link](2,2,4)
[Link](x='fruit_name',data=data)
[Link]('count of fruit_name')

25 | P a g e
Output:
Text(0.5, 1.0, 'count of fruit_name')

26 | P a g e
[Link] a program to perform Numeric data-processing with and without using built-in
function.
Date:17/07/2025
----------------------------------------------------------------------------------------------------------------
WITH BUILT-IN FUNCTION:

import numpy as np
import pandas as pd
df=pd.read_csv("[Link]")
[Link]()

Output:

[Link]

Output:
(768, 9)

[Link]()

Output:

27 | P a g e
features = [Link][:, :-1]
class_label = [Link][:, -1]
duplicate_values = df[[Link]()]
print("\nDuplicate values:")
print(duplicate_values)

Output:
Duplicate values:
Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI,
DiabetesPedigreeFunction, Age, Outcome]
Index: []

df.drop_duplicates(inplace=True)
print("DataFrame after removing duplicates:")
[Link](5)

Output:
DataFrame after removing duplicates:

missing_values=[Link]().sum()
print("Missing values:\n",missing_values)

Output:
Missing values:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

28 | P a g e
WITH OUT BUILT-IN FUNCTION:
import csv
with open('[Link]','r')as file:
reader=[Link](file)
data=list(reader)

header=data[0]
rows=data[1:]

for i in range(len(rows)):
rows[i]=[float(x)for x in rows[i]]

cols_to_fix=[1,2,3,4,5]

for col in cols_to_fix:


non_zero_vals=[row[col] for row in rows if row[col]!=0]
non_zero_vals.sort()
n=len(non_zero_vals)
median=non_zero_vals[n//2] if n%2!=0 else (non_zero_vals[n//2-1]+
non_zero_vals[n//2])/2

for row in rows:


if row[col]==0:
row[col]=median

print("\nPreprocessed Data(without built_in functions):")


print(header)
for row in rows[:5]:
print(row)

Output:
Preprocessed Data(without built_in functions):
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']
[6.0, 148.0, 72.0, 35.0, 125.0, 33.6, 0.627, 50.0, 1.0]
[1.0, 85.0, 66.0, 29.0, 125.0, 26.6, 0.351, 31.0, 0.0]
[8.0, 183.0, 64.0, 29.0, 125.0, 23.3, 0.672, 32.0, 1.0]
[1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0, 0.0]
[0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 33.0, 1.0]

29 | P a g e
9. Write a Python program to perform the following using employees_data.csv file:
1. Load the dataset and display the first 5 rows.
2. Clean the data:
* Check for and handle any missing values.
* Convert Joining_Date to datetime format.
3. Feature Engineering:
* Create a new column Years_of_Service calculated from today'sdate.
4. Data Analysis:
* Calculate the average salary per department.
* Find the number of employees in each gender category.
* Identify the department with the highest average Years_of_Service.
5. Data Visualization:
* Create a bar plot of average salary by department.
* Plot a pie chart showing the gender distribution.
6. Export the cleaned and enriched dataset to a new CSV file called
employees_cleaned.csv.
Date:08/08/2025
----------------------------------------------------------------------------------------------------------------

import pandas as pd
import [Link] as plt
import seaborn as sns
from datetime import datetime
df = pd.read_csv('employees .csv')
print("First 5 rows of the dataset:")
print([Link]())

Output:

First 5 rows of the dataset:


EMPLOYEE_ID FIRST_NAME LAST_NAME EMAIL \
0 1 Megan Chang [Link]@[Link]
1 2 Vanessa Patel [Link]@[Link]
2 3 Tammy Woods [Link]@[Link]
3 4 John Ponce [Link]@[Link]
4 5 Amy Olsen [Link]@[Link]

PHONE_NUMBER JOINING_DATE GENDER JOB_ID SALARY \


0 (048)764-7593x82421 11-11-2019 Female PU_MAN 110121.95
1 001-924-115-7815x659 29-01-2019 Other SA_REP 66444.07
2 408.016.0975x35139 16-05-2019 Female MK_MAN 110249.46
3 +1-711-587-1484 21-09-2017 Male PU_MAN 38534.77
4 +1-398-947-1965 23-05-2025 Other PU_MAN 84171.19

30 | P a g e
COMMISSION_PCT MANAGER_ID DEPARTMENT

0 NaN 3 Finance
1 0.24 20 Sales
2 0.07 9 Finance
3 NaN 17 Marketing
4 NaN 10 Finance

# 2. Clean the data


print("\nMissing values before cleaning:")
print([Link]().sum())

Output:

Missing values before cleaning:


EMPLOYEE_ID 0
FIRST_NAME 0
LAST_NAME 0
EMAIL 0
PHONE_NUMBER 0
JOINING_DATE 0
GENDER 0
JOB_ID 0
SALARY 0
COMMISSION_PCT 718
MANAGER_ID 0
DEPARTMENT 0

dtype: int64

[Link](inplace=True)
# Convert Joining_Date to datetime format
df['JOINING_DATE'] = pd.to_datetime(df['JOINING_DATE'],dayfirst=True,errors='coerce')
[Link](subset=['JOINING_DATE'], inplace=True) # Drop rows where Joining_Date
couldn't be parsed
print([Link]())

31 | P a g e
Output:
EMPLOYEE_ID FIRST_NAME LAST_NAME EMAIL \
1 2 Vanessa Patel [Link]@[Link]
2 3 Tammy Woods [Link]@[Link]
6 7 Frances Massey [Link]@[Link]
7 8 Brenda Rogers [Link]@[Link]
13 14 Joanne Stephens [Link]@[Link]

PHONE_NUMBER JOINING_DATE GENDER JOB_ID SALARY \


1 001-924-115-7815x659 2019-01-29 Other SA_REP 66444.07
2 408.016.0975x35139 2019-05-16 Female MK_MAN 110249.46
6 483.396.9477x51591 2023-11-16 Female MK_MAN 39063.11
7 304.135.2560 2017-04-24 Male MK_MAN 72930.88
13 (375)945-9924 2022-03-15 Female MK_MAN 119817.45

COMMISSION_PCT MANAGER_ID DEPARTMENT


1 0.24 20 Sales
2 0.07 9 Finance
6 0.13 40 Accounting
7 0.26 17 IT
13 0.15 20 Marketing

# 3. Feature Engineering

# Create a new column 'Years_of_Service'


today = pd.to_datetime('today')
df['Years_of_Service'] = (today - df['JOINING_DATE']).[Link] // 365
print(df[['EMPLOYEE_ID', 'FIRST_NAME', 'JOINING_DATE',
'Years_of_Service']].head())

Output:

EMPLOYEE_ID FIRST_NAME JOINING_DATE Years_of_Service


1 2 Vanessa 2019-01-29 6
2 3 Tammy 2019-05-16 6
6 7 Frances 2023-11-16 1
7 8 Brenda 2017-04-24 8
13 14 Joanne 2022-03-15 3

# 4. Data Analysis
print("\nAverage salary per department:")
print([Link]('DEPARTMENT')['SALARY'].mean())

32 | P a g e
print("\nEmployee count by gender:")
print(df['GENDER'].value_counts())

print("\nDepartment with highest average Years_of_Service:")


print([Link]('DEPARTMENT')['Years_of_Service'].mean().idxmax())

Output:

Average salary per department:


DEPARTMENT
Accounting 75289.210000
Finance 72732.191081
HR 66857.482766
IT 79238.097692
Marketing 71059.762258
Purchasing 82529.937895
Sales 75067.825000
Name: SALARY, dtype: float64

Employee count by gender:


GENDER
Male 98
Female 96
Other 88
Name: count, dtype: int64
Department with highest average Years_of_Service:
Marketing

# Bar plot of average salary by department


avg_salary_dept_df = [Link]('DEPARTMENT')['SALARY'].mean().reset_index()
avg_salary_dept_df.rename(columns={'SALARY': 'Average_Salary'}, inplace=True)
[Link](figsize=(4,4))
[Link](data=avg_salary_dept_df, x='DEPARTMENT', y='Average_Salary') # no palette
here
[Link]('Average Salary by Department')
[Link]('')
[Link]('Average Salary')
[Link](rotation=45)
plt.tight_layout()
[Link]()

33 | P a g e
Output:

# Pie chart of gender distribution


gender_count = df['GENDER'].value_counts()
[Link](figsize=(4,4))
[Link](gender_count, labels=gender_count.index, autopct='%1.1f%%',
colors=sns.color_palette('pastel'))
[Link]('Gender Distribution')
plt.tight_layout()
[Link]()

Output:

df.to_csv('employees_cleaned.csv', index=False)
print("\nCleaned and enriched dataset exported as 'employees_cleaned.csv'.")

Output:
Cleaned and enriched dataset exported as 'employees_cleaned.csv'.

34 | P a g e
[Link] a program to perform text data preprocessing for the dataset [Link] using
with and without building function.
Date:
----------------------------------------------------------------------------------------------------------------

import numpy as pd
import pandas as pd
df=pd.read_csv("[Link]")
[Link](5)
#expanding the display of text sms columns
pd.set_option('display.max_colwidth',1)
df=df[['Tweets','Retweets']]
[Link]()

Output:

df['Retweets'].value_counts().

Output:
Retweets
88 9
102 8
137 8
89 8
133 7
..
2575 1
5276 1
868 1
6886 1
11302 1
Name: count, Length: 1993, dtype: int64

35 | P a g e
import pandas as pd
import string
import re
from [Link] import WhitespaceTokenizer

def remove_urls(text):
if isinstance(text,str):
return [Link](r'http\S+|www\S+','',text,flags=[Link])
else:
return text
def remove_punctuation(text):
if isinstance(text,str):
return "".join([char for char in text if char not in [Link]])
else:
return text

def tokenization(text):
if isinstance(text,str):
tk=WhitespaceTokenizer()
return [Link](text)
else:
return text

#remove contraction
import contractions

# Define the function to expand contractions

def conc(text):
expanded_text = [Link](text)
return expanded_text
df['lower_case'] = df['Tweets'].apply(lambda x: [Link]()) #lower case
df['rem_url'] = df['lower_case'].apply(remove_urls) # Remove URLs first
df['rem_punct'] = df['rem_url'].apply(lambda x: remove_punctuation(x)) #apply punctuation
df['rem_conct'] = df['rem_punct'].apply(conc) #apply contraction
df['tokenised_msg'] = df['rem_punct'].apply(lambda x: tokenization(x)) #apply tokenization
[Link](10)

36 | P a g e
Output:

from [Link] import stopwords


",".join([Link]('english'))

Output:
"a,about,above,after,again,against,ain,all,am,an,and,any,are,aren,aren't,as,at,be,because,been,
before,being,below,between,both,but,by,can,couldn,couldn't,d,did,didn,didn't,do,does,doesn,d
oesn't,doing,don,don't,down,during,each,few,for,from,further,had,hadn,hadn't,has,hasn,hasn't,
have,haven,haven't,having,he,he'd,he'll,her,here,hers,herself,he's,him,himself,his,how,i,i'd,if,i'
ll,i'm,in,into,is,isn,isn't,it,it'd,it'll,it's,its,itself,i've,just,ll,m,ma,me,mightn,mightn't,more,most,
mustn,mustn't,my,myself,needn,needn't,no,nor,not,now,o,of,off,on,once,only,or,other,our,our
s,ourselves,out,over,own,re,s,same,shan,shan't,she,she'd,she'll,she's,should,shouldn,shouldn't,
should've,so,some,such,t,than,that,that'll,the,their,theirs,them,themselves,then,there,these,they
,they'd,they'll,they're,they've,this,those,through,to,too,under,until,up,ve,very,was,wasn,wasn't,
we,we'd,we'll,we're,were,weren,weren't,we've,what,when,where,which,while,who,whom,why
,will,with,won,won't,wouldn,wouldn't,y,you,you'd,you'll,your,you're,yours,yourself,yourselve
s,you've"

import nltk
stopwords=[Link]('english')
def remove_stopwords(text):
output=[i for i in text if i not in stopwords]
return output
df['rem_stopwords']=df['tokenised_msg'].apply(lambda x:remove_stopwords(x))
df

37 | P a g e
Output:

#streaming and lemmatizing


from [Link] import PorterStemmer
from [Link] import WordNetLemmatizer
from [Link] import word_tokenize
import nltk
porter_stemmer=PorterStemmer()
wordnet_lemmatizer=WordNetLemmatizer()
#defining a function for stemming

def stemming(text):
stem_text=[porter_stemmer.stem(word)for word in text]
return stem_text
def lemmatizer(text):
if isinstance(text,str):
words=word_tokenize(text)
lemmatized_words=[wordnet_lemmatizer.lemmatize(word)for word in words]
return ''.join(lemmatized_words)
else:
return text

38 | P a g e
#applying function for stemming

df['Stemmed_msg']=df['rem_stopwords'].apply(lambda x:stemming(x))

#apply lemmatizer function to the dataframe column


df['msg_lemmatized']=df['rem_stopwords'].apply(lemmatizer)
[Link](10)

Output:

#remove emojis

import emoji

#import demoji
#demoji.download_codes()

def emo(text):
temp=[Link](text,delimiters=(" "," "))
temp=[Link]("_"," ")
return temp
df['rem_emo']=df["rem_punct"].apply(lambda x:emo(x))
[Link](5)

39 | P a g e
Output:

40 | P a g e
[Link] python program for implement chi-square test for feature selection to train
SVM classifier using suitable dataset.
Date:14/08/2025
----------------------------------------------------------------------------------------------------------------
# 1. Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from [Link] import SVC
from [Link] import accuracy_score, classification_report

# 2. Load Dataset
data = pd.read_csv('fruit_data_with_colours.csv')
print("First 5 rows of the dataset:")
print([Link](5))

Output:
First 5 rows of the dataset:
fruit_label fruit_name fruit_subtype mass width height color_score
0 1 apple granny_smith 192 8.4 7.3 0.55
1 1 apple granny_smith 180 8.0 6.8 0.59
2 1 apple granny_smith 176 7.4 7.2 0.60
3 2 mandarin mandarin 86 6.2 4.7 0.80
4 2 mandarin mandarin 84 6.0 4.6 0.79

# 3. Define Column Names for Target and Non-numeric Features


fruit_label = 'fruit_label'
fruit_subtype = 'fruit_subtype'
fruit_name = 'fruit_name'

# 4. Define Features (X) and Target (y)


X = [Link]([fruit_label, fruit_subtype, fruit_name], axis=1)
y = data[fruit_label]

# 5. Feature Selection using Chi-Square Test


k_selected_features = 4 # Select top 4 features
chi2_selector = SelectKBest(chi2, k=k_selected_features)
X_selected = chi2_selector.fit_transform(X, y)

print("\nSelected top features (Chi-Square):")


selected_feature_indices = chi2_selector.get_support(indices=True)
print([Link][selected_feature_indices])

41 | P a g e
Output:
Selected top features (Chi-Square):
Index(['mass', 'width', 'height', 'color_score'], dtype='object')

# 6. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
X_selected, y, test_size=0.2, random_state=42
)
# 7. Train the SVM Classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

Output:

# 8. Predictions
y_pred = svm_classifier.predict(X_test)

# 9. Evaluation
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("\nAccuracy:", accuracy)
print("\nClassification Report:")
print(report)

Output:
Accuracy: 0.75

Classification Report:
precision recall f1-score support

1 0.67 0.67 0.67 3


2 1.00 1.00 1.00 2
3 0.33 0.50 0.40 2
4 1.00 0.80 0.89 5
accuracy 0.75 12
macro avg 0.75 0.74 0.74 12
weighted avg 0.81 0.75 0.77 12

42 | P a g e
[Link] a program to implement ANOVA test for feature selection to train SVM
classifier using suitable datasets.
Date:26/08/2025
----------------------------------------------------------------------------------------------------------------

import numpy as np
from [Link] import load_iris
from sklearn.feature_selection import SelectPercentile,f_classif
from [Link] import Pipeline
from [Link] import StandardScaler
from [Link] import SVC
from sklearn.model_selection import cross_val_score
import [Link] as plt

X,y=load_iris(return_X_y=True)
mg=[Link](0)
X=[Link]((X,2*[Link](([Link][0],36))))
clf=Pipeline([("anova",SelectPercentile(f_classif)),
("scaler",StandardScaler()),
("svc",SVC(gamma="auto"))
])

score_means=[]
score_stds=[]
percentiles=[1,3,6,10,15,20,30,40,60,80,100]

for percentile in percentiles:


clf.set_params(anova__percentile=percentile)
this_scores=cross_val_score(clf,X,y)
score_means.append(this_scores.mean())
score_stds.append((this_scores.std()))

[Link](percentiles,score_means,[Link](score_stds))
[Link]("Performance of the SVM-Anova varying the percentile offeatures selected")
[Link]([Link](0,100,11,endpoint=True))
[Link]("Percentile")
[Link]("tight")
[Link]()

43 | P a g e
Output:

44 | P a g e
[Link] a program to classify the IRIS dataset using the support vector classifier(svc)
algorithm with data visualization, pre-processing and performance evaluation.
Date:26/08/2025
----------------------------------------------------------------------------------------------------------------
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import SVC
from [Link] import StandardScaler
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score,confusion_matrix,classification_report
from sklearn import datasets

iris=datasets.load_iris()
X=[Link]
Y=[Link]

df=[Link](X,columns=iris.feature_names)
df['target']=Y
df['target']=df['target'].map(dict(enumerate(iris.target_names)))
[Link](df,hue="target",palette="Set2")
[Link]("pairplot of iris Dataset",y=1.02)
[Link]()

Output:

45 | P a g e
[Link](figsize=(8,6))
[Link]([Link][:,:-1].corr(),annot=True,cmap='coolwarm')
[Link]("Correlation Heatmap of iris Features")
[Link]()

Output:

print("Missing values in dataset:\n",[Link]().sum())

Output:
Missing values in dataset:
sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
target 0
dtype: int64

X=[Link]
Y=[Link]
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=1)
print("X_train",X_train)
print("X_test",X_test)
print("Y_train",Y_train)
print("Y_test",Y_test)
46 | P a g e
Output:
X_train [[7.7 2.6 6.9 2.3]
[5.7 3.8 1.7 0.3]
[5. 3.6 1.4 0.2]
[4.8 3. 1.4 0.3]
[5.2 2.7 3.9 1.4]
[5.1 3.4 1.5 0.2]
[5.5 3.5 1.3 0.2]
[7.7 3.8 6.7 2.2]
[6.9 3.1 5.4 2.1]
[7.3 2.9 6.3 1.8]
[6.4 2.8 5.6 2.2]
[6.2 2.8 4.8 1.8]
[6. 3.4 4.5 1.6]
[7.7 2.8 6.7 2. ]
[5.7 3. 4.2 1.2]
[4.8 3.4 1.6 0.2]
[5.7 2.5 5. 2. ]
[6.3 2.7 4.9 1.8]
[4.8 3. 1.4 0.1]
[4.7 3.2 1.3 0.2]
[6.5 3. 5.8 2.2]
[4.6 3.4 1.4 0.3]
[6.1 3. 4.9 1.8]
[6.5 3.2 5.1 2. ]
[6.7 3.1 4.4 1.4]
[5.7 2.8 4.5 1.3]
[6.7 3.3 5.7 2.5]
[6. 3. 4.8 1.8]
[5.1 3.8 1.6 0.2]
[6. 2.2 4. 1. ]
[6.4 2.9 4.3 1.3]
[6.5 3. 5.5 1.8]
[5. 2.3 3.3 1. ]
[6.3 3.3 6. 2.5]
[5.5 2.5 4. 1.3]
[5.4 3.7 1.5 0.2]
[4.9 3.1 1.5 0.2]
[5.2 4.1 1.5 0.1]
[6.7 3.3 5.7 2.1]
[4.4 3. 1.3 0.2]
[6. 2.7 5.1 1.6]
[6.4 2.7 5.3 1.9]
[5.9 3. 5.1 1.8]

47 | P a g e
[5.2 3.5 1.5 0.2]
[5.1 3.3 1.7 0.5]
[5.8 2.7 4.1 1. ]
[4.9 3.1 1.5 0.1]
[7.4 2.8 6.1 1.9]
[6.2 2.9 4.3 1.3]
[7.6 3. 6.6 2.1]
[6.7 3. 5.2 2.3]
[6.3 2.3 4.4 1.3]
[6.2 3.4 5.4 2.3]
[7.2 3.6 6.1 2.5]
[5.6 2.9 3.6 1.3]
[5.7 4.4 1.5 0.4]
[5.8 2.7 3.9 1.2]
[4.5 2.3 1.3 0.3]
[5.5 2.4 3.8 1.1]
[6.9 3.1 4.9 1.5]
[5. 3.4 1.6 0.4]
[6.8 2.8 4.8 1.4]
[5. 3.5 1.6 0.6]
[4.8 3.4 1.9 0.2]
[6.3 3.4 5.6 2.4]
[5.6 2.8 4.9 2. ]
[6.8 3.2 5.9 2.3]
[5. 3.3 1.4 0.2]
[5.1 3.7 1.5 0.4]
[5.9 3.2 4.8 1.8]
[4.6 3.1 1.5 0.2]
[5.8 2.7 5.1 1.9]
[4.8 3.1 1.6 0.2]
[6.5 3. 5.2 2. ]
[4.9 2.5 4.5 1.7]
[4.6 3.2 1.4 0.2]
[6.4 3.2 5.3 2.3]
[4.3 3. 1.1 0.1]
[5.6 3. 4.1 1.3]
[4.4 2.9 1.4 0.2]
[5.5 2.4 3.7 1. ]
[5. 2. 3.5 1. ]
[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.9 2.4 3.3 1. ]
[4.6 3.6 1. 0.2]
[5.9 3. 4.2 1.5]

48 | P a g e
[6.1 2.9 4.7 1.4]
[5. 3.4 1.5 0.2]
[6.7 3.1 4.7 1.5]
[5.7 2.9 4.2 1.3]
[6.2 2.2 4.5 1.5]
[7. 3.2 4.7 1.4]
[5.8 2.7 5.1 1.9]
[5.4 3.4 1.7 0.2]
[5. 3. 1.6 0.2]
[6.1 2.6 5.6 1.4]
[6.1 2.8 4. 1.3]
[7.2 3. 5.8 1.6]
[5.7 2.6 3.5 1. ]
[6.3 2.8 5.1 1.5]
[6.4 3.1 5.5 1.8]
[6.3 2.5 4.9 1.5]
[6.7 3.1 5.6 2.4]
[4.9 3.6 1.4 0.1]]
X_test [[5.8 4. 1.2 0.2]
[5.1 2.5 3. 1.1]
[6.6 3. 4.4 1.4]
[5.4 3.9 1.3 0.4]
[7.9 3.8 6.4 2. ]
[6.3 3.3 4.7 1.6]
[6.9 3.1 5.1 2.3]
[5.1 3.8 1.9 0.4]
[4.7 3.2 1.6 0.2]
[6.9 3.2 5.7 2.3]
[5.6 2.7 4.2 1.3]
[5.4 3.9 1.7 0.4]
[7.1 3. 5.9 2.1]
[6.4 3.2 4.5 1.5]
[6. 2.9 4.5 1.5]
[4.4 3.2 1.3 0.2]
[5.8 2.6 4. 1.2]
[5.6 3. 4.5 1.5]
[5.4 3.4 1.5 0.4]
[5. 3.2 1.2 0.2]
[5.5 2.6 4.4 1.2]
[5.4 3. 4.5 1.5]
[6.7 3. 5. 1.7]
[5. 3.5 1.3 0.3]
[7.2 3.2 6. 1.8]
[5.7 2.8 4.1 1.3]

49 | P a g e
[5.5 4.2 1.4 0.2]
[5.1 3.8 1.5 0.3]
[6.1 2.8 4.7 1.2]
[6.3 2.5 5. 1.9]
[6.1 3. 4.6 1.4]
[7.7 3. 6.1 2.3]
[5.6 2.5 3.9 1.1]
[6.4 2.8 5.6 2.1]
[5.8 2.8 5.1 2.4]
[5.3 3.7 1.5 0.2]
[5.5 2.3 4. 1.3]
[5.2 3.4 1.4 0.2]
[6.5 2.8 4.6 1.5]
[6.7 2.5 5.8 1.8]
[6.8 3. 5.5 2.1]
[5.1 3.5 1.4 0.3]
[6. 2.2 5. 1.5]
[6.3 2.9 5.6 1.8]
[6.6 2.9 4.6 1.3]]

Y_train [2 0 0 0 1 0 0 2 2 2 2 2 1 2 1 0 2 2 0 0 2 0 2 2 1 1 2 2 0 1 1 2 1 2 1 0 0
0201220010212212210101101002220010202
2 0 2 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 2 0 0 2 1 2 1 2 2 1 2 0]
Y_test [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
0 1 2 2 0 2 2 1]

sc=StandardScaler()
[Link](X_train)
X_train_std=[Link](X_train)
X_test_std=[Link](X_test)
print("X_train_std=\n",X_train_std)
print("X_test_std=\n",X_test_std)

Output:
X_train_std=
[[ 2.26050169e+00 -1.05089682e+00 1.77622921e+00 1.42370971e+00]
[-1.18973773e-01 1.82764665e+00 -1.14491883e+00 -1.14263397e+00]
[-9.51790185e-01 1.34788940e+00 -1.31344660e+00 -1.27095115e+00]
[-1.18973773e+00 -9.13823325e-02 -1.31344660e+00 -1.14263397e+00]
[-7.13842639e-01 -8.11018201e-01 9.09514958e-02 2.68855052e-01]
[-8.32816412e-01 8.68132159e-01 -1.25727068e+00 -1.27095115e+00]
[-3.56921319e-01 1.10801078e+00 -1.36962252e+00 -1.27095115e+00]

50 | P a g e
[ 2.26050169e+00 1.82764665e+00 1.66387736e+00 1.29539252e+00]
[ 1.30871150e+00 1.48496290e-01 9.33590353e-01 1.16707534e+00]
[ 1.78460660e+00 -3.31260955e-01 1.43917367e+00 7.82123787e-01]
[ 7.13842639e-01 -5.71139578e-01 1.04594220e+00 1.29539252e+00]
[ 4.75895093e-01 -5.71139578e-01 5.96534810e-01 7.82123787e-01]
[ 2.37947546e-01 8.68132159e-01 4.28007039e-01 5.25489419e-01]
[ 2.26050169e+00 -5.71139578e-01 1.66387736e+00 1.03875815e+00]
[-1.18973773e-01 -9.13823325e-02 2.59479267e-01 1.22206842e-02]
[-1.18973773e+00 8.68132159e-01 -1.20109475e+00 -1.27095115e+00]
[-1.18973773e-01 -1.29077545e+00 7.08886658e-01 1.03875815e+00]
[ 5.94868866e-01 -8.11018201e-01 6.52710734e-01 7.82123787e-01]
[-1.18973773e+00 -9.13823325e-02 -1.31344660e+00 -1.39926834e+00]
[-1.30871150e+00 3.88374913e-01 -1.36962252e+00 -1.27095115e+00]
[ 8.32816412e-01 -9.13823325e-02 1.15829405e+00 1.29539252e+00]
[-1.42768528e+00 8.68132159e-01 -1.31344660e+00 -1.14263397e+00]
[ 3.56921319e-01 -9.13823325e-02 6.52710734e-01 7.82123787e-01]
[ 8.32816412e-01 3.88374913e-01 7.65062582e-01 1.03875815e+00]
[ 1.07076396e+00 1.48496290e-01 3.71831115e-01 2.68855052e-01]
[-1.18973773e-01 -5.71139578e-01 4.28007039e-01 1.40537868e-01]
[ 1.07076396e+00 6.28253536e-01 1.10211813e+00 1.68034407e+00]
[ 2.37947546e-01 -9.13823325e-02 5.96534810e-01 7.82123787e-01]
[-8.32816412e-01 1.82764665e+00 -1.20109475e+00 -1.27095115e+00]
[ 2.37947546e-01 -2.01041131e+00 1.47127420e-01 -2.44413683e-01]
[ 7.13842639e-01 -3.31260955e-01 3.15655191e-01 1.40537868e-01]
[ 8.32816412e-01 -9.13823325e-02 9.89766277e-01 7.82123787e-01]
[-9.51790185e-01 -1.77053269e+00 -2.46104047e-01 -2.44413683e-01]
[ 5.94868866e-01 6.28253536e-01 1.27064590e+00 1.68034407e+00]
[-3.56921319e-01 -1.29077545e+00 1.47127420e-01 1.40537868e-01]
[-4.75895093e-01 1.58776803e+00 -1.25727068e+00 -1.27095115e+00]
[-1.07076396e+00 1.48496290e-01 -1.25727068e+00 -1.27095115e+00]
[-7.13842639e-01 2.54728252e+00 -1.25727068e+00 -1.39926834e+00]
[ 1.07076396e+00 6.28253536e-01 1.10211813e+00 1.16707534e+00]
[-1.66563282e+00 -9.13823325e-02 -1.36962252e+00 -1.27095115e+00]
[ 2.37947546e-01 -8.11018201e-01 7.65062582e-01 5.25489419e-01]
[ 7.13842639e-01 -8.11018201e-01 8.77414430e-01 9.10440971e-01]
[ 1.18973773e-01 -9.13823325e-02 7.65062582e-01 7.82123787e-01]
[-7.13842639e-01 1.10801078e+00 -1.25727068e+00 -1.27095115e+00]
[-8.32816412e-01 6.28253536e-01 -1.14491883e+00 -8.85999602e-01]
[-1.05669938e-15 -8.11018201e-01 2.03303343e-01 -2.44413683e-01]
[-1.07076396e+00 1.48496290e-01 -1.25727068e+00 -1.39926834e+00]
[ 1.90358037e+00 -5.71139578e-01 1.32682182e+00 9.10440971e-01]
[ 4.75895093e-01 -3.31260955e-01 3.15655191e-01 1.40537868e-01]
[ 2.14152792e+00 -9.13823325e-02 1.60770144e+00 1.16707534e+00]
[ 1.07076396e+00 -9.13823325e-02 8.21238506e-01 1.42370971e+00]

51 | P a g e
[ 5.94868866e-01 -1.77053269e+00 3.71831115e-01 1.40537868e-01]
[ 4.75895093e-01 8.68132159e-01 9.33590353e-01 1.42370971e+00]
[ 1.66563282e+00 1.34788940e+00 1.32682182e+00 1.68034407e+00]
[-2.37947546e-01 -3.31260955e-01 -7.75762758e-02 1.40537868e-01]
[-1.18973773e-01 3.26691839e+00 -1.25727068e+00 -1.01431679e+00]
[-1.05669938e-15 -8.11018201e-01 9.09514958e-02 1.22206842e-02]
[-1.54665905e+00 -1.77053269e+00 -1.36962252e+00 -1.14263397e+00]
[-3.56921319e-01 -1.53065407e+00 3.47755719e-02 -1.16096500e-01]
[ 1.30871150e+00 1.48496290e-01 6.52710734e-01 3.97172236e-01]
[-9.51790185e-01 8.68132159e-01 -1.20109475e+00 -1.01431679e+00]
[ 1.18973773e+00 -5.71139578e-01 5.96534810e-01 2.68855052e-01]
[-9.51790185e-01 1.10801078e+00 -1.20109475e+00 -7.57682419e-01]
[-1.18973773e+00 8.68132159e-01 -1.03256698e+00 -1.27095115e+00]
[ 5.94868866e-01 8.68132159e-01 1.04594220e+00 1.55202689e+00]
[-2.37947546e-01 -5.71139578e-01 6.52710734e-01 1.03875815e+00]
[ 1.18973773e+00 3.88374913e-01 1.21446997e+00 1.42370971e+00]
[-9.51790185e-01 6.28253536e-01 -1.31344660e+00 -1.27095115e+00]
[-8.32816412e-01 1.58776803e+00 -1.25727068e+00 -1.01431679e+00]
[ 1.18973773e-01 3.88374913e-01 5.96534810e-01 7.82123787e-01]
[-1.42768528e+00 1.48496290e-01 -1.25727068e+00 -1.27095115e+00]
[-1.05669938e-15 -8.11018201e-01 7.65062582e-01 9.10440971e-01]
[-1.18973773e+00 1.48496290e-01 -1.20109475e+00 -1.27095115e+00]
[ 8.32816412e-01 -9.13823325e-02 8.21238506e-01 1.03875815e+00]
[-1.07076396e+00 -1.29077545e+00 4.28007039e-01 6.53806603e-01]
[-1.42768528e+00 3.88374913e-01 -1.31344660e+00 -1.27095115e+00]
[ 7.13842639e-01 3.88374913e-01 8.77414430e-01 1.42370971e+00]
[-1.78460660e+00 -9.13823325e-02 -1.48197437e+00 -1.39926834e+00]
[-2.37947546e-01 -9.13823325e-02 2.03303343e-01 1.40537868e-01]
[-1.66563282e+00 -3.31260955e-01 -1.31344660e+00 -1.27095115e+00]
[-3.56921319e-01 -1.53065407e+00 -2.14003519e-02 -2.44413683e-01]
[-9.51790185e-01 -2.49016856e+00 -1.33752200e-01 -2.44413683e-01]
[-8.32816412e-01 1.10801078e+00 -1.31344660e+00 -1.27095115e+00]
[-1.07076396e+00 -9.13823325e-02 -1.31344660e+00 -1.27095115e+00]
[-1.07076396e+00 -1.53065407e+00 -2.46104047e-01 -2.44413683e-01]
[-1.42768528e+00 1.34788940e+00 -1.53815030e+00 -1.27095115e+00]
[ 1.18973773e-01 -9.13823325e-02 2.59479267e-01 3.97172236e-01]
[ 3.56921319e-01 -3.31260955e-01 5.40358887e-01 2.68855052e-01]
[-9.51790185e-01 8.68132159e-01 -1.25727068e+00 -1.27095115e+00]
[ 1.07076396e+00 1.48496290e-01 5.40358887e-01 3.97172236e-01]
[-1.18973773e-01 -3.31260955e-01 2.59479267e-01 1.40537868e-01]
[ 4.75895093e-01 -2.01041131e+00 4.28007039e-01 3.97172236e-01]
[ 1.42768528e+00 3.88374913e-01 5.40358887e-01 2.68855052e-01]
[-1.05669938e-15 -8.11018201e-01 7.65062582e-01 9.10440971e-01]
[-4.75895093e-01 8.68132159e-01 -1.14491883e+00 -1.27095115e+00]

52 | P a g e
[-9.51790185e-01 -9.13823325e-02 -1.20109475e+00 -1.27095115e+00]
[ 3.56921319e-01 -1.05089682e+00 1.04594220e+00 2.68855052e-01]
[ 3.56921319e-01 -5.71139578e-01 1.47127420e-01 1.40537868e-01]
[ 1.66563282e+00 -9.13823325e-02 1.15829405e+00 5.25489419e-01]
[-1.18973773e-01 -1.05089682e+00 -1.33752200e-01 -2.44413683e-01]
[ 5.94868866e-01 -5.71139578e-01 7.65062582e-01 3.97172236e-01]
[ 7.13842639e-01 1.48496290e-01 9.89766277e-01 7.82123787e-01]
[ 5.94868866e-01 -1.29077545e+00 6.52710734e-01 3.97172236e-01]
[ 1.07076396e+00 1.48496290e-01 1.04594220e+00 1.55202689e+00]
[-1.07076396e+00 1.34788940e+00 -1.31344660e+00 -1.39926834e+00]]
X_test_std=
[[-1.05669938e-15 2.30740390e+00 -1.42579845e+00 -1.27095115e+00]
[-8.32816412e-01 -1.29077545e+00 -4.14631819e-01 -1.16096500e-01]
[ 9.51790185e-01 -9.13823325e-02 3.71831115e-01 2.68855052e-01]
[-4.75895093e-01 2.06752527e+00 -1.36962252e+00 -1.01431679e+00]
[ 2.49844924e+00 1.82764665e+00 1.49534959e+00 1.03875815e+00]
[ 5.94868866e-01 6.28253536e-01 5.40358887e-01 5.25489419e-01]
[ 1.30871150e+00 1.48496290e-01 7.65062582e-01 1.42370971e+00]
[-8.32816412e-01 1.82764665e+00 -1.03256698e+00 -1.01431679e+00]
[-1.30871150e+00 3.88374913e-01 -1.20109475e+00 -1.27095115e+00]
[ 1.30871150e+00 3.88374913e-01 1.10211813e+00 1.42370971e+00]
[-2.37947546e-01 -8.11018201e-01 2.59479267e-01 1.40537868e-01]
[-4.75895093e-01 2.06752527e+00 -1.14491883e+00 -1.01431679e+00]
[ 1.54665905e+00 -9.13823325e-02 1.21446997e+00 1.16707534e+00]
[ 7.13842639e-01 3.88374913e-01 4.28007039e-01 3.97172236e-01]
[ 2.37947546e-01 -3.31260955e-01 4.28007039e-01 3.97172236e-01]
[-1.66563282e+00 3.88374913e-01 -1.36962252e+00 -1.27095115e+00]
[-1.05669938e-15 -1.05089682e+00 1.47127420e-01 1.22206842e-02]
[-2.37947546e-01 -9.13823325e-02 4.28007039e-01 3.97172236e-01]
[-4.75895093e-01 8.68132159e-01 -1.25727068e+00 -1.01431679e+00]
[-9.51790185e-01 3.88374913e-01 -1.42579845e+00 -1.27095115e+00]
[-3.56921319e-01 -1.05089682e+00 3.71831115e-01 1.22206842e-02]
[-4.75895093e-01 -9.13823325e-02 4.28007039e-01 3.97172236e-01]
[ 1.07076396e+00 -9.13823325e-02 7.08886658e-01 6.53806603e-01]
[-9.51790185e-01 1.10801078e+00 -1.36962252e+00 -1.14263397e+00]
[ 1.66563282e+00 3.88374913e-01 1.27064590e+00 7.82123787e-01]
[-1.18973773e-01 -5.71139578e-01 2.03303343e-01 1.40537868e-01]
[-3.56921319e-01 2.78716114e+00 -1.31344660e+00 -1.27095115e+00]
[-8.32816412e-01 1.82764665e+00 -1.25727068e+00 -1.14263397e+00]
[ 3.56921319e-01 -5.71139578e-01 5.40358887e-01 1.22206842e-02]
[ 5.94868866e-01 -1.29077545e+00 7.08886658e-01 9.10440971e-01]
[ 3.56921319e-01 -9.13823325e-02 4.84182963e-01 2.68855052e-01]
[ 2.26050169e+00 -9.13823325e-02 1.32682182e+00 1.42370971e+00]
[-2.37947546e-01 -1.29077545e+00 9.09514958e-02 -1.16096500e-01]

53 | P a g e
[ 7.13842639e-01 -5.71139578e-01 1.04594220e+00 1.16707534e+00]
[-1.05669938e-15 -5.71139578e-01 7.65062582e-01 1.55202689e+00]
[-5.94868866e-01 1.58776803e+00 -1.25727068e+00 -1.27095115e+00]
[-3.56921319e-01 -1.77053269e+00 1.47127420e-01 1.40537868e-01]
[-7.13842639e-01 8.68132159e-01 -1.31344660e+00 -1.27095115e+00]
[ 8.32816412e-01 -5.71139578e-01 4.84182963e-01 3.97172236e-01]
[ 1.07076396e+00 -1.29077545e+00 1.15829405e+00 7.82123787e-01]
[ 1.18973773e+00 -9.13823325e-02 9.89766277e-01 1.16707534e+00]
[-8.32816412e-01 1.10801078e+00 -1.31344660e+00 -1.14263397e+00]
[ 2.37947546e-01 -2.01041131e+00 7.08886658e-01 3.97172236e-01]
[ 5.94868866e-01 -3.31260955e-01 1.04594220e+00 7.82123787e-01]
[ 9.51790185e-01 -3.31260955e-01 4.84182963e-01 1.40537868e-01]]

svm=SVC(kernel='linear',random_state=1,C=0.1)
[Link](X_train_std,Y_train)
Y_pred=[Link](X_test_std)
acc=accuracy=accuracy_score(Y_test,Y_pred)
print('Accuracy:%3f'% acc)

Output:
Accuracy:0.955556

print("\n classiefication Report:")


print(classification_report(Y_test,Y_pred,target_names=iris.target_names))

Output:
classiefication Report:
precision recall f1-score support

setosa 1.00 1.00 1.00 14


versicolor 0.94 0.94 0.94 18
virginica 0.92 0.92 0.92 13
accuracy 0.96 45
macro avg 0.96 0.96 0.96 45
weighted avg 0.96 0.96 0.96 45

cm=confusion_matrix(Y_test,Y_pred)
[Link](figsize=(6,5))
[Link](cm,annot=True,fmt='d',cmap='Blues',
xticklabels=iris.target_names, yticklabels=iris.target_names)
[Link]("Confusion Matrix")
[Link]("Predicted Label")
[Link]("True label")
[Link]()

54 | P a g e
Output:

55 | P a g e
[Link] a program to implement K-means clustering. Using data visualization
technique to illustrate the clustering.
Date:28/08/2025
----------------------------------------------------------------------------------------------------------------
import numpy as nm
import [Link] as mtp
import pandas as pd
dataset=pd.read_csv("Mall_Customers.csv")
[Link](5)

Output:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40

x=[Link][:,[3,4]].values
from [Link] import KMeans
wcss_list=[]
for i in range(1,11):
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
[Link](x)
wcss_list.append(kmeans.inertia_)
[Link](range(1,11),wcss_list)
[Link]('The Elbow Method cluster(k)')
[Link]('Number of clusters(k)')
[Link]('wcss_list')
[Link]()
Output:

56 | P a g e
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
y_predict=kmeans.fit_predict(x)
[Link](x[y_predict == 0, 0], x[y_predict == 0, 1], s=100, c='blue', label='Cluster 1')
[Link](x[y_predict == 1, 0], x[y_predict == 1, 1], s=100, c='green', label='Cluster 2')
[Link](x[y_predict == 2, 0], x[y_predict == 2, 1], s=100, c='red', label='Cluster 3')
[Link](x[y_predict == 3, 0], x[y_predict == 3, 1], s=100, c='cyan', label='Cluster 4')
[Link](x[y_predict == 4, 0], x[y_predict == 4, 1], s=100, c='magenta', label='Cluster 5')
[Link](kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow',
label='Centroid')
[Link]('Clusters of customers')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]()
[Link]()
Output:

57 | P a g e
[Link] a program to implement hierarchical clustering algorithm. Using data
visualization technique to illustrate the clustering.
Date:28/08/2025
----------------------------------------------------------------------------------------------------------------
import numpy as np
import [Link] as plt
import pandas as pd
df=pd.read_csv("Mall_Customers.csv")
[Link]()

Output:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40

[Link]().sum()
Output:
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64

x=[Link][:,[3,4]].values
import [Link] as sch
dendrogram=[Link]([Link](x,method='ward'))
[Link]("Dendrogram")
[Link]("Customer")
[Link]("Euclidean sistance")
[Link]()

Output:

58 | P a g e
from [Link] import AgglomerativeClusteringhc =
AgglomerativeClustering(n_clusters=5, linkage='ward')
y_hc = hc.fit_predict(x)
[Link](x[y_hc == 0, 0], x[y_hc == 0, 1], s=100, c="red", label="cluster 1")
[Link](x[y_hc == 1, 0], x[y_hc == 1, 1], s=100, c="blue", label="cluster 2")
[Link](x[y_hc == 2, 0], x[y_hc == 2, 1], s=100, c="green", label="cluster 3")
[Link](x[y_hc == 3, 0], x[y_hc == 3, 1], s=100, c="cyan", label="cluster 4")
[Link](x[y_hc == 4, 0], x[y_hc == 4, 1], s=100, c="orange", label="cluster 5")
[Link]("Clusters of customers")
[Link]("Annual Income")
[Link]("Spending Score (1-100)")
[Link]()
[Link]()

Output:

59 | P a g e
[Link] a program to implement grid-based clustering using a suitable dataset.
Visualize the data using scatter plot.
Date:
---------------------------------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
df=pd.read_csv("Mall_Customers.csv")
[Link](5)
Output:

data = df[['Annual Income (k$)','Spending Score (1-100)']]


X = [Link]

import numpy as np
def grid_based_clustering(X, grid_size):
x_edges = [Link]([Link](X[:, 0]), [Link](X[:, 0]), grid_size[0] + 1)
y_edges = [Link]([Link](X[:, 1]), [Link](X[:, 1]), grid_size[1] + 1)
grid_cells, _, _ = np.histogram2d(X[:, 0], X[:, 1], bins=[x_edges, y_edges])
return grid_cells, x_edges, y_edges

# Define grid size


grid_size = (9, 9)
grid_cells, x_edges, y_edges = grid_based_clustering(X, grid_size)

import [Link] as plt


[Link](figsize=(15,6))
[Link](grid_cells.T,origin='lower',cmap='coolwarm',extent=[x_edges[0],x_edges[-
1],y_edges[0],y_edges[-1]])
[Link](label='Number of points')
[Link](X[:,0],X[:,1],c='y',s=50,label='Customers')
[Link]('Annual Income(k$)')
[Link]("Spending Score(1-100)")
[Link]('Grid-Based Clustering Heatmap of Mall Customers')
[Link]()
[Link]()

60 | P a g e
Output:

61 | P a g e

You might also like