Data
Science
Assignment- 2nd
[Link]. (Hons.) Mathematics
Semester – V
Name- Raman Mishra
Roll no- 23/1654
Submitter To: Prof. Dhiraj Singh
1. Chose a dataset containing information on student scores, perform the
following tasks using Python:
I. Load and clean the data.
Before clearing the data.
StudentID FirstName LastName Gender Math_Score English_Score Science_Score Study_Hours_Week
1001 James Smith Male 68 78 72 7
1002 Mary Johnson Female 88 92 85 10
1003 Robert Williams M 75 85 79 8
1004 Patricia Brown f 92 88 90 12
1005 John Jones Male 72 65 6
1006 Jennifer Garcia Female 81 95 88 11
1007 Michael Miller Male 55 65 59 5
1008 Linda Davis Female 79 82 77 9
1009 William Rodriguez M 105 90 92 14
1010 Elizabeth Martinez f 62 70 66 7
import pandas as pd
df =
pd.read_csv("C:/Users/ranje/OneDrive/Documents/StudentForA
[Link]")
[Link] =
[Link]().[Link]().[Link](' ', '_')
print("Before Cleaning:")
print([Link]())
print([Link]().sum())
df.drop_duplicates(inplace=True)
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] =
df[numeric_cols].fillna(df[numeric_cols].mean())
cat_cols = df.select_dtypes(include='object').columns
df[cat_cols] =
df[cat_cols].fillna(df[cat_cols].mode().iloc[0])
if 'gender' in [Link]:
df['gender'] = df['gender'].[Link]().[Link]()
df['gender'] =
df['gender'].replace({'MALE':'M','FEMALE':'F'})
print("\nAfter Cleaning:")
print([Link]())
print([Link]())
After clearing the data.
studentid firstname lastname gender math_score english_score science_score study_hours_week
1001 James Smith M 68 78 72 7
1002 Mary Johnson F 88 92 85 10
1003 Robert Williams M 75 85 79 8
1004 Patricia Brown F 92 88 90 12
1005 John Jones M 77.0212766 72 65 6
1006 Jennifer Garcia F 81 95 88 11
1007 Michael Miller M 55 65 59 5
1008 Linda Davis F 79 82 77 9
1009 William Rodriguez M 105 90 92 14
1010 Elizabeth Martinez F 62 70 66 7
II. Generate summary statistics.
import pandas as pd
df =
pd.read_csv(r"C:\Users\ranje\OneDrive\Documents\StudentFor
[Link]")
[Link] =
[Link]().[Link]().[Link](' ', '_')
df.drop_duplicates(inplace=True)
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] =
df[numeric_cols].fillna(df[numeric_cols].mean())
cat_cols = df.select_dtypes(include='object').columns
df[cat_cols] =
df[cat_cols].fillna(df[cat_cols].mode().iloc[0])
if 'gender' in [Link]:
df['gender'] = df['gender'].[Link]().[Link]()
df['gender'] =
df['gender'].replace({'MALE':'M','FEMALE':'F'})
print("\nSummary Statistics for Numeric Columns:")
print([Link]())
if len(cat_cols) > 0:
print("\nSummary for Categorical Columns:")
print([Link](include='object'))
Column1 math_score english_score study_hours_week
count 50 50 50
mean 77.0212766 80.96 9.02
std 13.16843685 10.51327635 2.661057025
min 48 55 4
Above 25% 68.25 73.5 7
Above 50% 77.5106383 82.5 9
Above 75% 87.75 90 11
max 105 97 14
III. Visualize the distribution of scores using histograms and box plots.
import pandas as pd
import [Link] as plt
df =
pd.read_csv(r"C:\Users\ranje\OneDrive\Documents\Cleaned_St
[Link]")
numeric_cols = df.select_dtypes(include='number').columns
for col in numeric_cols:
[Link](figsize=(6,4))
[Link](df[col])
[Link](f'Histogram of {col}')
[Link](col)
[Link]('Frequency')
[Link]()
for col in numeric_cols:
[Link](figsize=(6,4))
[Link](df[col])
[Link](f'Boxplot of {col}')
[Link](col)
[Link]()
IV. Identify and handle any missing values.
import pandas as pd
df =
pd.read_csv(r"C:\Users\ranje\OneDrive\Documents\StudentFor
[Link]")
print("Before Handling Missing Values:")
print([Link]().sum())
[Link] =
[Link]().[Link]().[Link](' ', '_')
df.drop_duplicates(inplace=True)
numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] =
df[numeric_cols].fillna(df[numeric_cols].mean())
cat_cols = df.select_dtypes(include='object').columns
df[cat_cols] =
df[cat_cols].fillna(df[cat_cols].mode().iloc[0])
if 'gender' in [Link]:
df['gender'] = df['gender'].[Link]().[Link]()
df['gender'] =
df['gender'].replace({'MALE':'M','FEMALE':'F'})
print("\nAfter Handling Missing Values:")
print([Link]().sum())
df.to_csv(r"C:\Users\ranje\OneDrive\Documents\Cleaned_Stud
[Link]", index=False)
print("\n Missing values handled and cleaned file saved.")
The above Python code identifies all missing values in the dataset
and handles them by replacing missing numeric values using the
mean of each column (mean imputation rule). After cleaning, a
new dataset file is successfully created and saved.
Using the same dataset from the previous question, apply PCA
to reduce the dimensionality.
I. Standardize the data.
import pandas as pd
df =
pd.read_csv("C:/Users/ranje/OneDrive/Documents/Cleaned_Stu
[Link]")
numerical_cols = ['math_score', 'english_score',
'science_score', 'study_hours_week']
df_numerical = df[numerical_cols]
df_scaled = (df_numerical - df_numerical.mean()) /
df_numerical.std()
print(df_scaled.head())
df_scaled.to_csv('standardized_student_data.csv',
index=False)
Result
II. Compute the covariance matrix.
import pandas as pd
df_scaled =
pd.read_csv(r"C:\Users\ranje\standardized_student_data.csv
")
covariance_matrix = df_scaled.cov()
print("Covariance Matrix:")
print(covariance_matrix)
Result
III. Calculate eigenvalues and eigenvectors.
import pandas as pd
import numpy as np
df_scaled =
pd.read_csv(r"C:\Users\ranje\standardized_student_data.csv
")
covariance_matrix = df_scaled.cov()
eigenvalues, eigenvectors =
[Link](covariance_matrix)
print("Eigenvalues:")
print(eigenvalues)
print("\nEigenvectors:")
print(eigenvectors)
Result
IV. Transform the data into principal components.
import pandas as pd
import numpy as np
file_name = 'standardized_student_data.csv'
df_scaled = pd.read_csv(file_name)
covariance_matrix = df_scaled.cov()
eigenvalues, eigenvectors =
[Link](covariance_matrix)
standardized_array = df_scaled.values
principal_components_array = [Link](standardized_array,
eigenvectors)
pc_columns = [f'PC{i+1}' for i in
range([Link][1])]
df_principal_components =
[Link](data=principal_components_array,
columns=pc_columns)
print("Transformed Data (Principal Components) - First 5
rows:")
print(df_principal_components.head())
df_principal_components.to_csv('principal_components.csv',
index=False)
Result
V. Visualize the explained variance ratio.
import [Link] as plt
import numpy as np
import pandas as pd
df =
pd.read_csv("C:/Users/ranje/OneDrive/Documents/Cleaned_Stu
[Link]")
numerical_cols = ['math_score', 'english_score',
'science_score', 'study_hours_week']
df_scaled = (df[numerical_cols] -
df[numerical_cols].mean()) / df[numerical_cols].std()
cov_matrix = [Link](df_scaled.T)
eigenvalues, eigenvectors = [Link](cov_matrix)
sorted_index = [Link](eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_index]
explained_variance_ratio = eigenvalues /
[Link](eigenvalues)
[Link]()
[Link](range(1, len(explained_variance_ratio) + 1),
explained_variance_ratio)
[Link]("Principal Components")
[Link]("Explained Variance Ratio")
[Link]("Explained Variance Ratio by Principal
Components")
[Link]()
RESULT
Apply the K-means clustering algorithm to segment the
students into di erent groups based on their scores.
I. Determine the optimal number of clusters using the
elbow method.
import pandas as pd
import numpy as np
import [Link] as plt
def calculate_distances(X, centroids):
n_samples = [Link][0]
k = [Link][0]
distances = [Link]((n_samples, k))
for i in range(k):
distances[:, i] = [Link](X - centroids[i], axis=1)
return distances
def kmeans_numpy(X, k, max_iters=100, tol=1e-4, random_seed=42):
[Link](random_seed)
n_samples, n_features = [Link]
random_indices = [Link](n_samples, k, replace=False)
centroids = X[random_indices]
for _ in range(max_iters):
distances = calculate_distances(X, centroids)
labels = [Link](distances, axis=1)
new_centroids = [Link]((k, n_features))
for i in range(k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
new_centroids[i] = [Link](cluster_points, axis=0)
else:
new_centroids[i] = X[[Link](n_samples)]
if [Link](centroids, new_centroids, atol=tol):
break
centroids = new_centroids
return centroids, labels
def calculate_wcss(X, centroids, labels):
wcss = 0
k = [Link][0]
for i in range(k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
wcss += [Link]((cluster_points - centroids[i])**2)
return wcss
try:
df = pd.read_csv("C:/Users/ranje/OneDrive/Documents/StAssignment_Cleaned.csv")
features = ['math_score', 'english_score', 'science_score']
for col in features:
df[col] = pd.to_numeric(df[col], errors='coerce')
df_clean = [Link](subset=features)
X = df_clean[features].to_numpy()
if [Link][0] > 1:
mean = [Link](X, axis=0)
std = [Link](X, axis=0)
std[std == 0] = 1.0
X_scaled = (X - mean) / std
wcss_list = []
k_range = range(1, 11)
print("Calculating WCSS for K=1 to 10...")
for k in k_range:
centroids, labels = kmeans_numpy(X_scaled, k, random_seed=42)
wcss = calculate_wcss(X_scaled, centroids, labels)
wcss_list.append(wcss)
print(f"K = {k}, WCSS = {wcss}")
print("\nWCSS Calculation Complete.")
[Link](figsize=(10, 6))
[Link](k_range, wcss_list, marker='o', linestyle='--')
[Link]('Elbow Method for Optimal K (NumPy Implementation)')
[Link]('Number of Clusters (K)')
[Link]('WCSS (Within-Cluster Sum of Squares)')
[Link](k_range)
[Link](True)
[Link]('kmeans_elbow_plot_numpy.png')
[Link]()
print("Elbow plot saved as 'kmeans_elbow_plot_numpy.png'")
except FileNotFoundError:
print("Error: 'StAssignment_Cleaned.csv' not found.")
except Exception as e:
print(f"An error occurred: {e}")
RESULT
II. Interpret the characteristics of each cluster.
Cluster Interpretation for k = 3
The typical pattern we get with student score data:
Cluster 0 → High Achievers
I. Score Range: Consistently high in most subjects
II. Strong in:
1. Mathematics
2. English
3. Science
III. Characteristics:
o Focused students
o Likely to perform well in competitions and higher studies
o Good conceptual understanding and language skills
Insight:
These students may be good candidates for advanced academic programs,
scholarship recommendations, and leadership roles.
Cluster 1 → Moderate / Average Performers
Score Range: Balanced scores, neither too high nor too low
Characteristics:
o Understands concepts but needs more practice
o English may fluctuate more than Science
o Performance could vary subject-to-subject
Insight:
These students can reach high performance with proper guidance and
consistent study habits.
Cluster 2 → Students Needing Support
Score Range: Lower scores in 2 or more subjects
Characteristics:
o Academic di iculties
o Requires remedial classes
o Confidence may be lower
Insight:
Personalized support, mentor guidance, and structured improvement plans
would help boost their overall performance.
Develop a classification model to predict whether a student will
pass or fail based on their scores.
I. Split the data into training and testing sets.
import pandas as pd
import numpy as np
df = pd.read_csv("StAssignment_Cleaned.csv")
[Link] = [Link]().[Link]()
cols = ['math_score', 'english_score', 'science_score']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df[cols] = df[cols].fillna(df[cols].mean())
df['pass'] = [Link]((df['math_score'] +
df['english_score'] + df['science_score'])/3 >= 40, 1, 0)
X =
df[['math_score','english_score','science_score']].values
y = df['pass'].values
indices = [Link](len(X))
train_size = int(0.8 * len(X))
train_idx = indices[:train_size]
test_idx = indices[train_size:]
X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]
II. Train a logistic regression model.
def sigmoid(z):
return 1 / (1 + [Link](-z))
def train_log_reg(X, y, lr=0.001, epochs=5000):
m, n = [Link]
W = [Link](n)
b = 0
for _ in range(epochs):
z = [Link](X, W) + b
A = sigmoid(z)
dW = [Link](X.T, (A - y)) / m
db = [Link](A - y) / m
W -= lr * dW
b -= lr * db
return W, b
W, b = train_log_reg(X_train, y_train)
def predict(X, W, b):
return (sigmoid([Link](X, W) + b) >= 0.5).astype(int)
y_pred = predict(X_test, W, b)
III. Evaluate the model using accuracy, precision, recall, and
F1-score.
tp = [Link]((y_test == 1) & (y_pred == 1))
tn = [Link]((y_test == 0) & (y_pred == 0))
fp = [Link]((y_test == 0) & (y_pred == 1))
fn = [Link]((y_test == 1) & (y_pred == 0))
accuracy = (tp + tn) / len(y_test)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0
accuracy, precision, recall, f1_score
IV. Discuss the implications of the model's performance.
The model's accuracy shows how often predictions are
correct. Precision indicates how well the model correctly
predicts pass students without misclassifying failing students
as pass. Recall represents how many actual passing students
were identified correctly. The F1-score provides a balanced
measure of precision and recall. If the model has low recall or
precision, it may incorrectly classify students and impact
fairness in student evaluation. Improving the model may
require including more input features, tuning parameters, or
collecting more training data.