0% found this document useful (0 votes)

3 views19 pages

Student Score Data Analysis in Python

The document outlines a data science assignment focused on analyzing a dataset of student scores using Python. It includes tasks such as data cleaning, generating summary statistics, visualizing score distributions, applying PCA for dimensionality reduction, and implementing K-means clustering to segment students. Additionally, it discusses developing a classification model to predict student pass/fail outcomes based on their scores.

Uploaded by

Raman Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views19 pages

Student Score Data Analysis in Python

Uploaded by

Raman Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data

Science

Assignment- 2nd

[Link]. (Hons.) Mathematics

Semester – V
Name- Raman Mishra
Roll no- 23/1654
Submitter To: Prof. Dhiraj Singh
1. Chose a dataset containing information on student scores, perform the
following tasks using Python:
I. Load and clean the data.

Before clearing the data.

StudentID FirstName LastName Gender Math_Score English_Score Science_Score Study_Hours_Week
1001 James Smith Male 68 78 72 7
1002 Mary Johnson Female 88 92 85 10
1003 Robert Williams M 75 85 79 8
1004 Patricia Brown f 92 88 90 12
1005 John Jones Male 72 65 6
1006 Jennifer Garcia Female 81 95 88 11
1007 Michael Miller Male 55 65 59 5
1008 Linda Davis Female 79 82 77 9
1009 William Rodriguez M 105 90 92 14
1010 Elizabeth Martinez f 62 70 66 7

import pandas as pd

df =
pd.read_csv("C:/Users/ranje/OneDrive/Documents/StudentForA
[Link]")

[Link] =
[Link]().[Link]().[Link](' ', '_')

print("Before Cleaning:")
print([Link]())
print([Link]().sum())

df.drop_duplicates(inplace=True)

numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] =
df[numeric_cols].fillna(df[numeric_cols].mean())

cat_cols = df.select_dtypes(include='object').columns
df[cat_cols] =
df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

if 'gender' in [Link]:
df['gender'] = df['gender'].[Link]().[Link]()
df['gender'] =
df['gender'].replace({'MALE':'M','FEMALE':'F'})

print("\nAfter Cleaning:")
print([Link]())
print([Link]())

After clearing the data.

studentid firstname lastname gender math_score english_score science_score study_hours_week
1001 James Smith M 68 78 72 7
1002 Mary Johnson F 88 92 85 10
1003 Robert Williams M 75 85 79 8
1004 Patricia Brown F 92 88 90 12
1005 John Jones M 77.0212766 72 65 6
1006 Jennifer Garcia F 81 95 88 11
1007 Michael Miller M 55 65 59 5
1008 Linda Davis F 79 82 77 9
1009 William Rodriguez M 105 90 92 14
1010 Elizabeth Martinez F 62 70 66 7

II. Generate summary statistics.

import pandas as pd

df =
pd.read_csv(r"C:\Users\ranje\OneDrive\Documents\StudentFor
[Link]")

[Link] =
[Link]().[Link]().[Link](' ', '_')
df.drop_duplicates(inplace=True)

numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] =
df[numeric_cols].fillna(df[numeric_cols].mean())

cat_cols = df.select_dtypes(include='object').columns
df[cat_cols] =
df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

if 'gender' in [Link]:
df['gender'] = df['gender'].[Link]().[Link]()
df['gender'] =
df['gender'].replace({'MALE':'M','FEMALE':'F'})

print("\nSummary Statistics for Numeric Columns:")

print([Link]())

if len(cat_cols) > 0:
print("\nSummary for Categorical Columns:")
print([Link](include='object'))
Column1 math_score english_score study_hours_week
count 50 50 50
mean 77.0212766 80.96 9.02
std 13.16843685 10.51327635 2.661057025
min 48 55 4
Above 25% 68.25 73.5 7
Above 50% 77.5106383 82.5 9
Above 75% 87.75 90 11
max 105 97 14

III. Visualize the distribution of scores using histograms and box plots.
import pandas as pd
import [Link] as plt

df =
pd.read_csv(r"C:\Users\ranje\OneDrive\Documents\Cleaned_St
[Link]")

numeric_cols = df.select_dtypes(include='number').columns

for col in numeric_cols:

[Link](figsize=(6,4))
[Link](df[col])
[Link](f'Histogram of {col}')
[Link](col)
[Link]('Frequency')
[Link]()

for col in numeric_cols:

[Link](figsize=(6,4))
[Link](df[col])
[Link](f'Boxplot of {col}')
[Link](col)
[Link]()
IV. Identify and handle any missing values.
import pandas as pd

df =
pd.read_csv(r"C:\Users\ranje\OneDrive\Documents\StudentFor
[Link]")

print("Before Handling Missing Values:")

print([Link]().sum())

[Link] =
[Link]().[Link]().[Link](' ', '_')
df.drop_duplicates(inplace=True)

numeric_cols = df.select_dtypes(include='number').columns
df[numeric_cols] =
df[numeric_cols].fillna(df[numeric_cols].mean())

cat_cols = df.select_dtypes(include='object').columns
df[cat_cols] =
df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

if 'gender' in [Link]:
df['gender'] = df['gender'].[Link]().[Link]()
df['gender'] =
df['gender'].replace({'MALE':'M','FEMALE':'F'})

print("\nAfter Handling Missing Values:")

print([Link]().sum())

df.to_csv(r"C:\Users\ranje\OneDrive\Documents\Cleaned_Stud
[Link]", index=False)

print("\n Missing values handled and cleaned file saved.")

The above Python code identiﬁes all missing values in the dataset
and handles them by replacing missing numeric values using the
mean of each column (mean imputation rule). After cleaning, a
new dataset ﬁle is successfully created and saved.
Using the same dataset from the previous question, apply PCA
to reduce the dimensionality.
I. Standardize the data.
import pandas as pd

df =
pd.read_csv("C:/Users/ranje/OneDrive/Documents/Cleaned_Stu
[Link]")

numerical_cols = ['math_score', 'english_score',

'science_score', 'study_hours_week']
df_numerical = df[numerical_cols]

df_scaled = (df_numerical - df_numerical.mean()) /

df_numerical.std()

print(df_scaled.head())

df_scaled.to_csv('standardized_student_data.csv',
index=False)

Result
II. Compute the covariance matrix.
import pandas as pd

df_scaled =
pd.read_csv(r"C:\Users\ranje\standardized_student_data.csv
")

covariance_matrix = df_scaled.cov()

print("Covariance Matrix:")
print(covariance_matrix)

Result

III. Calculate eigenvalues and eigenvectors.

import pandas as pd
import numpy as np

df_scaled =
pd.read_csv(r"C:\Users\ranje\standardized_student_data.csv
")

covariance_matrix = df_scaled.cov()

eigenvalues, eigenvectors =
[Link](covariance_matrix)

print("Eigenvalues:")
print(eigenvalues)

print("\nEigenvectors:")
print(eigenvectors)
Result

IV. Transform the data into principal components.

import pandas as pd
import numpy as np

file_name = 'standardized_student_data.csv'
df_scaled = pd.read_csv(file_name)

covariance_matrix = df_scaled.cov()
eigenvalues, eigenvectors =
[Link](covariance_matrix)

standardized_array = df_scaled.values

principal_components_array = [Link](standardized_array,
eigenvectors)

pc_columns = [f'PC{i+1}' for i in

range([Link][1])]
df_principal_components =
[Link](data=principal_components_array,
columns=pc_columns)

print("Transformed Data (Principal Components) - First 5

rows:")
print(df_principal_components.head())

df_principal_components.to_csv('principal_components.csv',
index=False)
Result

V. Visualize the explained variance ratio.

import [Link] as plt
import numpy as np
import pandas as pd

df =
pd.read_csv("C:/Users/ranje/OneDrive/Documents/Cleaned_Stu
[Link]")
numerical_cols = ['math_score', 'english_score',
'science_score', 'study_hours_week']
df_scaled = (df[numerical_cols] -
df[numerical_cols].mean()) / df[numerical_cols].std()

cov_matrix = [Link](df_scaled.T)
eigenvalues, eigenvectors = [Link](cov_matrix)
sorted_index = [Link](eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_index]

explained_variance_ratio = eigenvalues /
[Link](eigenvalues)

[Link]()
[Link](range(1, len(explained_variance_ratio) + 1),
explained_variance_ratio)
[Link]("Principal Components")
[Link]("Explained Variance Ratio")
[Link]("Explained Variance Ratio by Principal
Components")
[Link]()
RESULT
Apply the K-means clustering algorithm to segment the
students into di erent groups based on their scores.
I. Determine the optimal number of clusters using the
elbow method.
import pandas as pd
import numpy as np
import [Link] as plt

def calculate_distances(X, centroids):

n_samples = [Link][0]
k = [Link][0]
distances = [Link]((n_samples, k))

for i in range(k):
distances[:, i] = [Link](X - centroids[i], axis=1)

return distances

def kmeans_numpy(X, k, max_iters=100, tol=1e-4, random_seed=42):

[Link](random_seed)
n_samples, n_features = [Link]

random_indices = [Link](n_samples, k, replace=False)

centroids = X[random_indices]

for _ in range(max_iters):
distances = calculate_distances(X, centroids)
labels = [Link](distances, axis=1)

new_centroids = [Link]((k, n_features))

for i in range(k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
new_centroids[i] = [Link](cluster_points, axis=0)
else:
new_centroids[i] = X[[Link](n_samples)]

if [Link](centroids, new_centroids, atol=tol):

break

centroids = new_centroids

return centroids, labels

def calculate_wcss(X, centroids, labels):

wcss = 0
k = [Link][0]

for i in range(k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
wcss += [Link]((cluster_points - centroids[i])**2)

return wcss

try:
df = pd.read_csv("C:/Users/ranje/OneDrive/Documents/StAssignment_Cleaned.csv")

features = ['math_score', 'english_score', 'science_score']

for col in features:

df[col] = pd.to_numeric(df[col], errors='coerce')

df_clean = [Link](subset=features)

X = df_clean[features].to_numpy()

if [Link][0] > 1:
mean = [Link](X, axis=0)
std = [Link](X, axis=0)
std[std == 0] = 1.0

X_scaled = (X - mean) / std

wcss_list = []
k_range = range(1, 11)

print("Calculating WCSS for K=1 to 10...")

for k in k_range:
centroids, labels = kmeans_numpy(X_scaled, k, random_seed=42)
wcss = calculate_wcss(X_scaled, centroids, labels)
wcss_list.append(wcss)
print(f"K = {k}, WCSS = {wcss}")

print("\nWCSS Calculation Complete.")

[Link](figsize=(10, 6))
[Link](k_range, wcss_list, marker='o', linestyle='--')
[Link]('Elbow Method for Optimal K (NumPy Implementation)')
[Link]('Number of Clusters (K)')
[Link]('WCSS (Within-Cluster Sum of Squares)')
[Link](k_range)
[Link](True)
[Link]('kmeans_elbow_plot_numpy.png')
[Link]()

print("Elbow plot saved as 'kmeans_elbow_plot_numpy.png'")

except FileNotFoundError:
print("Error: 'StAssignment_Cleaned.csv' not found.")
except Exception as e:
print(f"An error occurred: {e}")
RESULT
II. Interpret the characteristics of each cluster.
Cluster Interpretation for k = 3
The typical pattern we get with student score data:

Cluster 0 → High Achievers

I. Score Range: Consistently high in most subjects
II. Strong in:
1. Mathematics
2. English
3. Science
III. Characteristics:
o Focused students
o Likely to perform well in competitions and higher studies
o Good conceptual understanding and language skills
Insight:
These students may be good candidates for advanced academic programs,
scholarship recommendations, and leadership roles.

Cluster 1 → Moderate / Average Performers

 Score Range: Balanced scores, neither too high nor too low
 Characteristics:
o Understands concepts but needs more practice
o English may ﬂuctuate more than Science
o Performance could vary subject-to-subject
Insight:
These students can reach high performance with proper guidance and
consistent study habits.

Cluster 2 → Students Needing Support

 Score Range: Lower scores in 2 or more subjects
 Characteristics:
o Academic di iculties
o Requires remedial classes
o Conﬁdence may be lower
Insight:
Personalized support, mentor guidance, and structured improvement plans
would help boost their overall performance.

Develop a classiﬁcation model to predict whether a student will

pass or fail based on their scores.
I. Split the data into training and testing sets.
import pandas as pd
import numpy as np

df = pd.read_csv("StAssignment_Cleaned.csv")
[Link] = [Link]().[Link]()

cols = ['math_score', 'english_score', 'science_score']

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df[cols] = df[cols].fillna(df[cols].mean())

df['pass'] = [Link]((df['math_score'] +
df['english_score'] + df['science_score'])/3 >= 40, 1, 0)

X =
df[['math_score','english_score','science_score']].values
y = df['pass'].values
indices = [Link](len(X))
train_size = int(0.8 * len(X))
train_idx = indices[:train_size]
test_idx = indices[train_size:]

X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]

II. Train a logistic regression model.

def sigmoid(z):
return 1 / (1 + [Link](-z))

def train_log_reg(X, y, lr=0.001, epochs=5000):

m, n = [Link]
W = [Link](n)
b = 0
for _ in range(epochs):
z = [Link](X, W) + b
A = sigmoid(z)
dW = [Link](X.T, (A - y)) / m
db = [Link](A - y) / m
W -= lr * dW
b -= lr * db
return W, b

W, b = train_log_reg(X_train, y_train)

def predict(X, W, b):

return (sigmoid([Link](X, W) + b) >= 0.5).astype(int)

y_pred = predict(X_test, W, b)

III. Evaluate the model using accuracy, precision, recall, and

F1-score.
tp = [Link]((y_test == 1) & (y_pred == 1))
tn = [Link]((y_test == 0) & (y_pred == 0))
fp = [Link]((y_test == 0) & (y_pred == 1))
fn = [Link]((y_test == 1) & (y_pred == 0))

accuracy = (tp + tn) / len(y_test)

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0

accuracy, precision, recall, f1_score

IV. Discuss the implications of the model's performance.

The model's accuracy shows how often predictions are

correct. Precision indicates how well the model correctly

predicts pass students without misclassifying failing students

as pass. Recall represents how many actual passing students

were identified correctly. The F1-score provides a balanced

measure of precision and recall. If the model has low recall or

precision, it may incorrectly classify students and impact

fairness in student evaluation. Improving the model may

require including more input features, tuning parameters, or

collecting more training data.

Analytics & Hypothesis Testing
No ratings yet
Analytics & Hypothesis Testing
26 pages
Student Data Cleaning and Normalization
No ratings yet
Student Data Cleaning and Normalization
4 pages
Lab Assignment 4 Solutions - Colab
No ratings yet
Lab Assignment 4 Solutions - Colab
4 pages
Student Data Cleaning Process
No ratings yet
Student Data Cleaning Process
7 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
13 pages
Student Performance Data Analysis
No ratings yet
Student Performance Data Analysis
16 pages
Machine Learning Practical File
No ratings yet
Machine Learning Practical File
31 pages
Assignment Python
No ratings yet
Assignment Python
16 pages
Kashvi Rathore 23104063 - Week 1 and 2
No ratings yet
Kashvi Rathore 23104063 - Week 1 and 2
5 pages
Assignment 1: "Autodata - CSV"
No ratings yet
Assignment 1: "Autodata - CSV"
49 pages
Student Grade Analysis from Dataset
No ratings yet
Student Grade Analysis from Dataset
20 pages
Python NumPy and Pandas Data Analysis
No ratings yet
Python NumPy and Pandas Data Analysis
19 pages
Kedar Dsbda Codes
No ratings yet
Kedar Dsbda Codes
18 pages
Analyzing Student Performance Factors
No ratings yet
Analyzing Student Performance Factors
26 pages
Op10 Merged
No ratings yet
Op10 Merged
27 pages
Python Data Analysis with Pandas
No ratings yet
Python Data Analysis with Pandas
3 pages
K-Means, Hierarchical, and DBSCAN Clustering
No ratings yet
K-Means, Hierarchical, and DBSCAN Clustering
21 pages
Statistics in Python: Mean, Median, Mode
No ratings yet
Statistics in Python: Mean, Median, Mode
8 pages
Fds XP Final
No ratings yet
Fds XP Final
10 pages
Student Data Preprocessing Techniques
No ratings yet
Student Data Preprocessing Techniques
6 pages
NumSciPandMat Pr-1 - Jupyter Notebook
No ratings yet
NumSciPandMat Pr-1 - Jupyter Notebook
8 pages
DS Output
No ratings yet
DS Output
15 pages
CSV Data Analysis and Manipulation
No ratings yet
CSV Data Analysis and Manipulation
9 pages
Data Preprocessing Techniques for ML
No ratings yet
Data Preprocessing Techniques for ML
9 pages
Python Programs for Data Analysis and Visualization
No ratings yet
Python Programs for Data Analysis and Visualization
19 pages
Data Cleaning and Encoding Techniques
No ratings yet
Data Cleaning and Encoding Techniques
29 pages
Data Science Lab Report Overview
No ratings yet
Data Science Lab Report Overview
23 pages
Pandas Course Outcomes and Experiments
No ratings yet
Pandas Course Outcomes and Experiments
18 pages
NumPy and Pandas Data Analysis Techniques
No ratings yet
NumPy and Pandas Data Analysis Techniques
6 pages
Student and Employee Data Analysis
No ratings yet
Student and Employee Data Analysis
13 pages
Correlation Analysis of Student Data
No ratings yet
Correlation Analysis of Student Data
22 pages
HCI
No ratings yet
HCI
8 pages
Data Science Practical Assignments
No ratings yet
Data Science Practical Assignments
21 pages
Probability and Statistical Analysis Guide
No ratings yet
Probability and Statistical Analysis Guide
12 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
DS Experiment 5
No ratings yet
DS Experiment 5
4 pages
Student Data Analysis with Python
No ratings yet
Student Data Analysis with Python
10 pages
Dave
No ratings yet
Dave
26 pages
Data Analysis: Placement Prediction For IT Students
No ratings yet
Data Analysis: Placement Prediction For IT Students
27 pages
DATASCI112 Midterm Cheat Sheet
No ratings yet
DATASCI112 Midterm Cheat Sheet
2 pages
Z-Score Analysis of Student Performance
No ratings yet
Z-Score Analysis of Student Performance
7 pages
Assignment 6
No ratings yet
Assignment 6
3 pages
Data Analysis and Clustering of Mall Customers
No ratings yet
Data Analysis and Clustering of Mall Customers
16 pages
Academic Performance Data Analysis
No ratings yet
Academic Performance Data Analysis
6 pages
Class XII Informatics Practices Guide
No ratings yet
Class XII Informatics Practices Guide
32 pages
DWMCODEDS
No ratings yet
DWMCODEDS
9 pages
Data Wrangling with Python Techniques
No ratings yet
Data Wrangling with Python Techniques
11 pages
Final Project Class12 FormattedOutputs
No ratings yet
Final Project Class12 FormattedOutputs
55 pages
Python Statistical Data Analysis Tasks
No ratings yet
Python Statistical Data Analysis Tasks
6 pages
Data Preprocessing with Python Techniques
No ratings yet
Data Preprocessing with Python Techniques
10 pages
Data Analysis with Python Pandas
No ratings yet
Data Analysis with Python Pandas
10 pages
Jamboree TOEFL Admission Analysis
No ratings yet
Jamboree TOEFL Admission Analysis
10 pages
Data Cleaning and Visualization Techniques
No ratings yet
Data Cleaning and Visualization Techniques
12 pages
Python Day47 Groupby
No ratings yet
Python Day47 Groupby
12 pages
Data Wrangling with Pandas Guide
No ratings yet
Data Wrangling with Pandas Guide
5 pages
Student Performance Prediction Analysis
No ratings yet
Student Performance Prediction Analysis
19 pages
Predicting Consumer Credit Ratings
No ratings yet
Predicting Consumer Credit Ratings
12 pages
Manual Histogram Equalization in MATLAB
No ratings yet
Manual Histogram Equalization in MATLAB
4 pages
Gasoline and Hydrocarbon Properties
No ratings yet
Gasoline and Hydrocarbon Properties
8 pages
Uncertainty Analysis in CFD Methods
No ratings yet
Uncertainty Analysis in CFD Methods
13 pages
Class 6 Basic Geometry Concepts
No ratings yet
Class 6 Basic Geometry Concepts
63 pages
Algebraic Structures: Groups & Semigroups
No ratings yet
Algebraic Structures: Groups & Semigroups
36 pages
Number and Quantity Act Worksheets
No ratings yet
Number and Quantity Act Worksheets
6 pages
CSE334 Summer Quiz Practice Set
No ratings yet
CSE334 Summer Quiz Practice Set
16 pages
Modul Pecutan Skor A+ Matematik Kertas 2 - Siri 5 Set Soalan
No ratings yet
Modul Pecutan Skor A+ Matematik Kertas 2 - Siri 5 Set Soalan
50 pages
Boolean Algebra Assignment Overview
No ratings yet
Boolean Algebra Assignment Overview
5 pages
Grade 9 Math Lesson on Quadratic Equations
No ratings yet
Grade 9 Math Lesson on Quadratic Equations
3 pages
S3 Physics: Reflection Exercises
100% (2)
S3 Physics: Reflection Exercises
7 pages
A Review of Time Synchronous Average Algorithm
No ratings yet
A Review of Time Synchronous Average Algorithm
8 pages
Linear Circuit Analysis I Course Guide
No ratings yet
Linear Circuit Analysis I Course Guide
2 pages
Assembly Theory: Evolution and Selection
No ratings yet
Assembly Theory: Evolution and Selection
12 pages
Class VII Maths Worksheet 2025-26
No ratings yet
Class VII Maths Worksheet 2025-26
1 page
Political Analysis and Research Overview
No ratings yet
Political Analysis and Research Overview
10 pages
Laplace Transform Question Bank
No ratings yet
Laplace Transform Question Bank
3 pages
Trigonometric Values Table 0°-360°
No ratings yet
Trigonometric Values Table 0°-360°
212 pages
Engineering Drawing Projections Explained
No ratings yet
Engineering Drawing Projections Explained
52 pages
Time Complexity of Convex Hull Algorithms
No ratings yet
Time Complexity of Convex Hull Algorithms
4 pages
C++ Binary Search Tree Implementation
No ratings yet
C++ Binary Search Tree Implementation
33 pages
Renewable Energy: M.A. Abdelghani-Idrissi, S. Khalfallaoui, D. Seguin, L. Verni Eres-Hassimi, S. Leveneur
No ratings yet
Renewable Energy: M.A. Abdelghani-Idrissi, S. Khalfallaoui, D. Seguin, L. Verni Eres-Hassimi, S. Leveneur
16 pages
Probability Distributions Exercises Guide
No ratings yet
Probability Distributions Exercises Guide
16 pages
Reservoir Simulation Overview and Cases
No ratings yet
Reservoir Simulation Overview and Cases
485 pages
Motion Graphs and Calculations Guide
No ratings yet
Motion Graphs and Calculations Guide
6 pages
Overview of Loaders and Linkers
100% (1)
Overview of Loaders and Linkers
25 pages
2 Chapter
No ratings yet
2 Chapter
24 pages
Physics Important Questions 2025
No ratings yet
Physics Important Questions 2025
2 pages
Basic Four Mathematics Exam 2021
No ratings yet
Basic Four Mathematics Exam 2021
3 pages
SPM 2008 Additional Mathematics Vectors
No ratings yet
SPM 2008 Additional Mathematics Vectors
10 pages

Student Score Data Analysis in Python

Uploaded by

Student Score Data Analysis in Python

Uploaded by

Data

[Link]. (Hons.) Mathematics

Before clearing the data.

After clearing the data.

II. Generate summary statistics.

print("\nSummary Statistics for Numeric Columns:")

for col in numeric_cols:

for col in numeric_cols:

print("Before Handling Missing Values:")

print("\nAfter Handling Missing Values:")

print("\n Missing values handled and cleaned file saved.")

numerical_cols = ['math_score', 'english_score',

df_scaled = (df_numerical - df_numerical.mean()) /

III. Calculate eigenvalues and eigenvectors.

IV. Transform the data into principal components.

pc_columns = [f'PC{i+1}' for i in

print("Transformed Data (Principal Components) - First 5

V. Visualize the explained variance ratio.

def calculate_distances(X, centroids):

def kmeans_numpy(X, k, max_iters=100, tol=1e-4, random_seed=42):

random_indices = [Link](n_samples, k, replace=False)

new_centroids = [Link]((k, n_features))

if [Link](centroids, new_centroids, atol=tol):

return centroids, labels

def calculate_wcss(X, centroids, labels):

features = ['math_score', 'english_score', 'science_score']

for col in features:

X_scaled = (X - mean) / std

print("Calculating WCSS for K=1 to 10...")

print("\nWCSS Calculation Complete.")

print("Elbow plot saved as 'kmeans_elbow_plot_numpy.png'")

Cluster 0 → High Achievers

Cluster 1 → Moderate / Average Performers

Cluster 2 → Students Needing Support

Develop a classiﬁcation model to predict whether a student will

cols = ['math_score', 'english_score', 'science_score']

II. Train a logistic regression model.

def train_log_reg(X, y, lr=0.001, epochs=5000):

def predict(X, W, b):

III. Evaluate the model using accuracy, precision, recall, and

accuracy = (tp + tn) / len(y_test)

accuracy, precision, recall, f1_score

IV. Discuss the implications of the model's performance.

The model's accuracy shows how often predictions are

correct. Precision indicates how well the model correctly

predicts pass students without misclassifying failing students

as pass. Recall represents how many actual passing students

were identified correctly. The F1-score provides a balanced

measure of precision and recall. If the model has low recall or

precision, it may incorrectly classify students and impact

fairness in student evaluation. Improving the model may

require including more input features, tuning parameters, or

collecting more training data.

You might also like