1.
Import Data and Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.
In [1]: import numpy as np
import pandas as pd
import seaborn as sns
import [Link] as plt
%matplotlib inline
import warnings
[Link]('ignore')
Import the CSV Data as Pandas DataFrame
In [2]: df = pd.read_csv("[Link]")
Show Top 5 Records
In [3]: [Link]()
Out[3]: parental test
math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course
bachelor's
0 female group B standard none 72 72 74
degree
some
1 female group C standard completed 69 90 88
college
master's
2 female group B standard none 90 95 93
degree
associate's
3 male group A free/reduced none 47 57 44
degree
some
4 male group C standard none 76 78 75
college
Shape of the dataset
In [4]: [Link]
(1000, 8)
Out[4]:
1. Dataset information
gender : sex of students -> (Male/female)
race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
parental level of education : parents' final education ->(bachelor's
degree,some college,master's degree,associate's degree,high school)
lunch : having lunch before test (standard or free/reduced)
test preparation course : complete or not complete before test
math score
reading score
writing score
1. Data Checks to perform
Check Missing values
Check Duplicates
Check data type
Check the number of unique values of each column
Check statistics of data set
Check various categories present in the different categorical column
3.1 Check Missing values
In [5]: [Link]().sum()
gender 0
Out[5]:
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64
There are no missing values in the data set
3.2 Check Duplicates
In [6]: [Link]().sum()
0
Out[6]:
There are no duplicates values in the data set
3.3 Check data types
In [7]: [Link]()
<class '[Link]'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
3.4 Checking the number of unique values of each column
In [8]: [Link]()
gender 2
Out[8]:
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
math score 81
reading score 72
writing score 77
dtype: int64
3.5 Check statistics of data set
In [9]: [Link]()
Out[9]: math score reading score writing score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000
Insight
From above description of numerical data, all means are very close to each other - between
66 and 68.05;
All standard deviations are also close - between 14.6 and 15.19;
While there is a minimum score 0 for math, for writing minimum is much higher = 10 and
for reading myet higher = 17
3.6 Exploring Data
In [10]: [Link]()
Out[10]: parental test
math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course
bachelor's
0 female group B standard none 72 72 74
degree
some
1 female group C standard completed 69 90 88
college
master's
2 female group B standard none 90 95 93
degree
associate's
3 male group A free/reduced none 47 57 44
degree
some
4 male group C standard none 76 78 75
college
In [11]: print("Categories in 'gender' variable: ",end=" " )
print(df['gender'].unique())
print("Categories in 'race_ethnicity' variable: ",end=" ")
print(df['race/ethnicity'].unique())
print("Categories in'parental level of education' variable:",end=" " )
print(df['parental level of education'].unique())
print("Categories in 'lunch' variable: ",end=" " )
print(df['lunch'].unique())
print("Categories in 'test preparation course' variable: ",end=" " )
print(df['test preparation course'].unique())
Categories in 'gender' variable: ['female' 'male']
Categories in 'race_ethnicity' variable: ['group B' 'group C' 'group A' 'group
D' 'group E']
Categories in'parental level of education' variable: ["bachelor's degree" 'some co
llege' "master's degree" "associate's degree"
'high school' 'some high school']
Categories in 'lunch' variable: ['standard' 'free/reduced']
Categories in 'test preparation course' variable: ['none' 'completed']
In [12]: # define numerical & categorical columns
numeric_features = [feature for feature in [Link] if df[feature].dtype != 'O']
categorical_features = [feature for feature in [Link] if df[feature].dtype ==
# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_fe
print('\nWe have {} categorical features : {}'.format(len(categorical_features), ca
We have 3 numerical features : ['math score', 'reading score', 'writing score']
We have 5 categorical features : ['gender', 'race/ethnicity', 'parental level of e
ducation', 'lunch', 'test preparation course']
3.8 Adding columns for "Total Score" and "Average"
In [13]: df['total score'] = df['math score'] + df['reading score'] + df['writing score']
df['average'] = df['total score']/3
[Link]()
Out[13]: parental test
math reading writing total
gender race/ethnicity level of lunch preparation a
score score score score
education course
bachelor's
0 female group B standard none 72 72 74 218 72
degree
some
1 female group C standard completed 69 90 88 247 82
college
master's
2 female group B standard none 90 95 93 278 92
degree
associate's
3 male group A free/reduced none 47 57 44 148 49
degree
some
4 male group C standard none 76 78 75 229 76
college
In [14]: reading_full = df[df['reading score'] == 100]['average'].count()
writing_full = df[df['writing score'] == 100]['average'].count()
math_full = df[df['math score'] == 100]['average'].count()
print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')
Number of students with full marks in Maths: 7
Number of students with full marks in Writing: 14
Number of students with full marks in Reading: 17
In [15]: reading_less_20 = df[df['reading score'] <= 20]['average'].count()
writing_less_20 = df[df['writing score'] <= 20]['average'].count()
math_less_20 = df[df['math score'] <= 20]['average'].count()
print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')
Number of students with less than 20 marks in Maths: 4
Number of students with less than 20 marks in Writing: 3
Number of students with less than 20 marks in Reading: 1
Insights
.From above values we get students have performed the worst in Maths
.Best performance is in reading section
1. Exploring Data ( Visualization )
4.1 Visualize average score distribution to make some conclusion.
.Histogram
.Kernel Distribution Function (KDE)
4.1.1 Histogram & KDE
In [16]: fig, axs = [Link](1, 2, figsize=(15, 7))
[Link](121)
[Link](data=df,x='average',bins=30,kde=True,color='g')
[Link](122)
[Link](data=df,x='average',kde=True,hue='gender')
[Link]()
In [17]: fig, axs = [Link](1, 2, figsize=(15, 7))
[Link](121)
[Link](data=df,x='total score',bins=30,kde=True,color='g')
[Link](122)
[Link](data=df,x='total score',kde=True,hue='gender')
[Link]()
Female students tend to perform well then male students.
In [18]: [Link](1,3,figsize=(25,6))
[Link](141)
[Link](data=df,x='average',kde=True,hue='lunch')
[Link](142)
[Link](data=df[[Link]=='female'],x='average',kde=True,hue='lunch')
[Link](143)
[Link](data=df[[Link]=='male'],x='average',kde=True,hue='lunch')
[Link]()
Insights
Standard lunch helps perform well in exams.
Standard lunch helps perform well in exams be it a male or a female.
In [19]: [Link](1,3,figsize=(25,6))
[Link](141)
ax =[Link](data=df,x='average',kde=True,hue='parental level of education')
[Link](142)
ax =[Link](data=df[[Link]=='male'],x='average',kde=True,hue='parental leve
[Link](143)
ax =[Link](data=df[[Link]=='female'],x='average',kde=True,hue='parental le
[Link]()
Insights
In general parent's education don't help student perform well in exam.
2nd plot shows that parent's whose education is of associate's degree or master's degree
their male child tend to perform well in exam
3rd plot we can see there is no effect of parent's education on female students.
In [20]: [Link](1,3,figsize=(25,6))
[Link](141)
ax =[Link](data=df,x='average',kde=True,hue='race/ethnicity')
[Link](142)
ax =[Link](data=df[[Link]=='female'],x='average',kde=True,hue='race/ethnic
[Link](143)
ax =[Link](data=df[[Link]=='male'],x='average',kde=True,hue='race/ethnicit
[Link]()
Insights
Students of group A and group B tends to perform poorly in exam.
Students of group A and group B tends to perform poorly in exam irrespective of whether
they are male or female
In [21]: [Link](figsize=(18,8))
[Link](1, 4, 1)
[Link]('MATH SCORES')
[Link](y='math score',data=df,color='red',linewidth=3)
[Link](1, 4, 2)
[Link]('READING SCORES')
[Link](y='reading score',data=df,color='green',linewidth=3)
[Link](1, 4, 3)
[Link]('WRITING SCORES')
[Link](y='writing score',data=df,color='blue',linewidth=3)
[Link]()
Insights
From the above three plots its clearly visible that most of the students score in between 60-
80 in Maths whereas in reading and writing most of them score from 50-80
4.3 Multivariate analysis using pieplot
In [22]: [Link]['[Link]'] = (30, 12)
[Link](1, 5, 1)
size = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['red','green']
[Link](size, colors = color, labels = labels,autopct = '.%2f%%')
[Link]('Gender', fontsize = 20)
[Link]('off')
[Link](1, 5, 2)
size = df['race/ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan','orange']
[Link](size, colors = color,labels = labels,autopct = '.%2f%%')
[Link]('Race/Ethnicity', fontsize = 20)
[Link]('off')
[Link](1, 5, 3)
size = df['lunch'].value_counts()
labels = 'Standard', 'Free'
color = ['red','green']
[Link](size, colors = color,labels = labels,autopct = '.%2f%%')
[Link]('Lunch', fontsize = 20)
[Link]('off')
[Link](1, 5, 4)
size = df['test preparation course'].value_counts()
labels = 'None', 'Completed'
color = ['red','green']
[Link](size, colors = color,labels = labels,autopct = '.%2f%%')
[Link]('Test Course', fontsize = 20)
[Link]('off')
[Link](1, 5, 5)
size = df['parental level of education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bac
color = ['red', 'green', 'blue', 'cyan','orange','grey']
[Link](size, colors = color,labels = labels,autopct = '.%2f%%')
[Link]('Parental Education', fontsize = 20)
[Link]('off')
plt.tight_layout()
[Link]()
[Link]()
Insights
Number of Male and Female students is almost equal
Number students are greatest in Group C
Number of students who have standard lunch are greater
Number of students who have not enrolled in any test preparation course is greater
Number of students whose parental education is "Some College" is greater followed closely
by "Associate's Degree"
4.4 Feature Wise Visualization
4.4.1 GENDER COLUMN
How is distribution of Gender ?
Is gender has any impact on student's performance ?
UNIVARIATE ANALYSIS ( How is distribution of Gender ? )
In [23]: f,ax=[Link](1,2,figsize=(20,10))
[Link](x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)
[Link](x=df['gender'].value_counts(),labels=['Male','Female'],explode=[0,0.1],auto
[Link]()
Insights
Gender has balanced data with female students are 518 (48%) and male students are 482
(52%)
4.4.2 RACE/EHNICITY COLUMN
How is Group wise distribution ?
Is Race/Ehnicity has any impact on student's performance ?
UNIVARIATE ANALYSIS ( How is Group wise distribution ?)
In [24]: f,ax=[Link](1,2,figsize=(20,10))
[Link](x=df['race/ethnicity'],data=df,palette = 'bright',ax=ax[0],saturation
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)
[Link](x = df['race/ethnicity'].value_counts(),labels=df['race/ethnicity'].value_c
[Link]()
Insights
Most of the student belonging from group C /group D.
Lowest number of students belong to groupA.
BIVARIATE ANALYSIS ( Is Race/Ehnicity has any impact on student's performance ? )
In [25]: Group_data2=[Link]('race/ethnicity')
f,ax=[Link](1,3,figsize=(20,8))
[Link](x=Group_data2['math score'].mean().index,y=Group_data2['math score'].me
ax[0].set_title('Math score',color='#005ce6',size=20)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=15)
[Link](x=Group_data2['reading score'].mean().index,y=Group_data2['reading scor
ax[1].set_title('Reading score',color='#005ce6',size=20)
for container in ax[1].containers:
ax[1].bar_label(container,color='black',size=15)
[Link](x=Group_data2['writing score'].mean().index,y=Group_data2['writing scor
ax[2].set_title('Writing score',color='#005ce6',size=20)
for container in ax[2].containers:
ax[2].bar_label(container,color='black',size=15)
Insights
Group E students have scored the highest marks.
Group A students have scored the lowest marks.
Students from a lower Socioeconomic status have a lower avg in all course subjects
4.4.3 PARENTAL LEVEL OF EDUCATION COLUMN
What is educational background of student's parent ?
Is parental education has any impact on student's performance ?
UNIVARIATE ANALYSIS ( What is educational background of student's parent ? )
In [26]: [Link]['[Link]'] = (15, 9)
[Link]('fivethirtyeight')
[Link](df["parental level of education"], palette = 'Blues')
[Link]('Comparison of Parental Education', fontweight = 30, fontsize = 20)
[Link]('Degree')
[Link]('count')
[Link]()
Insights
Largest number of parents are from some college.
4.4.4 LUNCH COLUMN
Which type of lunch is most common amoung students ?
What is the effect of lunch type on test results?
BIVARIATE ANALYSIS ( Is lunch type intake has any impact on student's performance ? )
In [27]: f,ax=[Link](1,2,figsize=(20,8))
[Link](x=df['parental level of education'],data=df,palette = 'bright',hue='t
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)
[Link](x=df['parental level of education'],data=df,palette = 'bright',hue='l
for container in ax[1].containers:
ax[1].bar_label(container,color='black',size=20)
Insights
Students who get Standard Lunch tend to perform better than students who got
free/reduced lunch
4.4.5 TEST PREPARATION COURSE COLUMN
Which type of lunch is most common amoung students ?
Is Test prepration course has any impact on student's performance ?
BIVARIATE ANALYSIS ( Is Test prepration course has any impact on student's performance ? )
In [28]: [Link](figsize=(12,6))
[Link](2,2,1)
[Link] (x=df['lunch'], y=df['math score'], hue=df['test preparation course'])
[Link](2,2,2)
[Link] (x=df['lunch'], y=df['reading score'], hue=df['test preparation course
[Link](2,2,3)
[Link] (x=df['lunch'], y=df['writing score'], hue=df['test preparation course
<Axes: xlabel='lunch', ylabel='writing score'>
Out[28]:
Insights
Students who have completed the Test Prepration Course have scores higher in all three
categories than those who haven't taken the course
4.4.6 CHECKING OUTLIERS
In [29]: [Link](1,4,figsize=(16,5))
[Link](141)
[Link](df['math score'],color='skyblue')
[Link](142)
[Link](df['reading score'],color='hotpink')
[Link](143)
[Link](df['writing score'],color='yellow')
[Link](144)
[Link](df['average'],color='lightgreen')
[Link]()
4.4.7 MUTIVARIATE ANALYSIS USING PAIRPLOT
In [30]: [Link](df,hue = 'gender')
[Link]()
Insights
From the above plot it is clear that all the scores increase linearly with each other.
1. Conclusions
Student's Performance is related with lunch, race, parental level education
Females lead in pass percentage and also are top-scorers
Student's Performance is not much related with test preparation course
Finishing preparation course is benefitial.
In [ ]: