Descriptive Statistics
Unit - I
Descriptive & Inferential Statistics
It organizes,
analyses and
tests and
presents the
data in a
Meaningful way.
It compares,
tests and
predicts the
data.
Descriptive statistics summarize your current dataset and Inferential
statistics use sample data to make generalizations about a larger
population.
Inferential Statistics
Mean
The mean provides a measure of central location for the data. If the
data are for a sample, the mean is denoted by 𝑥,ҧ if the data are for a
population, the mean is denoted by 𝜇.
For a sample with n observations {𝑥1 , 𝑥2 , … . 𝑥𝑛 }, the sample mean is
σ𝑛
𝑖=1 𝑥𝑖
given by 𝑥ҧ =
𝑛
σ𝑛
𝑖=1 𝑥𝑖
Population mean 𝜇= .
𝑁
Example: For a given data set 12, 14, 11, 12, 12, 12, 15, 17, 22, 15, 12
mean=154/11=14.
Weighted Mean
Weighted Mean is an average computed by giving
different weights to some of the individual values. If all the
weights are equal, then the weighted mean is the same
as the arithmetic mean.
σ 𝑤𝑖 𝑥𝑖
𝑥ҧ = , 𝑤𝑖 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑓𝑜𝑟 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑖.
σ 𝑤𝑖
Question 1: Suppose a marketing firm surveys 1,000 households to
determine the average number of TVs each household owns. The data
show many households with two or three TVs and a smaller number
with one or four. Every household in the sample has at least one TV and
no household has more than four. Find the mean number of TVs per
household.
Number of TVs per Household Number of Households
1 73
2 378
3 459
4 90
Q2) Consider the following purchase of a raw
material over the past three months. Calculate the
mean cost per pound of the raw material.
Purchase Cost per Pound($) Number of pounds
1 3.00 1200
2 3.40 500
3 2.80 2750
4 2.90 1000
5 3.25 800
Although mean is one of the most frequently used measures of central
tendency, we should be careful about taking decisions based on mean
value of the data.
Mean=95,000
At first glance, this seems like a reasonably high salary, suggesting the
employees are well-compensated. However, this mean is heavily
influenced by the exceptionally high salary of Employee J ($500,000),
which skews the average.
Sort the salaries
The median salary is $52,500, which is much lower than the mean. This
value gives a better indication of what a typical employee in this
company earns, as it is not affected by the extreme salary of Employee
J.
This example shows that the mean can be misleading in cases where
there are outliers or a skewed distribution.
Median
The median is the value in the middle when the data is arranged in
ascending order (smallest to largest).
• For an odd number of observations, the median is the middle value
• For even, the median is the average of the two middle values.
Example: The number of deposits in a branch of a bank in a week is
given below.
Day 1 2 3 4 5 6 7
Number 245 326 180 226 445 319 260
of
deposits
Median=260
Consider the following data:
220,180,235,240,270,260,250,425,300,500
Median=255
The median is more stable than the mean value, as adding a new
observation, the median may not change significantly. However, the
drawback of the median is that it is not calculated using the entire
dataset like in the case of the mean. We are just looking for the
midpoint instead of using the actual values of the data.
Mode is the most frequently occurring value in the dataset.
• If the data consists of two modes, we say that that data are bimodal.
• If the data contains more than two modes, we say that the data are
multimodal.
Question: A small local bakery wants to analyze the sales of its popular
cupcakes over the past month. They have recorded the number of
cupcakes sold each day. Here are the daily sales figures:
25,30,22,28,30,27,25,26,23,25,28,30,24,22,26,27,29,26,27,25.
Find mean, median, and mode.
Percentiles
A percentile is a term that describes how a score compares to
other scores from the same set.
Diff b/w percentage & percentile:
The percentage score reflects how well the student did on the
exam itself; the percentile score reflects how well he did in
comparison to other students.
• we say that a student scored 100 "percent" if and only if he had
scored 100/100.
• we say that a student scored 100 "percentile" if all the students
(100% students) scored less than him.
Steps to calculate percentile
Percentile is calculated by the ratio of the number of values
below ‘x’ to the total number of values.
• Arrange the data in ascending order
Q1):The scores of 10 students are 49, 47, 38, 58, 60, 65, 70, 80, 79, 92.
Using the percentile formula, calculate the percentile for score 70?
Q2): The weights of 10 people were recorded in kg as 35, 41, 42, 56,
58, 62, 70, 71, 90, 77. Find percentile for the weight 58 kg?
When percentile is given:
• Arrange all data values in the data set in ascending order
• Calculate n
• If n is not a integer, round up. The next integer greater than n denotes
the position of the percentile
• If n is an integer, then the percentile is an average of the values in the
positions n and n+1.
Q3): In a college, a list of scores of 10 students is announced. The
scores are 56, 45, 69, 78, 72, 94, 82, 80, 63, 59. Using the percentile
formula, find the 70th percentile.
Q4): Find the 50th & 85th percentile for the salary data :
3850,3950,4050,3880,3755,3710,3890,4130,3940,4325,3920,3880.
Deciles
Deciles correspond to special values of percentile that divide the data
into 10 equal parts. The first decile contains 10% of the data, second
decile contains 20% of the data and so on.
Quartiles
Quartiles divide the data into 4 equal parts. The first quartile 𝑄1
contains 25% of the data, 𝑄2 contains 50% of the data (median), 𝑄3
accounts for 75% of the given data.
Q5): Find 𝑄1 , 𝑄2 , 𝑄3 for the data given in question 4.
Q6): Consider a sample with data values of 27,25,20,15,30,34,28,25.
Find 25th, 50th and 75th percentiles.
Measures of dispersion (Measures of Variability)
Measures of variability are useful in identifying how close the records
are to the mean value and outliers in the data.
Variability in the data are measured using the following measurements:
• Range
• Interquartile range
• Variance
• Standard Deviation
• Co-efficient of variation
Inter-Quartile Range(IQR)
IQR is a measure of the distance between Quartile 1 and Quartile 3 in
the dataset. It measures the spread of the middle 50% of the data.
For the datapoints in Q4) (salary data) , Q3=4000, Q1=3865.
IQR=135
A smaller IQR indicates that the middle 50% of the data points are
close to each other. This means there is less variability in the middle
50% of the data. This indicates a moderate spread of salaries in the
middle 50% of your dataset.
Standard deviation:
A standard deviation is a measure of how dispersed the data is
in relation to the mean.
There are six steps for finding the standard deviation by hand:
• List each score and find their mean.
• Subtract the mean from each score to get the deviation from the
mean.
• Square each of these deviations.
• Add up all of the squared deviations.
• Divide the sum of the squared deviations by n – 1 (for a sample)
or N (for a population).
• Find the square root of the number you found.
Find the S.D. of the data given in Q4.
Co-efficient of Variation:
The coefficient of variation (CV) is a relative measure of variability that
indicates the size of a standard deviation in relation to its mean. It is
also known as the relative standard deviation (RSD).
Normal Distribution:
A normal distribution, also known as a Gaussian distribution or a bell
curve, is a common way to describe how values in a dataset are
distributed.
Skewness
• Positive skewness is when the distribution takes place so that we get
a long tail towards the right side of the graph. This is called a right-
skewed graph,
• In this distribution, the mean is greater than the median, which is
greater than the mode. That is, we get mean > median > mode.
Negative skewness is when the distribution takes place so that we get
a long tail towards the left side of the graph. This is called a left-
skewed [Link] mode > median > mean.
Right Skewness:
Examples: Income distribution- In many countries, the distribution of
income is right-skewed. While most people earn an average or below-
average income, a small number of individuals earn exceptionally high
incomes, creating a long tail on the right.
Real estate pricing: Housing prices in a city or region often exhibit right
skewness. Most houses may be priced within an affordable range, but
there are a few luxury properties priced much higher, extending the tail
to the right.
Left Skewness:
Time spent on a task by experts: The time taken by experts to
complete a specific task might be left-skewed. Most experts can
complete the task quickly, but a few may take longer, extending the tail
to the left.
Employee retirement age: In many companies, the ages at which
employees retire may be left-skewed. Most employees retire around a
standard retirement age, but some might retire earlier due to health
issues or personal choices, leading to a long tail on the left.
Kurtosis
Kurtosis is a measure of the peak of the distribution and indicates how
high the distribution around the mean. It indicates whether the
distribution is flat, normal or peaked shape.
Kurtosis is another measure of shape that goes by the shape of the tail.
That is, whether the tail of the distribution is heavy or light.
Formula
A kurtosis value <3 represents a platykurtic distribution
A kurtosis value >3 represents a leptokurtic distribution
A kurtosis value =3 represents a standard normal distribution
(mesokurtic).
Cross-sectional data- consists of several variables recorded at the same
time.
• examining the GDP of different countries in a single year
• comparing the financial statements of companies at a fixed date
Time Series Data - is recorded over consistent intervals of time.
• Monthly subscribers
• Weather records
• Inflation Rates: Monthly or yearly inflation rates.
• Stock Prices: Daily closing prices of a company’s stock over several years.
• GDP: Quarterly or annual Gross Domestic Product figures of a country.
• Sales Data: Weekly or monthly sales revenue of a retail store.
Data Visualization
Bar Chart: A bar chart is a frequency chart for qualitative (categorical)
data summarized in a frequency, relative frequency, or percent
frequency distribution.
Pie Chart: Used to show the relative freq. and percent freq. for
categorical data.
Dot Plot: Used to show the distribution of the quantitative data over
the entire range of data(horizontal axis).
Histogram: Used to show the frequency distribution of the
quantitative data over a set of class intervals.
Scatter Plot: A scatter diagram is a graphical display of the relationship
between two quantitative variables. Scatter plots are used to observe
relationships between variables.
Read about types of correlation.
Self study: Side by side bar chart, stacked bar chart