Module I - Descriptive and Inferential Statistics
Module I - Descriptive and Inferential Statistics
DESRIPTIVE AND
INFERENTIAL STATISTICS
DESRIPTIVE STATISTICS
Statistics: It is a branch of mathematics dealing with the collection,
analysis, interpretation, and presentation of data.
Types of Statistics:
• Descriptive Statistics • Inferential Statistics
It is used to summarize and describe It is used to make inferences or
characteristics of dataset. It involves predictions about a population based
calculating summary measures such as on a sample of data. It involves using
the mean, median, mode, range, statistical tests, such as hypothesis test
standard deviation, variance. and regression analysis.
Population and Sample:
• A population is the entire group that you want
to draw conclusions about.
Mean: It is the sum of the all data entries divided by the number of entries.
Σ𝑥𝑖 Σ𝑥𝑖
Population Mean 𝜇 = Sample Mean 𝑥ҧ =
𝑁 𝑛
where, N represents the number of entries in population.
And n represents number of entries in sample.
Σ𝑓𝑖 𝑥𝑖
For grouped frequency distribution: Mean- 𝑥ҧ =
Σ𝑓𝑖
Median: It is the value that lies in the middle of the data when the data set
is ordered. The median measures the center of an ordered data set by
dividing it into two equal parts. If the data set has an odd number of
entries, the median is the middle data entry. If the data set has an even
number of entries, the median is the mean of the two middle data entries.
Mode: It is the most frequent number in the dataset. i.e. The mode of a
data set is the data entry that occurs with the greatest frequency. A data
set can have one mode, more than one mode, or no mode. If no entry is
repeated, the data set has no mode. If two entries occur with the same
greatest frequency, each entry is a mode and the data set is called
bimodal.
Formulae's of Measures of Central Tendency
A. For ungrouped data
Σ𝑥𝑖
i) Mean: 𝑥ҧ =
𝑛
Σ𝑓𝑖 𝑥𝑖
i) Mean: 𝑥ҧ = ; 𝑁 = Σ𝑓𝑖
𝑁
ii) Median:
𝑥𝑖 𝑓𝑖 c.f. (Cumulative frequency)
𝑥1 𝑓1 𝑓1
𝑥2 𝑓2 𝑓1 + 𝑓2
⋮ ⋮ ⋮
𝑥𝑛 𝑓𝑛 𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 = 𝑁
𝑁
Median = Value of 𝑥 for which c.f. is just greater than .
2
Σ𝑓𝑖 𝑥𝑖
i) Mean: 𝑥ҧ = ; 𝑁 = Σ𝑓𝑖
𝑁
ℎ 𝑁
ii) Median: Median = 𝐿 + − 𝑐. 𝑓.
𝑓 2
𝑁
where Median Class = class for which c.f. is just greater
2
Range: It is the difference between largest and smallest values in the data
set.
𝑅𝑎𝑛𝑔𝑒 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Standard deviation: It is a statistical measure that quantifies the amount of
variation or dispersion in a set of data values. It tells us how much the
values in a dataset deviate, on average, from the mean (average) of the
dataset. If the standard deviation is low, the data points are close to the
mean, indicating less variability. If the standard deviation is high, the data
points are spread out over a larger range of values, indicating greater
variability.
Σ 𝑥𝑖 −𝑥ҧ 2 Σ𝑥𝑖2 2 Σ𝑥𝑖
for ungrouped data: 𝜎 = + or 𝜎 = + − 𝑥ҧ ; 𝑥ҧ =
𝑛 𝑛 𝑛
for discrete or continuous frequency distribution:
Σ𝑓𝑖 𝑥𝑖 −𝑥ҧ 2 Σ𝑓𝑖 𝑥𝑖2 2 Σ𝑓𝑖 𝑥𝑖
𝜎=+ or 𝜎 = + − 𝑥ҧ ; N = Σ𝑓𝑖 and 𝑥ҧ =
𝑁 𝑁 𝑁
Variance: It is the square of standard deviation.
Variance = 𝜎 2
Remark:
1. Mean − Mode = 3 Mean − Median
2. Short cut method:
A) Arithmetic Mean:
i) for ungrouped data
put 𝑑 = 𝑥𝑖 − 𝐴 ; 𝐴 = Assumed Mean
⇒ 𝑑ҧ = 𝑥ҧ − 𝐴
⇒ 𝑥ҧ = 𝑑ҧ + 𝐴
⇒ 𝑥ҧ = 𝐴 + ℎ𝑢ത
B) Standard Deviation:
i) for ungrouped data
put 𝑑 = 𝑥𝑖 ± 𝐴 ; 𝐴 = Assumed Mean
⇒ S. D. 𝑥𝑖 = S. D. 𝑑
& S. D. 𝑘𝑥 = 𝑘 ∙ S. D. 𝑥
ii) for grouped data
𝑥𝑖 −𝐴 𝑑
put 𝑢= =
ℎ ℎ
⇒ 𝑑 = 𝑢ℎ
⇒ S. D. 𝑑 = S. D. 𝑢ℎ
Σ𝑓ℎ2 𝑢2 2
= − ℎ𝑢ത
𝑁
= ℎ ∙ S. D. 𝑢
Coefficient of Variation (C.V.)
It is used to define consistency of the data or variability of the data.
𝑆. 𝐷. 𝜎
C. V. = × 100 = × 100
𝐴. 𝑀. 𝑥ҧ
The ratio of the standard deviation to the mean i.e. 𝜎Τ𝑥ҧ is known as the
coefficient of variation. As this is a ratio, having no dimension, it is used to
comparing the variations between the two groups with different means.
Ex. If A & B are two sets of data with 𝐶. 𝑉. 𝐴 < 𝐶. 𝑉. 𝐵 means Set A is
more consistent than set B, OR set B shows more variability than A.
Examples
1) The scores of two cricketers Virat and Rohit for 6 matches each are
given below, find who is more consistent and whose performance is better.
Virat 58 59 60 54 65 66
Rohit 84 56 92 65 86 44
Solution:
Virat 𝑥𝑖 Rohit 𝑦𝑖 𝑥𝑖2 𝑦𝑖2
58 84 3364 7056
59 56 3481 3136
60 92 3600 8464
54 65 2916 4225
65 86 4225 7396
66 44 4356 1936
Total Σ 362 427 21942 32213
𝑆.𝐷. 𝜎 Σ𝑥𝑖 Σ𝑥𝑖2 2
C. V. = × 100 = × 100 ; 𝑥ҧ = , 𝜎𝑥 = + − 𝑥ҧ
𝐴.𝑀. 𝑥ҧ 𝑛 𝑛
4.11
For Virat, 𝑥ҧ = 60.33, 𝜎 = 4.11 , C. V. = × 100 = 6.812
60.33
17.44
For Rohit, 𝑦ത = 71.16, 𝜎 = 17.44 , C. V. = × 100 = 24.508
71.16
𝑥𝑖 −𝐴
Remark: for fast calculations, put 𝑢 =
ℎ
1 1
𝜇𝑟′ = Σ𝑓𝑖 𝑥𝑖 − 𝐴 𝑟 = Σ𝑓𝑖 𝑢ℎ 𝑟
𝑁 𝑁
ℎ𝑟 ℎ𝑟 Σ𝑓𝑖 𝑢𝑟
𝜇𝑟′ = Σ𝑓𝑖 𝑢𝑟 = ; 𝑟 = 1, 2, 3, ⋯
𝑁 Σ𝑓𝑖
Skewness :
Skewness means 'lack of symmetry’. Skewness is studied to have an idea
about the shape of the curve. A distribution is said to be skewed if
a) Mean, median and mode fall at different points.
i.e. Mean ≠ Median ≠ Mode
b) The curve drawn with the help of the given data is not symmetrical but
stretched more to one side than to the other.
Skewness can be positive as well as negative. The skewness is positive if
the larger tail of the distribution lies towards the higher values of the
variate (the right), i.e., if the curve drawn within the help of the given data
is stretched more to the right than to the left and is negative in the
contrary case.
𝑀 = 𝑀𝑒𝑎𝑛,
𝑀𝑑 = 𝑀𝑒𝑑𝑖𝑎𝑛,
𝑀𝑜 = 𝑀𝑜𝑑𝑒
Measure of Skewness:
3 Mean − Median
1. skewness =
S.D.
𝜇3 2
2. Coefficient of skewness 𝛽1 =
𝜇2 3
= 50 − 4 40 2 + 6 20 2 2 −3 2 4 = 162
Thus, first four central moments are
𝜇1 = 0, 𝜇2 = 16, 𝜇3 = −64, 𝜇4 = 162
Now to find Arithmetic Mean 𝑥ҧ we have:
𝑥ҧ = 𝜇1′ + 𝐴 = 2 + 5 = 7
To find S.D.: σ = + 𝜇2 = 16 = 4
𝜇3 2 −64 2
Coefficient of skewness: 𝛽1 = = =1
𝜇2 3 16 3
𝜇4 162
Coefficient of kurtosis: 𝛽2 = = = 0.63
𝜇2 2 16 2
ℎ ℎ2 ℎ3 ℎ4
∴ 𝜇1′ = Σ𝑓𝑖 𝑢 , 𝜇2′ = Σ𝑓𝑖 𝑢2 , 𝜇3′ = Σ𝑓𝑖 𝑢3 , 𝜇4′ = Σ𝑓𝑖 𝑢4
𝑁 𝑁 𝑁 𝑁
Table:
𝑥𝑖 − 25
class Freq. 𝑥𝑖 𝑢𝑖 = 𝑓𝑖 𝑢𝑖 𝑓𝑖 𝑢𝑖2 𝑓𝑖 𝑢𝑖3 𝑓𝑖 𝑢𝑖4
ℎ = 10
0-10 6 5 −2 −12 24 − 48 96
10-20 26 15 −1 − 26 26 − 26 26
20-30 47 25 = A 0 0 0 0 0
30-40 15 35 1 15 15 15 15
40-50 6 45 2 12 24 48 96
Total 100 − 11 89 − 11 233
10 10 2
∴ 𝜇1′ = −11 = −1.1 , 𝜇2′ = 89 = 89 ,
100 100
10 3 10 4
𝜇3′ = −11 = −110 , 𝜇4′ = 233 = 23300
100 100
Thus, central moments are given by
𝜇1 = 0
𝜇2 = 𝜇2′ − 𝜇1′ 2 = 89 − −1.1 2 = 87.79
𝜇3 = 𝜇3′ − 3 𝜇2′ 𝜇1′ + 2 𝜇1′ 3
= −110 − 3 89 −1.1 + 2 −1.1 3
= 181.038
𝜇4 = 𝜇4′ − 4 𝜇3′ 𝜇1′ + 6 𝜇2′ 𝜇1′ 2 − 3 𝜇1′ 4
𝜇4 23457.7477
Coefficient of kurtosis: 𝛽2 = = = 3.0436
𝜇2 2 87.79 2
2) Calculate the first four moments about the mean for the following
distribution. Also find 𝛽1 and 𝛽2 .
x 1 2 3 4 5 6 7 8 9 10
f 6 15 23 42 62 60 40 24 13 5
Correlation :
Bivariate distribution: Distributions involving two variables is known as
bivariate distribution. For example, if we measure the heights and weights
of a certain group of persons, we shall get what is known as Bivariate
Distribution - one variable relating to height and the other relating to
weight.
Correlation: In a bivariate distribution if a change in one variable affects a
change in the other variable, the variables are said to be correlated. If the
two variables deviate in the same direction, i.e., if the increase (or
decrease) in one result in a corresponding increase (or decrease) in the
other, correlation is said to be direct or positive. But if they constantly
deviate in the opposite directions, i.e., if increase (or decrease) in one
result in corresponding decrease (or increase) in the other, correlation is
said to be diverse or negative.
For example, the correlation between the income and expenditure is
positive and the correlation between price and demand of a commodity, is
negative.
Karl Pearson’s Coefficient of Correlation: Correlation coefficient between
two random variables 𝑥 and y, usually denoted by 𝑟(𝑥, y) or simply 𝑟𝑥𝑦 is a
numerical measure of linear relationship between them and is defined as
COV 𝑥, y
𝑟 𝑥, 𝑦 =
𝜎𝑥 𝜎𝑦
1 1
∵ COV 𝑥, y = Covarince 𝑥, 𝑦 = σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത = σ𝑥𝑖 𝑦𝑖 − ( 𝑥ҧ 𝑦ത )
𝑛 𝑛
1 1
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത σ 𝑥𝑖 𝑦𝑖 − 𝑥ҧ 𝑦ത
𝑛 𝑛
Thus 𝑟 𝑥, 𝑦 = =
1 2 1σ 1 1
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത 2 σ 𝑥𝑖2 − 𝑥ҧ 2 σ 𝑦𝑖 2 − 𝑦ത 2
n n 𝑛 𝑛
Remark: Correlation Coefficient cannot exceed unity numerically. It always
lies between −1 and + 1, i. e. −1 ≤ 𝑟(𝑥, 𝑦) ≤ 1. If 𝑟 = 1, the correlation is
perfect positive and if 𝑟 = −1, correlation is perfect negative.
Note: Correlation coefficient is independent of change of origin and scale:
𝑥−𝐴 y−B
If u = and v = so that, 𝑥 = A + ℎu and y = B + 𝑘v then
ℎ 𝑘
𝑟(𝑥, y) = 𝑟(u, v)
Thus, for ungrouped data:
𝑛 σ 𝑢𝑣 − σ 𝑢 σ 𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑛 σ 𝑢2 − σ 𝑢 2 𝑛 σ 𝑣2 − σ 𝑣 2
Solution:
Correlation coefficient is given by
𝑛 σ 𝑢𝑣 − σ 𝑢 σ 𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑛 σ 𝑢2 − σ 𝑢 2 𝑛 σ 𝑣2 − σ 𝑣 2
Here n = 7.
Supply (𝑥) Price (𝑦) 𝑢 = 𝑥 − 166 𝑣 = 𝑦 − 170 𝑢2 𝑣2 𝑢𝑣
𝑟 𝑥, 𝑦 = −0.9322
2) Calculate coefficient of correlation if 𝑛 = 6, Σ 𝑥 − 18.5 = −3,
Σ 𝑦 − 50 = 20, Σ 𝑥 − 18.5 𝑦 − 50 = −120, Σ 𝑥 − 18.5 2 = 19
& Σ 𝑦 − 50 2 = 850.
Solution: Suppose 𝑢 and 𝑣 are deviations of 𝑥 and 𝑦 values from 18.5 and
50, respectively. i.e. 𝑢 = 𝑥 − 18.5, 𝑣 = 𝑦 − 50.
Given: 𝑛 = 6, Σ𝑢 = −3, Σ𝑣 = 20, Σ𝑢𝑣 = −120, Σ𝑢2 = 19, Σ𝑣 2 = 850
Correlation coefficient is given by
𝑛 σ 𝑢𝑣 − σ 𝑢 σ 𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑛 σ 𝑢2 − σ 𝑢 2 𝑛 σ 𝑣 2 − σ 𝑣 2
6 −120 − −3 20
𝑟 𝑥, 𝑦 = 𝑟 u, v =
6 19 − −3 2 6 850 − 20 2
𝑟 𝑥, 𝑦 = −0.9395
3) Given: r = 0.9, Σ𝑥𝑦 = 60, 𝜎𝑥 = 3, Σ𝑦 2 = 100. Find the number of items
if 𝑥 and 𝑦 are deviations of 𝑢 and 𝑣 from arithmetic mean.
Solution: Given: r = 0.9, Σ𝑥𝑦 = 60, 𝜎𝑥 = 3, Σ𝑦 2 = 100.
Also, 𝑥 = 𝑢 − 𝑢ത and 𝑦 = 𝑣 − 𝑣.ҧ
Correlation coefficient is given by
COV 𝑢, 𝑣 σ 𝑢 − 𝑢ത 𝑣 − 𝑣ҧ σ𝑥 𝑦
𝑟 𝑢, 𝑣 = = =
𝜎𝑢 𝜎𝑣 𝑛 𝜎𝑢 𝜎𝑣 𝑛 𝜎𝑥 𝜎𝑦
Σ 𝑣−𝑣ത 2 Σ𝑦 2 2 Σ𝑦 2 2 100
Now, 𝜎𝑦 = 𝜎𝑣 = = ⇒ 𝜎𝑦 = ⇒ 𝜎𝑦 =
𝑛 𝑛 𝑛 𝑛
2 2
σ𝑥 𝑦 60
∴ 𝑟2 = 2 ⇒ 0.9 2 = 2 ⇒ 𝑛 = 4.938 ≈ 5
𝑛2 𝜎𝑥 2 𝜎𝑦 100
𝑛2 3 2
𝑛
Practice Problems:
1) Calculate correlation coefficient for following distribution.
x 5 9 18 19 24 28 32
y 7 9 14 21 23 29 30
f 6 9 13 20 16 11 7
Lines of Regression: The line of regression is the line which gives the best
estimate to the value of one variable for any specific value of the other
variable.
Regression line of y on x :
Consider the set of values 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1,2, … , 𝑛 and let the line of
regression of y on x be 𝑦 = 𝑚𝑥 + 𝑐. Then it is given by,
𝜎𝑦
(𝑦𝑖 − 𝑦)
ത = 𝑟 (𝑥𝑖 − 𝑥)ҧ
𝜎𝑥
or
(𝑦𝑖 − 𝑦)
ത = 𝑏𝑦𝑥 (𝑥𝑖 − 𝑥)ҧ
or
COV 𝑥,𝑦
(𝑦𝑖 − 𝑦)
ത = (𝑥𝑖 − 𝑥)ҧ
𝜎𝑥2
𝜎𝑦
Note: 1) 𝑟 = 𝑏𝑦𝑥 is called regression coefficient of y on x.
𝜎𝑥
𝜎𝑌 COV 𝑥,𝑦
2) m = 𝑏𝑦𝑥 = 𝑟 =
𝜎𝑋 𝜎𝑥2
Regression line of x on y :
Consider the set of values 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1,2, … , 𝑛 and let the line of
regression of x on y be 𝑥 = 𝑛𝑦 + 𝑐. Then it is given by,
𝜎𝑥
(𝑥𝑖 − 𝑥)ҧ = 𝑟 (𝑦𝑖 − 𝑦) ത
𝜎𝑦
or
(𝑥𝑖 − 𝑥)ҧ = 𝑏𝑥𝑦 (𝑦𝑖 − 𝑦)
ത
or
COV 𝑥, 𝑦
(𝑥𝑖 − 𝑥)ҧ = 2 (𝑦𝑖 − 𝑦)
ത
𝜎𝑦
𝜎𝑥
Note: 1) 𝑟 = 𝑏𝑥𝑦 is called regression coefficient of x on y.
𝜎𝑦
𝜎𝑥 COV 𝑥,𝑦
2) n = 𝑏𝑥𝑦 = 𝑟 =
𝜎𝑦 𝜎𝑦2
Examples
1) Obtain regression lines for the following data. Hence find 𝑦 5 & 𝑥 9 .
𝑥 6 2 10 4 8
y 9 11 5 8 7
Solution: Here n = 5
𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑦𝑖2 𝑥𝑖 𝑦𝑖
6 9 36 81 54
2 11 4 121 22
10 5 100 25 50
4 8 16 64 32
8 7 64 49 56
Total Σ 30 40 220 340 214
Σ𝑥𝑖 30 Σ𝑦𝑖 40
𝑥ҧ = = = 6 ; 𝑦ത = = =8
𝑛 5 𝑛 5
2
Σ𝑥𝑖2 Σ𝑥𝑖 220
𝜎𝑥 = − = − 36 = 2.83
𝑛 𝑛 5
2
Σ𝑦𝑖2 Σ𝑦𝑖 340
𝜎𝑦 = − = − 64 = 2
𝑛 𝑛 5
4 2
∴ 𝑟 = − 𝑏𝑥𝑦 𝑏𝑦𝑥 =− = − = −0.6667
9 3
Practice Problems:
1) The regression equations are 8𝑥 − 10𝑦 + 66 = 0 & 40𝑥 − 18𝑦 − 214 = 0.
The value of variance of 𝑥 is 9. Find:
A) The mean values of 𝑥 and 𝑦.
B) The correlation between 𝑥 and 𝑦.
C) The standard deviation of 𝑦.
2) Obtain regression lines to the following data.
Year of Experience 1.1 1.5 2 2.9 3 3.7 4 4.5
Salary 39343 37731 43525 56642 60150 57189 56957 61111
Curve Fitting by Least Squares Criteria :
Polynomial Regression: Let 𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + ⋯ + 𝑎𝑛 𝑥 𝑛 represents
polynomial of degree ‘n’.
For given set of n pairs of observations 𝑥𝑖 , 𝑦𝑖 the unknowns 𝑎0 , 𝑎1 , ⋯ , 𝑎𝑛
are estimated by least squares method of minimizing.
𝑛
2
𝑆= 𝑑𝑖2 = 𝑦𝑖 − 𝑎0 + 𝑎1 𝑥𝑖 + 𝑎2 𝑥𝑖2 + ⋯+ 𝑎𝑛 𝑥𝑖𝑛
𝑖=1
𝑥 −2 −1 0 1 2
y 4 1 2 7 15
Price 𝑥 20 16 10 11 14
Demand 𝑦 22 41 120 89 56
INFERENTIAL STATISTICS
HYPOTHESIS TESTING
It is method used to make decisions or inferences about a population
based on sample data. It helps us to determine whether a certain
assumption (or hypothesis) about the data is reasonable.
KEY TERMS
NULL HYPOTHESIS 𝐇𝟎 :
A statement that suggest there is no effect or no difference.
Reality H0 H0
Decision True False
CORRECT
RejectH0 TYPE I ERROR
DECISION
Z-Test
It is a statistical test used to determine if there is significant
difference between sample and population means or between
two sample means when the population standard deviation is
known and sample size is large 𝑛 ≥ 30 .
It follows a normal distribution N 0, 1
Formula
1. One Sample Z-test
Compare a sample mean to a known population mean.
ҧ
𝑥−μ
Z= σ
n
Formula
ഥ −μ
X
1. For a single sample: 𝑡 =
s/ n
X1 −X2
2. For two independent samples: 𝑡 =
𝑠1 2 𝑠2 2
+
n1 n2
Degrees of Freedom (df)
It refers to the number of independent values or
observations in a statistical calculation that can vary
while estimating a parameter.
It is crucial in hypothesis testing because it helps to
determine the distribution of the test statistic.
Ex. If we randomly select 5 numbers that must sum to
100, then we can freely choose only 4 numbers as 5th
number is fixed to maintain the sum
Hence 𝑑𝑓 = 𝑛 − 1 = 5 − 1 = 4
Example
A sample of 10 students has an average score of 85 with a standard
deviation of 5. Test if the population mean is 80 at a 5% significance level.
Solution
H0 : μ = 80
H1 : μ ≠ 80
ഥ
X − μ 85 − 80
𝑡= = = 3.16
s/ n 5/ 10
Degrees of freedom (df) = 𝑛 − 1 = 9
t-critical (df = 9) = ±2.262,
Since 3.16 > 2.262, we reject 𝐻𝑜 .
Practice Problems
1. A company manufactures printed circuit boards (PCBs) and aims to
keep the thickness of a specific layer at 1.5 mm. A quality control
engineer measures the thickness of 9 randomly selected boards,
obtaining an average thickness of 1.48 mm with a standard
deviation of 0.03 mm. Can we conclude at α = 0.05 that the average
thickness is significantly different from 1.5 mm?
2. A networking company is testing two models of wireless routers for
their speed performance. A sample of 15 tests with Router A shows
an average speed of 150 Mbps with a standard deviation of 12 Mbps,
while a sample of 18 tests with Router B shows an average speed of
140 Mbps with a standard deviation of 10 Mbps. At α = 0.05, is there
enough evidence to suggest a significant difference in the mean
speed of the two routers?
F-Test
The F-test is used to compare the variances of two datasets or
to assess the significance of multiple factors in a system. It
determines whether differences in variability are statistically
significant by analyzing the ratio of two variances.
The test is commonly applied in quality control, performance
comparison, and model evaluation.
Formula
𝑠1 2
F= 2
𝑠2
Where 𝑠1 2 and 𝑠2 2 are the sample variances.
Example
Two samples have variances of 25 and 16 with sample sizes of 10 each.
Test if the variances are equal at a 5% significance level.
Solution
H0 : σ1 2 = σ2 2
H1 : σ1 2 ≠ σ2 2
𝑠1 2 25
F= 2= = 1.5625
𝑠2 16
Degrees of freedom: df1 = df2 = 𝑛 − 1 = 9,
Critical F-value(df1 = 9, df2 = 9) = 3.18,
Since 1.5625 < 3.18, we fail to reject 𝐻𝑜 .
Practice Problems
1. Two universities conducted a study on the variation in students'
weekly study hours. A researcher wants to test if the variability in
study hours differs between the two universities at a 5% significance
level.
University A: Sample variance = 20, Sample size = 15
University B: Sample variance = 12, Sample size = 15
2. A tech company is testing two different brands of smartphone batteries
to determine if their battery life variability differs. The test is
conducted at a 1% significance level. Below are the recorded battery
life durations (in hours) for both brands:
Brand A 10.5 11.2 9.8 10.0 11.5 10.8
Brand B 9.5 10.0 10.8 9.9 10.3 10.6
Chi-Square Test
The Chi-Square test is a method used to determine whether there is a
significant difference between observed and expected frequencies in
categorical data. It works by comparing how much the actual data
deviates from what we expect under a given assumption. The test is
widely used in research, quality control, and survey analysis.
Formula
𝑛
𝑂𝑖 − 𝐸𝑖 2
2
𝜒 =
𝐸𝑖
𝑖=1
1. Point Estimate
A Point Estimate is a single value that serves as an
estimate of a population parameter.
𝑆𝑆𝐵 24
Step 4. 𝑀𝑆𝐵 = = = 12 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑘 − 1 (𝑘 − 𝑁𝑜. 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠)
𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 2
𝑆𝑆𝑊 12
Step 5. 𝑀𝑆𝑊 = = =2 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑁 − 𝑘
𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 6
MSB 12
Step 6. F-Statistics: 𝐹 = = =6
MSW 2