0% found this document useful (0 votes)
8 views94 pages

Module I - Descriptive and Inferential Statistics

The document provides an overview of descriptive and inferential statistics, detailing the types of statistics, measures of central tendency (mean, median, mode), and measures of dispersion (range, variance, standard deviation). It explains the concepts of population and sample, along with formulas for calculating various statistical measures. Additionally, it includes examples and practice problems to illustrate the application of these statistical concepts.

Uploaded by

umeshshinde3386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views94 pages

Module I - Descriptive and Inferential Statistics

The document provides an overview of descriptive and inferential statistics, detailing the types of statistics, measures of central tendency (mean, median, mode), and measures of dispersion (range, variance, standard deviation). It explains the concepts of population and sample, along with formulas for calculating various statistical measures. Additionally, it includes examples and practice problems to illustrate the application of these statistical concepts.

Uploaded by

umeshshinde3386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE-I

DESRIPTIVE AND
INFERENTIAL STATISTICS
DESRIPTIVE STATISTICS
Statistics: It is a branch of mathematics dealing with the collection,
analysis, interpretation, and presentation of data.

Types of Statistics:
• Descriptive Statistics • Inferential Statistics
It is used to summarize and describe It is used to make inferences or
characteristics of dataset. It involves predictions about a population based
calculating summary measures such as on a sample of data. It involves using
the mean, median, mode, range, statistical tests, such as hypothesis test
standard deviation, variance. and regression analysis.
Population and Sample:
• A population is the entire group that you want
to draw conclusions about.

• A sample is the specific group that you will


collect data from

Population is the entire pool from which a statistical sample is drawn.


The information obtained from the sample allows statisticians to develop
hypotheses about the larger population. Researchers gather information
from a sample because of the difficulty of studying the entire population.
Measures of Central Tendency
A Measure of Central Tendency is a value that represents a typical, or
central, entry of a data set. You can think of this value as where the middle
of distribution lies.
Three measures of central tendency are:
● Mean
● Median
● Mode

Mean: It is the sum of the all data entries divided by the number of entries.
Σ𝑥𝑖 Σ𝑥𝑖
Population Mean 𝜇 = Sample Mean 𝑥ҧ =
𝑁 𝑛
where, N represents the number of entries in population.
And n represents number of entries in sample.
Σ𝑓𝑖 𝑥𝑖
For grouped frequency distribution: Mean- 𝑥ҧ =
Σ𝑓𝑖
Median: It is the value that lies in the middle of the data when the data set
is ordered. The median measures the center of an ordered data set by
dividing it into two equal parts. If the data set has an odd number of
entries, the median is the middle data entry. If the data set has an even
number of entries, the median is the mean of the two middle data entries.

Mode: It is the most frequent number in the dataset. i.e. The mode of a
data set is the data entry that occurs with the greatest frequency. A data
set can have one mode, more than one mode, or no mode. If no entry is
repeated, the data set has no mode. If two entries occur with the same
greatest frequency, each entry is a mode and the data set is called
bimodal.
Formulae's of Measures of Central Tendency
A. For ungrouped data
Σ𝑥𝑖
i) Mean: 𝑥ҧ =
𝑛

ii) Median: Arrange data in ascending or descending order,


if n is odd,
𝑛+1 𝑡ℎ
Median = observation
2
if n is even,
𝑛 𝑡ℎ 𝑛 𝑡ℎ
𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛+ +1 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
2 2
Median =
2
iii) Mode: Most repeated observation in the data
B. For discrete/ungrouped frequency distribution
𝑥𝑖 𝑥1 𝑥2 ⋯ 𝑥𝑛
𝑓𝑖 𝑓1 𝑓2 ⋯ 𝑓𝑛

Σ𝑓𝑖 𝑥𝑖
i) Mean: 𝑥ҧ = ; 𝑁 = Σ𝑓𝑖
𝑁

ii) Median:
𝑥𝑖 𝑓𝑖 c.f. (Cumulative frequency)
𝑥1 𝑓1 𝑓1
𝑥2 𝑓2 𝑓1 + 𝑓2
⋮ ⋮ ⋮
𝑥𝑛 𝑓𝑛 𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 = 𝑁

𝑁
Median = Value of 𝑥 for which c.f. is just greater than .
2

iii) Mode: Value of 𝑥 with greatest frequency.


C. For continuous/grouped frequency distribution
𝐶𝑙𝑎𝑠𝑠 𝑎1 − 𝑎2 𝑎2 − 𝑎3 ⋯ 𝑎𝑛 − 𝑎𝑛+1
𝑓𝑖 𝑓1 𝑓2 ⋯ 𝑓𝑛

Mid Value 𝑎1 + 𝑎2 𝑎2 + 𝑎3 𝑎𝑛 + 𝑎𝑛+1


𝑥1 = 𝑥2 = ⋯ 𝑥𝑛 =
𝑥𝑖 2 2 2

Σ𝑓𝑖 𝑥𝑖
i) Mean: 𝑥ҧ = ; 𝑁 = Σ𝑓𝑖
𝑁

ℎ 𝑁
ii) Median: Median = 𝐿 + − 𝑐. 𝑓.
𝑓 2

𝑁
where Median Class = class for which c.f. is just greater
2

𝐿 = lower limit of median class


ℎ = class width of median class
𝑓 = frequency of median class
𝑐. 𝑓. = cumulative frequency of pre-median class
𝑓𝑚 −𝑓1
iii) Mode: Mode = 𝑙 + ℎ
2𝑓𝑚 −𝑓1 −𝑓2

where Modal Class = class with highest frequency in the distribution


𝑙 = lower limit of modal class
ℎ = class width of modal class
𝑓𝑚 = frequency of modal class
𝑓1 = frequency of pre-modal class
𝑓2 = frequency of post-modal class
Measures of Dispersion
Measures of dispersion are used to determine the spread of data. It helps
to interpret the variability in data. "Dispersion is a statistical term used to
describe the extent to which data is scattered and how much it differs from
the mean value."
Measures of dispersion are:
● Range
● Variance
● Standard deviation

Range: It is the difference between largest and smallest values in the data
set.
𝑅𝑎𝑛𝑔𝑒 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Standard deviation: It is a statistical measure that quantifies the amount of
variation or dispersion in a set of data values. It tells us how much the
values in a dataset deviate, on average, from the mean (average) of the
dataset. If the standard deviation is low, the data points are close to the
mean, indicating less variability. If the standard deviation is high, the data
points are spread out over a larger range of values, indicating greater
variability.
Σ 𝑥𝑖 −𝑥ҧ 2 Σ𝑥𝑖2 2 Σ𝑥𝑖
for ungrouped data: 𝜎 = + or 𝜎 = + − 𝑥ҧ ; 𝑥ҧ =
𝑛 𝑛 𝑛
for discrete or continuous frequency distribution:
Σ𝑓𝑖 𝑥𝑖 −𝑥ҧ 2 Σ𝑓𝑖 𝑥𝑖2 2 Σ𝑓𝑖 𝑥𝑖
𝜎=+ or 𝜎 = + − 𝑥ҧ ; N = Σ𝑓𝑖 and 𝑥ҧ =
𝑁 𝑁 𝑁
Variance: It is the square of standard deviation.
Variance = 𝜎 2
Remark:
1. Mean − Mode = 3 Mean − Median
2. Short cut method:
A) Arithmetic Mean:
i) for ungrouped data
put 𝑑 = 𝑥𝑖 − 𝐴 ; 𝐴 = Assumed Mean
⇒ 𝑑ҧ = 𝑥ҧ − 𝐴
⇒ 𝑥ҧ = 𝑑ҧ + 𝐴

ii) for grouped data


𝑥𝑖 −𝐴
put 𝑢=

⇒ 𝑥ҧ = 𝐴 + ℎ𝑢ത
B) Standard Deviation:
i) for ungrouped data
put 𝑑 = 𝑥𝑖 ± 𝐴 ; 𝐴 = Assumed Mean
⇒ S. D. 𝑥𝑖 = S. D. 𝑑
& S. D. 𝑘𝑥 = 𝑘 ∙ S. D. 𝑥
ii) for grouped data
𝑥𝑖 −𝐴 𝑑
put 𝑢= =
ℎ ℎ

⇒ 𝑑 = 𝑢ℎ
⇒ S. D. 𝑑 = S. D. 𝑢ℎ
Σ𝑓ℎ2 𝑢2 2
= − ℎ𝑢ത
𝑁
= ℎ ∙ S. D. 𝑢
Coefficient of Variation (C.V.)
It is used to define consistency of the data or variability of the data.

𝑆. 𝐷. 𝜎
C. V. = × 100 = × 100
𝐴. 𝑀. 𝑥ҧ
The ratio of the standard deviation to the mean i.e. 𝜎Τ𝑥ҧ is known as the
coefficient of variation. As this is a ratio, having no dimension, it is used to
comparing the variations between the two groups with different means.
Ex. If A & B are two sets of data with 𝐶. 𝑉. 𝐴 < 𝐶. 𝑉. 𝐵 means Set A is
more consistent than set B, OR set B shows more variability than A.
Examples
1) The scores of two cricketers Virat and Rohit for 6 matches each are
given below, find who is more consistent and whose performance is better.
Virat 58 59 60 54 65 66
Rohit 84 56 92 65 86 44
Solution:
Virat 𝑥𝑖 Rohit 𝑦𝑖 𝑥𝑖2 𝑦𝑖2
58 84 3364 7056
59 56 3481 3136
60 92 3600 8464
54 65 2916 4225
65 86 4225 7396
66 44 4356 1936
Total Σ 362 427 21942 32213
𝑆.𝐷. 𝜎 Σ𝑥𝑖 Σ𝑥𝑖2 2
C. V. = × 100 = × 100 ; 𝑥ҧ = , 𝜎𝑥 = + − 𝑥ҧ
𝐴.𝑀. 𝑥ҧ 𝑛 𝑛

4.11
For Virat, 𝑥ҧ = 60.33, 𝜎 = 4.11 , C. V. = × 100 = 6.812
60.33

17.44
For Rohit, 𝑦ത = 71.16, 𝜎 = 17.44 , C. V. = × 100 = 24.508
71.16

∵ 𝐶. 𝑉. 𝑉𝑖𝑟𝑎𝑡 < 𝐶. 𝑉. 𝑅𝑜ℎ𝑖𝑡


∴ Player Virat is more consistent than player Rohit.
And ∵ 𝑥ҧ 𝑉𝑖𝑟𝑎𝑡 < 𝑦ത 𝑅𝑜ℎ𝑖𝑡

∴ Performance of Rohit is better than Virat.


Practice Problems:
1) The scores of two golfers A and B for 10 rounds each given below. Find
mean and standard deviation for both player A & B and hence find who
is more consistent.
A 66 59 58 54 65 60 52 75 69 52
B 78 56 84 65 86 92 44 54 78 68
2) Following table gives the marks obtained in a paper of Mathematics out
of 50, by the students of two divisions. Find out which division shows
greater variability.
C.I. 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50
Div A 2 6 8 8 15 18 12 11 9 4
Div B 3 5 7 9 12 16 11 5 6 2
Moments :
Central Moments: Moments about the arithmetic mean of a distribution
are called central moments. The 𝑟 𝑡ℎ moment about mean of distribution is
1 𝑟
given by 𝜇𝑟 = σ 𝑓𝑖 𝑥𝑖 − 𝑥ҧ ; 𝑁 = Σ𝑓𝑖 , 𝑥ҧ = A. M.
𝑁
1 0 1
In particular, 𝜇0 = σ 𝑓 𝑥𝑖 − 𝑥ҧ = σ 𝑓 = 1.
𝑁 𝑖 𝑖 𝑁 𝑖 𝑖
1
𝜇1 = σ 𝑓𝑖 𝑥𝑖 − 𝑥ҧ = 0 First moment about mean of a distribution
𝑁
1 2
𝜇2 = σ 𝑓𝑖 𝑥𝑖 − 𝑥ҧ = 𝜎 2 (Variance) Second central moment.
𝑁
1 3
𝜇3 = σ 𝑓𝑖 𝑥𝑖 − 𝑥ҧ Third central moment and so on.
𝑁
Raw Moments: Moments about assumed mean of the distribution are
called raw moments. The 𝑟 𝑡ℎ moment about any number A (assumed
mean) is denoted by 𝜇𝑟′ and is given by
1 1
𝜇𝑟 = ෍ 𝑓𝑖 𝑥𝑖 − 𝐴 = ෍ 𝑓𝑖 𝑑𝑖𝑟 ;
′ 𝑟 𝑁 = Σ𝑓𝑖 , 𝑑𝑖 = 𝑥𝑖 − 𝐴
𝑁 𝑁
In particular, 𝜇0′ = 1 ,
1
𝜇1′ = σ 𝑓𝑖 𝑥𝑖 − 𝐴 = 𝑥ҧ − 𝐴 First raw moment
𝑁
1
𝜇2′ = σ 𝑓𝑖 𝑥𝑖 − 𝐴 2 Second raw moment.
𝑁
1
𝜇3′ = σ 𝑓𝑖 𝑥𝑖 − 𝐴 3 Third raw moment and so on.
𝑁
Relation between Central Moment 𝜇𝑟 and raw Moments 𝜇𝑟′ :
𝜇0 = 1
𝜇1 = 0
𝜇2 = 𝜇2′ − 𝜇1′ 2

𝜇3 = 𝜇3′ − 3 𝜇2′ 𝜇1′ + 2 𝜇1′ 3

𝜇4 = 𝜇4′ − 4 𝜇3′ 𝜇1′ + 6 𝜇2′ 𝜇1′ 2 − 3 𝜇1′ 4

𝑥𝑖 −𝐴
Remark: for fast calculations, put 𝑢 =

1 1
𝜇𝑟′ = Σ𝑓𝑖 𝑥𝑖 − 𝐴 𝑟 = Σ𝑓𝑖 𝑢ℎ 𝑟
𝑁 𝑁

ℎ𝑟 ℎ𝑟 Σ𝑓𝑖 𝑢𝑟
𝜇𝑟′ = Σ𝑓𝑖 𝑢𝑟 = ; 𝑟 = 1, 2, 3, ⋯
𝑁 Σ𝑓𝑖
Skewness :
Skewness means 'lack of symmetry’. Skewness is studied to have an idea
about the shape of the curve. A distribution is said to be skewed if
a) Mean, median and mode fall at different points.
i.e. Mean ≠ Median ≠ Mode
b) The curve drawn with the help of the given data is not symmetrical but
stretched more to one side than to the other.
Skewness can be positive as well as negative. The skewness is positive if
the larger tail of the distribution lies towards the higher values of the
variate (the right), i.e., if the curve drawn within the help of the given data
is stretched more to the right than to the left and is negative in the
contrary case.
𝑀 = 𝑀𝑒𝑎𝑛,
𝑀𝑑 = 𝑀𝑒𝑑𝑖𝑎𝑛,
𝑀𝑜 = 𝑀𝑜𝑑𝑒

Measure of Skewness:
3 Mean − Median
1. skewness =
S.D.
𝜇3 2
2. Coefficient of skewness 𝛽1 =
𝜇2 3

Note: For a symmetrical distribution 𝛽1 = 0.


Kurtosis :
'Convexity of curve' or ‘Kurtosis’ enables us to have an idea about the
‘flatness’ or ‘peakedness’ of the curve. It is measured by the coefficient
𝜇4
𝛽2 and is given by 𝛽2 =
𝜇2 2
Examples
1) The first four moments of a distribution about the value 5 are 2, 20, 40
& 50. Obtain first four central moments, mean, S.D. and coefficient of
skewness & kurtosis.
Solution: Given Assumed Mean A = 5, and raw moments (moments about
value A = 5) are 𝜇1′ = 2, 𝜇2′ = 20, 𝜇3′ = 40, and 𝜇4′ = 50.
The central moments are given by
𝜇1 = 0
𝜇2 = 𝜇2′ − 𝜇1′ 2 = 20 − 2 2 = 16
𝜇3 = 𝜇3′ − 3 𝜇2′ 𝜇1′ + 2 𝜇1′ 3
= 40 − 3 20 2 + 2 2 3
= −64
𝜇4 = 𝜇4′ − 4 𝜇3′ 𝜇1′ + 6 𝜇2′ 𝜇1′ 2
− 3 𝜇1′ 4

= 50 − 4 40 2 + 6 20 2 2 −3 2 4 = 162
Thus, first four central moments are
𝜇1 = 0, 𝜇2 = 16, 𝜇3 = −64, 𝜇4 = 162
Now to find Arithmetic Mean 𝑥ҧ we have:
𝑥ҧ = 𝜇1′ + 𝐴 = 2 + 5 = 7

To find S.D.: σ = + 𝜇2 = 16 = 4

𝜇3 2 −64 2
Coefficient of skewness: 𝛽1 = = =1
𝜇2 3 16 3

𝜇4 162
Coefficient of kurtosis: 𝛽2 = = = 0.63
𝜇2 2 16 2

as 𝛽2 = 0.63 < 3 , the nature of curve is platykurtic.


2) Calculate first four central moments for the following frequency
distribution. Also find coefficient of skewness & kurtosis.
No. of jobs completed
0-10 10-20 20-30 30-40 40-50
(Class)
No. of workers
6 26 47 15 6
(f)

Solution: First four raw moments are given by


ℎ𝑟 ℎ𝑟 Σ𝑓𝑖 𝑢𝑟
∵ 𝜇𝑟′ = Σ𝑓𝑖 𝑢𝑟 =
𝑁 Σ𝑓𝑖

ℎ ℎ2 ℎ3 ℎ4
∴ 𝜇1′ = Σ𝑓𝑖 𝑢 , 𝜇2′ = Σ𝑓𝑖 𝑢2 , 𝜇3′ = Σ𝑓𝑖 𝑢3 , 𝜇4′ = Σ𝑓𝑖 𝑢4
𝑁 𝑁 𝑁 𝑁
Table:
𝑥𝑖 − 25
class Freq. 𝑥𝑖 𝑢𝑖 = 𝑓𝑖 𝑢𝑖 𝑓𝑖 𝑢𝑖2 𝑓𝑖 𝑢𝑖3 𝑓𝑖 𝑢𝑖4
ℎ = 10
0-10 6 5 −2 −12 24 − 48 96
10-20 26 15 −1 − 26 26 − 26 26
20-30 47 25 = A 0 0 0 0 0
30-40 15 35 1 15 15 15 15
40-50 6 45 2 12 24 48 96
Total 100 − 11 89 − 11 233

10 10 2
∴ 𝜇1′ = −11 = −1.1 , 𝜇2′ = 89 = 89 ,
100 100

10 3 10 4
𝜇3′ = −11 = −110 , 𝜇4′ = 233 = 23300
100 100
Thus, central moments are given by
𝜇1 = 0
𝜇2 = 𝜇2′ − 𝜇1′ 2 = 89 − −1.1 2 = 87.79
𝜇3 = 𝜇3′ − 3 𝜇2′ 𝜇1′ + 2 𝜇1′ 3
= −110 − 3 89 −1.1 + 2 −1.1 3
= 181.038
𝜇4 = 𝜇4′ − 4 𝜇3′ 𝜇1′ + 6 𝜇2′ 𝜇1′ 2 − 3 𝜇1′ 4

= 23300 − 4 −110 −1.1 + 6 89 −1.1 2 − 3 −1.1 4 = 23457.7477


𝜇3 2 181.038 2
Coefficient of skewness: 𝛽1 = = = 0.0484
𝜇2 3 87.79 3

𝜇4 23457.7477
Coefficient of kurtosis: 𝛽2 = = = 3.0436
𝜇2 2 87.79 2

as 𝛽2 = 3.0436 > 3 , the nature of curve is leptokurtic.


Practice Problems:
1) For the following distribution, find first four moments about the mean,
arithmetic mean, S.D. and coefficient of skewness & kurtosis.

x 2 2.5 3 3.5 4 4.5 5


f 5 38 65 92 70 40 10

2) Calculate the first four moments about the mean for the following
distribution. Also find 𝛽1 and 𝛽2 .
x 1 2 3 4 5 6 7 8 9 10
f 6 15 23 42 62 60 40 24 13 5
Correlation :
Bivariate distribution: Distributions involving two variables is known as
bivariate distribution. For example, if we measure the heights and weights
of a certain group of persons, we shall get what is known as Bivariate
Distribution - one variable relating to height and the other relating to
weight.
Correlation: In a bivariate distribution if a change in one variable affects a
change in the other variable, the variables are said to be correlated. If the
two variables deviate in the same direction, i.e., if the increase (or
decrease) in one result in a corresponding increase (or decrease) in the
other, correlation is said to be direct or positive. But if they constantly
deviate in the opposite directions, i.e., if increase (or decrease) in one
result in corresponding decrease (or increase) in the other, correlation is
said to be diverse or negative.
For example, the correlation between the income and expenditure is
positive and the correlation between price and demand of a commodity, is
negative.
Karl Pearson’s Coefficient of Correlation: Correlation coefficient between
two random variables 𝑥 and y, usually denoted by 𝑟(𝑥, y) or simply 𝑟𝑥𝑦 is a
numerical measure of linear relationship between them and is defined as
COV 𝑥, y
𝑟 𝑥, 𝑦 =
𝜎𝑥 𝜎𝑦
1 1
∵ COV 𝑥, y = Covarince 𝑥, 𝑦 = σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത = σ𝑥𝑖 𝑦𝑖 − ( 𝑥ҧ 𝑦ത )
𝑛 𝑛
1 1
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത σ 𝑥𝑖 𝑦𝑖 − 𝑥ҧ 𝑦ത
𝑛 𝑛
Thus 𝑟 𝑥, 𝑦 = =
1 2 1σ 1 1
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത 2 σ 𝑥𝑖2 − 𝑥ҧ 2 σ 𝑦𝑖 2 − 𝑦ത 2
n n 𝑛 𝑛
Remark: Correlation Coefficient cannot exceed unity numerically. It always
lies between −1 and + 1, i. e. −1 ≤ 𝑟(𝑥, 𝑦) ≤ 1. If 𝑟 = 1, the correlation is
perfect positive and if 𝑟 = −1, correlation is perfect negative.
Note: Correlation coefficient is independent of change of origin and scale:
𝑥−𝐴 y−B
If u = and v = so that, 𝑥 = A + ℎu and y = B + 𝑘v then
ℎ 𝑘
𝑟(𝑥, y) = 𝑟(u, v)
Thus, for ungrouped data:
𝑛 σ 𝑢𝑣 − σ 𝑢 σ 𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑛 σ 𝑢2 − σ 𝑢 2 𝑛 σ 𝑣2 − σ 𝑣 2

for ungrouped/grouped frequency distribution:


𝑁 σ 𝑓𝑢𝑣 − σ 𝑓𝑢 σ 𝑓𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑁 σ 𝑓𝑢2 − σ 𝑓𝑢 2 𝑁 σ 𝑓𝑣 2 − σ 𝑓𝑣 2
Examples
1) Compute correlation coefficient between supply and price of
commodity using following data.
Supply 152 158 169 182 160 166 182
Price 198 178 167 152 180 170 162

Solution:
Correlation coefficient is given by
𝑛 σ 𝑢𝑣 − σ 𝑢 σ 𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑛 σ 𝑢2 − σ 𝑢 2 𝑛 σ 𝑣2 − σ 𝑣 2

Here n = 7.
Supply (𝑥) Price (𝑦) 𝑢 = 𝑥 − 166 𝑣 = 𝑦 − 170 𝑢2 𝑣2 𝑢𝑣

152 198 −14 28 196 784 −392


158 178 −8 8 64 64 −64
169 167 3 −3 9 9 −9
182 152 16 −18 256 324 −288
160 180 −6 10 36 100 −60
166 170 0 0 0 0 0
182 162 16 −8 256 64 −128
Total Σ 7 17 817 1345 −941

Correlation coefficient is given by


7 −941 − 7 17
𝑟 𝑥, 𝑦 = 𝑟 u, v =
7 817 − 7 2 7 1345 − 17 2

𝑟 𝑥, 𝑦 = −0.9322
2) Calculate coefficient of correlation if 𝑛 = 6, Σ 𝑥 − 18.5 = −3,
Σ 𝑦 − 50 = 20, Σ 𝑥 − 18.5 𝑦 − 50 = −120, Σ 𝑥 − 18.5 2 = 19
& Σ 𝑦 − 50 2 = 850.
Solution: Suppose 𝑢 and 𝑣 are deviations of 𝑥 and 𝑦 values from 18.5 and
50, respectively. i.e. 𝑢 = 𝑥 − 18.5, 𝑣 = 𝑦 − 50.
Given: 𝑛 = 6, Σ𝑢 = −3, Σ𝑣 = 20, Σ𝑢𝑣 = −120, Σ𝑢2 = 19, Σ𝑣 2 = 850
Correlation coefficient is given by
𝑛 σ 𝑢𝑣 − σ 𝑢 σ 𝑣
𝑟 𝑥, 𝑦 = 𝑟(u, v) =
𝑛 σ 𝑢2 − σ 𝑢 2 𝑛 σ 𝑣 2 − σ 𝑣 2

6 −120 − −3 20
𝑟 𝑥, 𝑦 = 𝑟 u, v =
6 19 − −3 2 6 850 − 20 2

𝑟 𝑥, 𝑦 = −0.9395
3) Given: r = 0.9, Σ𝑥𝑦 = 60, 𝜎𝑥 = 3, Σ𝑦 2 = 100. Find the number of items
if 𝑥 and 𝑦 are deviations of 𝑢 and 𝑣 from arithmetic mean.
Solution: Given: r = 0.9, Σ𝑥𝑦 = 60, 𝜎𝑥 = 3, Σ𝑦 2 = 100.
Also, 𝑥 = 𝑢 − 𝑢ത and 𝑦 = 𝑣 − 𝑣.ҧ
Correlation coefficient is given by
COV 𝑢, 𝑣 σ 𝑢 − 𝑢ത 𝑣 − 𝑣ҧ σ𝑥 𝑦
𝑟 𝑢, 𝑣 = = =
𝜎𝑢 𝜎𝑣 𝑛 𝜎𝑢 𝜎𝑣 𝑛 𝜎𝑥 𝜎𝑦

Σ 𝑣−𝑣ത 2 Σ𝑦 2 2 Σ𝑦 2 2 100
Now, 𝜎𝑦 = 𝜎𝑣 = = ⇒ 𝜎𝑦 = ⇒ 𝜎𝑦 =
𝑛 𝑛 𝑛 𝑛
2 2
σ𝑥 𝑦 60
∴ 𝑟2 = 2 ⇒ 0.9 2 = 2 ⇒ 𝑛 = 4.938 ≈ 5
𝑛2 𝜎𝑥 2 𝜎𝑦 100
𝑛2 3 2
𝑛
Practice Problems:
1) Calculate correlation coefficient for following distribution.
x 5 9 18 19 24 28 32
y 7 9 14 21 23 29 30
f 6 9 13 20 16 11 7

2) Two examiners A & B award marks to seven students. Obtain the


correlation coefficient and comment on its value.
Roll No. 1 2 3 4 5 6 7
Marks by A 40 44 28 30 44 38 31
Marks by B 32 39 26 30 38 34 28
Regression :
Regression analysis is a mathematical measure of the average
relationship between two or more variables in terms of the original units
of the data.
In regression analysis there are two types of variables. The variable
whose value is influenced or is to be predicted is called dependent variable
and the variable which influences the values or is used for prediction. is
called independent variable. In regression analysis independent variable is
also known as regressor or predictor or explanatory variable while the
dependent variable is also known as regressed or explained variable.

Lines of Regression: The line of regression is the line which gives the best
estimate to the value of one variable for any specific value of the other
variable.
Regression line of y on x :
Consider the set of values 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1,2, … , 𝑛 and let the line of
regression of y on x be 𝑦 = 𝑚𝑥 + 𝑐. Then it is given by,
𝜎𝑦
(𝑦𝑖 − 𝑦)
ത = 𝑟 (𝑥𝑖 − 𝑥)ҧ
𝜎𝑥
or
(𝑦𝑖 − 𝑦)
ത = 𝑏𝑦𝑥 (𝑥𝑖 − 𝑥)ҧ
or
COV 𝑥,𝑦
(𝑦𝑖 − 𝑦)
ത = (𝑥𝑖 − 𝑥)ҧ
𝜎𝑥2
𝜎𝑦
Note: 1) 𝑟 = 𝑏𝑦𝑥 is called regression coefficient of y on x.
𝜎𝑥
𝜎𝑌 COV 𝑥,𝑦
2) m = 𝑏𝑦𝑥 = 𝑟 =
𝜎𝑋 𝜎𝑥2
Regression line of x on y :
Consider the set of values 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1,2, … , 𝑛 and let the line of
regression of x on y be 𝑥 = 𝑛𝑦 + 𝑐. Then it is given by,
𝜎𝑥
(𝑥𝑖 − 𝑥)ҧ = 𝑟 (𝑦𝑖 − 𝑦) ത
𝜎𝑦
or
(𝑥𝑖 − 𝑥)ҧ = 𝑏𝑥𝑦 (𝑦𝑖 − 𝑦)

or
COV 𝑥, 𝑦
(𝑥𝑖 − 𝑥)ҧ = 2 (𝑦𝑖 − 𝑦)

𝜎𝑦
𝜎𝑥
Note: 1) 𝑟 = 𝑏𝑥𝑦 is called regression coefficient of x on y.
𝜎𝑦
𝜎𝑥 COV 𝑥,𝑦
2) n = 𝑏𝑥𝑦 = 𝑟 =
𝜎𝑦 𝜎𝑦2
Examples
1) Obtain regression lines for the following data. Hence find 𝑦 5 & 𝑥 9 .
𝑥 6 2 10 4 8
y 9 11 5 8 7
Solution: Here n = 5
𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑦𝑖2 𝑥𝑖 𝑦𝑖
6 9 36 81 54
2 11 4 121 22
10 5 100 25 50
4 8 16 64 32
8 7 64 49 56
Total Σ 30 40 220 340 214
Σ𝑥𝑖 30 Σ𝑦𝑖 40
𝑥ҧ = = = 6 ; 𝑦ത = = =8
𝑛 5 𝑛 5

2
Σ𝑥𝑖2 Σ𝑥𝑖 220
𝜎𝑥 = − = − 36 = 2.83
𝑛 𝑛 5

2
Σ𝑦𝑖2 Σ𝑦𝑖 340
𝜎𝑦 = − = − 64 = 2
𝑛 𝑛 5

Σ𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത 214 − 5 6 8


r= = = −0.9187
𝑛𝜎𝑥 𝜎𝑦 5 2.83 2
1) Regression line y on x is
𝜎𝑦 2
𝑦𝑖 − 𝑦ത = 𝑟 𝑥𝑖 − 𝑥ҧ ⟹ 𝑦 − 8 = −0.9187 𝑥−6
𝜎𝑥 2.83
∴ 𝑦 = −0.6493 𝑥 + 11.8958
when 𝑥 = 5 ⟹ 𝑦 = 8.6499
2) Regression line x on y is
𝜎𝑥 2.83
𝑥𝑖 − 𝑥ҧ = 𝑟 𝑦𝑖 − 𝑦ത ⟹ 𝑥 − 6 = −0.9187 𝑦−8
𝜎𝑦 2
∴ 𝑥 = −1.291 𝑦 + 16.39
when 𝑦 = 9 ⟹ 𝑥 = 4.771
2) If the two lines of regression are 9𝑥 + y − λ = 0 & 4𝑥 + 𝑦 = 𝜇. The
mean of 𝑥 and y are 2 and −3 respectively. Find the values of λ, 𝜇 and
coefficient of correlation between 𝑥 & 𝑦.
Solution: As 𝑥ҧ = 2, 𝑦ത = −3 is the point of intersection of two regression
lines. Thus 𝑥ҧ = 2, 𝑦ത = −3 satisfies both the equations.
9𝑥 + y = λ ⟹ 9 2 + −3 = 15 ⟹ λ = 15
4𝑥 + 𝑦 = 𝜇 ⟹ 4 2 + −3 = 5 ⟹ 𝜇=5
Now, consider regression line x on y is 9𝑥 + y = 15 and regression line on
1
y on x in 4𝑥 + 𝑦 = 5. Thus 𝑏𝑥𝑦 = − and 𝑏𝑦𝑥 = −4.
9

4 2
∴ 𝑟 = − 𝑏𝑥𝑦 𝑏𝑦𝑥 =− = − = −0.6667
9 3
Practice Problems:
1) The regression equations are 8𝑥 − 10𝑦 + 66 = 0 & 40𝑥 − 18𝑦 − 214 = 0.
The value of variance of 𝑥 is 9. Find:
A) The mean values of 𝑥 and 𝑦.
B) The correlation between 𝑥 and 𝑦.
C) The standard deviation of 𝑦.
2) Obtain regression lines to the following data.
Year of Experience 1.1 1.5 2 2.9 3 3.7 4 4.5
Salary 39343 37731 43525 56642 60150 57189 56957 61111
Curve Fitting by Least Squares Criteria :
Polynomial Regression: Let 𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + ⋯ + 𝑎𝑛 𝑥 𝑛 represents
polynomial of degree ‘n’.
For given set of n pairs of observations 𝑥𝑖 , 𝑦𝑖 the unknowns 𝑎0 , 𝑎1 , ⋯ , 𝑎𝑛
are estimated by least squares method of minimizing.
𝑛
2
𝑆= 𝑑𝑖2 = ෍ 𝑦𝑖 − 𝑎0 + 𝑎1 𝑥𝑖 + 𝑎2 𝑥𝑖2 + ⋯+ 𝑎𝑛 𝑥𝑖𝑛
𝑖=1

This results in following 𝑛 + 1 normal equations


𝑛
𝜕𝑆
= 0 ⇒ 2 ෍ 𝑦𝑖 − 𝑎0 + 𝑎1 𝑥𝑖 + 𝑎2 𝑥𝑖2 + ⋯ + 𝑎𝑛 𝑥𝑖𝑛 =0
𝜕𝑎0
𝑖=1

⇒ Σ𝑦𝑖 = 𝑛𝑎0 + 𝑎1 Σ𝑥𝑖 + 𝑎2 Σ𝑥𝑖2 + ⋯ + 𝑎𝑛 Σ𝑥𝑖𝑛 ---------------------(1)


𝑛
𝜕𝑆
=0 ⇒ 2 ෍ −𝑥𝑖 𝑦𝑖 − 𝑎0 + 𝑎1 𝑥𝑖 + 𝑎2 𝑥𝑖2 + ⋯ + 𝑎𝑛 𝑥𝑖𝑛 =0
𝜕𝑎1
𝑖=1

⇒ Σ𝑥𝑖 𝑦𝑖 = 𝑎0 Σ𝑥𝑖 + 𝑎1 Σ𝑥𝑖2 + 𝑎2 Σ𝑥𝑖3 + ⋯ + 𝑎𝑛 Σ𝑥𝑖𝑛+1 ---------------------(2)


⋮ ⋮
𝑛
𝜕𝑆
=0 ⇒ 2 ෍ −𝑥𝑖𝑛 𝑦𝑖 − 𝑎0 + 𝑎1 𝑥𝑖 + 𝑎2 𝑥𝑖2 + ⋯ + 𝑎𝑛 𝑥𝑖𝑛 =0
𝜕𝑎𝑛
𝑖=1

⇒ Σ𝑥𝑖𝑛 𝑦𝑖 = 𝑎0 Σ𝑥𝑖𝑛 + 𝑎1 Σ𝑥𝑖𝑛+1 + 𝑎2 Σ𝑥𝑖𝑛+2 + ⋯ + 𝑎𝑛 Σ𝑥𝑖2𝑛 --------------(𝑛 + 1)

Solving these 𝑛 + 1 normal equations simultaneously, we get values of


𝑎0 , 𝑎1 , ⋯ , 𝑎𝑛 and hence best fitting curve.
Fitting of Straight Line: To fit a straight line 𝑦 = 𝑎 + 𝑏𝑥 ; where 𝑎, 𝑏 are
constants to be determined to 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1, 2, ⋯ , 𝑛 points by least
squares criteria has normal equations
Σ𝑦𝑖 = 𝑛𝑎 + 𝑏 Σ𝑥𝑖
Σ𝑥𝑖 𝑦𝑖 = 𝑎 Σ𝑥𝑖 + 𝑏 Σ𝑥𝑖2
Solving these two equations simultaneously we can find 𝑎 & 𝑏, and hence
best fitting line.
Fitting of Parabola (Quadratic Curve): To fit a parabola 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 ;
where 𝑎, 𝑏, 𝑐 are constants to be determined to 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1, 2, ⋯ , 𝑛
points by least squares criteria has normal equations
Σ𝑦𝑖 = 𝑛𝑎 + 𝑏 Σ𝑥𝑖 + 𝑐 Σ𝑥𝑖2
Σ𝑥𝑖 𝑦𝑖 = 𝑎 Σ𝑥𝑖 + 𝑏 Σ𝑥𝑖2 + 𝑐 Σ𝑥𝑖3
Σ𝑥𝑖2 𝑦𝑖 = 𝑎 Σ𝑥𝑖2 + 𝑏 Σ𝑥𝑖3 + 𝑐 Σ𝑥𝑖4
Solving these equations simultaneously we can find 𝑎, 𝑏 & 𝑐.
Fitting of Exponential Curve: To fit an exponential curve 𝑦 = 𝑎 𝑏 𝑥 ; Thus,
𝑎, 𝑏 are constants to be determined to 𝑥𝑖 , 𝑦𝑖 ; 𝑖 = 1, 2, ⋯ , 𝑛 points by least
squares criteria.
Consider exponential curve 𝑦 = 𝑎 𝑏 𝑥
⇒ ln 𝑦 = ln 𝑎 + 𝑥 ln 𝑏
Let, Y = ln 𝑦 , 𝑋 = 𝑥, 𝐴 = ln 𝑎 , 𝐵 = ln 𝑏
⇒ 𝑌 = 𝐴 + 𝐵𝑋
which is an equation of straight line has normal equations
Σ𝑌𝑖 = 𝑛𝐴 + 𝐵 Σ𝑋𝑖
Σ𝑋𝑖 𝑌𝑖 = 𝐴 Σ𝑋𝑖 + 𝐵 Σ𝑋𝑖2
Solving these two equations simultaneously we can find A & 𝐵, and hence
𝑎 & 𝑏 which gives required equation of exponential curve.
Examples
1) Fit a straight line to the following data by least squares method. Hence
find 𝑦 22 .
𝑥 0 2 4 6 8 12 20
y 10 12 18 22 20 30 30

Solution: To fit the straight line 𝑦 = 𝑎 + 𝑏𝑥 to the n = 7 points, has Normal


equations
Σ𝑦𝑖 = 𝑛𝑎 + 𝑏 Σ𝑥𝑖

Σ𝑥𝑖 𝑦𝑖 = 𝑎 Σ𝑥𝑖 + 𝑏 Σ𝑥𝑖2


𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑥𝑖 𝑦𝑖
0 10 0 0
2 12 4 24
4 18 16 72
6 22 36 132
8 20 64 160
12 30 144 360
20 30 400 600
Total Σ 52 142 664 1348

Thus, Normal equations becomes 142 = 7𝑎 + 52 𝑏


1348 = 52 𝑎 + 664 𝑏
Solving above equations simultaneously we get 𝑎 = 12.4444, 𝑏 = 1.0556
Thus, required equation of line is 𝑦 = 12.4444 + 1.0556 𝑥
And when 𝑥 = 22 ⟹ 𝑦 = 35.6676 i. e. 𝑦 22 = 35.6676
2) Fit a parabola (second degree polynomial) to the following data by
least squares method.

𝑥 −2 −1 0 1 2
y 4 1 2 7 15

Solution: To fit a parabola 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 to the n = 5 points, has


Normal equations
Σ𝑦𝑖 = 𝑛𝑎 + 𝑏 Σ𝑥𝑖 + 𝑐 Σ𝑥𝑖2
Σ𝑥𝑖 𝑦𝑖 = 𝑎 Σ𝑥𝑖 + 𝑏 Σ𝑥𝑖2 + 𝑐 Σ𝑥𝑖3
Σ𝑥𝑖2 𝑦𝑖 = 𝑎 Σ𝑥𝑖2 + 𝑏 Σ𝑥𝑖3 + 𝑐 Σ𝑥𝑖4
𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑦𝑖
−2 4 4 −8 16 −8 16
−1 1 1 −1 1 −1 1
0 2 0 0 0 0 0
1 7 1 1 1 7 7
2 15 4 8 16 30 60
Total Σ 0 29 10 0 34 28 84
Thus, Normal equations becomes
29 = 5 𝑎 + 0 + 10 𝑐
28 = 0 + 10 𝑏 + 0
84 = 10 𝑎 + 0 + 34 𝑐
Solving above equations simultaneously we get
𝑎 = 2.085, 𝑏 = 2.8, and 𝑐 = 1.857.
Thus, required equation of parabola is 𝑦 = 2.085 + 2.8 𝑥 + 1.857 𝑥 2 .
3) Fit an exponential curve 𝑦 = 𝑎𝑏 𝑥 to the following data by least squares
method.
𝑥 2 4 6 8 10 12
y 1.8 1.5 1.4 1.1 1.1 0.9

Solution: To fit an exponential curve 𝑦 = 𝑎𝑏 𝑥 to the n = 6 points.


Consider exponential curve 𝑦 = 𝑎 𝑏 𝑥
⇒ ln 𝑦 = ln 𝑎 + 𝑥 ln 𝑏
Let, Y = ln 𝑦 , 𝑋 = 𝑥, 𝐴 = ln 𝑎 , 𝐵 = ln 𝑏
⇒ 𝑌 = 𝐴 + 𝐵𝑋
which is an equation of straight line having normal equations
Σ𝑌𝑖 = 𝑛𝐴 + 𝐵 Σ𝑋𝑖
Σ𝑋𝑖 𝑌𝑖 = 𝐴 Σ𝑋𝑖 + 𝐵 Σ𝑋𝑖2
𝑋𝑖 = 𝑥𝑖 𝑦𝑖 𝑌𝑖 = ln 𝑦𝑖 𝑋𝑖2 𝑋𝑖 𝑌𝑖
2 1.8 0.5878 4 1.1756
4 1.5 0.4055 16 1.6219
6 1.4 0.3365 36 2.0188
8 1.1 0.0953 64 0.7625
10 1.1 0.0953 100 0.9531
12 0.9 −0.1054 144 −1.2643
Total Σ 42 1.4150 364 5.2675
Thus, Normal equations becomes 1.4150 = 6 𝐴 + 42 𝐵
5.2675 = 42 𝐴 + 364 𝐵
Solving above equations simultaneously we get 𝐴 = 0.6995, 𝐵 = −0.0662
∵ 𝐴 = ln 𝑎 , 𝐵 = ln 𝑏 ⇒ 𝑎 = 𝑒 𝐴 , 𝑏 = 𝑒𝐵
∴ 𝑎 = 𝑒 0.6995 = 2.012 and 𝑏 = 𝑒 −0.0662 = 0.936
𝑥
Thus, required equation of exponential curve is 𝑦 = 2.012 0.936
Practice Problems:
1) Fit an exponential curve 𝑦 = 𝑎𝑒 𝑏𝑥 to the following data by least squares
method. 𝑥 1 2 3 4
y 7 11 17 27
2) Fit a power function (geometric curve) of the form 𝑦 = 𝑎𝑥 𝑏 to the
following data by least squares method and estimate y at 𝑥 = 12.

Price 𝑥 20 16 10 11 14
Demand 𝑦 22 41 120 89 56
INFERENTIAL STATISTICS
HYPOTHESIS TESTING
It is method used to make decisions or inferences about a population
based on sample data. It helps us to determine whether a certain
assumption (or hypothesis) about the data is reasonable.

KEY TERMS
NULL HYPOTHESIS 𝐇𝟎 :
A statement that suggest there is no effect or no difference.

ALTERNATIVE HYPOTHESIS 𝐇𝟏 /𝐇𝐚 :


A statement that contradicts the null hypothesis, indicating the
presence of an effect or significant difference.
Basic Statistical Concepts
1. Parameter:
A numerical value that describes the population (e.g. population
mean 𝜇)
2. Statistics:
A numerical value that describes the sample e.g. sample mean 𝑥ҧ
3. Sampling Distribution:
i) It is the distribution of sample statistics
ii) Central Limit Theorem (CLT):
If the sample size is large 𝑛 ≥ 30 , the sampling distributions of
the mean follows a normal distribution, regardless of the
population distribution.
4. Test Statistic:
A Standardized value used to decide whether to reject H0 .
5. Significance Level α :
α = P Reject H0 H0 is true
α is the risk of wrongly rejecting H0 when it is true.
Generally, we use α = 0.05 5% or α = 0.01 1%
6. p-value:
The probability of observing the test results assuming H0 is true.
Decision Rule – i) If p-value < α then reject H0
ii) If p-value ≥ α then fail to reject H0
Types of Errors in Testing of Hypothesis
When a statistical hypothesis is tested there are four possibilities:

Reality H0 H0
Decision True False

CORRECT
RejectH0 TYPE I ERROR
DECISION

Fail to Reject H0 CORRECT


TYPE II ERROR
(Accept H0 ) DECISION
Examples
1. New Drug Effectiveness
H0 : The new drug does not work. (Null Hypothesis)
H1 : The new drug works. (Alternate Hypothesis)
Type I Error: Claim the drug works, but it doesn’t.
Type II Error: Claim the drug doesn’t work, but it actually does.
2. Courtroom Trial
H0 : The person is innocent.
H1 : The person is guilty.
Type I Error: Convicting an innocent person (False Positive).
Type II Error: Letting a guilty person go free (False Negative).
Two-tailed Tests of Hypothesis
1. Used when we check for deviations in both directions (higher or
lower) from the hypothesized value.
2. The rejection region is split between both tails of the distribution.
3. At 5% significance α = 0.05 , 2.5% is allocated to each tail.
One-tailed Tests of Hypothesis
Used when we check if the value is greater or less than a specific point.
1. Right-tailed: Tests if the value is significantly greater.
2. Left-tailed : Tests if the value is significantly smaller.
Different Tests of Hypothesis

Z-Test
It is a statistical test used to determine if there is significant
difference between sample and population means or between
two sample means when the population standard deviation is
known and sample size is large 𝑛 ≥ 30 .
It follows a normal distribution N 0, 1
Formula
1. One Sample Z-test
Compare a sample mean to a known population mean.
ҧ
𝑥−μ
Z= σ
n

Where 𝑥ҧ – Sample mean, μ – Population mean,


σ – Population standard deviation, n – Sample size

2. Two Sample Z-test


Compare the means of two independent groups/samples.
X1 − X 2
Z=
σ1 2 σ2 2
+
n1 n2
3. Z-test for Proportions
Used to test whether a sample proportion differs significantly
from a known population proportion or to compare proportion
from two samples.
P1 − P2
Z=
1 1
P(1 − P) +
n1 n2
n1 P1 +n2 P2
Where P = known as Pooled proportion
n1 +n2
P1 , P2 = Sample Proportions from two groups
Example
1. A company claims that the average height of employees is 170 cm.
A sample of 50 employees has a mean height of 172 cm with a
standard deviation of 8 cm. Test at a 5% significance level.
Solution
H0 : μ = 170
H1 : μ ≠ 170

X − μ 172 − 170
Z= = = 1.77
σ/ n 8/ 50
At α = 0.05, critical Z-value = ±1.96
Since 1.77 < 1.96, we fail to reject 𝐻𝑜 .
2. A sample of 50 steel rods manufactured by Process A has an
average tensile strength of 600 Mpa σ = 25MPa , while a sample
of 50 steel rods manufactured by Process B has an average tensile
strength of 605 Mpa σ = 24MPa .
At a 0.05 significance level, test whether there is a significant
difference in tensile strength between the two processes.
Solution
Given: Process A - n1 = 50, xഥ1 = 600, σ1 = 25
Process B - n2 = 50, x2 = 605, σ2 = 24
Step 1: Hypothesis
H0 : μ1 = μ2
H1 : μ1 ≠ μ2 Two − tailed test
Step 2: Compute Z-Statistic
X1 − X 2 600 − 605
Z= = = −1.02
σ1 2 σ2 2 2 2
+ 25 24
n1 n2 +
50 50
Step 3: Find Critical Value
For a two-tailed test at 𝛼 = 0.05, critical values are 𝑍𝛼/2 = ±1.96
Step 4: Decision Rule
If 𝑍 ≥ 1.96, then reject Ho , otherwise accept it.
Since 𝑍 = 1.02 ≤ 1.96, we fail to reject Ho
There is no significant difference between in tensile strength between
the two manufacturing processes.
Practice Problems
1. A company claims that the average battery life of its smartphones is
10 hours. A researcher tests 36 phones and finds the average
battery life is 9.8 hours, with a population standard deviation of 1.2
hours. At a 0.05 significance level, test whether the average battery
life differs from 10 hours.
2. A researcher wants to test if electric cars have a higher average
mileage per charge than hybrid cars. A sample of 35 electric cars
shows an average mileage of 280 miles, while a sample of 35 hybrid
cars shows an average mileage of 270 miles.
At a 0.05 significance level, test whether electric cars have a
significantly higher mileage per charge than hybrid cars.
Student’s t-Test
It is used to compare the means of two groups when the population
variance is unknown and the sample size is small (n < 30).
The test measures whether the difference between group averages
is due to random variation or a real effect.

Formula
ഥ −μ
X
1. For a single sample: 𝑡 =
s/ n

X1 −X2
2. For two independent samples: 𝑡 =
𝑠1 2 𝑠2 2
+
n1 n2
Degrees of Freedom (df)
It refers to the number of independent values or
observations in a statistical calculation that can vary
while estimating a parameter.
It is crucial in hypothesis testing because it helps to
determine the distribution of the test statistic.
Ex. If we randomly select 5 numbers that must sum to
100, then we can freely choose only 4 numbers as 5th
number is fixed to maintain the sum
Hence 𝑑𝑓 = 𝑛 − 1 = 5 − 1 = 4
Example
A sample of 10 students has an average score of 85 with a standard
deviation of 5. Test if the population mean is 80 at a 5% significance level.
Solution
H0 : μ = 80
H1 : μ ≠ 80

X − μ 85 − 80
𝑡= = = 3.16
s/ n 5/ 10
Degrees of freedom (df) = 𝑛 − 1 = 9
t-critical (df = 9) = ±2.262,
Since 3.16 > 2.262, we reject 𝐻𝑜 .
Practice Problems
1. A company manufactures printed circuit boards (PCBs) and aims to
keep the thickness of a specific layer at 1.5 mm. A quality control
engineer measures the thickness of 9 randomly selected boards,
obtaining an average thickness of 1.48 mm with a standard
deviation of 0.03 mm. Can we conclude at α = 0.05 that the average
thickness is significantly different from 1.5 mm?
2. A networking company is testing two models of wireless routers for
their speed performance. A sample of 15 tests with Router A shows
an average speed of 150 Mbps with a standard deviation of 12 Mbps,
while a sample of 18 tests with Router B shows an average speed of
140 Mbps with a standard deviation of 10 Mbps. At α = 0.05, is there
enough evidence to suggest a significant difference in the mean
speed of the two routers?
F-Test
The F-test is used to compare the variances of two datasets or
to assess the significance of multiple factors in a system. It
determines whether differences in variability are statistically
significant by analyzing the ratio of two variances.
The test is commonly applied in quality control, performance
comparison, and model evaluation.

Formula
𝑠1 2
F= 2
𝑠2
Where 𝑠1 2 and 𝑠2 2 are the sample variances.
Example
Two samples have variances of 25 and 16 with sample sizes of 10 each.
Test if the variances are equal at a 5% significance level.
Solution
H0 : σ1 2 = σ2 2
H1 : σ1 2 ≠ σ2 2

𝑠1 2 25
F= 2= = 1.5625
𝑠2 16
Degrees of freedom: df1 = df2 = 𝑛 − 1 = 9,
Critical F-value(df1 = 9, df2 = 9) = 3.18,
Since 1.5625 < 3.18, we fail to reject 𝐻𝑜 .
Practice Problems
1. Two universities conducted a study on the variation in students'
weekly study hours. A researcher wants to test if the variability in
study hours differs between the two universities at a 5% significance
level.
University A: Sample variance = 20, Sample size = 15
University B: Sample variance = 12, Sample size = 15
2. A tech company is testing two different brands of smartphone batteries
to determine if their battery life variability differs. The test is
conducted at a 1% significance level. Below are the recorded battery
life durations (in hours) for both brands:
Brand A 10.5 11.2 9.8 10.0 11.5 10.8
Brand B 9.5 10.0 10.8 9.9 10.3 10.6
Chi-Square Test
The Chi-Square test is a method used to determine whether there is a
significant difference between observed and expected frequencies in
categorical data. It works by comparing how much the actual data
deviates from what we expect under a given assumption. The test is
widely used in research, quality control, and survey analysis.
Formula
𝑛
𝑂𝑖 − 𝐸𝑖 2
2
𝜒 =෍
𝐸𝑖
𝑖=1

Where 𝑂𝑖 - Observed value


𝐸𝑖 - Expected value
Example
A coin is tossed 100 times: 55 heads, 45 tails. Test if the coin is fair.
Solution
H0 : The coin is fair.
H1 : The coin is not fair.
Expected frequencies – H=50, T=50
2
2 2 2
2
𝑂𝑖 − 𝐸𝑖 55 − 50 45 − 50
𝜒 =෍ = + =1
𝐸𝑖 50 50
𝑖=1
Degrees of freedom: df = 𝑛 − 1 = 1,
χ2 -critical (df = 1, 𝛼 = 0.05) = 3.841.
Since 1 < 3.841, we fail to reject 𝐻𝑜 .
Practice Problems
1. A die is rolled 150 times, and the results are recorded as follows:
Outcome 1 2 3 4 5 6
Observed
Frequency 28 22 20 25 30 25
(O)

Tests at 5% significance whether the die is fair.


2. A survey records the preferences for three brands among 300
people. Observed frequencies are
Brand A: 120, Brand B: 100, Brand C: 80.
Test at 5% if preferences are equally distributed.
Point Estimate and Confidence Intervals

1. Point Estimate
A Point Estimate is a single value that serves as an
estimate of a population parameter.

2. Confidence Interval (CI)


It is a range of values, derived from sample statistics,
that likely contains the population mean 𝜇 with a
specified level of confidence.
Formula for C.I. for Sample Mean
1. Population Standard Deviation σ is Known Z−Distribution
σ

C. I. = X ± 𝑍𝛼/2 ×
n
Where 𝑍𝛼/2 = Z − score corresponding to the confidence level
σ = Population Standard Deviation

2. Population Standard Deviation σ is Unknown t−Distribution


s

C. I. = X ± 𝑡 𝛼 ×
2
, 𝑛−1 n

Where 𝑡 𝛼, 𝑛−1 = t − score from the table based on df


2

s = Sample Standard Deviation


Example on C.I. with Known σ
A researcher collects a sample of 𝑛 = 25 data points and finds a simple
mean of ഥ
X = 50. The population standard deviation is σ = 10.
Construct a 95% C.I.
Solution
Given: ഥ
X = 50, σ = 10, n = 25, Z − score for 95 % C.I. = 1.96
σ 10
Margin of Error 𝑀𝐸 = 𝑍𝛼/2 × = 1.96 × = 3.92
n 25
σ
Confidence Interval C. I. = ഥ
X ± 𝑍𝛼/2 × = 50 ± 3.92 = 46.08, 53.92
n
We are 95% confident that the true population mean lies within the
interval 46.08, 53.92
Practice Problems
1. A sample of 30 students was selected to estimate the average time
they spend studying per day. The sample mean is 5.5 hours with a
standard deviation of 1.2 hours.
Find the 95% confidence interval for the population mean.
2. A factory wants to estimate the average time it takes to
manufacture a product. From a small sample of 10 products, the
time taken is as follows (in hours):
7.2, 8.0, 6.8, 7.5, 8.1, 7.0, 7.8, 8.2, 7.4, 7.9.
Assuming the time follows a normal distribution, calculate a 90%
confidence interval for the mean manufacturing time.
ANOVA (Analysis of Variance)
It is used to compare the means of three or more groups
to determine if there is a significant difference among
them. It helps test whether at least group mean is
different from the others, without performing multiple
pairwise t-tests.
Formula
MSB
F=
MSW

Where MSB – Between-group mean square


MSW – Within-group mean square
Example
Three groups contains following data
Group 1: 8,10,12 Group 2: 11,12,13 Group 3:13,14,15
Test whether there is significant difference between their means at 𝛼 = 0.05
Solution
H0 : The mean of three groups are equal, 𝜇1 = 𝜇2 = 𝜇3
H1 : At least one group mean is significantly different from others.
From Given data we get 𝑋1 = 10, 𝑋2 = 12, 𝑋3 = 14
σ 𝑋1 +𝑋2 +𝑋3 108

Step 1. Overall Mean = 𝑋𝑜𝑣𝑒𝑟𝑎𝑙𝑙 = = = 12
𝑇𝑜𝑡𝑎𝑙 𝑁𝑜. 𝑁 9

Step 2. Sum of Square between 𝑆𝑆𝐵 − Variation between groups


𝑆𝑆𝐵 = 𝑛 σ 𝑋ഥ𝑖 − 𝑋ത𝑜𝑣𝑒𝑟𝑎𝑙𝑙 2 = 3 10 − 12 2 + 0 2 + 14 − 12 2 = 24
Step 3. Sum of Square within 𝑆𝑆𝑊 − Variation within groups
2
𝑆𝑆𝑊 = σ𝑛𝑖=1 σ𝑛𝑗=1 𝑋𝑖𝑗 − 𝑋ഥ𝑖 = 8 + 2 + 2 = 12

𝑆𝑆𝐵 24
Step 4. 𝑀𝑆𝐵 = = = 12 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑘 − 1 (𝑘 − 𝑁𝑜. 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠)
𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 2

𝑆𝑆𝑊 12
Step 5. 𝑀𝑆𝑊 = = =2 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 = 𝑁 − 𝑘
𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 6

MSB 12
Step 6. F-Statistics: 𝐹 = = =6
MSW 2

Step 7. Critical F-value 𝛼 = 0.05, 𝑑𝑓𝑏𝑒𝑡𝑤𝑒𝑒𝑛 = 2, 𝑑𝑓𝑤𝑖𝑡ℎ𝑖𝑛 = 6 = 5.143

Since 6 > 5.143, we reject the null hypothesis Ho .


Practice Problems
1. A restaurant manager wants to check if three different menu
designs (A, B and C) affect customer satisfaction. He collects
satisfaction ratings(out of 10) from 6 customers per menu design.
Customer Satisfaction
Menu Design
Ratings
A 7, 8, 7, 9, 8, 7
B 6, 7, 7, 8, 6, 7
C 8, 9, 8, 7, 8, 9

Using ANOVA test at 𝛼 = 0.05, determine if the menu design


significantly affects customer satisfaction.
2. A researcher wants to test whether three different fertilizers
(A, B and C) have different effects on crop yield. She collects
data on the crop yield (in kg per plot) for each fertilizer from 5
plots per group, as shown below:
Fertilizer Crop Yield (kg per plot)
A 20, 22, 19, 24, 21
B 25, 27, 23, 26 , 28
C 22, 23, 21, 20, 24

At a significance level of 0.05, perform a One-Way ANOVA to


determine if there is a significant difference in crop yield
among the three fertilizers.
Removal of Outliers using Median
It involves identifying and potentially eliminating data points that
significantly differ from the central tendency of the dataset.
Method:
Step 1. Calculate the Median (M)
Step 2. Compute the Median Absolute Deviation (MAD)
MAD = Median X − M
Step 3. Set a Threshold for Outliers
Lower Bound = M − k × MAD
Upper Bound = M + k × MAD
commonly k = 2.5 or 3
Step 4. Identify and Remove Outliers
Example
Given the data set: 5, 7, 8, 9, 10, 50
Find the Outliers using the Median method.
Solution
Step 1. M = Median 5, 7, 8, 9, 10, 50 = 8.5
Step 2. Calculate the absolute deviations and find MAD
𝑋 − 8.5 = 3.5, 1.5, 0.5, 0.5, 1.5, 41.5
MAD = Median 3.5, 1.5, 0.5, 0.5, 1.5, 41.5 = 1.5
Step 3. Set a Threshold let k = 3
Lower Bound = 8.5 − 3 × 1.5 = 4
Upper Bound = 8.5 + 3 × 1.5 = 13
Step 4. Since 50 is outside the range 4, 13 , it is identified as an
outlier.
Practice Problems
1. Given the data set: 8, 9, 10, 12, 100, 14, 16, 13, 15
Find the outliers using the median method and remove them.

2. A researcher collects the following dataset representing the ages


of participants in a study:
22, 25, 27, 24, 30, 28, 29, 50, 26, 24, 23, 500
i) Use the median and the IQR to identify outliers in the dataset.
ii) Remove the outliers and calculate the revised median of the
dataset.

You might also like