CHAPTER FIVE: ANALYSIS OF VARIANCE
There are situations where more than two populations are involved and we need to test the
significance of differences between three or more sample means. We also need to test the null
hypothesis that three or more populations from which independent samples are drawn have equal
[or homogeneous] means against the alternative hypothesis that population means are not equal.
Let μ1, μ2, μ3,…..,= μk be the mean value for population 1,2,3,….K respectively. Then from
sample data we intend to test the following hypothesis.
H 0= μ1= μ2= μ3=…..,= μk
H 1= Not all μ j are equal j=1,2,3….K
I.e. the null hypothesis should be rejected if any of the n sample means is different fro others.
The following are few examples involving more than two populations where it is necessary to
conduct a comparison to arrive at a statistical inference:
Effectiveness of different promotional devices in terms of sales.
Quality of a product produced by different manufacturers in terms of an attribute.
Production volume in different shifts in a factory.
Yield from plots of land due to varieties of seeds, fertilizers and cultivation methods.
Terminologies:
Response variable; The variable of interest
A factor criterion: The variables, qualitative or quantitative, which are related to a
response variable.
Level: The value assumed a factor in an experiment.
Treatments: The combination of levels at the factors for which the response will be
observed.
1
Assumptions for analysis of variance
1. Each population has a normal distribution
2. The populations from which the samples are drawn have equal variance, i.e. δ 21=δ 22=δ 23
=…... =δ 2k
3. Each sample is drawn randomly and is independent of other samples.
Analysis of variance approach
The first step in the analysis of variance is to partition the total variation in the sample data in to
the following two component variations in such a way that it is possible to estimate the
contribution of factors that may cause variation.
1. The amount of variation among the sample mean or the variation attributed to the
difference among sample means. This variation is due to assignable causes.
2. The amount of variation within the sample observations. This difference is considered
due to chance causes or experimental [random] errors.
One-way classification to test equality of population means
The classification is based on one factor or attribute of interest such as
I. Range of marks scored by students in a course.
II. Lfavour preference of ice-cream by customers.
III. Yield of crop due to variety of seeds and so on.
Suppose our main aim is to make inferences about k population means based on sample data,
where, μ j is the mean of the population of measurements associated with the treatment
[ j=1,2,3 , … k ]. The null and alternative hypothesis to be tested is stated as:
H 0= μ1= μ2= μ3=…..,= μk
H 1= Not all μ j are equal j=1,2,3….K
Let:
n j= size of the j t h sample [ j=1,2,3 , … k ]
n j= total number of observations in all samples combined i.e. [n=n1+ n2+ n3 … nk ]or [n=rk ] if
i= j
x ij =t h e i t h Observation in j t h sample
2
Observation Populations[number of samples]
1 2 3
[measurements]
1 x 11 x 12 x 13
2 x 21 x 22 x 23
3 x 31 x 32 x 33
. . . .
. . . .
. . . .
r xr 1 xr 2 x rk
Sum T1 T2 T k =T
Arithmetic mean x́ 1 x́ 2 x́ k = x́
Where:
r r
1
T i=∑ x ij x́ i= ∑x
i=1 r i=1 ij
k k r k
1
T =∑ T i x́= ∑ x́ = 1 ∑ ∑ x
rk j=1 j n i=1 j=1 ij
j=1
The values of x́ are called sample means and x́ is the grand mean of all observations (or
measurements) in all the samples. Since there are r rows and k columns in the table above, then
total number of observations is ¿ n , provided each row has equal number of observations. But if
the number of observations in each row varies , then the total number of observations is
n=n1 +n2 +n 3 … n k =n
Example 1. Three brands of tires, A, B, C were tested for durability. A sample of four types of
each brand is subjected to the same test and the number of kilometers until wear
out was noted from each brand of tires.
The data in thousand kilometers is given below
Observations Samples [number of brands]
A B C
1 26 18 23
2 25 16 19
3 28 17 26
4 12 18 30
Sum 91 69 98
Mean 22.75 17.25 24.50
Since the same number of observations is obtained from each brand of tires [population],
therefore the number of observations in the table is n=rk n=4 × 3=12
3
The sample [population] mean of the three samples are given by
r 4
1 1
x́ 1= ∑x
r i=1 ij
x́ B = ∑ x = 1 ( 69 )=17.25
4 i=1 i 2 4
4 4
1 1 1 1
x́ A= ∑ x i 1= ( 91 ) =22.75 x́ C = ∑ x i 3= ( 98 )=24.50
4 i =1 4 4 i=1 4
The grand mean for all samples is
k r k
1 1
x́= ∑ x́ j= ∑ ∑ x ij
rk j=1 n i=1 j=1
4 3
1
x́= ∑ ∑ x = 1 ( 26+25+28+ …19+30 )=21.50
12 i=1 j=1 ij 12
Steps for testing the null hypothesis
STEP 1. State the hypothesis to test equality of population means as
H 0= μ1= μ2= μ3=…..,= μk
H 1= Not all μ j are equal j=1, 2, 3….K
α =level of significance
STEP 2. Calculate the total variation: If a single sample of size n is taken from the
population, then estimate of the population variance based on the variance of
sampling distribution of means is given by:
s=
∑ (x− x́)2 = ss
n−1 df
Where: s2=variance of sampling distribution of mean=mean square
ss=∑ of square deviations
4
Total variation (sum of squares total, SST) = sum of guared differences between each sample
value from the grand mean, x́
r k
SST =∑ ∑ (x− x́ )2
i=1 j=1
Observations Samples [number of brands]
A B C
1 26 18 23
2 25 16 19
3 28 17 26
4 12 18 30
Sum 91 69 98
Mean 22.75 17.25 24.50
Between sample variability:
If the null hypothesis is true, the population means would all be equal. We would then expect
that the sample means would be close to one another. If the alternative hypothesis is true,
however, there would be large differences between some of the sample means. Sum of squares
between samples [SSB] measures the proximity of the sample means to each other.
k
SSB=∑ n j ( x́− x́ )2
j=1
¿ n A ( x́ A− x́ )2+ nB ( x́ B− x́)2 +n C ( x́C − x́)2
¿ 4 (22.75−21.50)2 + 4(17.25−21.50)2+ 4 (24.50−21.50)2
¿ 6.25+72.25+36
=114.50
Within sample variability
If large differences exist between sample means, at least some sample means differ considerably
from the grand mean, producing a large value of SSB. It is then rationale to reject the null
hypothesis in favor of the alternative hypothesis. Sum of square within sample [SSW] provides a
measure of the amount of variation in the response variable that is not caused by the samples.
r k
SSW =∑ ∑ ( x ij −x́ j )2
i=1 j=1
5
Where:
k k k k
∑ (x 1 j−x́ A )2 +∑ ( x 2 j −x́ B )2+∑ ( x 3 j− x́ C )2+… … … … . ∑ ( x pj− x́ p )2
j=1 j=1 j=1 j=1
¿ ¿+(25−22.75)2 (28−22.75)2 (12−22.75)2]+
+ [ ( 18−17.25 )2+ ( 16−17.25 )2 + ( 17−17.25 )2 + ( 18−17.25 )2 ]
2 2 2 2
+[ ( 23−24.50 ) + ( 19−24.50 ) + ( 26−24.50 ) + ( 30−24.50 ) ]
¿ 10.5625+5.0625+27.5625+115.5625+ 0.5625++1.5625+0.0625+0.5625+2.25+30.25+ 2.25+30.25
=226.50
Total variability
r k k r k
SST =∑ ∑ (x ij −x́) =∑ n j( x́−x́) + ∑ ∑ (xij − x́ j)2
2 2
i=1 j=1 j =1 i=1 j=1
Where: SST =∑ of square Total
Alternatively:
Total variability =Within sample variability + Between sample variability
SST =SSB+ SSW
¿ 226.50+114.50
=341
6
v
T
b
r
V
e
g
o
n
h
t
i
w
l
p
m
a
s
STEP 3. Calculate variation between sample means: also called “between column variance
A. Calculate mean values x́ 1 x́ 2 x́ 3 … x́ k of all k samples
B. Calculate grand mean x́= ¿
Where
1
k
T =grand total of all observations .
T =number of all observations∈all k samples
C. Calculate the difference between the mean of each sample and the grand mean as x́ 1−x́ ,
x́ 2−x́ , x́ 3−x́ … x́ k −x́. Multiply each of these by the number of observations in the
corresponding sample and sum. The total gives the sum of the squared differences
between the sample means in each group and is denoted by SSB
Between variability:
k
SSB=∑ n j ( x́− x́ )2
j=1
¿ n A ( x́ A− x́ )2+ nB ( x́ B− x́)2 +n C ( x́C − x́)2
¿ 4 (22.75−21.50)2 + 4(17.25−21.50)2+ 4 (24.50−21.50)2
¿ 6.25+72.25+36
=114.50
7
STEP 4. Calculate the variation within samples. This also called within sample variance.
A. Calculate mean values x́ 1 x́ 2 x́ 3 … x́ k for all k samples.
B. Calculate difference of each observation in k samples from the mean values of the
respective samples.
C. Square all the differences obtained in step ( B) and find the total of these
differences. The total gives the sum of the squares of differences within the
samples and is denoted by SSW .
r k
SSW =∑ ∑ ( x ij −x́ j )2
i=1 j=1
k k k k
∑ (x 1 j−x́ A ) +∑ ( x 2 j −x́ B ) +∑ ( x 3 j− x́ C ) +… … … … . ∑ ( x pj− x́ p )2
2 2 2
j=1 j=1 j=1 j=1
¿ ¿+(25−22.75)2 (28−22.75)2 (12−22.75)2]+
+ [ ( 18−17.25 )2+ ( 16−17.25 )2 + ( 17−17.25 )2 + ( 18−17.25 )2 ]
+[ ( 23−24.50 )2 + ( 19−24.50 )2+ ( 26−24.50 )2 + ( 30−24.50 )2 ]
¿ 10.5625+5.0625+27.5625+115.5625+ 0.5625+1.5625+0.0625+0.562+2.25+30.25+2.25+30.25
=226.50
The sum is also called the error sum of squares. Alternatively,
SSW =SST −SSB
¿ 341−114.50
=226.50
STEP 5. Calculate average variation between and within sample mean squares.
Total df =Between samples df +Within samples df
n−1=( k −1 ) +( n−k)
Where: k =number of samples=3
n=number of observations=12
¿ ( 3−1 )+(12−1)
8
¿ 11+2
¿ 13
SSB 114.5
MSB= = =¿57.25
K −1 3−1
SSW 226.50
MSW = +
n−k 12−3
=25.167
SST 341
MST = =
n−1 12−1
=31
where :
MSB=mean square between samples
MSW =mean square within sample
k −1=df withthe ∑ of the squares among sdamples
n−k =df withthe ∑ of the squares within sdamples
When these “sum of squares” are divided by their associated degrees of freedom, we get the
following variances or mean square terms:
SSB SSW SST
MSB= ; MSW = ; and MST =
K −1 n−k n−1
STEP 6. Apply the F test withk −1 degrees of freedom for the numerator and n−k degrees
of freedom for denominator.
SSB
2
δ Between k −1 MSB
F= 2 = =
δ Wi thin SSW MSW
n−k
57.25
¿
25.167
¿ 2.27
STEP 7. Make decision regarding null hypothesis
Reject H 0if the calculated value of F> its critical value. F α ,(k−1 , n−k)
Otherwise accept H 0
The test statistic is F-distributed with k −1and n−k degrees of freedom provided that the
response variable is normally distributed. F distribution used to test and estimate the ration of
9
two population variances. The degrees of freedom for this application are the denominators in
the mean squares, that is, V 1=k−1 and V 2=n−k
, V 1=k−1 ,V 1=3−1=2
V 2=n−k , V 2=12−3=9
F α ,(k−1 , n−k)
Figure 1 rejection region for null hypothesis using ANOVA
In order to test H 0 at 0.05 level of significance, we use F 0.05 ,(3−1,12−3)=F 0.05 ,(2,9)=4.26
Since F test statistic value is less than F table value we accept the null hypothesis.
i.e. 2.27<¿4.26
F 0.05 ,(2,9)=¿4.26
10