Why study about Chi-square Test and ANOVA?
We have already learned how to test hypotheses using data from either one or two samples. Suppose
we have data from 5 populations instead of two. Chi-square tests enable us to test whether more than
two population proportions can be considered equal. Proportions means part of the samples for a
particular population (e.g. if a pharma company is testing two drugs on 3 populations, the proportion
or percentage of success and failure of treatment by the drugs in each of these populations will be
important knowledge). If we classify a population into several categories with respect to two attributes
(such as age and job performance), we can then use a chi-square test to determine whether the two
attributes are independent of each other.
The analysis of variance or ANOVA, will enable us to test whether more than two population means
can be considered equal.
Chi-Square Statistics
We will study about this part using an example for ease of understanding.
Chi-square Distribution
If the null hypothesis is true, then the sampling distribution of the chi-square statistic, χ2, can be
closely approximated by a continuous curve known as a chi-square distribution. The important
assumptions required for this approximation are:
1. The sample observations should be independent.
2. The sample size is large (as a thumb rule it should be more than 50).
3. The sum of observed frequencies ( fo ) must be equal to the sum of expected frequencies (fe).
4. Observations should be independent of each other.
The chi-square distribution is a probability distribution. Therefore, the total area under the curve in
each chi-square distribution is 1.0.
To use a chi-square hypothesis test, we must have a sample size large enough to guarantee the
similarity between the theoretically correct distribution and our sampling distribution of χ2, the chi-
square statistic. When the expected frequencies are too small, the value of χ2 will be overestimated
and will result in too many rejections of the null hypothesis. To avoid making incorrect inferences
from χ2 hypothesis tests, follow the general rule that an expected frequency of less than 5 in one cell
of a contingency table is too small to use. When the table contains more than one cell with an
expected frequency of less than 5, we can combine these in order to get an expected frequency of 5 or
more.
So far, we have rejected the null hypothesis if the difference between the observed and expected
frequencies, that is the chi-square statistic, is too large. But if the chi-square value was zero, we
should be careful to question whether absolutely no difference exists between observed and expected
frequencies. If we have strong feelings that some difference ought to exist, we should examine either
the way the data were collected or the manner in which measurements were taken, or both, to be
certain that existing differences were not obscured or missed in collecting sample data.
Chi-Square Test as a Goodness of Fit
The chi-square test can also be used to decide whether a particular probability distribution, such as the
binomial, Poisson, or normal, is the appropriate distribution.
Question.
Solution.
Under a Poisson distribution with expectation of λ events in a given interval, the probability
of k events in the same interval is:
The expected frequency is (Observed Value) x (1- Probability of event)
Analysis of Variance: ANOVA
Using analysis of variance, we will be able to make inferences about whether our samples are drawn
from populations having the same mean. In order to use analysis of variance, we must assume that
each of the samples is drawn from a normal population and that each of these populations has the
same variance, σ2. However, if the sample sizes are large enough, we do not need the assumption of
normality. The three steps in analysis of variance are:
1. Determine one estimate of the population variance from the variance among the sample means.
2. Determine a second estimate of the population variance from the variance within the samples.
3. Compare these two estimates. If they are approximately equal in value, accept the null hypothesis.
The null hypothesis is not true if these two estimates will differ considerably.
We will try to understand the use of ANOVA through a case study:
The director wonders whether there are differences in effectiveness among the methods.
When populations are not the same, the between-column variance (which was derived from the
variance among the sample means) tends to be larger than the within-column variance (which was
derived from the variances within the samples), and the value of F tends to be large. This leads us to
reject the null hypothesis.
The specific shape of F distribution depends on the number of degrees of freedom in both the
numerator and the denominator of the F ratio. But, in general, the F distribution is skewed to the right
and tends to become more symmetrical as the numbers of degrees of freedom in the numerator and
denominator increase.
To do F hypothesis tests, we shall use an F table in which the columns represent the number of
degrees of freedom for the numerator and the rows represent the degrees of freedom for the
denominator.
Simple Regression and Correlation Analysis
Regression and correlation analyses show us how to determine both the nature and the strength of a
relationship between two variables. In regression analysis, we shall develop an estimating equation—
that is, a mathematical formula that relates the known variables to the unknown variable. Then, after
we have learned the Development of an estimating equation pattern of this relationship, we can apply
correlation analysis to determine the degree to which the variables are related. Correlation analysis,
then, tells us how well the estimating equation actually describes the relationship.
Regression and correlation analyses are based on the relationship, or association, between two (or
more) variables. The known variable (or variables) is called the independent variable(s). The variable
we are trying to predict is the dependent variable. We often find direct and inverse relationship
between such variables. Also we may find a causal relationship where, the independent variable
causes the dependent variable to change.
Method of Least Squares
We have used Y to represent the individual values of the observed points measured along the Y-axis.
Now we should begin to use 𝑌̌ (Y hat) to symbolize the individual values of the estimated points—
that is, the points that lie on the estimating line. Accordingly, we shall write the equation for the
estimating line as:
The standard error of estimate measures the variability, or scatter, of the observed values around the
regression line.
Correlation Analysis
Correlation analysis is the statistical tool we can use to describe the degree to which one variable is
linearly related to another.
Regression and correlation analyses can in no way determine cause and effect.
Completely Randomized Single Factor Experiments: Fixed and Random Effects
Model
We will study about this part with examples and application of ANOVA for analysis.
Random Effects Model
Nested Effects Model