0% found this document useful (0 votes)
13 views95 pages

Statistical Inference and Confidence Intervals

The document outlines the principles of statistical inference, including concepts such as population vs. sample, sampling distributions, confidence intervals, and hypothesis testing. It discusses the importance of estimating population parameters using sample data and the implications of type I and type II errors in decision-making. Additionally, it explains the central limit theorem, standard error, and the calculation of confidence intervals for various statistical measures.

Uploaded by

DAYO ADIGUN
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views95 pages

Statistical Inference and Confidence Intervals

The document outlines the principles of statistical inference, including concepts such as population vs. sample, sampling distributions, confidence intervals, and hypothesis testing. It discusses the importance of estimating population parameters using sample data and the implications of type I and type II errors in decision-making. Additionally, it explains the central limit theorem, standard error, and the calculation of confidence intervals for various statistical measures.

Uploaded by

DAYO ADIGUN
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

STATISTICAL INFERENCE

By
Dr. Onoja M. AKPA

1
Lecture outline
• Population: sample and sampling distributions
• Parameters and statistics, standard error of the
mean
• 95% Confidence intervals
• Hypothesis testing
• Errors in decision making (type 1 and 2 errors)
• Level of significance
• Choice of appropriate test statistics in inference
• Parametric and non parametric tests
2
What is Statistical Inference?

 This involves the procedures of drawing


conclusion on a population parameter
on the basis of the results obtained
from a sample drawn from the
population.

3
SAMPLE vs Population
• Usually total population of interest larger than
available resources
• Or may be so general that it does not exist in
a conventional sense such as population of
individuals who will require open heart
surgery.
• The inevitable is to use a SAMPLE

4
SAMPLING VARIATIONS
• Results from individuals vary
• Results from different samples vary
• Sample means less variable than individual results.
• Two parameters describing the Normal Distribution
(miu and sigma) refer to parameter values describing
the properties of our population of interest
• True population parameters difficult to measure
hence rely on sample data

5
Reliability of SAMPLE MEAN
• SAMPLE MEANS show less variation than individual
values
• Estimates of means from repeated samples is
reduced because of smoothing effect that
characterizes the averaging process
• The more variable the individual values the more
variable the sample estimates
• Variability of sample means depend on sample size.
• There is need to test hypothesis that variation from
the sample is not by chance

6
Sampling Distributions
 Sampling distribution is the ddistribution of
sample estimates obtained from repeated
samples taken from the population
 The sampling distribution for the mean,
median, standard deviation, proportion and
any other statistic can be computed
 The central limit theorem describes the
properties of sampling distributions
 Parameter and Estimate
7
Central Limit Theorem
 Sampling distributions are approximately normally
distributed regardless of the nature of the variable in
the parent population
 The mean of the sampling distribution is equal to the
true population mean
 Mean of sample means is an unbiased estimate of the
true population parameter (mean)
 The standard deviation (SD) of sampling distribution is
directly proportional to the population SD and
inversely proportional to the square root of the
sample size 8
Standard Error
 The standard deviation of the sampling distribution is
called the standard error.
 e.g. SE of the mean = σ/√n
 Where σ is the population standard deviation and n is the
sample size
 It is the measure of precision of the estimate i.e the price
we pay for taking a sample
 Precision is a measure of reliability of the sample statistic
as an estimate of population parameter.
 The larger the sample size, the less the error in its
estimate
9
Standard Error of the Mean
For single sample;

For two independent samples

where s1 and s2 are the standard deviations of the groups


and n1 and n2 are the sample sizes in each group
10
Standard Error of the Proportion
 For single sample;

where p is proportion with the characteristic in the


population and n is the size of the sample
 For two independent samples

 where p1 and p2 are the proportions and n1 and n2 are the


sample sizes for group 1 and 2 respectively
11
Standard Deviation or Standard Error
• Standard deviation measures the
sampling variability of the
individual observation while the
sampling variability of the
sampling means is measured by
standard error of the mean.
Standard Deviation or Standard Error
• Quote standard deviation if interest is in
the variability of individuals as regards the
level of the factor being investigated –
FBS, SBP, Age and cholesterol level.
• Quote standard Error if emphasis is on the
estimate of a population parameter.
It is a measure of uncertainty in the
sample statistic as an estimate of
population parameter.
ESTIMATION
• A major purpose or objective of Health Research is to
estimate certain population characteristics or
phenomena
• Characteristic or phenomenon can be quantitative
such as average SYSTOLIC BLOOD PRESSURE of adult
men or qualitative such as proportion with
MALNUTRITION
• Can be POINT or INTERVAL ESTIMATE

14
INTERVAL ESTIMATE

• Not sufficient to rely on a single estimate


• Other samples could yield plausible estimates
• Comfortable to find a range of values within which to
find all possible mean values
• Highly probable for true values to lie within this range
• Properties of normal distribution allows us to define
the confidence interval for the estimate.

15
WHAT IS A CONFIDENCE INTERVAL

– Confidence interval is a range of likely values for an unknown


population parameter at a given confidence level.
– Provides uncertainty attached to sample estimate on account
of sampling error.
– Only by convention that the 95% confidence level is commonly
chosen.
CI: Statistic ± (Z1-α/2) × [δ/sqrt(n)]
Statistic ±1.96×S.E
• 95% of confidence intervals from repeated sample
estimates xbar contains the true population mean- µ
16
CONFIDENCE INTERVAL
 Confidence interval is a range of likely
values for an unknown population
parameter at a given confidence level.
 It contains the true estimate of a
population parameter with a stated degree
of confidence
 Its width depends on n, SE and degree of
confidence.
 Uses the concept of sampling distributions
to compute intervals at appropriate
probability levels
17
WHAT IS A CONFIDENCE INTERVAL

– For Proportion
– 95% confidence level is given as:
CI: Proportion ± (Z1-α/2) × [SD of
proportion/sqrt(n)]
P ±1.96×sqrt[pq/n]

• P- is the sample proportion and q=1-p

18
Advantages of using confidence intervals.

• Tell us by how much an exposure affect the subjects.


• Focus is on range of values to be considered plausible for the
population.
• Convey information on the magnitude of the differences in say-
blood glucose levels between diabetic and non-diabetic patients.
• Gives estimates of differences in say glucose levels that would
have been obtained if total population were studied.

19
Confidence Interval (C.I.)
Point estimate (e.g. mean, proportion, RR, OR)

Lower confidence limit


x Upper confidence limit

Confidence interval

20
What affects the CI?
CI is affected by
 Level of confidence/significance (i.e alpha
value)
 Sample size
 Variation in the data
 For RR and OR, the strength of association

21
Advantage of using confidence
intervals.
• Tell us by how much an exposure affect the subjects.
• Focus is on range of values to be considered plausible for
the population.
• Convey information on the magnitude of the differences in
say-blood glucose levels between diabetic and non-
diabetic patients.
• Gives estimates of differences in say glucose levels that
would have been obtained if total population were studied
and so the degree of control by the treatment.
• Given confidence interval shows ability to detect lack of
precision of estimates.
• Ability to detect significant and non-statistical significant
effects.

22
A Typical Result

 “The difference between the sample


mean of systolic pressures in diabetics
and non-diabetics was 6.0 mmHg with a
95% CI: 1.1 to 10.9 mm Hg; the test-
statistic t was 2.4, with a d.f. of 198 and
an associated P-value, 0.02.”

 It is left for the physician to conclude if a


treatment that can reduce systolic
pressure by so much value is medically
important or beneficial!
Confidence Intervals & Statistical
Significance
• For comparison of mean values or
proportions, confidence intervals
containing zero implies data is
consistent with Ho.
• For Relative Risks or Odds Ratio –
Confidence Intervals containing 1
implies risk factor not statistically
significant.

24
Confidence Intervals (CI)-
Proportions
Single Proportion (Prevalence of disease)
CI: = P + Z1 - /2 * S.E. (P)

CI= (P – Z1 - /2 * S.E.(P) , P + Z1- /2 * S.E. )


(P)

lower limit upper limit.

Two-sample-unpaired data (independent)


CI: (P1 – P2) + Z1 - /2 * S.E. (P1-P2)

Therefore,
CI: (P1-P2) - Z1 - /2 * S.E. (P1-P2) to
(P1-P2) + Z1 - /2 * S.E. (P1-P2)
Calculating confidence limits for
Proportion

• A total of 129 women participated in an


anticoagulant study. If 89 women survived a
heart attack, what is the confidence interval for
the proportion of survival in this study?
– Note: use 95% confidence level

26
ANSWER

• P= 89/129 = 0.69
• CI: Statistic ± (Z1-α/2) × sqrt[p(1-p)/n]
Statistic ±1.96×S.E

• Standard error = 0.69 0.31


0.04072
129
• 95% confidence limits= 0.69±1.96 ×0.0407
=(0.61, 0.77)
or (61%, 77%)
• 99% confidence limits =0.69±2.58 ×0.0407
=(0.58, 0.80)

27
Exercise

• A total of 120 Yoruba children were screened for


myopia in a rural village and 20 of these children
were determined to be myopic. Calculate the
95% and 99% confidence interval for the true
prevalence of myopia in this population.

28
Exercise
The table below shows the risk of anthrax
among persons exposed and unexposed to
slaughtering cows. Find the proportion of Illness
and non-illness among the two groups. Also
build a 95% CI around the difference of the two
proportions.
Ill Non ill Total
Exposed 20 4 24
Non exposed 25 247 272
Total 45 251 296 29
Paired Samples – (Dependent).
Qualitative data
• Each subject serving as its control
• Observation on same unit on two occasions
• Same group exposed to different treatments
twice
• Interest is on presence or absence of an attribute
in the two occasions.
• Present result as number of pairs with possible
attributes.

30
Quantitative Data Confidence interval
for the single estimate of mean value.
• Let the single estimate be ẍ.
Then the CI is given by ẍ + t1 – /2 * S.E. (ẍ)
C.I: (ẍ - t1 - /2 * S.E. (ẍ) - ẍ + t1 - /2 * S.E. (ẍ ))

• Example: The mean diastolic blood pressure from 16


subjects is 90.0 mm Hg, and the standard deviation is 14
mm Hg. Calculate its standard error and 95% confidence
limits. Compare with its 99% confidence interval
• Answer: Standard error – 3.5
95% confidence limits – 82.55 to 97.46
99% confidence limits –79.68 to…….

31
• CI for Comparison of two mean values
– C.I: (ẍ1 – ẍ2) + t1- /2 * S.E. (ẍ1 – ẍ2)
– S.E. (ẍ1 – ẍ2) = S.E (ẍ1)2 + S.E. (ẍ2)2
S.D. =  Σ (xi - ẍ )2
n-1
• Pooled Estimate: Sp2 = (n1-1)S12 + (n2-1)S22
n1 + n 2 - 2

• S.E (ẍ 1 – ẍ2 ) = Sp  ( 1 + 1) n1 n2

• C.I: ((ẍ1–ẍ2)- t1- /2*S.E(ẍ1–ẍ2), (ẍ1–ẍ2)+t1- /2 * S.E. (ẍ1–ẍ2))

32
Example
• The mean score on knowledge of the
adverse effects of (a group of 81 primary
health care physicians with less than 10
years experience) the diagnosis of
depression was 35.94, SD = 4.60. If the
mean score on knowledge of 64 primary
care physicians with more than 10 years
experience was 39.8, SD = 4.05.
• Obtain the difference between the means
• Obtain 95% CI for the mean difference.
33
What is a Statistical Hypothesis?
 It is a statement of fact yet to be tested.
 It is a conjecture about the characteristics or
phenomena of a population that has not been
verified.
 For instance, an endocrinologist may say
NIDDM patients are generally younger than
IDDM patients.
 Is this true? The answer is either Yes or No
 The truth will only come out after investigation
34
Hypothesis testing
• Hypothesis is a statement or concept about a
phenomenon that has not been verified
• Procedure for investigating/verifying whether
these statements are true is known as
hypothesis testing
• Can be stated as a negative or positive
statement
– Null hypothesis
– Alternative hypothesis
35
Steps in Hypothesis Tests
 Write down the two hypotheses
 Determine the appropriate test statistic
to be used
 Determine the significance level of the
test
 Decide on the distribution of the test
statistic and the sidedness of the test i.e
whether single or double
 Decide the boundary or boundaries of
the rejection region
 Give the decision rule
36
Types of Hypothesis
1. Null Hypothesis
It is a statement that forms the basis
of investigation in a significant test. It
is a statement that does not prejudge
the phenomenon of interest.
It is a statement of no difference
It is a statement of no association
It is a statement of no effect
It is a statement of equality

37
2. Alternative Hypothesis
 This is the hypothesis formulated as
an alternative to the Null hypothesis
which the investigators will accept if
the null hypothesis is rejected.
 It specifies what forms of departure
from the null hypothesis are of
potential concern.
 It is a statement of inequality

38
 A full specification of Hypothesis
must be one of the following:

39
Errors in Hypothesis Tests
 Type I error
This is the error committed when
the null hypothesis is rejected when
in actual fact it is true (α)
 Type II error
This is the error committed when
we fail to reject the null hypothesis
when in actual fact it is false (β)
40
Result Probabilities
H0: Innocent
Jury Trial Hypothesis Test
The Truth The Truth
Verdict Innocent Guilty Decision H0 True H0 False
Do Not Type II
Innocent Correct Error Reject 1-
Error (  )
H0
Type I Power
Guilty Error Correct Reject Error
H0 (1 -  )
( )

41
Level of statistical error
• Also called level of significance or alpha (α) error;
p value; type 1 error
• It is the maximum probability of committing type
one error.
• Chance of wrongfully rejecting the null
hypothesis
• A small p value (less than a pre-specified value
usually 0.05) leads to the rejection of the null
• Power (1-β) of a test is its ability to reject a null
hypothesis when it is false.
• 42
P-Value
• A p-value is a measure of how much evidence we have
against the null hypothesis
• It measures consistency by finding probability of
observing the results from a sample with results more
extreme, assuming Ho is true.
• Smaller p-value implies greater Inconsistency
• The smaller the p-value, the more evidence we have
against Ho and vice versa
• Conventionally, we reject Ho if p-value<0.05; implying
Significance. You don’t accept Ho

43
Choice of test statistic
• Test statistic is a particular form of data summary to
be used in a test.
• It is computed from the data to enable comparison
with percentage points of the appropriate
distribution
• Depends on
– Study objective (estimation, comparison)
– Study design(independent or dependent samples)
– Variable (data) type – quantitative or qualitative
– Sample size and sampling method(random or not)
– Sampling distribution
44
Selecting the appropriate Test Statistics
for test involving 2 or more samples
• Relationship between two quantitative variables
– Correlation analysis (strength of relationship)
– Linear regression analysis(prediction of one variable from
the other
• Relationship between two qualitative variables
– Chi square test
• Relationship between one qualitative and one
quantitative variable
– Can be reduced to mean difference between two or more
groups
– if two groups (t test)
– If more than two groups Analysis of variance (ANOVA)

45
The test statistic for t-test is,

The test statistic for z-test is,

46
Parametric tests of significance
• Significance tests which follow certain
assumptions about the type of distribution of
the variable of interest in the population
• The most common distribution assumed is the
normal distribution
• Examples are z test, t test and F test (Analysis
of variance)

47
Non parametric tests
• Do not make any distributional assumptions
about they underlying variable in the
population
• Appropriate for skewed data, graded
responses and analysis requiring little or no
assumptions about the underlying variable
• Examples : Mann Whitney U, Krusakl Wallis,
Chi square test, Wilcoxon signed rank test etc

48
Nonparametric vs Parametric

• Sign Test • One-sample t-test


• Mann-Whitney Test • Two-sample t-test (Independent)
• Wilcoxin sign ranked Test • Two sample t-test (dependent)
• Spearman Rank Test • Correlation/Regression
• Kruskal-Wallis Test • One-way ANOVA
• Friedman Test • One-way blocked ANOVA or two
way ANOVA

49
Proportion
• Sample proportion in the success category is
denoted by p̂

x # of successes
pˆ  
n sample size
• When both np and n(1-p) are at least 5, can be
approximated by a normal distribution with mean
p̂ and standard deviation

p (1  p )
 p  pˆ 
pˆ n 50
Example: Z Test for Proportion

Q. A researcher claims that


he receives 4% of
responses to his survey. Check:
To test this claim, a
random sample of 500 np 500 .04  20
were surveyed with 25
responses. Test at the  =
5
.05 significance level. n 1  p  500 1  .04 
480 5
51
Z Test for Proportion: Solution
H0: p .04 Test Statistic:
ˆ-p
p .05  .04
Ha: p  .04 z  1.14
 = .05 pq .04.96 
n = 500 n 500

Critical Values:  1.96 Decision:


Do not reject at  = .05
Reject Reject
Conclusion:
.025 .025 We do not have
sufficient evidence to
reject the researcher’s
-1.96 0 1.96 Z
1.14 claim of 4% response52
rate.
p -Value Solution
(p Value = 0.2542)  ( = 0.05).
Do Not Reject.
p Value = 2 x .1271

Reject Reject

=
0.05
0 1.14 1.96
Z
Test Statistic 1.14 is in the Do Not Reject
Region 53
Comparisons of two proportions
z-test statistic for comparing differences between 2 proportions.
Suitable when n1 + n2 < 50

z= P1 – P2 P1 = Proportion with attribute in


first group.
P1 (1-P1) + P2 (1-P2) P2 = Proportion with attribute
n1 n2 in second group.

On n1 + n2 – 2 degrees of freedom n1, n2 = sample sizes for th


first and second groups

Example: In a Nigerian teaching hospital, 18 of the 25


male patients who visited the eye clinic on Feb 2nd 2011
were diagnosed of short sightedness. On same day, only
10 of the 25 females were diagnosed of same problem. Is
there any difference between the proportions among male
and female groups at 95% significance level? 54
I f n1 + n2 >= 50, use Z-statistic

Z = P1 – P2

P1 (1-P1) + P2 (1-P2)
n1 n2

Z = 18/ 25 – 10/ 25

18/ 25 x 7/ 25 + 10/ 25_ x 15/ 25


25 25

Z = 0.72 – 0.40

0.72 x 0.28 + 0.440.6


25 25

Z = 2.4077

55
The student t-test

• Used to compare mean values between


two groups.
• Distribution has an underlying normal
distribution
• Has more areas at the tails of the
distribution
• Distribution based on degrees of freedom
to take care of small sample size less than
50.
• Widely used in medical studies.
56
Comparison of two mean values
Independent Samples:
- -
Use t = x1 – x2
- -
SE (x1 – x2)
-
- X1 is mean of first group
-
- X2 is mean of second group
- -
- S.E (x1–x2)= S2+S22 if assume unequal variance in two groups
n1 n2 (Pool the S if equal variance assumed)

S2 = (n1 – 1)S12 + (n2 – 1)S22


n1 + n2 - 2

57
Example on t Test
• The mean score on knowledge of the
adverse effects of a group of 81 primary
health care physicians with less than 10
years experience to the diagnosis of
depression was 35.94, SD = 4.60. If the
mean score on knowledge of 64 primary
care physicians with more than 10 years
experience was 39.8, SD = 4.05; test if
the difference in knowledge of the
diagnosis of depression is statistically
significant. What interpretation can you
Solution: Mean SD sample size

Group 1:
(< 10 years experience) 35.94, 4.60
81

Group 2:
(> 10 years experience: 39.8, 4.05
64

t = 35.94 – 39.8

(4.60)2 + (4.05)2
81 64

= - 3.86 = 5.3656
0.7194
Degree of freedom = 81 + 64-2
= 143
P <0.01

59
EXAMPLE ON T-TEST
• A total of 36 hypertensive individuals were
split into two groups of 18. Group 1 received a
diuretic therapy while Group 2 received a
diuretic therapy in combination with another
antihypertensive [Link] one month,
their diastolic blood pressures were measured
and results summarized as follows: GRP1
MEAN= 117.0 sd=22, gp2: mean=93.0, SD=20.
Was there any significant effect of therapy?
60
EXAMPLE ON PAIREDT-TEST
• A random sample of 6 patients with ischeanic heart disease
were treated with clofibrate and the concentration of ther
plasma fbrnngen determined as follows
• patents no : 1 2 3 4 5 6
• pre-value: 379 351 420 303 346 370
• post-value: 325 333 391 275 311 323
• Does the treatment have any statistical significant effect?

61
Solution:
- Null Hypothesis:
There is no difference in the 10 measureme
- Alternative Hypothesis:
There is a difference in the measurements.
- Level of significance: 0.05
- Test statistic: Paired – t-test

62
Comparison of mean values
Dependent Groups
-
Use t = d
-
SE (d)
-
Where d = Mean difference of pairs

S.E (d) = S
n
Where S =  (di – d)2
n- 1

n = Number of pairs

63
Evaluation of test statistic:
Paired differences (di) 0, -2, 1, 1, -2, -5, 0, -1, -1

D =  di = - 10 = -1
n 10

SD (d) =  di2 – ( di)2 = 38 – 100


n 10
n– 1 10 - 1
= 1.764
10. t – value = d_
SE (d)
= -1_ = 0.5669
1.764
t = 0.5669 on 9.d.f.
Verdict P > 0.1, Do not Reject Ho.
64
EXERCISE ON PAIRED T-TEST
• Seven pairs of twins were allocated at random
to two alternative diets. Their weight gain
after a fixed duration were as follows: Diet(A,
B) : (10,16), (17,20), (8,14), (15,15), (17,16),
(12,16), (14,17).
• DO THE DIETS SHOW ANY SIGNIFICANT
EFFECT ON WEIGHT GAIN?

65
Goodness of fit test
(One-Way Chi Square Test)
Compares observed frequencies within groups to their
expected frequencies.
HO = “observed” frequencies are not different from the
“expected” frequencies.
Research hypothesis: They are different.
fo = observed frequency
fe = expected frequency
2
( fo  fe )

2
 
fe
66
One-way Chi Square Interpretation
• Calculate the Chi Square statistic across all the categories.

• Degrees of freedom = k - 1, where k is the number of


categories.

• Compare value to Table of Χ2.

• If our calculated value of chi square is less than the table


value, accept or retain Ho

• If our calculated chi square is greater than the table value,


reject Ho

• …as with t-tests and ANOVA – all work on the same principle
for acceptance and rejection of the null hypothesis 67
Example on Goodness of fit test
Cystic fibrosis is an inherited condition controlled by
the action of a single recessive gene, c. sufferer have
homozygous genotype cc. for an offspring to be at
risk - both parents must be carriers – heterozygous.
Genotype Cc of 86 children 25 were normal, 45
carriers and 16 affected. Is there any derivation from
the Mendellian Theory that says 25%, 50% and 25%
are normal, carriers and affected in any population?
Solution:
Category
Genotype Normal Carrier Affected
Observed 25 45 16
Expected 21.5 43.0 21.5
(O-E) +3.5 +2.0 -5.5
(O-E)2 12.25 4.0 30.25
(O-E)2 0.570 0.093 1.407
E
X2 2 = 2.070
X2 (.05) = 5.99
P >0.05
Example 2:
Number of bacteria colonies observed on a set of microscopes
slides might reasonably be observed to follow a Poisson
process.
P(r) = r e-
r!
Pr(o) = o e- = e-
o!
Pr(1) = 1 e- =  e- =  P(o) etc
1!

If 40 slides yield a total of 50 colonies,


:. λ = 50 /40 = 1.25.
The distribution observed showed 14 slides had no colonies,
12 slides 1 colony each, 6 slides 2colonies each and 8 slides
had 3 or more colonies.
Does the Poisson distribution of an average of 1.25 colonies
per slide fit the observed distribution?
Exercise on Chi Square Goodness of fit Test
• According to a genetic theory, when a male having a brown
hair have children with black haired female, the next
generation of children will have black, brown, dark brown
and chocolate brown in proportions 6/13, 4/13, 1/13 and
2/13 respectively.
• The outcomes in an actual experiment are as follows 123
with black hair, 70 had brown hair, 48 with dark brown hair
and only 19 had chocolate brown hair.
• Using a 5% significance level, determine whether the result
of the experiment supports the theory.
71
Chi-square Test

What it is?
Choice of test statistic to
investigate the significance of
association (Dependence)
between two qualitative
variables.
Examples:

• Presence or absence of a risk factor


and having or not having a condition.
• Exposure to dust and Bronchial
Asthma.
• Breast cancer and Type of Diet.
• Occupation and Colorectal cancer.

73
Contingency Table

• Usual table for data presentation.


• Involves cross classification of two
qualitative variables.
• Table serve initial assessment of
association.

74
75
Two-Way Chi Square
• Comparisons between frequencies (rather than scores
as in t or F tests).
• So, null hypothesis is that the two or more
populations do not differ with respect to frequency of
occurrence.
• Review cross-tabulations
• Are the differences in responses of two groups
statistically significantly different?
• Two-way = one set of observed frequencies vs
another set.

76
Two-way Chi Square Example
• Null hypothesis: The relative frequency [or
percentage] of HIV Infected patients who died is the
same as the relative frequency of HIV negative patients
who died.
• Or
• Ho: Death does not depend on HIV status
• Categories (independent variable) are HIV status
(positive or negative)liberals and conservatives.
Dependent variable being measured is Death (Died or
Alive).
• Note that both rows and columns are nominal data --
which could not be handled by t test or ANOVA. Here
the numbers are frequencies, not an interval variable.
77
Two-Way Chi Square Example…..
• We get the expected frequencies for each cell by
multiplying the row marginal total by the column
marginal total and dividing the result by N.
• We’ll put the expected values in parentheses.
• Degree of freedom (df): Used to determine the
rejection region.
• The df is (r-1)(c-1) where r and c are numbers of
rows and columns in the contingency table
respectively

78
Procedure for Statistical
Test.
• Step 1
Ho: There is no association between
occupational status and the presence of stress.
• Step 2: HA: There is an association.
• Step 3: α = 0.05
• Step 4: Choose X2 test.
• Step 5: X2 = Σ (Oi – Ei)2
Ei
• Step 6: Compare calculated X2 with tabulated
X2 an appropriate degree of freedom at 5% level.
Conclusion:P < 0.05, Reject Null Hypothesis.
• Where O i = Observed frequency in cell i of table.

And E i = Expected frequency in cell i of Table


If Null Hypothesis were true.
79
Two-Way Chi Square Example
• We get the expected frequencies for each cell
by multiplying the row marginal total by the
column marginal total and dividing the result
by N.
• We’ll put the expected values in parentheses.

80
1. Example of a contingency Table. Data on occupational
status of subjects and the presence of stress.

Occupation
Stress Professional Skilled Unskilled Total
Present 5 13 70 88
Absent 20 32 60 112
Total 25 45 130 200
Percentage 20.0 28.9 53.8 44.0
with stress

81
Expected Frequencies

E1 = 88 x 25 = 11.0 E4 = 112 x 25 = 14.0


200 200

E2 = 88 x 45 = 19.8 E5 = 112 x 45 = 25.2


200 200

E3 = 88 x 130 = 57.2 E6 = 112 x 130 = 72.8


200 200 82
Chi-square value
X2=(5– 11.0)2 + (13-19.8)2 + (70-57.2)2 + (20-14.0)2 + (32-25.2)2
11.0 19.8 57.2 14 25.2

+ (60-72.8)2 = 3.27 + 2.57 + 2.34 + 1.83 + 2.86 + 2.25

x2 = 15.12 on 2 degrees of freedom.

83
Decisions.
• Compare calculated chi-square with
tabulated chi-square at 5% level and
corresponding degree of freedom.
• If calculated chi-square is smaller than
tabulated chi-square, then P > 0.05.
• Do not reject Null hypothesis if P > 0.05.
• If calculated chi-square is larger than
tabulated chi-square, than P < 0.05.
• Reject Null hypothesis if P < 0.05.
• Here tabulated chi-square on 2 d.f. at 5%
is 5.991.
• Decision – Reject Null Hypothesis.

84
The exact test for 2x2 table

• useful for table with very small frequencies in the cells.


• Probability of a table with above frequencies
• is P = (a+c)! (b+a)! (a+b)! (c+d)!
• N! a! b! c! d!
• Conditional on observed marginal totals.
• We can calculate the probability of all tables with same
marginal totals under the null hypothesis - use if N<20 and
any cell < 5

85
Short-cut for 2x2 table
X2c = ({Iad – bcI}-½N)2N
(a+b)(c+d)(b+d)(a+c)
Paired – McNemar’s
• Analysis and Paired Qualitative Data
• Pairing can be achieved by matching or using
same experimental units twice.
• X2 - McNemar's Test
• X2MN = (|r-s| - 1)2
• r+s

87
Exercise:
Ability of two media to detect tubercle bacilli in a laboratory
experiment. Fifty specimens of sputum are each cultured on
two different media. The results showed 20 were positive on
both media and 16 were negative on both media. If the total
positive on each media were 32 and 22 respectively, present
the result in a 2X2 table. Is there a statistically significant
difference in the ability of the two culture media to detect
tubercle baccilli.
Medium a
+ -
Medium
+ 20 12 32
B
- 2 16 18
22 28 50
Example:
A study was carried out to investigate the effect of
gender on smoking habits one hundred pairs of
males and females with partners in each pair being
of the same age and socio-economic status were
studied. In 26 couples, the male was the smoker
compared with only 10 couples in which the female
was the smoker. If in 24 couples both males and
females were smokers and none of the subjects in
the remaining couples smoke is there any statistical
effect of gender on smoking?
Solution: Consider four possible outcomes

• Male smokes/female smokers (+ +) 24


• Male smokes/female does not smoke (+ -) 26
• Male does not smoke/female smokes (- +) 10
• Male does not smoke/female does not smoke (- -) 40

90
Contingency Table
Male
+ -
+ 24 10 34
Female 26 40 66
-
50 50 100
Test Procedures
Ho : No gender effect on smoking
HA : There is gender effect
 = 0.05
X2 - McNemar's Test
X2MN = (|r-s| - 1)2
r+s
Calculation

X2 = (| 10-26 | - 1)2 = 225


36 36
X21 = 6.250, P<0.05

It appears there is a difference in the smoking habit


of males and females. Indeed males are more likely
to smoke than females
Listen again…….
You have to put in many, many,
many tiny efforts that nobody
sees or appreciates before you
achieve anything worthwhile…..
So continue computing statistics……

94
For Further Enquires, Contact

08188231380, 08061348165,
franstel74@[Link]

95

You might also like