0% found this document useful (0 votes)
16 views40 pages

GET 305 (Lecture Note) Module 1

The course on Engineering Statistics and Data Analytics (GET 305) aims to equip students with the ability to apply statistical methods to biomedical and engineering datasets, including descriptive statistics, statistical inference, and regression analysis. Students will also learn to use SPSS software for data analysis and evaluate data analytics concepts relevant to healthcare. The course emphasizes the importance of statistics in experimental design, quality improvement, and decision-making in engineering and biomedical applications.

Uploaded by

ekumajohndeadly
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views40 pages

GET 305 (Lecture Note) Module 1

The course on Engineering Statistics and Data Analytics (GET 305) aims to equip students with the ability to apply statistical methods to biomedical and engineering datasets, including descriptive statistics, statistical inference, and regression analysis. Students will also learn to use SPSS software for data analysis and evaluate data analytics concepts relevant to healthcare. The course emphasizes the importance of statistics in experimental design, quality improvement, and decision-making in engineering and biomedical applications.

Uploaded by

ekumajohndeadly
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COURSE: ENGINEERING STATISTICS AND DATA ANALYTICS

COURSE CODE: GET 305


3 Unit E: LH 45
At the end of the course, students should be able to:

1. Apply descriptive statistical methods to biomedical and engineering datasets, using frequency
distributions, measures of central tendency, dispersion, and percentiles, to summarize and interpret
experimental and observational data with acceptable numerical accuracy.

2. Apply statistical inference techniques such as confidence intervals and hypothesis testing, using
appropriate test statistics and significance levels, to draw valid conclusions from sampled
biomedical engineering data.

3. Analyse relationships between variables through regression and correlation methods, using real-
world biomedical datasets, to support prediction, modelling, and data-driven decision-making in
healthcare engineering applications.

4. Implement statistical modelling and data analysis using the SPSS statistical software environment
and its relevance to data analytics in engineering and biomedical applications. Perform data entry,
data cleaning, and manipulation of variables using SPSS. Apply SPSS statistical procedures for descriptive
statistics, inferential analysis, and regression modelling

5. Evaluate and apply data analytics concepts including big data analytics and cloud computing tools,
through case-based examples in biomedical engineering and health systems, to address
contemporary challenges in large-scale healthcare data analysis.

1. Analyse probabilistic models relevant to engineering problems, by applying probability rules and
discrete and continuous distributions (Binomial, Poisson, Hypergeometric, and Normal), to
estimate uncertainty and variability in biomedical systems.

1
1.0 Introduce Engineering Statistics and Data Analytics
Statistics is the branch of mathematics used in engineering to collect, analyse, interpret, and present data
to make informed decisions. It deals with uncertainty, variation, and probability in measurements,
materials, and processes. In engineering, this involves analysing experimental data, testing hypotheses,
and estimating relationships between variables to guide design, quality control, and decision-making.
Everything dealing with the collection, processing, analysis, and interpretation of numerical data belongs
to the domain of statistics. In engineering, this includes tasks such as calculating the average downtime of
machines in a factory, analysing test data from electronic circuits, evaluating the performance of medical
or mechanical devices, predicting the reliability of engines or power systems, and studying vibrations in
bridges, aircraft wings, or rotating machines.

1.1 Why Engineers Study Statistics

Statistics helps biomedical engineers make sound design and manufacturing decisions using data—
especially when it is impossible or too costly to test every single component or device.
Examples:
• While designing an infant incubator, an engineer tests temperature sensors from a small batch and
uses the results to decide whether the entire production meets safety standards.
• During fabrication of a patient monitor, only a few circuit boards are stress-tested. Statistical
analysis is then used to predict the reliability of all units produced.
• In developing a phototherapy system, light intensity is measured on selected prototypes, and the
data guide design adjustments for optimal therapeutic performance.
• When selecting materials for a prosthetic or catheter, an engineer tests samples for strength and
biocompatibility and applies statistics to choose the best material for mass production.
The process of using statistics usually involves four steps:
1. Set goals: Decide clearly what you want to find out.
2. Plan data collection: Decide what data is needed and how to collect it.
3. Analyze data: Use statistical methods to get useful information from the data.
4. Interpret results: Understand the information and make conclusions.
By following these steps, statistics helps you gather information efficiently and make informed decisions.

2
1.2 Role of Statistics in Experimental Design, Healthcare, and Biomedical Research
1. Role in Quality Improvement
Statistics is essential for improving processes and products. Engineers and scientists use it to collect data,
analyze trends, and present information visually, which helps identify areas needing improvement. For
example: Hospitals use statistics to track infection rates in different wards. By analyzing the data, they
can detect patterns and implement measures to reduce hospital-acquired infections.
2. Applications in Experimental Design
Experimental design is about planning how data is collected so that the results are reliable and conclusions
are valid. Statistics plays a key role by:
• Controlling variation: Ensuring that differences in outcomes are due to the factors being tested,
not random errors.
• Testing hypotheses: Determining whether observed effects are significant or due to chance.
• Monitoring processes: Detecting when something goes wrong and requires correction.
• Example (Biomedical Research): When testing a new drug, a scientist uses statistics to divide
patients into treatment and control groups randomly. This ensures that differences in outcomes are
due to the drug, not other factors like age or health condition.

2.0 DATA TYPES AND DATA COLLECTION METHODS

Variables
Variables are qualities or quantities that vary from one member of the sample to another. They describe
characteristics we can measure or count. Examples include age, sex, height, income, marital status, and
eye colour.
They are called variables because their values are not the same for everyone, and they can also change
over time. For example, age is a variable because people in a group do not all have the same age, and each
person’s age increases with time. Likewise, income is a variable because it can be different for different
people and may also increase or decrease over time.

Types of Variables
Variables can be classified into two, namely Quantitative (Numeric) and Qualitative (Categorical).
Each type can be classified further.

3
Figure 1: types of Variables

Quantitative (Numeric) Variable


A quantitative random variable is that which could be expressed in numerical terms. They arise from
counts or measurements. They are of two types: Discrete and continuous.
1. Discrete Variable
A discrete variable is something you can count. It takes whole numbers only (no fractions or decimals). In
other words, if you can count it one by one, it’s discrete. Examples include the number of cars in a parking lot,
students in a class, or patients in a hospital. The values usually look like this: 0, 1, 2, 3, 4, and so on.
2. Continuous Variable
A continuous variable is something you measure. It can take any value within a range, including decimals.
Examples include weight, height, length, volume, or time. For instance, the time spent attending to a
patient, the length of medical equipment produced, or a person’s weight are continuous because they can
be 2.5 minutes, 45.8 cm, or 63.2 kg.

Types of Continuous Variables


There are two main types of continuous variables: interval variables and ratio variables.
1. Interval Variable
An interval variable has equal differences between values, but zero does not mean “nothing.”
Examples include IQ, temperature, and pH.
For instance, in temperature: the difference between 100°C and 90°C is the same as between 90°C and
80°C. However, 0°C does not mean there is no heat, and 20°C is not “twice as hot” as 10°C. The zero
point is just a reference, not a true zero.
2. Ratio Variable
A ratio variable also has equal differences between values, but it has a true zero point. This means that
when the value is zero, the quantity truly does not exist.
Examples include weight, height, age, monthly income, and years of education.
For example: 0 kg means no weight, 0 years means no age, because of this true zero, we can make
comparisons like: 6 m is twice as tall as 3 m, 60 kg is twice as heavy as 30 kg.
QUALITATIVE VARIABLE

Qualitative variables describe qualities or characteristics, not numbers you measure or count. Examples

4
include: Gender (male, female), Marital status (single, married, divorced, widowed), Blood group (A, B,
AB, O), Sometimes we give them numbers for easy recording (for example: male = 1, female = 2), but
these numbers do not have mathematical meaning, you cannot add or multiply them. There are two types:
ordinal and nominal variables.
1. Ordinal Variable
An ordinal variable is a type of categorical variable that can be arranged in order or ranked, but the gap between
the categories is not equal or measurable. you can arrange them in order, but you can’t measure how far apart
they are, or but what quantity the differ.
Examples:
• Disease severity: absent → mild → moderate → severe
• Students’ grades: A, B, C, D
• Attitude: strongly agree → agree → disagree → strongly disagree
• Ratings: very low → low → medium → high → very high
Although these are ordered, we cannot say exactly how much better or worse one category is compared to
another.
2. Nominal Variable
A nominal variable is also categorical, but it has no natural order or ranking. They are just names or labels.
Examples:
• Sex: male, female
• Marital status: single, married, divorced, widowed
• Study type: full-time, part-time, evening
• Hair colour: black, brown, red
• Religion
These groups are different, but none is higher or better than the other. Sometimes numbers are used for coding
(e.g., male = 1, female = 2), but the numbers have no mathematical meaning.

Functional classification of variables


Dependent Variable
A dependent variable is the variable that changes because of another variable. Example:
If you change the light intensity in a phototherapy unit, the bilirubin level measured in the baby is the dependent
variable.
Causes
E.g If salt intake → Hypertension, then hypertension is a dependent variable
Independent Variables
Independent variables are the factors you change or control in an experiment to see how they affect the outcome
(dependent variable). They are also called: Predictor variable, Controlled variable or Manipulated variable
Independent = what you change
Dependent = what you measure
Example:
If you adjust the light intensity in a phototherapy unit, the light intensity is the independent variable because
you are controlling it to see how it affects the baby’s bilirubin level.
5
Causes
E.g Hypertension → coronary heart disease, then hypertension is an independent variable

Independent variable(s) may be of these kinds such as continuous variable(s), binary/dichotomous


variable(s), nominal categorical variable(s), ordinal categorical variable(s), etc.
Intermediary (Intervening) Variable
An intermediary variable is a variable that explains how or why one variable affects another. It is also
called a mediating variable.
Example:
• We see that people with higher income tend to live longer (income → longevity).
• But having more money doesn’t directly make you live longer.
• Higher income usually means better medical care, which helps people live longer.
• Here, medical care is the intervening variable, because it links income to longevity.
Intervening variable = the middle factor that explains how one variable affects another.
Causes Causes
If salt intake → Hypertension → coronary heart disease.

Note that if obesity causes (or is associated with) both Hypertension, and conorary heart disease while
hypertension also causes conorary heart disease, then hypertension and obesity are Confounder
variables

Hypertension

Obesity Conorary Heart disease

Binary variable (Dichotomous variable): These are nominal variables that occur in two
categories, E.g., “improved/not improved”, “disease present/ disease not present”; yes/No;
male/female; etc. They are often labeled zero and one.

3.0 Descriptive Statistical Methods


Descriptive statistical methods are techniques used to organize, summarize, and present data in a
meaningful way so that its main features can be easily understood.
They do not make predictions or generalizations beyond the data collected; instead, they simply describe
what the data shows.
Common Descriptive Statistical Methods Include:
• Tables and frequency distributions
• Graphs and charts (bar charts, pie charts, histograms)
• Measures of central tendency (mean, median, mode)
• Measures of dispersion (range, variance, standard deviation)

6
FREQUENCY DISTRIBUTION

Frequency distribution is a representation, either in a graphical or tabular format, that displays the
number of observations or times a given quantity (or group of quantities) occurs in a set of data. For
example, the frequency distribution of income in a population would show how many individuals (or
households) have the income of a certain level.

Example 1: A biomedical engineering team collected body weight measurements (in pounds) of 57
pediatric patients using a digital weighing system during routine clinical screening in a tertiary hospital.
The data were used to assess sensor performance, calibration accuracy, and patient weight distribution for
pediatric medical device design.:

68 63 42 27 30 36 28 32 79 27 22 23 24 25 44 65 43 25 74 51 36 42 28 31 28 25 45 12 57 51 12 32 49 38
42 27 31 50 38 21 16 24 69 47 23 22 43 27 49 28 23 19 46 30 43 49 12
From the data set above we have:
Solution

Weight Interval Tally Frequency Relative


Frequency (%)
10 -19 llll 5 8.8
20 -29 llll llll llll llll 19 33.3
30 -39 llll llll 10 17.5
40- 49 llll llll lll 13 22.8
50–59 llll 4 7.0
60 – 69 llll 4 7.0
70 - 70 ll 3 3.5
Total 57 100
Note that the frequency distribution table above can be presented in form of a chart

7
Figure 1: Simple Bar Chart of weights of 57 children at a day-care center

Figure 2: Pie Chart of weights of 57 children at a day-care centre

Example 2:

A biomedical engineering team working with public health officials is developing a digital immunization
monitoring system for primary healthcare centers. To validate the system and understand vaccination coverage,
data were collected from electronic health records on the immunization status of under-five children in both
rural and urban communities.
The table below summarizes the collected data:

Immunization status Rural Urban


Complete 181 202
Partial 114 107
None 19 4
Using this data:
a) Construct a multiple bar chart to compare immunization status between rural and urban areas.
b) Construct a component (stacked) bar chart to show the proportional distribution of immunization status in
each location.

8
Figure 3: Multiple Bar Chart of immunization status of under-five children in a rural and urban
town

Figure 3: Component Bar Chart of immunization status of under-five children in a rural and
urban town

Measures of Central tendency (MEAN, MEDIAN AND MODE)

The average value is usually represented by the arithmetic mean, customarily just called the mean.
This is simply the sum of the values divided by the number of values. Let X represent the mean, X =
(∑x)/n, where x denotes the values of the variable, ∑ is the Greek capital letter sigma means ‘the sum
of’ and n is the number of observations.

Other measures of the average value are the median and the mode.
The median is the value that divides the distribution in half. If the observations are arranged in increasing order,
the median is the middle observation or the (n + 1) the value of ordered observations.
Example:
A biomedical engineering team is calibrating a non-invasive blood volume monitoring device intended
for use in dialysis and critical care units. To validate the device, plasma volume measurements (in litres)
are taken from eight healthy adult males during baseline testing:

2.75, 2.86, 3.37, 2.76, 2.62, 3.49, 3.05, 3.12

The engineer must determine the mean, median, and mode of these values to establish a reference plasma
volume range for healthy adults, which will later be used to compare patient readings and detect abnormal
fluid balance.

(a) Mean: n = 8, ∑ x = 2:75 + 2:86 + 3:37 + 2:76 + 2:62 + 3:49 +3:05 + 3:12 = 24:02
24.02
Therefore, X = 8
= 3.00 litres

(b) If there is an even number of observations, there is no middle one and the average of the two
‘middle’ ones is taken.
Median: First rearranging the measurements in increasing order gives:
9
2:62, 2:75, 2:76, 2:86, 3:05, 3:12, 3:37, 3:49
Median = (n + 1)/2 = 9/2 = 4.5th value = average of 4th and 5th values = (2:86 +3:05)/2 = 2:96 litres

The mode is the value which occurs most often.


(c) Mode: There is no estimate of the mode, since all the values are different.
The mean is usually the preferred measure since it takes into account each individual observation and is
most amenable to statistical analysis. The median is a useful descriptive measure if there are one or two
extremely high or low values, which would make the mean unrepresentative of the majority of the data.
The mode is seldom used. If the sample is small, either it may not be possible to estimate the mode or the
estimate obtained may be misleading.
The mean, median and mode are, on average, equal when the distribution is symmetrical and unimodal.
When the distribution is positively skewed, a geometric mean may be more appropriate than the arithmetic
mean.

Measures of Dispersion
The range
The range is the simplest measure, and is the difference between the largest and smallest values. Its
disadvantage is that it is based on only two of the observations and gives no idea of how the other
observations are arranged between these two. Also, it tends to be larger, the larger the size of the sample.

Variance
Variance (and its square root, standard deviation) is the most commonly used measure of variation because
it considers all observations. It is based on how far each value deviates from the mean. When values are close
to the mean, variation is small; when they are widely spread, variation is large. Simply averaging deviations
does not work because positive and negative values cancel out, giving zero. Therefore, variation is measured
by considering the size of deviations rather than their direction. However, this measure is not mathematically
very tractable, and so instead we average the squares of the deviations, since the square of a number is
always positive.
𝑥𝑖 −𝑥 2
Variance S2 = ∑
𝑛−1
Degrees of freedom
Note that the sum of squared deviations is divided by (n - 1) rather than n, because it can be shown
mathematically that this gives a better estimate of the variance of the underlying population. The
denominator (n- 1) is called the number of degrees of freedom of the variance. This number is (n - 1) rather
than n, since only (n - 1) of the deviations (x− x) are independent from each other. The last one can always
be calculated from the others because all of them must add up to zero.

Standard Deviation
For many purposes it is more convenient to express the variation in the original units by taking the square
root of the variance. This is called the standard deviation (S.D.).
̅̅̅2
∑𝑥𝑖−𝑥)  (x
( x )2
) −
2  i
s =√ 𝑛−1
or s = i
n
n −1
10
When using a calculator, the second formula is more convenient for calculation, since the
mean does not have to be calculated first and then subtracted from each of the
observations.

Coefficient of variation
The coefficient of variation expresses the standard deviation as a percentage of the sample mean. This is
useful when interest is in the size of the variation relative to the size of the observation, and it has the
advantage that the coefficient of variation is independent of the units of observation.
𝑠
c.v = x 100
𝑥
Note: Engineering benchmark:
• CV < 10% → excellent control
• 10–20% → acceptable
• 30% → unstable process

Standard error
The sample mean is unlikely to be exactly equal to the population mean. A different sample would give a
different estimate, the difference being due to sampling variation. This is called the standard error of the
sample mean, and it measures how precisely the population mean is estimated by the sample mean.
𝑠
Se = √𝑛

The size of the standard error depends both on how much variation there is in the population and on the size
of the sample. The larger the sample size n, the smaller is the standard error.
Example: In a medical equipment production company, nine temperature sensors from consecutive
production batches were tested to check accuracy of temperature readings (°C). The readings obtained were:
The contamination levels (MPN/g) obtained were: 0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392,
0.418.
a) Using statistical methods, determine:
a. The range of the data.
b. The mean and variance of the sample.
c. The standard deviation.
d. The coefficient of variation (CV).
e. The standard error of the mean.

Sn sample Xi − X ( X i − X )2
1 0.593 0.136556 0.018647
2 0.142 -0.31444 0.098875
3 0.329 -0.12744 0.016242
4 0.691 0.234556 0.055016
5 0.231 -0.22544 0.050825
11
6 0.793 0.336556 0.11327
7 0.519 0.062556 0.003913
8 0.392 -0.06444 0.004153
9 0.418 -0.03844 0.001478
Total (∑) 4.108 0.36242

Soln.
a) Range = Highest value – Lowest value =0.793 - 0.142 = 0.651MPN/g

∑𝑥 4.108
b) Mean = x = = = 0.4564
𝑛 9
(𝑥𝑖 −𝑥)2 0.3624
c) Variance S2 = ∑ = = 0.0453(MPN/g)2
𝑛−1 8
d) Standard deviation = √S2 = √0.0453 = 0.2129
This tells us how far, on average, each batch deviates from the mean contamination level. A
standard deviation of 0.213, compared with a mean of 0.456, is high. This means that the process
is not tightly controlled.
N.B: The wider the standard deviation, the wider your control limits, and the higher your defect
risk.
𝑆 0.2129
e) Coefficient of Variation (CV) = 𝑋 × 100 = 0.4564 × 100 = 46.65%
Therefore, the 47% indicates that the production process is statistically unstable.
𝑆 0.2129 0.2129
f) Standard Error of the Mean (SEM) = √𝑛 = = = 0.0710
√9 3

HYPOTHESIS TESTING
Hypothesis testing is the use of organized statistical steps to determine the probability that a given
hypothesis is true. Hypothesis testing is designed to detect significant difference or significant
associations. Significant differences/ associations here refer to differences/ associations that did not occur
by random chance.

Steps in Hypothesis Testing


Step 1. Hypotheses Statement: Null hypothesis (Ho) and the alternative hypothesis (Ha) where
necessary
The null hypothesis is usually stated in the negative form. It assumes that there is no real effect, no difference,
or no association between variables. In other words, it says that any observed difference is due to random
chance, not because of the factor being studied. Example: There is no significant difference between two
devices.

Null (Ho ) is stated as the hypothesis “no association” or “no difference

-Similar with law court where the accused is presumed not guilty until proven
otherwise

12
➢ Alternative (Ha): To contradict the stated null hypothesis

Eg, Ho: There is no significant association between oral contraceptive use and blood pressure
Ha: significant association is likely to exist between oral contraceptive use and blood pressure

Ho: µ = 0

Step 2. Statement of significant level (α) ( usually at 5%, 1%, etc)


Step 3. Appropriate test statistic (Chi-square, t-test, F-test (ANOVA), etc)
Step 4. Computation- based on the test statistic
Step 5. Decision- to reject or not to reject
➢ “Reject Ho” if calculated value > tabulated value (or p-value < 0.05),
➢ Do not reject Ho” if calculated value ≤ tabulated value (or p-value ≥ 0.05),

Special Note
• Tabulated values are found from statistical tables (Z, t, Chi-sq, F-test, etc).
• Calculated values manually calculated in the analysis
• P-value is usually supplied by the applications software (SPSS, Epi-Info, STATA, SAS, etc)
• The p-value is the probability of getting a test statistic (distance) or more extreme than what was
observed by chance if it was true.
• Both tabulated value and p-value produce same line of result
• p-value is more advanced than the tabulated value yet traditional

Conclusion
• Evidence of significant difference (or association) is established if Ho was rejected at
significant level such as 5% level of significance
• No evidence of significant difference (or association) is established if the null hypothesis Ho was
not rejected at significant level such as 5% level of significance

Useful guide in p value Interpretations


• 0.01≤p < 0.05 → significant
• 0.001≤ p < 0.01→ highly significant
• p < 0.001 → very highly significant
• p > 0.05 → not significant
• 0.05 ≤ p < 0.10, → a trend towards statistically significant
Note: Significant different is quite different from statistically significant hence always check the results
properly.
13
Confidence Intervals (CI)
• P-values alone are often not sufficient for proper interpretation and decision-making in health research.
• Therefore, it is important to also report confidence intervals (CI), such as the 95% CI.
• A confidence interval shows the range of values within which we are reasonably confident the true
population mean or difference lies.
• As stated by Verbeek (2010), “A 95% confidence interval represents the set of all null hypotheses that
would not be rejected in a statistical test.”

Interpretations Based on 95% C I

Bio-statistical Quote: “ If the difference is not different enough to make a difference, What is the difference?”

TOOLS IN HYPOTHESIS TESTING.


Common statistical tools in hypothesis testing include Chi square test, Z-test, T test, F test (or Anova test),
correlation test, and regression test.

Chi Square Test (  2 )

A chi-square distribution is the distribution of the sum of squares of k independent standard normal random
variables with k degree of freedom. A chi-square test is a statistical hypothesis test where the null
hypothesis that the distribution of the test statistic is a chi-square distribution, is true. While the chi-square
distribution was first introduced by German statistician Friedrich Robert Helmert, the chi- square test was
first used by Karl Pearson in 1900. Hence Pearson’s chi-squared test (also called ‘chi-squared’ test and
denoted by ‘  2 ’ is the most popular type of Chi square test today. A classic example of chi-square test is
the test for fairness of a die where we test the hypothesis that all six possible outcomes are equally likely.

14
Chi Square Distribution with 1 degree of freedom

In biomedical engineering applications, chi-square analysis is commonly used in:


• Choice of material to be used
• Medical device quality control (e.g., defect classification)
• Clinical research (e.g., treatment effectiveness across groups)
• Biological experiments (e.g., gene or biomarker presence/absence)
• Hospital equipment audits
• Patient demographic studies and questionnaire responses

Chi-Square Goodness-of-Fit Test


The Chi-square goodness-of-fit test is used to check how well observed data match expected values or a
theoretical model. It is applied when you have one categorical variable with two or more groups or categories.
Examples of one categorical variable:
• Smoking status: current, former, non-smoker
• Food preference: fish, meat, vegetables
• Device condition: good, fair, poor
The test compares:
• Observed frequencies (what you actually counted) with
• Expected frequencies (what your model or assumption predicts)
Chi-square test for independent
Chi-square test for independent may be required if 2 variables are both categorical or nominal (e.g.
Gender and Depression levels, smoking status and cough classes, sex and marital status, etc), with
interest on checking whether the results of one could be as a result of the category on the other variable.

Contingency table
In many cases, the categorical variables of interest have at least two levels each.

Column 1 Column 2 … Column n


Row 1 rc11= O11, E11 rc12= O12, E12 … n1
Row 2 rc21= O21, E21 rc22= O22, E22 … n2

⁝ ⁝ ⁝ ⁝ ⁝

Row m rcm1 rcm2 … N

15
Hence using a contingency table having two rows and two columns (i.e: nr=nc=2). The general
form of a 2x2 table is
Column 1 Column 2 Total
Row 1 a= rc11= O11, E11 b= rc12= O12, E12 a+b= O11, E11+ O12, E12
Row 2 c= rc21= O21, E21 d= rc22= O22, E22 c+d= O21, E21+ O22, E22
Total a+c= O11, E11+ O21, b+d= O12, E12+ O22, E22 N=( a+ b) + (c+d)
E21
▪ In this case, the chi-square statistic has the following simplified form,
▪ Under the null hypothesis, χ2-statistic has chi-square distribution with (nr-1)x(nc-1) degrees of
freedom, where r and c represent the number of rows and number of columns respectively.
Testing equality of two population proportions using data from two samples
• Ho: p1 = p2 Ho: p1 - p2 = 0
• Ha: p1 ≠ p2 HA: p1 - p2 ≠ 0
In the context of the 2x2 table, this is testing whether there is a relationship between the rows and
columns
Chi Square Test (  2 ) for Independent
Useful common test for an association between two group variables.
• Not a measure of effect size
• Has no outcome variable
Test statistics compares observed frequencies (Oi) with expected frequencies (Ei)
(𝑂𝑖 − 𝐸𝑖 )2
2 = ∑ 𝐸𝑖
~ (nc-1)(nr-1)
Assumptions of Chi Square
1. No expected category should be less than 1 (it does not matter what the observed values are)
2. No more than one-fifth of expected categories should be less than 5.
(a + b)!(c + d )!(a + c)!(b + d )!
If Chi Square rule is violated, Fishers Exact Test should be used. ie. P =
a!b!c!d!N!
Chi-square is used to analyze qualitative data, when your data are things like:
• Male / Female
• Pass / Fail
• Good / Fair / Poor
• Device working / Device faulty
• Blood group A / B / AB / O

Chi Square Examples 1


A biomedical engineering team is developing a smart food safety monitoring system for hospital cafeterias.
To validate their risk-prediction algorithm, they analyze data from a cohort study investigating whether
consumption of raw tomatoes is associated with Salmonella infection among patients and staff.
Two groups were followed:
• Group A: Individuals who ate tomatoes
• Group B: Individuals who did not eat tomatoes
The occurrence of Salmonella illness was recorded.

16
Tomato Consumption Salmonella (+) Salmonella (–) Total
Ate tomato 41 89 130
Did not eat 19 151 170
Total 60 240 300

Solution:
Step 1: State hypothesis
H0: There is no significant difference in salmonella illness between tomato eaters and non-tomato
eaters.
H1: There is likely significant difference in salmonella illness between tomato eaters and non- eaters
Level of significance = 5%
(𝑂 − 𝐸 )2
Test statistic: 2 = ∑ 𝑖 𝑖 ~ (nc-1)(nr-1)
𝐸𝑖
Step 2: Calculate the Expected Frequencies (Eij )
Salmonella illness
Yes No Total
Did ate Tomato 41 89 130
Did not eat tomato 19 151 170
Total 60 240 300
The observed frequency given in the table. The corresponding expected frequency, Eij for each cell is
obtained by multiplying the and row (RT ) and column totals (CT) and dividing this result by the overall
𝑅𝑇 ×𝐶𝑇
total (OT). The expected frequency is calculated as Eij =
𝑂𝑇

130 ×60 130 ×240 170 ×60 170 ×240


E11 = = 26, E12 = = 104, E21 = = 34, E22 = = 136
300 300 300 300

O E (O-E) (O-E)2 (O-E)2/E


41 26 15 225 8.6538
89 104 -15 225 2.16346
19 34 -15 225 6.6176
151 136 15 225 1.6544
Total 19.0894

Degree of freedom (df)= (nc-1)(nr-1) = (2-1)(2-1) = 1 eaters. It also indicates that association was
nc = number of column variables and nr = number of row variables found between tomato and salmonella.
At 5% level of significance, we obtain from table Chi square tabulated 4@5% = 3.8415

Note: How 3.8415 is obtained on the Chi square table.


Step 1: Determine Degree of Freedom (df)
Step 2: Choose Significance Level (5%)
Step 3: Enter Chi-Square Table, (Row = df = 1, Column = 0.05)
Decision: since tabulated = 19.089 is > 3.8415 (table). Hence, null hypothesis H0 was rejected.
Conclusion: Significant difference was found in salmonella illness between the tomato eaters and the non-tomato
17
From SPSS software: The output table will provide a cross-tabulation as well as the Chi-Square test.
Procedure in SPSS: Click on Analyze , descriptive statistics, crosstabs
tomato * salmonella Cross tabulation.
Salmonella Total
Illness no illness
Count 41 89 130
Yes
Tomat Expected Count 26.0 104.0 130.0
o Count 19 151 170
No
Expected Count 34.0 136.0 170.0
Count 60 240 300
Total
Expected Count 60.0 240.0 300.0

The probability value (p-value) is 0.000 on table (ie p <0.0001), which is far less than 0.05, and the Chi-
square value is 19.089. Very high significant difference was found between the tomato eaters and the non-
tomato eaters in salmonella illness.

Value Df Asymp. Sig. Exact Sig. Exact Sig.


(2-sided) (2-sided) (1-sided)
Pearson Chi-Square 19.089 1 .000
Continuity Correctionb 17.838 1 .000
Likelihood Ratio 19.108 1 .000
Fisher's Exact Test .000 .000
Linear-by-Linear Association 19.026 1 .000
N of Valid Cases 300

Example 2: A hospital biomedical engineering department is evaluating the impact of operator training
level on the fault rate of infusion pumps used in clinical wards. The goal is to determine whether additional
technical training is necessary to reduce equipment downtime and improve patient safety.

Training level Low fault Moderate High fault


Basic 21 64 17
Intermediate 16 49 14
Advanced 29 93 28
Is there sufficient evidence at α = 0.05 to conclude that operator training level is associated with equipment
fault rate?

Hypothesis

H0: Equipment fault rate is independent of operator training level.


H1: Equipment fault rate is associated with operator training level.
Level of significance = At 5%
(𝑂𝑖 − 𝐸𝑖 )2
2 = ∑ 𝐸𝑖
~ (nc-1)(nr-1)

18
Calculation of Expected Frequencies (Eij )

The observed frequencies are given in the table. The corresponding expected frequencies, Eij
for the cells are obtained by multiplying the column totals (CT) and row (RT ) and dividing this result
by the overall total (OT).
𝑅𝑇 ×𝐶𝑇
The expected frequency is calculated as Eij = 𝑂𝑇

Above average Average Below average Total


Blind 21 (20.34) 64 (63.48) 17 (18.18) 102
Deaf 16 (15.75) 49 (49.17) 14 (14.08) 79
No handicap 29 (29.91) 93 (93.35) 28 (26.74) 150
Total 66 206 59 331

O E (O-E) (O-E)2 (O-E)2/E


21 20.34 0.66 0.4356 0.02142
64 63.48 0.52 0.2704 0.00426
17 18.18 -1.18 1.3924 0.07659
16 15.75 0.25 0.0625 0.00397
49 49.17 -0.17 0.0289 0.00059
14 14.08 -0.08 0.0064 0.00045
29 29.91 -0.91 0.8281 0.02769
93 93.35 -0.35 0.1225 0.00131
28 26.74 1.26 1.5876 0.05937
Total (∑) 0.195646
Degree of freedom (df)= (nc-1)(nr-1) = (3-1)(3-1) = 2 x 2 = 4
nc = number of column variables and nr = number of row variables
At 5% level of significance, we obtain from table  2 = 9.48773
Decision: since  2 = 0.195646 is  9.489 , Null hypothesis H0 is Accepted
Conclusion: At the 5% significance level, there is no statistically significant association between operator
training level and equipment fault rate. This suggests that equipment fault rate does not depend on the operator’s
training level, and other factors (e.g., maintenance protocols, device quality, workflow) may be more important
in determining equipment reliability.
19
Software Solution-SPSS SOFTWARE OUTPUT
Handicap * Performance Cross tabulation
Performance Total
above average below average average
Count 21 64 17 102
blind
Expected Count 20.3 63.5 18.2 102.0
Count 16 49 14 79
Handicap deaf
Expected Count 15.8 49.2 14.1 79.0
Count 29 93 28 150
no handicap
Expected Count 29.9 93.4 26.7 150.0
Count 66 206 59 331
Total
Expected Count 66.0 206.0 59.0 331.0

Chi-Square Tests
Value df Asymp. Sig. (2-sided)

Pearson Chi-Square .196a 4 .995


Likelihood Ratio .197 4 .995
Linear-by-Linear .174 1 .677
Association
N of Valid Cases 331

The probability value (p-value) is 0.995, which is greater than 0.05, and the Chi-square value is 0.196. No
significant association found between handicap and performance in this study .
Conclusion: No evidence to conclude that the handicap of the workers has influence on their performances

Assignment:

A team installs smart hand hygiene sensors in an ICU. They record staff compliance (Yes/No) and whether
nosocomial infections occur in patients.

Staff Compliance Infection (+) Infection (–) Total

Compliant 5 95 100

Non-compliant 20 80 100

Total 25 175 200

Question:

• Test if there is an association between hand hygiene compliance and patient infection rates.

2
Question 2:
A biomedical engineer compares two catheter materials (Silicone vs. PTFE) and the occurrence of catheter-
associated urinary tract infections (CAUTI).
Catheter Type CAUTI (+) CAUTI (–) Total

Silicone 8 42 50
PTFE 3 47 50
Total 11 89 100
Question:
• Is there a significant association between catheter type and CAUTI incidence?

Chi Square Distribution Table


df 0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
1 --- --- 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 20.278
8 1.344 1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188
11 2.603 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718
18 6.265 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 37.156
19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 38.582
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401
22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.559

3
MEAN COMPARISON TEST

Common mean comparison test include Z test, T-test (Students’ T test) and one way Analysis of variance (ANOVA)
test. The "Z-test" compares the mean of a set of measurements to a given constant when the sample variance is known.
It is expected to satisfy the following conditions
i) the observed data X1, ..., Xn are (i) independent,
ii) have a common mean µ, and
iii) have a constant variance σ2, then the sample average X has mean µ and variance σ2 / n.

The null hypothesis is that the mean value of X is a given number µ0.
We can use X as a test-statistic, rejecting the null hypothesis if X − µ0 is large.
Test statistic Z = (X − µ0) / s, s is the standard deviation, where s2 = σ2 / n.
• Uses z scores for large sample sizes (n ≥30)
• Use the Student T test from for small sample sizes (n<30)

Test based on T-DISTRIBUTION (Students’ T test)


T test was developed in 1908 by W.S. Gossett an employee of Guinness brewery in Dublin.
The t-test is based on assumptions of normality and homogeneity of variance (asymmetric distribution).

Types of T-test:
T test comes in 3 types, namely one sample t-test, independent sample t-test and paired sample t-test

One Sample T-Test- (Single Sample t Test with only 1 group);


This is a t test of one group against a hypothetical mean (standard mean value). The basic idea of the test is a
comparison of the average of the sample (observed average) and the population (expected average), with an
adjustment for the number of cases in the sample and the standard deviation of the average. This variation compares
one sample mean against a mean derived from an independent source, for example a published source.
The One Sample t Test is commonly used to test the following:

• Statistical difference between a sample means and a known or hypothesized value of the mean in the
population.
• Statistical difference between the sample means and the sample midpoint of the test variable.
• Statistical difference between the sample means of the test variable and chance.
o This approach involves first calculating the chance level on the test variable. The chance level is
then used as the test value against which the sample mean of the test variable is compared.
• Statistical difference between a change score and zero.
o This approach involves creating a change score from two variables, and then comparing the mean
change score to zero, which will indicate whether any change occurred between the two time points
for the original measures. If the mean change score is not significantly different from zero, no
significant change occurred.
Formula:

H0:(𝑋 = μ), H1: There is no difference between the sample mean and the population parameter
4
Ha: (𝑋 ≠ μ), H1: There is a difference between the sample mean and the population parameter
Assumptions in One sample t test
• The dependent variable must be continuous (interval/ratio).
• The observations are independent of one another.
• The dependent variable should be approximately normally distributed.
• The dependent variable should not contain any outliers.
Examples
After several patients on hemodialysis developed infections, the hospital’s biomedical engineering team
wants to determine whether the water used in a particular model of dialysis machine has bacterial
contamination above a dangerous level. The safe limit of bacterial contamination in dialysis water is 0.3
MPN/mL. The team sampled 9 different dialysis machines from the same production batch and measured
bacterial levels (in MPN/mL): 0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418
Task: Use this data to test if the average bacterial contamination exceeds the safe limit of 0.3 MPN/mL.
Null hypothesis H0: μ = 0.3 (average bacterial level is safe)
Alternative hypothesis Ha: μ > 0.3 (average bacterial level is above safe limit)
Solution:
Let represent mean and s represent
sn sample Xi − X ( X I − X )2
standard deviation
1 0.593 0.136556 0.018647
2 0.142 -0.31444 0.098875 ,
3 0.329 -0.12744 0.016242
4 0.691 0.234556 0.055016 n= number of observations= 9, Xi=
5 0.231 -0.22544 0.050825 sample
6 0.793 0.336556 0.11327
7 0.519 0.062556 0.003913 ,
8 0.392 -0.06444 0.004153
9 0.418 -0.03844 0.001478
Total (∑) 4.108 0.36242

5
SPSS SOFTWARE OUTPUT
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
sample 9 .456444 .2128439 .0709480
One-Sample Test
Test Value = 0.3
t df Sig. Mean 95% Confidence Interval of the
(2-tailed) Difference
Difference
Lower Upper
sample 2.205 8 .059 .1564444 -.007162 .320051
Note: Sig = 0.059 represents p-value on table, indicating not significant since p-value is not less than 0.05

INDEPENDENT SAMPLES T-TEST – (ie 2 groups, 2means; no relation between groups),

Independent Samples T-Test compares means of two samples which don’t directly influence each other
(samples are two different groups of people or things). e.g., average income for group of males and
females, mean weight patients on active drug verses ones on placebo. etc

Practice Example: Two health quality-control technologists measured the surface finish of a
metal part, obtaining the data shown below.

Technology 1 Technology2
1.45 1.54
1.37 1.41
1.21 1.56
1.54 1.37
1.48 1.20
1.29 1.31
1.34 1.27
1.35
Assume that the measurements are normally distributed, and that the variances are equal, are their
differences in mean of finished measurements by the two technologists, using =0.05?

Null Hypothesis H0:


μ1=μ2 (The mean surface finish measured by Technology 1 is equal to the mean measured by
Technology 2).

Alternative Hypothesis Ha:


μ1≠μ2 (The mean surface finish measured by Technology 1 is not equal to the mean measured by
Technology 2).

(ie. test H0: x1 = x2 v.s. H1: x1  x2 ).


6
Solution

(Let technology 1 = X1 and technology 2 = X2

S/n Technology 1 (X1) Technology 2 (X2) (X1)2 (X2)2


1 1.45 1.54 2.1025 2.3716
2 1.37 1.41 1.8769 1.9881
3 1.21 1.56 1.4641 2.4336
4 1.54 1.37 2.3716 1.8769
5 1.48 1.2 2.1904 1.44
6 1.29 1.31 1.6641 1.7161
7 1.34 1.27 1.7956 1.6129
8 1.35 1.8225
Total 9.68 11.01 13.4652 15.2617

7
Example 2: In an environmental health management setting, two catalysts are being analyzed to determine how they
affect the mean yield of a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable.
Since catalyst 2 is cheaper, should it be adopted, providing it does not change the process yield. An experiment is run
in the pilot plant and results in the data table below. Is there any difference between the mean yields? Use  =0.05
and assume equal variances
Number 1 2 3 4 5 6 7 8
Catalyst 1 90.50 94.18 92.18 95.39 91.79 89.07 94.72 89.12
Catalyst 1 89.19 90.95 90.46 93.21 97.19 97.04 91.07 92.75

Solution:
H 0 : x1 − x2 = 0 H1 : x1 − x2  0
Number v Catalyst 1 Catalyst 2 (Catalyst 1)2 (Catalyst 2)2
s 1 91.50 89.19 8372.25 7954.856
2 94.18 90.95 8869.872 8271.903
3 92.18 90.46 8497.152 8183.012
4 95.39 93.21 9099.252 8688.104
5 91.79 97.19 8425.404 9445.896
6 89.07 97.04 7933.465 9416.762
7 94.72 91.07 8971.878 8293.745
8 89.12 92.75 7942.374 8602.563
Total 737.95 741.86 68111.65 68856.84

Decision: Since -2.145 < -0.36 < 2.145, the null hypothesis cannot be rejected

Conclusion: we do not have strong evidence to conclude that catalyst 2 results in a mean yield that differs from the
mean yield when catalyst 1 is used.

8
SPSS SOFTWARE OUTPUT

Group N Mean Std. Deviation Std. Error Mean


Catalyst 1 8 92.2438 2.40159 .84909
Catalyst 2 8 92.7325 2.98345 1.05481
Levene's Test t-test for Equality of Means
for Equality of
Variances
F Sig. T Df Sig. Mean Std. Error 95% Confi. Int.
(2-tailed) Difference Differenc of the Difference
e Lower Upper
Equal
variances .324 .578 -.361 14 .724 -.48875 1.35410 -3.3930 2.4155
assumed
Equal
variances -.361 13.389 .724 -.48875 1.35410 -3.4055 2.4280
not assumed

Note:
i. We do not want the test for equality of variance to show significance so that the t-test assumption of equal
variance should not be violated
ii. P-value is not significant at 5% (p-value = 0.724). It implies that means 92.24 in catalyst 1 and 93.73 in catalyst
did not differ significantly between the two groups.

PAIRED SAMPLE T-TEST (Dependent or Related) Here we have two means.

In paired or related sample T test, samples are collected from the same group of people. It compares means
of two samples which you expect to be connected (often, data is from the same sample at two different times).e.g.,
left hand-right hand, BP taken 2 times on a group of patients, Average mean before and after an intervention (for
the same group).

Formal Hypothesis Statement:


H0: diff=0 verses H1: diff  0; or H0: diff=0 verses H1: diff < 0; or H0: diff=0 verses H1: diff > 0 Test statistic for
dependent (paired) sample.
𝐷 𝑠 𝐷 𝑑
t= 𝑆𝐸
, where 𝑆𝐸 = √𝑛 or t = 𝑠 , 𝐷 = ∑ 𝑛𝑖
⁄√𝑛

𝑫 is the mean paired difference (ie first mean sample minus second mean sample) and 𝑺𝑬 is the standard error
of the mean (s is the standard deviation of the mean)
Formal Hypothesis Statement: H0: diff=0 verses H1: diff  0
Practical Example: A biomedical engineering research team developed a prototype non-invasive optical SpO₂
sensor for continuous patient monitoring. Its performance was evaluated by comparing measurements from the
prototype with those from a standard hospital pulse oximeter. Ten volunteers participated, with oxygen saturation
recorded for each subject using both devices. Because measurements were taken from the same individuals, a
paired parametric test was applied to determine whether the prototype sensor differed significantly from the
clinical standard.
9
Null hypothesis (H₀): There is no significant difference between measurements from the prototype
Prototype Sensor Reference Device sensor and the reference device.
196 192
190 187 Alternative hypothesis (H₁):
155 149 There is a significant difference between
199 200 measurements.
Solution 190 183
203 203
237 242
202 194
228 223
212 207
Prototype Reference di di2
Sensor Device
196 192 4 16
190 187 3 9
155 149 6 36
199 200 -1 1
190 183 7 49
203 203 0 0
237 242 -5 25
202 194 8 64
228 223 5 25
212 207 5 25
32 250

t/2= t0.05/2 = t0.025 @ 9 d.f = 2.262, tcalculated>tcritical Ho is rejected


Conclusion: There is a statistically significant difference between measurements obtained using the prototype
sensor and the standard pulse oximeter.

10
Practice Exc1
The following shows the heights in centimetres of 24 two-year old Nigeria boys with homozygous sickle cell
disease.
84.4, 80.6, 85.0, 89.9, 80.0, 82.5, 89.0, 81.3, 80.7, 81.9, 86.8, 84.3,
87.0, 83.4, 85.4, 78.5, 89.8, 85.0, 84.1, 85.4, 85.5, 86.3, 80.6, 81.9
(i) Calculate the mean,
(ii) Calculate the standard deviation and standard error of the heights
(iii) If the average height of a two-year old in the UK is 86.5cm. What do you say about the effect of sickle cell
disease and height?

Practice Ex 2
Patients who undergo an operation were randomized to two groups; one (Group1) receiving post-operative care in
hospital and the other (Group 2) receiving post-operative care at home. The patients were asked to rate their
satisfaction with the care received on a scale of 0-100. The results are given below. Use T test to compare the
11
satisfaction levels in the two treatment groups.
Group 1 77.1 100 75.4 79.5 78.6 99.0 84.9 92.7 78.2 100 72.2
Group 2 75.8 68.1 70.2 72.1 91.3 74.2 76.8 76.2 60.2 87.0 90.1
Practice Ex3
The following data was collected in a small clinical trial intended to reduce blood pressure.
Before – Blood pressure before treatment and After – Blood pressure after treatment
Note: a placebo is a chemically insert substance which is known to have no physical effect but which is similar in
appearance to a conventional medicine.
Treatment Before After
calcium 107 100 Compare before and after blood pressures in treatment (calcium)
calcium 110 114 group? (Can be reframed as... does the blood pressure reduce in
calcium 123 105 treatment (calcium) group?)
calcium 129 112
calcium 112 115
calcium 111 116
calcium 107 106
calcium 112 102
calcium 136 125
calcium 102 104

Practice Exc 4: The following data was collected in a small trial intended to test the effect of calcium
in the reduction blood pressure. Two groups of high blood pressure people were used, one group
received the calcium treatment and the other received a placebo. Test if there are there differences in
BP reduction between the two groups Note: a placebo is an inactive substance but which is similar in
appearance to a conventional medicine.

Treatment (calcium) Placebo


107 109
110 112
123 102
129 98
112 114
111 119
107 112
112 110
136 117
102 130

12
CORRELATION
Correlation quantifies (puts a number to) the strength of the linear relationship between two variables and also
indicates the direction of the relationship. A correlation simply indicates that there is a relationship between the two
variables. The correlation coefficient, r, measures the strength of the linear relationship. Three types of linear
correlation can be considered, they are positive correlation, negative correlation and an absence of linear correlation
(no correlation)
Value of the Correlation Co-Efficient (r) Strength of the Correlation
1 or -1 Perfect
0.8 to 0.99 (or -0.8 to -0.99) Very Strong
0.6 to 0.79 (or -0.6 to -0.79) Strong
0.4 to 0.59 (or -0.4 to -0.59) Moderate
0.1 to 0.39 (or -0.1 to -0.39) Weak
0 Zero (no correlation)

Scatterplots and Correlation Coefficients


The scatterplots below show how different patterns of data produce different degrees of correlation.

Note
▪ The value of r is between +1 and -1. Values of r close to +1 represents strong positive linear relationship. It occur
when data points fall exactly on a straight line. Values of r close to -1 represents strong negative linear relationship
▪ The correlation becomes weaker as the data points become more scattered. A value of r close to 0 means that the
linear association is very weak.
▪ If the data points fall in a random pattern, the correlation is equal to zero. It could be that there is NO association
at all, or the relationship is non-linear.
▪ Correlation is affected by outlier. Compare the first scatterplot with the last scatterplot. The single outlier in the
last plot greatly reduces the correlation (from 1.00 to 0.71).
13
Common Ways to Calculate a Correlation Coefficient
𝑛∑𝑥𝑦− ∑𝑥∑𝑦
1. Pearson Product Moment correlation (parametric Method) r =
√(𝑛∑𝑥 2 −(∑𝑥 2 )(𝑛∑𝑦 2 −(∑𝑦 2 ))

𝟔∑𝒅𝒊𝟐
2. Spearman Spearman’s rank correlation coefficient (non-parametric method) = rs = 1-𝒏(𝒏𝟐−𝟏)

Example: A tobacco company wishes to know whether heavy smoking is related to longevity. From a sample of
recently deceased smokers, the number of cigarettes (estimated on a per day for their last five years after visits with
their surviving relatives) is paired with the number of years that they lived.
Cigarette (x) 25 35 10 40 85 75 60 45 50
Years lived (y) 63 68 72 62 65 46 51 60 55

i. Obtain the Pearson correlation coefficient (or Spearman Spearman’s rank correlation coefficient
ii. Interpret the correlation result
iii. Test if the coefficient is significant at α = 5%

Solution: Pearson correlation coefficient is given by


𝒏∑𝒙𝒚 − ∑𝒙∑𝒚
√(𝒏∑𝒙𝟐 − (∑𝒙𝟐 )(𝒏∑𝒚𝟐 − (∑𝒚𝟐 ))

Subject Cigarette (x) Years lived (y) xy x2 y2


1 25 63 1575 625 3969
2 35 68 2380 1225 4624
3 10 72 720 100 5184
4 40 62 2480 1600 3844
5 85 65 5525 7225 4225
6 75 46 3450 5625 2116
7 60 51 3060 3600 2601
8 45 60 2700 2025 3600
9 50 55 2750 2500 3025
Total 425 542 24640 24525 33188

Solution

14
Method 2
6∑𝑑𝑖2
Spearman Spearman’s rank correlation = rs = 1-𝑛(𝑛2−1)

Cig Ye Rank Rank Differ Square


arette (X) ars lived for X (RX) for Y (RY) ence Difference
(Y) (di=RX- RY ) (d 2 )
i
25 63 2 6 -4 16
35 68 3 8 -5 25
10 72 1 9 -8 64
40 62 4 5 -1 1
85 65 9 7 2 4
75 46 8 1 7 49
60 51 7 2 5 25
45 60 5 4 1 1
50 55 6 3 3 9
 d 2 =194
i

ii. r = -0.61 or 0.62 indicates strong negative correlation (ie more smoking imply reduced longetivity)
iii The test procedure is as follows; Ho:  = 0 Vs   0
𝑟 √𝑛−2 −𝑜.611√9−2
Test statistics = talculated = = = -2.0425
√1−𝑟 2 √1−(−0.6111)2
Significance level a =5 %, From table t0.025,8= -2.
Decision criterion: Reject Ho: if talculated > tα/2
Conclusion: Evidence of association found between smoking and years lived

REGRESSION ANALYSIS
Regression analysis is a statistical methodology that utilizes the relation between two or more quantitative variables so that
a response or outcome variable can be predicted from the other, or others. This methodology is widely used in health and
biological sciences, and many other disciplines. An example of its applications is on the predicting the length of hospital
stay of a surgical patient by utilizing the relationship between the time in the hospital and the severity of the operation.
Regression analysis serves three major purposes: (I) description, (2) control, and (3) prediction.
Note:
• The primary goal of quantitative analysis is to use current information about a phenomenon to predict its future behavior.
• Current information is usually in the form of a set of data.
15
• In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values
of an independent (or predictor) variable X and a dependent (or response) variable Y.
• The goal of the analyst who studies the data is to find a functional relation between the response variable y and the predictor
variable x.
Historical Origin of Regression
Historically Regression Analysis was first developed by Sir Francis Galton, in the latter part of 19th century, who studied
the relation between heights of sons and fathers. Heights of sons of both tall and short fathers appeared to “revert” or
“regress” to the mean of the group. Galton considered this tendency to be a regression to “mediocrity. He developed a
mathematical description of this regression tendency. The term regression persists to this day to describe statistical relations
between variables.

RELATIONS BETWEEN VARIABLES


The concept of a relation between two variables, such as between family income and family expenditures for
housing, is a familiar one. We distinguish between a functional relation and a statistical relation, and consider
each of these in turn.

Functional Relation between Two Variables


A functional relation between two variables is expressed by a mathematical formula. If X denotes the
independent variable and Y the dependent variable, a functional relation is Y = f(X)
Given a particular value of X, the function f indicates the corresponding value of Y. For the observations plotted on
Figure below, note that all values fall directly on the line of functional relationship.

16
Statistical Relation between Two Variables
A statistical relation, unlike a functional relation, is not a perfect one. In general, the observations for a statistical
relation do not fall directly on-the curve of relationship.

NOTE: In Regression Function


For each X, take f (x) to be the expected value (i.e., mean value) of y. E (Y) denotes the expected value of Y.

Linear Regression/ Linear Regression model:


Given data points: (Yi, Xi) for i = 1,...,n, interest is in the probability distribution of Y as a function of X (ie Y =
fx)
Mean of Y is a straight-line function of X, plus an error term or residual
 y = f (x) + 

Goal is to find the best fit line that minimizes the sum of the error terms

Simple linear regression model:


Let's begin by looking at the simplest case, where there are two variables, one explanatory (X) and one response
variable (Y). A change in X causes a change in Y, as shown in Fig below.

It is always worth viewing your data (if possible) before performing regressions to get an idea as to the type of
relationship (eg. whether it is best described by a straight line or curve).

17
A simple regression model could be written as
Yi = β0 + β1xi + εi
For fitted value for observations, the estimated mean: 𝑦̂=µ {y|x} = b0 + b1x

Residual for observations: Resi= Yi - fiti = ei= yi − ŷ


yi is the observed regression line
µ {Y | X} = b0 + b1x

µ{Y | X} = “mean of Y given X” or “regression of Y on X”


b0= Intercept Slope, b1 =Slope and x= Unknown parameter
The two regression coefficients b0 + b1are called the intercept and the slope. Their actual values are
also unknown, and need to be estimated using the empirical data at hand. To find such estimators, we
may use the Least Square method.

THE LEAST SQUARE METHOD


To avoid individual judgment in curve fitting, it is necessary to agree on a definition of a “bestfitting
line” or curve. Consider the following set of points:

For a given value of x, say x1, there will be a difference between the value y1 and the corresponding value
as determined by the “best fitting” curve. This distance, D1, is referred to as a residual.

18
A residual is the difference from the actual y-value and the value obtained by plugging the x-value (that
goes with the y-value) into the regression equation. We will write an estimated regression line based

on sample data as ŷ = β0 + β1xi + εi

Assumptions of Linear Regression


• Linearity:
• Constant Variance: var{μy} = σ2
• Normality: Distribution of Y’s at any X is normal
• Independence: Given Xi’s, the Yi’s are independent
Example 1: The data in the following table gives the age (x) and the vital capacity (VC) (in litres) (y),
for 12 adults. VC is a measure of lung capacity.

19
Age (x) 39 40 41 41 45 49 52 47 61 65 58 59
VC (y) 4.26 5.29 5.52 3.71 4.02 5.09 2.70 4.31 2.70 3.03 2.73 3.67

i. Calculate the linear regression coefficients by Least square method


ii. Write the regression equation using the found coefficients
iii. Give interpretation for the values of the regression coefficients found.

ŷ = β0 + β1xi + εi
Solution

SN Age
VC (y)
(x) xy x2 y2
1 39 4.26 166.14 1521 18.1476
2 40 5.29 211.6 1600 27.9841
3 41 5.52 226.32 1681 30.4704
4 41 3.71 152.11 1681 13.7641
5 45 4.02 180.9 2025 16.1604
6 49 5.09 249.41 2401 25.9081
7 52 2.7 140.4 2704 7.29
8 47 4.31 202.57 2209 18.5761
9 61 2.7 164.7 3721 7.29
10 65 3.03 196.95 4225 9.1809
11 58 2.73 158.34 3364 7.4529
12 59 3.67 216.53 3481 13.4689
Total (∑) 597 47.03 2265.97 30613 195.6935
Mean 49.75 3.919167

i. Consider the least square method: Yi = 0+1xi+εi, εi ~N(0,²)

20
iii.
Interpretations
b0, the intercept, can be interpreted as the value to predict for the vital capacity (Y) if age (X) = 0. We would
expect vital capacity to be 7.94 litres. Since X is a continuous variable, b1, represents the difference in the
predicted value of Y for each one-unit difference in X. This means that, each 1-year difference in age attracts -
0.081 litres of vital capacity (ie an increase in age is likely to lead to reduction in vital capacity by 0.081 litres).

SPSS OUTPUT
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 7.942 1.230 6.458 .000
1
age -.081 .024 -.724 -3.321 .008
a. Dependent Variable: vc
Note: Age is significant at 5% significant level with p-value (sig on table) of 0.008, which is less than
0.05.

21
22

Common questions

Powered by AI

A contingency table in the chi-square test organizes data to show the frequency distribution of variables and assess their independence. It provides a structured way to visualize the relationship between categorical variables. The chi-square statistic is calculated by summing the squared difference between observed and expected frequencies for each cell in the table, normalized by the expected frequency: χ² = Σ((O_i - E_i)² / E_i). A significant chi-square statistic suggests a potential association between the variables in question .

Differentiating between null and alternative hypotheses is crucial in hypothesis testing as they represent competing claims about a population parameter. The null hypothesis (H0) is typically formulated to reflect no effect or no difference, serving as a baseline that is assumed true until evidence suggests otherwise. For example, H0: p1 = p2 for proportions. The alternative hypothesis (Ha) is what researchers aim to support, proposing that an effect exists or a difference is significant, such as Ha: p1 ≠ p2. This distinction forms the basis for statistical tests, which seek to determine if observed data can reject the null hypothesis in favor of the alternative .

Hypothesis testing involves several organized steps to detect significant differences or associations between variables. First, formulate the null hypothesis (H0) expressing no effect or association, and the alternative hypothesis (Ha) suggesting a significant effect. Next, choose an appropriate statistical test (e.g., Z-test, T-test, chi-square test) based on data type and distribution. Calculate the test statistic from sample data and compare it against a critical value from the relevant statistical distribution at a specified significance level (e.g., α = 0.05). If the test statistic exceeds the critical value, reject H0, indicating a significant result. Otherwise, fail to reject H0, implying no discernible effect .

Degrees of freedom are crucial in calculating sample variance as they account for the estimation of one parameter (the sample mean) from the data. Using (n-1) instead of n, where n is the sample size, corrects the bias in estimation, providing a more accurate estimate of the population variance. This adjustment arises because only (n-1) deviations are independent; the nth deviation can be deduced since all deviations sum to zero. Therefore, dividing by (n-1) rather than n yields an unbiased estimator of the population variance .

Variance and standard deviation improve understanding of variability in data by considering all observations rather than just the two extreme values used in calculating the range. While the range provides a sense of spread by the difference between the maximum and minimum values, variance and standard deviation reflect how each data point deviates from the mean. Variance involves averaging squared deviations, capturing variability more comprehensively and mitigating the influence of extreme values that can distort the range. Standard deviation, the square root of variance, expresses variability in the original units, making it intuitive to interpret compared to variance's squared units .

The standard error of the mean (SEM) is crucial for interpreting sample data, as it quantifies how precisely the sample mean estimates the population mean. It provides insight into the reliability of the mean by indicating the degree of sampling variability. The SEM decreases with larger sample sizes, enhancing the estimate's precision, because larger samples tend to better capture population characteristics and reduce uncertainty in extrapolating the sample mean to the population. Therefore, the SEM is key in hypothesis testing and constructing confidence intervals .

The coefficient of variation (CV) is defined as the standard deviation of a dataset expressed as a percentage of the mean. It's useful for comparing variability across different datasets, even those with different units or scales, because it's dimensionless. By expressing variability relative to the mean, CV allows for benchmarking performance across various datasets, highlighting the relative dispersion irrespective of the units of measurement. For example, a CV less than 10% indicates excellent control, whereas over 30% suggests an unstable process .

The T-distribution is applied in hypothesis testing for small sample sizes (n < 30) to assess if the sample mean significantly differs from a known population mean. Assuming normal distribution, the T-distribution accounts for increased variability in smaller samples, providing a more accurate model than the normal distribution. Underlying assumptions include independence of observations, normality of data, and homogeneity of variance. The T-distribution adapts with degrees of freedom, broadening its tails in smaller samples, reflecting the uncertainty inherent in smaller data amounts .

The chi-square test assesses associations between categorical variables by comparing observed frequencies in contingency tables to expected frequencies under the null hypothesis of no association. It calculates a test statistic that follows a chi-square distribution, with statistical significance indicating a potential relationship between variables. Key assumptions include: categories must be independent, expected frequencies in cells should not be less than one, and no more than 20% of cell expected frequencies should be less than five. Violating these assumptions can affect the validity of the test results, necessitating alternatives like Fisher's Exact Test when conditions aren't met .

The geometric mean is preferable to the arithmetic mean in datasets that are positively skewed or involve rates of change, such as growth rates, because it minimizes the impact of extreme values. It provides a measure that is more representative of the central tendency in logarithmic-sized distributions, offering a robust average when data spans several orders of magnitude. In contrast, the arithmetic mean can be heavily influenced by large outliers, distorting the average in such distributions .

You might also like