GET 305 (Lecture Note) Module 1
GET 305 (Lecture Note) Module 1
1. Apply descriptive statistical methods to biomedical and engineering datasets, using frequency
distributions, measures of central tendency, dispersion, and percentiles, to summarize and interpret
experimental and observational data with acceptable numerical accuracy.
2. Apply statistical inference techniques such as confidence intervals and hypothesis testing, using
appropriate test statistics and significance levels, to draw valid conclusions from sampled
biomedical engineering data.
3. Analyse relationships between variables through regression and correlation methods, using real-
world biomedical datasets, to support prediction, modelling, and data-driven decision-making in
healthcare engineering applications.
4. Implement statistical modelling and data analysis using the SPSS statistical software environment
and its relevance to data analytics in engineering and biomedical applications. Perform data entry,
data cleaning, and manipulation of variables using SPSS. Apply SPSS statistical procedures for descriptive
statistics, inferential analysis, and regression modelling
5. Evaluate and apply data analytics concepts including big data analytics and cloud computing tools,
through case-based examples in biomedical engineering and health systems, to address
contemporary challenges in large-scale healthcare data analysis.
1. Analyse probabilistic models relevant to engineering problems, by applying probability rules and
discrete and continuous distributions (Binomial, Poisson, Hypergeometric, and Normal), to
estimate uncertainty and variability in biomedical systems.
1
1.0 Introduce Engineering Statistics and Data Analytics
Statistics is the branch of mathematics used in engineering to collect, analyse, interpret, and present data
to make informed decisions. It deals with uncertainty, variation, and probability in measurements,
materials, and processes. In engineering, this involves analysing experimental data, testing hypotheses,
and estimating relationships between variables to guide design, quality control, and decision-making.
Everything dealing with the collection, processing, analysis, and interpretation of numerical data belongs
to the domain of statistics. In engineering, this includes tasks such as calculating the average downtime of
machines in a factory, analysing test data from electronic circuits, evaluating the performance of medical
or mechanical devices, predicting the reliability of engines or power systems, and studying vibrations in
bridges, aircraft wings, or rotating machines.
Statistics helps biomedical engineers make sound design and manufacturing decisions using data—
especially when it is impossible or too costly to test every single component or device.
Examples:
• While designing an infant incubator, an engineer tests temperature sensors from a small batch and
uses the results to decide whether the entire production meets safety standards.
• During fabrication of a patient monitor, only a few circuit boards are stress-tested. Statistical
analysis is then used to predict the reliability of all units produced.
• In developing a phototherapy system, light intensity is measured on selected prototypes, and the
data guide design adjustments for optimal therapeutic performance.
• When selecting materials for a prosthetic or catheter, an engineer tests samples for strength and
biocompatibility and applies statistics to choose the best material for mass production.
The process of using statistics usually involves four steps:
1. Set goals: Decide clearly what you want to find out.
2. Plan data collection: Decide what data is needed and how to collect it.
3. Analyze data: Use statistical methods to get useful information from the data.
4. Interpret results: Understand the information and make conclusions.
By following these steps, statistics helps you gather information efficiently and make informed decisions.
2
1.2 Role of Statistics in Experimental Design, Healthcare, and Biomedical Research
1. Role in Quality Improvement
Statistics is essential for improving processes and products. Engineers and scientists use it to collect data,
analyze trends, and present information visually, which helps identify areas needing improvement. For
example: Hospitals use statistics to track infection rates in different wards. By analyzing the data, they
can detect patterns and implement measures to reduce hospital-acquired infections.
2. Applications in Experimental Design
Experimental design is about planning how data is collected so that the results are reliable and conclusions
are valid. Statistics plays a key role by:
• Controlling variation: Ensuring that differences in outcomes are due to the factors being tested,
not random errors.
• Testing hypotheses: Determining whether observed effects are significant or due to chance.
• Monitoring processes: Detecting when something goes wrong and requires correction.
• Example (Biomedical Research): When testing a new drug, a scientist uses statistics to divide
patients into treatment and control groups randomly. This ensures that differences in outcomes are
due to the drug, not other factors like age or health condition.
Variables
Variables are qualities or quantities that vary from one member of the sample to another. They describe
characteristics we can measure or count. Examples include age, sex, height, income, marital status, and
eye colour.
They are called variables because their values are not the same for everyone, and they can also change
over time. For example, age is a variable because people in a group do not all have the same age, and each
person’s age increases with time. Likewise, income is a variable because it can be different for different
people and may also increase or decrease over time.
Types of Variables
Variables can be classified into two, namely Quantitative (Numeric) and Qualitative (Categorical).
Each type can be classified further.
3
Figure 1: types of Variables
Qualitative variables describe qualities or characteristics, not numbers you measure or count. Examples
4
include: Gender (male, female), Marital status (single, married, divorced, widowed), Blood group (A, B,
AB, O), Sometimes we give them numbers for easy recording (for example: male = 1, female = 2), but
these numbers do not have mathematical meaning, you cannot add or multiply them. There are two types:
ordinal and nominal variables.
1. Ordinal Variable
An ordinal variable is a type of categorical variable that can be arranged in order or ranked, but the gap between
the categories is not equal or measurable. you can arrange them in order, but you can’t measure how far apart
they are, or but what quantity the differ.
Examples:
• Disease severity: absent → mild → moderate → severe
• Students’ grades: A, B, C, D
• Attitude: strongly agree → agree → disagree → strongly disagree
• Ratings: very low → low → medium → high → very high
Although these are ordered, we cannot say exactly how much better or worse one category is compared to
another.
2. Nominal Variable
A nominal variable is also categorical, but it has no natural order or ranking. They are just names or labels.
Examples:
• Sex: male, female
• Marital status: single, married, divorced, widowed
• Study type: full-time, part-time, evening
• Hair colour: black, brown, red
• Religion
These groups are different, but none is higher or better than the other. Sometimes numbers are used for coding
(e.g., male = 1, female = 2), but the numbers have no mathematical meaning.
Note that if obesity causes (or is associated with) both Hypertension, and conorary heart disease while
hypertension also causes conorary heart disease, then hypertension and obesity are Confounder
variables
Hypertension
Binary variable (Dichotomous variable): These are nominal variables that occur in two
categories, E.g., “improved/not improved”, “disease present/ disease not present”; yes/No;
male/female; etc. They are often labeled zero and one.
6
FREQUENCY DISTRIBUTION
Frequency distribution is a representation, either in a graphical or tabular format, that displays the
number of observations or times a given quantity (or group of quantities) occurs in a set of data. For
example, the frequency distribution of income in a population would show how many individuals (or
households) have the income of a certain level.
Example 1: A biomedical engineering team collected body weight measurements (in pounds) of 57
pediatric patients using a digital weighing system during routine clinical screening in a tertiary hospital.
The data were used to assess sensor performance, calibration accuracy, and patient weight distribution for
pediatric medical device design.:
68 63 42 27 30 36 28 32 79 27 22 23 24 25 44 65 43 25 74 51 36 42 28 31 28 25 45 12 57 51 12 32 49 38
42 27 31 50 38 21 16 24 69 47 23 22 43 27 49 28 23 19 46 30 43 49 12
From the data set above we have:
Solution
7
Figure 1: Simple Bar Chart of weights of 57 children at a day-care center
Example 2:
A biomedical engineering team working with public health officials is developing a digital immunization
monitoring system for primary healthcare centers. To validate the system and understand vaccination coverage,
data were collected from electronic health records on the immunization status of under-five children in both
rural and urban communities.
The table below summarizes the collected data:
8
Figure 3: Multiple Bar Chart of immunization status of under-five children in a rural and urban
town
Figure 3: Component Bar Chart of immunization status of under-five children in a rural and
urban town
The average value is usually represented by the arithmetic mean, customarily just called the mean.
This is simply the sum of the values divided by the number of values. Let X represent the mean, X =
(∑x)/n, where x denotes the values of the variable, ∑ is the Greek capital letter sigma means ‘the sum
of’ and n is the number of observations.
Other measures of the average value are the median and the mode.
The median is the value that divides the distribution in half. If the observations are arranged in increasing order,
the median is the middle observation or the (n + 1) the value of ordered observations.
Example:
A biomedical engineering team is calibrating a non-invasive blood volume monitoring device intended
for use in dialysis and critical care units. To validate the device, plasma volume measurements (in litres)
are taken from eight healthy adult males during baseline testing:
The engineer must determine the mean, median, and mode of these values to establish a reference plasma
volume range for healthy adults, which will later be used to compare patient readings and detect abnormal
fluid balance.
(a) Mean: n = 8, ∑ x = 2:75 + 2:86 + 3:37 + 2:76 + 2:62 + 3:49 +3:05 + 3:12 = 24:02
24.02
Therefore, X = 8
= 3.00 litres
(b) If there is an even number of observations, there is no middle one and the average of the two
‘middle’ ones is taken.
Median: First rearranging the measurements in increasing order gives:
9
2:62, 2:75, 2:76, 2:86, 3:05, 3:12, 3:37, 3:49
Median = (n + 1)/2 = 9/2 = 4.5th value = average of 4th and 5th values = (2:86 +3:05)/2 = 2:96 litres
Measures of Dispersion
The range
The range is the simplest measure, and is the difference between the largest and smallest values. Its
disadvantage is that it is based on only two of the observations and gives no idea of how the other
observations are arranged between these two. Also, it tends to be larger, the larger the size of the sample.
Variance
Variance (and its square root, standard deviation) is the most commonly used measure of variation because
it considers all observations. It is based on how far each value deviates from the mean. When values are close
to the mean, variation is small; when they are widely spread, variation is large. Simply averaging deviations
does not work because positive and negative values cancel out, giving zero. Therefore, variation is measured
by considering the size of deviations rather than their direction. However, this measure is not mathematically
very tractable, and so instead we average the squares of the deviations, since the square of a number is
always positive.
𝑥𝑖 −𝑥 2
Variance S2 = ∑
𝑛−1
Degrees of freedom
Note that the sum of squared deviations is divided by (n - 1) rather than n, because it can be shown
mathematically that this gives a better estimate of the variance of the underlying population. The
denominator (n- 1) is called the number of degrees of freedom of the variance. This number is (n - 1) rather
than n, since only (n - 1) of the deviations (x− x) are independent from each other. The last one can always
be calculated from the others because all of them must add up to zero.
Standard Deviation
For many purposes it is more convenient to express the variation in the original units by taking the square
root of the variance. This is called the standard deviation (S.D.).
̅̅̅2
∑𝑥𝑖−𝑥) (x
( x )2
) −
2 i
s =√ 𝑛−1
or s = i
n
n −1
10
When using a calculator, the second formula is more convenient for calculation, since the
mean does not have to be calculated first and then subtracted from each of the
observations.
Coefficient of variation
The coefficient of variation expresses the standard deviation as a percentage of the sample mean. This is
useful when interest is in the size of the variation relative to the size of the observation, and it has the
advantage that the coefficient of variation is independent of the units of observation.
𝑠
c.v = x 100
𝑥
Note: Engineering benchmark:
• CV < 10% → excellent control
• 10–20% → acceptable
• 30% → unstable process
Standard error
The sample mean is unlikely to be exactly equal to the population mean. A different sample would give a
different estimate, the difference being due to sampling variation. This is called the standard error of the
sample mean, and it measures how precisely the population mean is estimated by the sample mean.
𝑠
Se = √𝑛
The size of the standard error depends both on how much variation there is in the population and on the size
of the sample. The larger the sample size n, the smaller is the standard error.
Example: In a medical equipment production company, nine temperature sensors from consecutive
production batches were tested to check accuracy of temperature readings (°C). The readings obtained were:
The contamination levels (MPN/g) obtained were: 0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392,
0.418.
a) Using statistical methods, determine:
a. The range of the data.
b. The mean and variance of the sample.
c. The standard deviation.
d. The coefficient of variation (CV).
e. The standard error of the mean.
Sn sample Xi − X ( X i − X )2
1 0.593 0.136556 0.018647
2 0.142 -0.31444 0.098875
3 0.329 -0.12744 0.016242
4 0.691 0.234556 0.055016
5 0.231 -0.22544 0.050825
11
6 0.793 0.336556 0.11327
7 0.519 0.062556 0.003913
8 0.392 -0.06444 0.004153
9 0.418 -0.03844 0.001478
Total (∑) 4.108 0.36242
Soln.
a) Range = Highest value – Lowest value =0.793 - 0.142 = 0.651MPN/g
∑𝑥 4.108
b) Mean = x = = = 0.4564
𝑛 9
(𝑥𝑖 −𝑥)2 0.3624
c) Variance S2 = ∑ = = 0.0453(MPN/g)2
𝑛−1 8
d) Standard deviation = √S2 = √0.0453 = 0.2129
This tells us how far, on average, each batch deviates from the mean contamination level. A
standard deviation of 0.213, compared with a mean of 0.456, is high. This means that the process
is not tightly controlled.
N.B: The wider the standard deviation, the wider your control limits, and the higher your defect
risk.
𝑆 0.2129
e) Coefficient of Variation (CV) = 𝑋 × 100 = 0.4564 × 100 = 46.65%
Therefore, the 47% indicates that the production process is statistically unstable.
𝑆 0.2129 0.2129
f) Standard Error of the Mean (SEM) = √𝑛 = = = 0.0710
√9 3
HYPOTHESIS TESTING
Hypothesis testing is the use of organized statistical steps to determine the probability that a given
hypothesis is true. Hypothesis testing is designed to detect significant difference or significant
associations. Significant differences/ associations here refer to differences/ associations that did not occur
by random chance.
-Similar with law court where the accused is presumed not guilty until proven
otherwise
12
➢ Alternative (Ha): To contradict the stated null hypothesis
Eg, Ho: There is no significant association between oral contraceptive use and blood pressure
Ha: significant association is likely to exist between oral contraceptive use and blood pressure
Ho: µ = 0
Special Note
• Tabulated values are found from statistical tables (Z, t, Chi-sq, F-test, etc).
• Calculated values manually calculated in the analysis
• P-value is usually supplied by the applications software (SPSS, Epi-Info, STATA, SAS, etc)
• The p-value is the probability of getting a test statistic (distance) or more extreme than what was
observed by chance if it was true.
• Both tabulated value and p-value produce same line of result
• p-value is more advanced than the tabulated value yet traditional
Conclusion
• Evidence of significant difference (or association) is established if Ho was rejected at
significant level such as 5% level of significance
• No evidence of significant difference (or association) is established if the null hypothesis Ho was
not rejected at significant level such as 5% level of significance
Bio-statistical Quote: “ If the difference is not different enough to make a difference, What is the difference?”
A chi-square distribution is the distribution of the sum of squares of k independent standard normal random
variables with k degree of freedom. A chi-square test is a statistical hypothesis test where the null
hypothesis that the distribution of the test statistic is a chi-square distribution, is true. While the chi-square
distribution was first introduced by German statistician Friedrich Robert Helmert, the chi- square test was
first used by Karl Pearson in 1900. Hence Pearson’s chi-squared test (also called ‘chi-squared’ test and
denoted by ‘ 2 ’ is the most popular type of Chi square test today. A classic example of chi-square test is
the test for fairness of a die where we test the hypothesis that all six possible outcomes are equally likely.
14
Chi Square Distribution with 1 degree of freedom
Contingency table
In many cases, the categorical variables of interest have at least two levels each.
⁝ ⁝ ⁝ ⁝ ⁝
15
Hence using a contingency table having two rows and two columns (i.e: nr=nc=2). The general
form of a 2x2 table is
Column 1 Column 2 Total
Row 1 a= rc11= O11, E11 b= rc12= O12, E12 a+b= O11, E11+ O12, E12
Row 2 c= rc21= O21, E21 d= rc22= O22, E22 c+d= O21, E21+ O22, E22
Total a+c= O11, E11+ O21, b+d= O12, E12+ O22, E22 N=( a+ b) + (c+d)
E21
▪ In this case, the chi-square statistic has the following simplified form,
▪ Under the null hypothesis, χ2-statistic has chi-square distribution with (nr-1)x(nc-1) degrees of
freedom, where r and c represent the number of rows and number of columns respectively.
Testing equality of two population proportions using data from two samples
• Ho: p1 = p2 Ho: p1 - p2 = 0
• Ha: p1 ≠ p2 HA: p1 - p2 ≠ 0
In the context of the 2x2 table, this is testing whether there is a relationship between the rows and
columns
Chi Square Test ( 2 ) for Independent
Useful common test for an association between two group variables.
• Not a measure of effect size
• Has no outcome variable
Test statistics compares observed frequencies (Oi) with expected frequencies (Ei)
(𝑂𝑖 − 𝐸𝑖 )2
2 = ∑ 𝐸𝑖
~ (nc-1)(nr-1)
Assumptions of Chi Square
1. No expected category should be less than 1 (it does not matter what the observed values are)
2. No more than one-fifth of expected categories should be less than 5.
(a + b)!(c + d )!(a + c)!(b + d )!
If Chi Square rule is violated, Fishers Exact Test should be used. ie. P =
a!b!c!d!N!
Chi-square is used to analyze qualitative data, when your data are things like:
• Male / Female
• Pass / Fail
• Good / Fair / Poor
• Device working / Device faulty
• Blood group A / B / AB / O
16
Tomato Consumption Salmonella (+) Salmonella (–) Total
Ate tomato 41 89 130
Did not eat 19 151 170
Total 60 240 300
Solution:
Step 1: State hypothesis
H0: There is no significant difference in salmonella illness between tomato eaters and non-tomato
eaters.
H1: There is likely significant difference in salmonella illness between tomato eaters and non- eaters
Level of significance = 5%
(𝑂 − 𝐸 )2
Test statistic: 2 = ∑ 𝑖 𝑖 ~ (nc-1)(nr-1)
𝐸𝑖
Step 2: Calculate the Expected Frequencies (Eij )
Salmonella illness
Yes No Total
Did ate Tomato 41 89 130
Did not eat tomato 19 151 170
Total 60 240 300
The observed frequency given in the table. The corresponding expected frequency, Eij for each cell is
obtained by multiplying the and row (RT ) and column totals (CT) and dividing this result by the overall
𝑅𝑇 ×𝐶𝑇
total (OT). The expected frequency is calculated as Eij =
𝑂𝑇
Degree of freedom (df)= (nc-1)(nr-1) = (2-1)(2-1) = 1 eaters. It also indicates that association was
nc = number of column variables and nr = number of row variables found between tomato and salmonella.
At 5% level of significance, we obtain from table Chi square tabulated 4@5% = 3.8415
The probability value (p-value) is 0.000 on table (ie p <0.0001), which is far less than 0.05, and the Chi-
square value is 19.089. Very high significant difference was found between the tomato eaters and the non-
tomato eaters in salmonella illness.
Example 2: A hospital biomedical engineering department is evaluating the impact of operator training
level on the fault rate of infusion pumps used in clinical wards. The goal is to determine whether additional
technical training is necessary to reduce equipment downtime and improve patient safety.
Hypothesis
18
Calculation of Expected Frequencies (Eij )
The observed frequencies are given in the table. The corresponding expected frequencies, Eij
for the cells are obtained by multiplying the column totals (CT) and row (RT ) and dividing this result
by the overall total (OT).
𝑅𝑇 ×𝐶𝑇
The expected frequency is calculated as Eij = 𝑂𝑇
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
The probability value (p-value) is 0.995, which is greater than 0.05, and the Chi-square value is 0.196. No
significant association found between handicap and performance in this study .
Conclusion: No evidence to conclude that the handicap of the workers has influence on their performances
Assignment:
A team installs smart hand hygiene sensors in an ICU. They record staff compliance (Yes/No) and whether
nosocomial infections occur in patients.
Compliant 5 95 100
Non-compliant 20 80 100
Question:
• Test if there is an association between hand hygiene compliance and patient infection rates.
2
Question 2:
A biomedical engineer compares two catheter materials (Silicone vs. PTFE) and the occurrence of catheter-
associated urinary tract infections (CAUTI).
Catheter Type CAUTI (+) CAUTI (–) Total
Silicone 8 42 50
PTFE 3 47 50
Total 11 89 100
Question:
• Is there a significant association between catheter type and CAUTI incidence?
3
MEAN COMPARISON TEST
Common mean comparison test include Z test, T-test (Students’ T test) and one way Analysis of variance (ANOVA)
test. The "Z-test" compares the mean of a set of measurements to a given constant when the sample variance is known.
It is expected to satisfy the following conditions
i) the observed data X1, ..., Xn are (i) independent,
ii) have a common mean µ, and
iii) have a constant variance σ2, then the sample average X has mean µ and variance σ2 / n.
The null hypothesis is that the mean value of X is a given number µ0.
We can use X as a test-statistic, rejecting the null hypothesis if X − µ0 is large.
Test statistic Z = (X − µ0) / s, s is the standard deviation, where s2 = σ2 / n.
• Uses z scores for large sample sizes (n ≥30)
• Use the Student T test from for small sample sizes (n<30)
Types of T-test:
T test comes in 3 types, namely one sample t-test, independent sample t-test and paired sample t-test
• Statistical difference between a sample means and a known or hypothesized value of the mean in the
population.
• Statistical difference between the sample means and the sample midpoint of the test variable.
• Statistical difference between the sample means of the test variable and chance.
o This approach involves first calculating the chance level on the test variable. The chance level is
then used as the test value against which the sample mean of the test variable is compared.
• Statistical difference between a change score and zero.
o This approach involves creating a change score from two variables, and then comparing the mean
change score to zero, which will indicate whether any change occurred between the two time points
for the original measures. If the mean change score is not significantly different from zero, no
significant change occurred.
Formula:
H0:(𝑋 = μ), H1: There is no difference between the sample mean and the population parameter
4
Ha: (𝑋 ≠ μ), H1: There is a difference between the sample mean and the population parameter
Assumptions in One sample t test
• The dependent variable must be continuous (interval/ratio).
• The observations are independent of one another.
• The dependent variable should be approximately normally distributed.
• The dependent variable should not contain any outliers.
Examples
After several patients on hemodialysis developed infections, the hospital’s biomedical engineering team
wants to determine whether the water used in a particular model of dialysis machine has bacterial
contamination above a dangerous level. The safe limit of bacterial contamination in dialysis water is 0.3
MPN/mL. The team sampled 9 different dialysis machines from the same production batch and measured
bacterial levels (in MPN/mL): 0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418
Task: Use this data to test if the average bacterial contamination exceeds the safe limit of 0.3 MPN/mL.
Null hypothesis H0: μ = 0.3 (average bacterial level is safe)
Alternative hypothesis Ha: μ > 0.3 (average bacterial level is above safe limit)
Solution:
Let represent mean and s represent
sn sample Xi − X ( X I − X )2
standard deviation
1 0.593 0.136556 0.018647
2 0.142 -0.31444 0.098875 ,
3 0.329 -0.12744 0.016242
4 0.691 0.234556 0.055016 n= number of observations= 9, Xi=
5 0.231 -0.22544 0.050825 sample
6 0.793 0.336556 0.11327
7 0.519 0.062556 0.003913 ,
8 0.392 -0.06444 0.004153
9 0.418 -0.03844 0.001478
Total (∑) 4.108 0.36242
5
SPSS SOFTWARE OUTPUT
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
sample 9 .456444 .2128439 .0709480
One-Sample Test
Test Value = 0.3
t df Sig. Mean 95% Confidence Interval of the
(2-tailed) Difference
Difference
Lower Upper
sample 2.205 8 .059 .1564444 -.007162 .320051
Note: Sig = 0.059 represents p-value on table, indicating not significant since p-value is not less than 0.05
Independent Samples T-Test compares means of two samples which don’t directly influence each other
(samples are two different groups of people or things). e.g., average income for group of males and
females, mean weight patients on active drug verses ones on placebo. etc
Practice Example: Two health quality-control technologists measured the surface finish of a
metal part, obtaining the data shown below.
Technology 1 Technology2
1.45 1.54
1.37 1.41
1.21 1.56
1.54 1.37
1.48 1.20
1.29 1.31
1.34 1.27
1.35
Assume that the measurements are normally distributed, and that the variances are equal, are their
differences in mean of finished measurements by the two technologists, using =0.05?
7
Example 2: In an environmental health management setting, two catalysts are being analyzed to determine how they
affect the mean yield of a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable.
Since catalyst 2 is cheaper, should it be adopted, providing it does not change the process yield. An experiment is run
in the pilot plant and results in the data table below. Is there any difference between the mean yields? Use =0.05
and assume equal variances
Number 1 2 3 4 5 6 7 8
Catalyst 1 90.50 94.18 92.18 95.39 91.79 89.07 94.72 89.12
Catalyst 1 89.19 90.95 90.46 93.21 97.19 97.04 91.07 92.75
Solution:
H 0 : x1 − x2 = 0 H1 : x1 − x2 0
Number v Catalyst 1 Catalyst 2 (Catalyst 1)2 (Catalyst 2)2
s 1 91.50 89.19 8372.25 7954.856
2 94.18 90.95 8869.872 8271.903
3 92.18 90.46 8497.152 8183.012
4 95.39 93.21 9099.252 8688.104
5 91.79 97.19 8425.404 9445.896
6 89.07 97.04 7933.465 9416.762
7 94.72 91.07 8971.878 8293.745
8 89.12 92.75 7942.374 8602.563
Total 737.95 741.86 68111.65 68856.84
Decision: Since -2.145 < -0.36 < 2.145, the null hypothesis cannot be rejected
Conclusion: we do not have strong evidence to conclude that catalyst 2 results in a mean yield that differs from the
mean yield when catalyst 1 is used.
8
SPSS SOFTWARE OUTPUT
Note:
i. We do not want the test for equality of variance to show significance so that the t-test assumption of equal
variance should not be violated
ii. P-value is not significant at 5% (p-value = 0.724). It implies that means 92.24 in catalyst 1 and 93.73 in catalyst
did not differ significantly between the two groups.
In paired or related sample T test, samples are collected from the same group of people. It compares means
of two samples which you expect to be connected (often, data is from the same sample at two different times).e.g.,
left hand-right hand, BP taken 2 times on a group of patients, Average mean before and after an intervention (for
the same group).
𝑫 is the mean paired difference (ie first mean sample minus second mean sample) and 𝑺𝑬 is the standard error
of the mean (s is the standard deviation of the mean)
Formal Hypothesis Statement: H0: diff=0 verses H1: diff 0
Practical Example: A biomedical engineering research team developed a prototype non-invasive optical SpO₂
sensor for continuous patient monitoring. Its performance was evaluated by comparing measurements from the
prototype with those from a standard hospital pulse oximeter. Ten volunteers participated, with oxygen saturation
recorded for each subject using both devices. Because measurements were taken from the same individuals, a
paired parametric test was applied to determine whether the prototype sensor differed significantly from the
clinical standard.
9
Null hypothesis (H₀): There is no significant difference between measurements from the prototype
Prototype Sensor Reference Device sensor and the reference device.
196 192
190 187 Alternative hypothesis (H₁):
155 149 There is a significant difference between
199 200 measurements.
Solution 190 183
203 203
237 242
202 194
228 223
212 207
Prototype Reference di di2
Sensor Device
196 192 4 16
190 187 3 9
155 149 6 36
199 200 -1 1
190 183 7 49
203 203 0 0
237 242 -5 25
202 194 8 64
228 223 5 25
212 207 5 25
32 250
10
Practice Exc1
The following shows the heights in centimetres of 24 two-year old Nigeria boys with homozygous sickle cell
disease.
84.4, 80.6, 85.0, 89.9, 80.0, 82.5, 89.0, 81.3, 80.7, 81.9, 86.8, 84.3,
87.0, 83.4, 85.4, 78.5, 89.8, 85.0, 84.1, 85.4, 85.5, 86.3, 80.6, 81.9
(i) Calculate the mean,
(ii) Calculate the standard deviation and standard error of the heights
(iii) If the average height of a two-year old in the UK is 86.5cm. What do you say about the effect of sickle cell
disease and height?
Practice Ex 2
Patients who undergo an operation were randomized to two groups; one (Group1) receiving post-operative care in
hospital and the other (Group 2) receiving post-operative care at home. The patients were asked to rate their
satisfaction with the care received on a scale of 0-100. The results are given below. Use T test to compare the
11
satisfaction levels in the two treatment groups.
Group 1 77.1 100 75.4 79.5 78.6 99.0 84.9 92.7 78.2 100 72.2
Group 2 75.8 68.1 70.2 72.1 91.3 74.2 76.8 76.2 60.2 87.0 90.1
Practice Ex3
The following data was collected in a small clinical trial intended to reduce blood pressure.
Before – Blood pressure before treatment and After – Blood pressure after treatment
Note: a placebo is a chemically insert substance which is known to have no physical effect but which is similar in
appearance to a conventional medicine.
Treatment Before After
calcium 107 100 Compare before and after blood pressures in treatment (calcium)
calcium 110 114 group? (Can be reframed as... does the blood pressure reduce in
calcium 123 105 treatment (calcium) group?)
calcium 129 112
calcium 112 115
calcium 111 116
calcium 107 106
calcium 112 102
calcium 136 125
calcium 102 104
Practice Exc 4: The following data was collected in a small trial intended to test the effect of calcium
in the reduction blood pressure. Two groups of high blood pressure people were used, one group
received the calcium treatment and the other received a placebo. Test if there are there differences in
BP reduction between the two groups Note: a placebo is an inactive substance but which is similar in
appearance to a conventional medicine.
12
CORRELATION
Correlation quantifies (puts a number to) the strength of the linear relationship between two variables and also
indicates the direction of the relationship. A correlation simply indicates that there is a relationship between the two
variables. The correlation coefficient, r, measures the strength of the linear relationship. Three types of linear
correlation can be considered, they are positive correlation, negative correlation and an absence of linear correlation
(no correlation)
Value of the Correlation Co-Efficient (r) Strength of the Correlation
1 or -1 Perfect
0.8 to 0.99 (or -0.8 to -0.99) Very Strong
0.6 to 0.79 (or -0.6 to -0.79) Strong
0.4 to 0.59 (or -0.4 to -0.59) Moderate
0.1 to 0.39 (or -0.1 to -0.39) Weak
0 Zero (no correlation)
Note
▪ The value of r is between +1 and -1. Values of r close to +1 represents strong positive linear relationship. It occur
when data points fall exactly on a straight line. Values of r close to -1 represents strong negative linear relationship
▪ The correlation becomes weaker as the data points become more scattered. A value of r close to 0 means that the
linear association is very weak.
▪ If the data points fall in a random pattern, the correlation is equal to zero. It could be that there is NO association
at all, or the relationship is non-linear.
▪ Correlation is affected by outlier. Compare the first scatterplot with the last scatterplot. The single outlier in the
last plot greatly reduces the correlation (from 1.00 to 0.71).
13
Common Ways to Calculate a Correlation Coefficient
𝑛∑𝑥𝑦− ∑𝑥∑𝑦
1. Pearson Product Moment correlation (parametric Method) r =
√(𝑛∑𝑥 2 −(∑𝑥 2 )(𝑛∑𝑦 2 −(∑𝑦 2 ))
𝟔∑𝒅𝒊𝟐
2. Spearman Spearman’s rank correlation coefficient (non-parametric method) = rs = 1-𝒏(𝒏𝟐−𝟏)
⬚
Example: A tobacco company wishes to know whether heavy smoking is related to longevity. From a sample of
recently deceased smokers, the number of cigarettes (estimated on a per day for their last five years after visits with
their surviving relatives) is paired with the number of years that they lived.
Cigarette (x) 25 35 10 40 85 75 60 45 50
Years lived (y) 63 68 72 62 65 46 51 60 55
i. Obtain the Pearson correlation coefficient (or Spearman Spearman’s rank correlation coefficient
ii. Interpret the correlation result
iii. Test if the coefficient is significant at α = 5%
Solution
14
Method 2
6∑𝑑𝑖2
Spearman Spearman’s rank correlation = rs = 1-𝑛(𝑛2−1)
⬚
ii. r = -0.61 or 0.62 indicates strong negative correlation (ie more smoking imply reduced longetivity)
iii The test procedure is as follows; Ho: = 0 Vs 0
𝑟 √𝑛−2 −𝑜.611√9−2
Test statistics = talculated = = = -2.0425
√1−𝑟 2 √1−(−0.6111)2
Significance level a =5 %, From table t0.025,8= -2.
Decision criterion: Reject Ho: if talculated > tα/2
Conclusion: Evidence of association found between smoking and years lived
REGRESSION ANALYSIS
Regression analysis is a statistical methodology that utilizes the relation between two or more quantitative variables so that
a response or outcome variable can be predicted from the other, or others. This methodology is widely used in health and
biological sciences, and many other disciplines. An example of its applications is on the predicting the length of hospital
stay of a surgical patient by utilizing the relationship between the time in the hospital and the severity of the operation.
Regression analysis serves three major purposes: (I) description, (2) control, and (3) prediction.
Note:
• The primary goal of quantitative analysis is to use current information about a phenomenon to predict its future behavior.
• Current information is usually in the form of a set of data.
15
• In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values
of an independent (or predictor) variable X and a dependent (or response) variable Y.
• The goal of the analyst who studies the data is to find a functional relation between the response variable y and the predictor
variable x.
Historical Origin of Regression
Historically Regression Analysis was first developed by Sir Francis Galton, in the latter part of 19th century, who studied
the relation between heights of sons and fathers. Heights of sons of both tall and short fathers appeared to “revert” or
“regress” to the mean of the group. Galton considered this tendency to be a regression to “mediocrity. He developed a
mathematical description of this regression tendency. The term regression persists to this day to describe statistical relations
between variables.
16
Statistical Relation between Two Variables
A statistical relation, unlike a functional relation, is not a perfect one. In general, the observations for a statistical
relation do not fall directly on-the curve of relationship.
Goal is to find the best fit line that minimizes the sum of the error terms
It is always worth viewing your data (if possible) before performing regressions to get an idea as to the type of
relationship (eg. whether it is best described by a straight line or curve).
17
A simple regression model could be written as
Yi = β0 + β1xi + εi
For fitted value for observations, the estimated mean: 𝑦̂=µ {y|x} = b0 + b1x
For a given value of x, say x1, there will be a difference between the value y1 and the corresponding value
as determined by the “best fitting” curve. This distance, D1, is referred to as a residual.
18
A residual is the difference from the actual y-value and the value obtained by plugging the x-value (that
goes with the y-value) into the regression equation. We will write an estimated regression line based
19
Age (x) 39 40 41 41 45 49 52 47 61 65 58 59
VC (y) 4.26 5.29 5.52 3.71 4.02 5.09 2.70 4.31 2.70 3.03 2.73 3.67
ŷ = β0 + β1xi + εi
Solution
SN Age
VC (y)
(x) xy x2 y2
1 39 4.26 166.14 1521 18.1476
2 40 5.29 211.6 1600 27.9841
3 41 5.52 226.32 1681 30.4704
4 41 3.71 152.11 1681 13.7641
5 45 4.02 180.9 2025 16.1604
6 49 5.09 249.41 2401 25.9081
7 52 2.7 140.4 2704 7.29
8 47 4.31 202.57 2209 18.5761
9 61 2.7 164.7 3721 7.29
10 65 3.03 196.95 4225 9.1809
11 58 2.73 158.34 3364 7.4529
12 59 3.67 216.53 3481 13.4689
Total (∑) 597 47.03 2265.97 30613 195.6935
Mean 49.75 3.919167
20
iii.
Interpretations
b0, the intercept, can be interpreted as the value to predict for the vital capacity (Y) if age (X) = 0. We would
expect vital capacity to be 7.94 litres. Since X is a continuous variable, b1, represents the difference in the
predicted value of Y for each one-unit difference in X. This means that, each 1-year difference in age attracts -
0.081 litres of vital capacity (ie an increase in age is likely to lead to reduction in vital capacity by 0.081 litres).
SPSS OUTPUT
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 7.942 1.230 6.458 .000
1
age -.081 .024 -.724 -3.321 .008
a. Dependent Variable: vc
Note: Age is significant at 5% significant level with p-value (sig on table) of 0.008, which is less than
0.05.
21
22
A contingency table in the chi-square test organizes data to show the frequency distribution of variables and assess their independence. It provides a structured way to visualize the relationship between categorical variables. The chi-square statistic is calculated by summing the squared difference between observed and expected frequencies for each cell in the table, normalized by the expected frequency: χ² = Σ((O_i - E_i)² / E_i). A significant chi-square statistic suggests a potential association between the variables in question .
Differentiating between null and alternative hypotheses is crucial in hypothesis testing as they represent competing claims about a population parameter. The null hypothesis (H0) is typically formulated to reflect no effect or no difference, serving as a baseline that is assumed true until evidence suggests otherwise. For example, H0: p1 = p2 for proportions. The alternative hypothesis (Ha) is what researchers aim to support, proposing that an effect exists or a difference is significant, such as Ha: p1 ≠ p2. This distinction forms the basis for statistical tests, which seek to determine if observed data can reject the null hypothesis in favor of the alternative .
Hypothesis testing involves several organized steps to detect significant differences or associations between variables. First, formulate the null hypothesis (H0) expressing no effect or association, and the alternative hypothesis (Ha) suggesting a significant effect. Next, choose an appropriate statistical test (e.g., Z-test, T-test, chi-square test) based on data type and distribution. Calculate the test statistic from sample data and compare it against a critical value from the relevant statistical distribution at a specified significance level (e.g., α = 0.05). If the test statistic exceeds the critical value, reject H0, indicating a significant result. Otherwise, fail to reject H0, implying no discernible effect .
Degrees of freedom are crucial in calculating sample variance as they account for the estimation of one parameter (the sample mean) from the data. Using (n-1) instead of n, where n is the sample size, corrects the bias in estimation, providing a more accurate estimate of the population variance. This adjustment arises because only (n-1) deviations are independent; the nth deviation can be deduced since all deviations sum to zero. Therefore, dividing by (n-1) rather than n yields an unbiased estimator of the population variance .
Variance and standard deviation improve understanding of variability in data by considering all observations rather than just the two extreme values used in calculating the range. While the range provides a sense of spread by the difference between the maximum and minimum values, variance and standard deviation reflect how each data point deviates from the mean. Variance involves averaging squared deviations, capturing variability more comprehensively and mitigating the influence of extreme values that can distort the range. Standard deviation, the square root of variance, expresses variability in the original units, making it intuitive to interpret compared to variance's squared units .
The standard error of the mean (SEM) is crucial for interpreting sample data, as it quantifies how precisely the sample mean estimates the population mean. It provides insight into the reliability of the mean by indicating the degree of sampling variability. The SEM decreases with larger sample sizes, enhancing the estimate's precision, because larger samples tend to better capture population characteristics and reduce uncertainty in extrapolating the sample mean to the population. Therefore, the SEM is key in hypothesis testing and constructing confidence intervals .
The coefficient of variation (CV) is defined as the standard deviation of a dataset expressed as a percentage of the mean. It's useful for comparing variability across different datasets, even those with different units or scales, because it's dimensionless. By expressing variability relative to the mean, CV allows for benchmarking performance across various datasets, highlighting the relative dispersion irrespective of the units of measurement. For example, a CV less than 10% indicates excellent control, whereas over 30% suggests an unstable process .
The T-distribution is applied in hypothesis testing for small sample sizes (n < 30) to assess if the sample mean significantly differs from a known population mean. Assuming normal distribution, the T-distribution accounts for increased variability in smaller samples, providing a more accurate model than the normal distribution. Underlying assumptions include independence of observations, normality of data, and homogeneity of variance. The T-distribution adapts with degrees of freedom, broadening its tails in smaller samples, reflecting the uncertainty inherent in smaller data amounts .
The chi-square test assesses associations between categorical variables by comparing observed frequencies in contingency tables to expected frequencies under the null hypothesis of no association. It calculates a test statistic that follows a chi-square distribution, with statistical significance indicating a potential relationship between variables. Key assumptions include: categories must be independent, expected frequencies in cells should not be less than one, and no more than 20% of cell expected frequencies should be less than five. Violating these assumptions can affect the validity of the test results, necessitating alternatives like Fisher's Exact Test when conditions aren't met .
The geometric mean is preferable to the arithmetic mean in datasets that are positively skewed or involve rates of change, such as growth rates, because it minimizes the impact of extreme values. It provides a measure that is more representative of the central tendency in logarithmic-sized distributions, offering a robust average when data spans several orders of magnitude. In contrast, the arithmetic mean can be heavily influenced by large outliers, distorting the average in such distributions .