Environmental
Engineering
Research Workshop
Statistical Methods in research
Sarath Raj, PhD
I. What is Statistics?
Science of collecting, organizing, analyzing, and
interpreting data
It helps make informed decisions using data
Types
Descriptive Statistics:
Summarizes data (mean, median, charts).
Inferential Statistics:
Makes predictions or inferences from a sample to a population
Descriptive Statistics
Used generically in place of measures of central
tendency and dispersion for inferential statistics.
🟥 These statistics describe or summarize the qualities of data.
Another name is “summary statistics”, which are univariate.
Mean
Median
Mode
Range
Standard Deviation, etc.
Measures of Central
Tendency
These measures tap into the average distribution
of a set of scores or values in the data.
Mean
Median
Mode
The Mean
The “mean” of some data is the average score or
value, such as the average age of an MPA student or
average weight of professors that like to eat donuts.
Inferential mean of a sample: X=( X)/n
Mean of a population: =( X)/N
The Mean Problem
The main problem associated with the mean value of
some data is that it is sensitive to outliers.
The average weight of people might be
affected if there was one in the group that
weighed 600 pounds.
The Median
Because the mean average can be sensitive
to extreme values, the median is sometimes
useful and more accurate.
The median is simply the middle value
among some scores of a variable. (no
standard formula for its computation).
Percentiles
If we know the median, then we can go up or
down and rank the data as being above or
below certain thresholds.
You may be familiar with standardized tests.
90ᵗʰ percentile, your score was higher than
90% of the rest of the sample.
The Mode
The most frequent response or
value for a variable.
Multiple Modes
Bimodal
Multimodal
Measures of dispersion
Measures of dispersion tell us
about variability in the data.
How much do values differ for a variable from the min
to max, and distance among scores in between. We
use:
Range
Standard Deviation
Variance (standard deviation squared)
Measures of dispersion
To glean information from data, i.e. to make an inference,
we need to see variability in our variables.
Measures of dispersion give us information about how
much our variables vary from the mean, because if they
don’t it makes it difficult infer anything from the data.
Dispersion is also known as the spread or range of
variability.
The Range
r=h–l Where h is high and l is low
In other words, the range gives us the value
between the minimum and maximum values
of a variable.
Understanding this statistic is important in
understanding your data, especially for
management and diagnostic purposes.
Standard Deviation
A standardized measure of
distance from the mean.
In other words, it allows you to know how
far some cases are located from the mean.
How extreme our your data?
68% of cases fall within one standard
deviation from the mean, 97% for two
deviations.
Standard Deviation
X = score for each point in data
_
X = mean of scores for the variable
n = sample size (number of observations or
cases
Confidence Intervals
Gives a range of values that is likely to contain the true population
parameter (like the mean or proportion). It reflects the precision of
your estimate.
🟥 The level C of a confidence interval gives the probability that
the interval produced by the method employed includes the true
value of the parameter.
A study finds the average exam score of a sample of
students is 75, with a 95% confidence interval of [72, 78].
Interpretation: “We are 95% confident that the true
average score of all students lies between 72 and 78.
Inferential statistics
While descriptive statistics summarize the characteristics
of a data set, inferential statistics help you come to
conclusions and make predictions based on your data.
When you have collected data from a sample, you can use
inferential statistics to understand the larger population
from which the sample is taken.
Inferential statistics have two main uses:
✅ making estimates about populations.
✅ testing hypotheses to draw conclusions about
populations.
Statistical Significance
A result is called statistically significant if it is unlikely to have
occurred by chance. A “statistically significant difference” means
there is statistical evidence that there is a difference.
In simple cases, it is defined as the probability of making a decision to
reject the null hypothesis when the null hypothesis is actually true.
The decision is often made using the p-value: the p-value is the
probability of obtaining a value of the test statistic at least as extreme
as the one that was actually observed, given that the null hypothesis is
true.
if the p-value is less than the significance level, then the null hypothesis
is rejected. The smaller the p-value, the more significant the result is
said to be.
Degrees of freedom
Degrees of freedom, often represented by df, is the number of
independent pieces of information used to calculate a statistic.
It’s calculated as the sample size minus the number of
restrictions.
Simple Analogy
Imagine you have 3 test scores that must average to 70.
You pick the first score: 65
You pick the second score: 75
The third score? It’s fixed, it must be 70 to keep the average at 70.
✅ You were free to choose only 2 values.
🔒 The third one is not free—it’s constrained by the total.
So in this case, the degrees of freedom = 2.
Student’s t-test
A t-test compares the means of two independent groups to
determine if they are significantly different.
Type When to Use Example
Group A vs Group B
Independent t-test Compare two different groups
test scores
Compare the same group before Pre-test vs Post-test
Paired t-test
and after a treatment scores
Student’s t-test
Group A Group B
70 78
75 80
72 79
74 81
73 77
Paired t-test (Before and After)
Used to compare two measurements taken from the same
subject, before and after a treatment.
Difference
Student Before After
(D)
1 65 70 5
2 67 72 5
3 70 75 5
4 72 76 4
5 68 73 5
ANOVA
Analysis of variance (ANOVA) is a statistical test used to
assess the difference between the means of more than two
groups.
At its core, ANOVA allows you to simultaneously compare
arithmetic means across groups.
You can determine whether the differences observed are
due to random chance or if they reflect genuine, meaningful
differences.
Type When to Use Example
Uses one independent variable or
Group A vs Group B vs
One-way ANOVA factor
Group C test scores
Uses two independent variables
Previous groups with
Two-way ANOVA or factors
different species
ANOVA
Correlation & Regression
Is there a relationship between x and y?
What is the strength of this relationship
Pearson’s r
Can we describe this relationship and use this to
predict y from x?
Regression
Is the relationship we have described statistically
significant?
t test
The relationship between x and y
Correlation: is there a relationship between 2 variables?
Regression: how well a certain independent variable predict
dependent variable?
Correlation I Causation
In order to infer causality: manipulate independent variable
and observe effect on dependent variable
Scattergrams
Y Y Y
Y Y Y
X X X
Positive correlation Negative correlation No correlation
Variance vs Covariance
Notes on your sample:
If you’re wishing to assume that your sample is
representative of the general population (RANDOM
EFFECTS MODEL), use the degrees of freedom (n – 1) in
your calculations of variance or covariance.
But if you’re simply wanting to assess your current sample
(FIXED EFFECTS MODEL), substitute n for the degrees of
freedom.
Variance vs Covariance
Do two variables change together?
Covariance Variance
Gives information on the Gives information on
degree to which two variables variability of a single variable.
vary together
Note how similar the covariance is to variance: the equation
simply multiplies x’s error scores by y’s error scores as
opposed to squaring x’s error scores.
Covariance
When X and Y : cov (x,y) = pos.
When X and Y : cov (x,y) = neg.
When no constant relationship: cov (x,y) = 0
Example of Covariance
x y xi - x yi
- y ( x - x )( y - y )
i i
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
y 3 7
x 3
What does this number tell us?
Problem with Covariance
The value obtained by covariance is dependent on the size of
the data’s standard deviations:
if large, the value will be greater than
if small…
even if the relationship between x and y is exactly the same in
the large versus small standard deviation datasets.
Solution: Pearson’s r
Covariance does not really tell us anything
Solution: standardise this measure
Pearson’s R: standardises the covariance value.
Divides the covariance by the multiplied standard
deviations of X and Y:
Regression
Correlation tells you if there is an association between
x and y but it doesn’t describe the relationship or allow
you to predict one variable from the other.
To do this we need REGRESSION!
Best - Fit Line
Aim of linear regression is to fit a straight line, ŷ = ax + b, to data
that gives best prediction of y for any value of x
This will be the line that
minimises distance between ŷ = ax + b
data and fitted line, i.e.
the residuals slope intercept
= ŷ, predicted value
= y i , true value
ε = residual error
General Linear Model
Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the
intercept.
y = ax + b +ε
A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple Regression
Multiple regression is used to determine the effect of a number of
independent variables, x₁, x₂, x₃ etc, on a single dependent
variable, y
The different x variables are combined in a linear way and each
has its own regression coefficient:
y = a₁x₁+ a₂x₂ +…..+ anxn + b + ε
The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable, y.
i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
THANK YOU!
Sarath Raj, PhD