Module – 3
Resampling Methods
3.1 What Are Resampling Methods?
Resampling methods are computational techniques in which repeated samples are drawn
from the observed data itself to understand how a statistic behaves.
In traditional statistics, we rely heavily on formulas, assumptions (normality, independence),
and theoretical distributions.
But resampling uses the data itself to estimate:
• Variability
• Standard error
• Confidence intervals
• Significance of results
Resampling is especially powerful when:
• Theoretical formulas are complex or unavailable
• Sample sizes are small
• Distribution assumptions are violated
• Modern computing makes repeated sampling easy
Resampling includes two major techniques:
1. Bootstrap Resampling
2. Permutation Testing
3.2 Bootstrap Resampling
3.2.1 Meaning / Definition
Bootstrap is a data-driven resampling technique where we repeatedly draw samples with
replacement from the original dataset.
Each bootstrap sample is the same size as the original data.
Purpose:
• Estimate standard error
• Build confidence intervals
• Understand sampling variability
• Make inferences without strong assumptions
Bootstrap is especially useful when:
• Sample size is small
• Distribution is unknown
• Traditional formulas do not apply
3.2.2 How Bootstrap Works (Step-by-Step)
Assume we have a dataset of size n.
Step 1 – Collect original sample
Example: daily sales for 10 days.
Step 2 – Draw a bootstrap sample
Randomly pick n observations with replacement:
• Some values will repeat
• Some may be absent
Step 3 – Compute the statistic
Mean, median, standard deviation, etc.
Step 4 – Repeat many times
Usually 1,000 to 10,000 bootstrap samples.
Step 5 – Build a bootstrap distribution
The distribution of all computed statistics approximates the sampling distribution.
Step 6 – Use distribution for inference
• 95% Confidence interval
• Standard error
• Bias estimate
3.2.3 Why “With Replacement”?
Because in real life, samples vary randomly.
Bootstrapping simulates this randomness by allowing repeated selections.
This helps approximate:
• Sampling variability
• True population characteristics
3.2.4 Simple Example
Original data (5 numbers):
12,15,20,10,18
One bootstrap sample might look like:
15,20,20,12,10
Another:
12,12,18,20,15
Each time, we compute the statistic (mean, median, etc.).
If we repeat this 1,000 times:
• The distribution of means gives us the bootstrap sampling distribution.
• The spread of this distribution gives standard error.
• Percentiles give confidence intervals.
This avoids reliance on formulas.
3.2.5 Advantages of Bootstrap
• Does not require normality
• Works for small datasets
• Works for complicated statistics
• Robust and widely used
• Requires only the observed data
3.2.6 Applications of Bootstrap
Bootstrap is used in:
1. Analytics & Data Science
• Estimating uncertainty in model performance
• Constructing confidence intervals for regression coefficients
• Estimating standard errors of medians, quantiles
2. Finance
• Risk estimation
• Portfolio volatility
3. Machine Learning
• Evaluating model stability
• Bagging (bootstrap aggregating)
4. Medical research
• Estimating effects when sample sizes are small
3.3 Permutation Tests
3.3.1 Meaning / Definition
A permutation test is a non-parametric statistical test that evaluates whether two groups
differ significantly by randomly shuffling group labels.
Unlike bootstrap, permutation does NOT sample with replacement.
Purpose:
• Test significance without assuming normality
• Compare group means, medians, or any statistic
• Useful for small samples and unknown distributions
3.3.2 How Permutation Test Works
Suppose you want to test whether two groups have different means.
Step 1 — Combine both groups into one dataset
Step 2 — Shuffle (permute) all values randomly
Step 3 — Split the shuffled data back into two groups of the same sizes as original
Step 4 — Compute the difference in means
Step 5 — Repeat 1,000+ times
Step 6 — Compare actual difference to permutation differences
• If actual difference is extremely rare among permutation results → significant
• If common → not significant
3.3.3 Why Permutation Tests Are Powerful
• They make no distributional assumptions
• They reflect the idea of “random assignment” from experiments
• They directly simulate the null hypothesis:
“Both groups come from the same distribution.”
3.3.4 Simple Example
Group A scores: 10, 12, 14
Group B scores: 18, 20, 22
Observed difference = 20 − 12 = 8
Now combine all values, shuffle, and repeatedly compute differences.
If very few shuffled differences exceed 8, the result is significant.
3.3.5 When to Use Bootstrap vs Permutation Test
Aspect Bootstrap Permutation
Purpose Estimate variability Hypothesis testing
Sampling With replacement Without replacement
Assumptions Minimal Minimal
Output CI, SE, bias p-value
Used for Estimation Group comparison
3.4 Key Points to Remember
• Bootstrap = with replacement
• Permutation = shuffling labels
• Bootstrap estimates uncertainty
• Permutation tests significance
• Both are distribution-free and rely on computation
• Useful when classical formulas fail
Statistical Significance & p-values
4.1 Meaning of Statistical Significance
Statistical significance tells us whether an observed effect (difference in means, conversion
rate, performance metric, etc.) is likely to be real or whether it could have occurred just by
random chance.
When we run an experiment or test (such as A/B testing, t-test, or permutation test), we
observe a difference in outcomes. But not every difference is meaningful — some variations
naturally occur due to random sampling.
Statistical significance answers the question:
Is this difference strong enough that it is unlikely to occur accidentally?
It is determined through the p-value.
4.2 What Is a p-Value?
A p-value is the probability of observing a result at least as extreme as the one obtained,
assuming the null hypothesis (H₀) is true.
In simple words:
p-value tells how surprising the data is if the null hypothesis is correct.
• Small p-value → data is very unlikely under H₀ → strong evidence against H₀.
• Large p-value → data is consistent with H₀.
It does NOT tell:
• Probability that H₀ is true
• Probability that H₁ is true
• Degree of effect
It only tells how compatible the observed data is with the null hypothesis.
4.3 Significance Level (α)
The significance level α defines the cutoff for deciding whether a result is statistically
significant.
Common α values:
• α = 0.05 → 5% risk
• α = 0.01 → 1% risk
If p-value < α → Reject H₀ → result is statistically significant.
If p-value ≥ α → Fail to reject H₀ → result is not significant.
4.4 Interpreting p-Values (Textbook Style)
p-value Interpretation
< 0.01 Strong evidence against H₀
0.01 to 0.05 Moderate evidence against H₀
0.05 to 0.10 Weak evidence
> 0.10 No evidence against H₀
Important:
Even if p < 0.05, the effect might be very small.
Statistical significance ≠ practical significance.
4.5 How p-Values Are Computed (High Level)
p-values come from:
• t-distribution (for t-tests)
• z-distribution (for large samples or known SD)
• Permutation distribution (for non-parametric testing)
They all compare:
• Observed test statistic vs
• Expected distribution under H₀
If the observed value is in the extreme tail → small p-value.
4.6 Example: A/B Testing Interpretation
Suppose:
• Version A CTR = 12%
• Version B CTR = 15%
Statistical test gives p = 0.03.
Interpretation:
• If α = 0.05 → 0.03 < 0.05, so the improvement is significant.
• There is only a 3% chance of seeing such a difference if both versions were truly
equal.
So Version B performs better in a statistically meaningful way.
4.7 Why Statistical Significance Matters
It helps us:
• Avoid false conclusions
• Make reliable decisions
• Understand whether observed differences reflect real effects
• Ensure experiments (like A/B tests) are meaningful
In business and data science:
• Marketing teams test ads
• Product teams test UI changes
• Data scientists test model improvements
• Researchers validate scientific claims
Statistical significance ensures that improvements aren’t due to randomness.
4.8 Key Points to Remember
• Significance tells if effect is beyond random variation
• p-value measures compatibility with null hypothesis
• p-value does NOT measure effect size
• α is pre-chosen threshold (commonly 0.05)
• Statistical significance ≠ practical significance
• Small sample sizes → unstable p-values
• Large samples → p-values become smaller (easier to achieve significance)
t-Tests (One-Sample, Two- Sample, Paired)
5.1 Introduction to t-Tests
A t-test is a statistical method used to determine whether means of one or two groups differ
significantly.
It is used when:
• The sample size is small (n < 30)
• The population standard deviation (σ) is unknown
• Data is approximately normally distributed
• We want to compare averages (means)
t-tests rely on the t-distribution, which adjusts for the uncertainty that comes from
estimating population variance using sample data.
t-tests are very common in:
• A/B testing
• Business analytics
• Medical studies
• Machine learning model comparisons
• Psychology and social science experiments
5.2 Why t-Tests Are Needed
We use t-tests when:
• We want to compare mean performance between groups
• We cannot use z-tests because σ is unknown
• Sample sizes are not large enough to assume normality automatically
The t-distribution helps us determine whether the difference between observed means is
truly meaningful or due to sampling randomness.
5.3 Types of t-Tests
There are three major t-tests:
Type Purpose
Compare sample mean to a known/claimed population
One-Sample t-Test
mean
Two-Sample Independent t-
Compare means of two independent groups
Test
Compare means of the same group before and after a
Paired t-Test
treatment
Each has a different formula, different use-case, and different assumptions.
5.4 ONE-SAMPLE t-TEST
5.4.1 Meaning / Use-Case
Used when we want to test whether the mean of a single sample is equal to a known or
claimed value.
Examples:
• Does the average weight of a packet match the manufacturer claim?
• Is the average time on app > 10 minutes?
• Is the average test score equal to 50 marks?
5.4.2 Formula
𝑋ˉ − 𝜇0
𝑡=
𝑠/√𝑛
Where:
• 𝑋ˉ= sample mean
• 𝜇0 = claimed population mean
• 𝑠= sample standard deviation
• 𝑛= sample size
5.4.3 Assumptions
• Data roughly follows a normal distribution
• σ is unknown
• Sample is random
5.4.4 Simple Example
A company claims the average battery life = 10 hours.
We test 8 devices and get mean = 9.2 hours, s = 0.7.
Apply the formula → compute t → compare with t-table.
If |t| > t-critical → reject the claim.
5.5 TWO-SAMPLE INDEPENDENT t-TEST
5.5.1 Meaning / Use-Case
Used when comparing means of two different independent groups.
Examples:
• Do male and female students score differently?
• Does method A perform better than method B?
• Does website version A outperform version B?
This is commonly used in A/B testing.
5.5.2 Formula
𝑋ˉ1 − 𝑋ˉ2
𝑡=
𝑠12 𝑠22
√
𝑛1 + 𝑛2
Where:
• 𝑋ˉ1 , 𝑋ˉ2= means of the two groups
• 𝑠1 , 𝑠2 = sample standard deviations
• 𝑛1 , 𝑛2 = sample sizes
5.5.3 Assumptions
• Both groups are independent
• Data in each group roughly normal
• Variances are equal (for pooled t-test)
• If variances unequal → Welch’s t-test
5.5.4 Interpretation
If |t| > t-critical → difference in means is significant.
5.5.5 Example (Conceptual)
Group A avg = 78
Group B avg = 74
Compute t-statistic → If p < 0.05 → A performs significantly better.
5.6 PAIRED t-TEST
5.6.1 Meaning / Use-Case
Used when the same subjects are measured twice:
• Before vs After training
• Weight before vs after diet
• Marks before vs after coaching
• App performance before vs after optimization
The test compares differences within each pair.
5.6.2 Key Idea
Instead of comparing two separate groups, we compare difference values (d):
𝑑 = 𝑋𝑎𝑓𝑡𝑒𝑟 − 𝑋𝑏𝑒𝑓𝑜𝑟𝑒
Then we analyze whether the mean of d is significantly different from zero.
5.6.3 Formula
𝑑ˉ
𝑡=
𝑠𝑑 /√𝑛
Where:
• 𝑑ˉ = mean of differences
• 𝑠𝑑 = standard deviation of differences
• n = number of pairs
5.6.4 Why Paired t-Test Is Powerful
• Removes individual variation
• More sensitive to detecting changes
• Requires smaller sample sizes
5.6.5 Example
Before training: 60, 55, 58
After training: 65, 59, 62
Differences = 5, 4, 4
Compute 𝑑ˉ, 𝑠𝑑 , then t.
If t is significant → training improved performance.
5.7 Degrees of Freedom in t-Tests
Degrees of freedom (df) for t-tests:
• One-sample: df = n − 1
• Two-sample: df = n₁ + n₂ − 2
• Paired t-test: df = n − 1
df determines which t-distribution curve to use.
5.8 When to Use Each t-Test
Scenario t-Test
Compare sample mean to claimed value One-sample t-test
Compare two independent groups Two-sample t-test
Scenario t-Test
Same group measured twice Paired t-test
5.9 Applications of t-Tests
Business & Product Analytics
• Conversion rate comparison
• A/B test analysis
• Customer satisfaction changes
Machine Learning
• Comparing model performance
• Feature importance testing
Healthcare
• Treatment effectiveness
• Drug trials
Education
• Performance improvement
• Teaching method comparison
5.10 Key Points to Remember
• Use t-test when σ is unknown
• Sample size small → t-distribution more accurate
• Always check assumptions
• Paired t-test removes individual variation
• Welch’s t-test is used when variances differ
• t-tests rely on p-values for significance
Multiple Testing & Bonferroni Correction
6.1 Introduction to Multiple Testing
In real-world data science and analytics, it is common to perform several hypothesis tests at
the same time.
Examples:
• Checking which of 20 website features improve conversion
• Comparing performance of multiple marketing ads
• Testing many variables to see which correlates with sales
• Exploratory data analysis on large datasets
When multiple tests are conducted, the risk of making a Type I error (false positive)
increases significantly.
A Type I error happens when we incorrectly reject a true null hypothesis.
For a single test:
𝑃(Type I Error) = 𝛼
But when m tests are run simultaneously:
𝑃(at least one false positive) = 1 − (1 − 𝛼)𝑚
This becomes large very quickly.
Example:
If α = 0.05 and we run 20 tests:
𝑃 = 1 − (0.95)20 = 0.642
64.2% chance of getting at least one false positive even when all null hypotheses are
true.
This problem is called the:
Multiple Testing Problem
(or Multiple Comparisons Problem)
6.2 Why Multiple Testing Is a Problem
When many tests are run:
• Some statistically significant results occur just by chance
• Gives misleading conclusions
• Researchers may incorrectly claim a variable is important
• Businesses might implement changes thinking something “worked”
Multiple testing therefore increases:
• False positives
• Overfitting
• Misinterpretation of significance
To avoid this, corrections are applied to control error rates.
6.3 Bonferroni Correction — Concept
The Bonferroni Correction is a simple, conservative method used to control the Family-Wise
Error Rate (FWER).
Family-Wise Error Rate (FWER)
Probability of making at least one false positive among all tests.
To control FWER at α (usually 0.05), Bonferroni adjusts the significance threshold for each
individual test.
6.4 Bonferroni Correction — Formula
𝛼
𝛼corrected =
𝑚
Where:
• α = overall significance level (e.g., 0.05)
• m = number of hypothesis tests
This makes the threshold per test much smaller.
Example:
If m = 10 tests:
0.05
𝛼corrected = = 0.005
10
Now only p-values < 0.005 will be considered significant.
6.5 Why This Works
Bonferroni ensures:
𝑃(any false positive) ≤ 𝛼
By lowering α per test, it compensates for the fact that many tests are being conducted.
6.6 Example — Practical Interpretation
A product team tests 5 different email subject lines.
Subject Line p-value
A 0.03
B 0.01
C 0.04
D 0.20
E 0.07
Using normal α = 0.05 → we would claim A, B, C are significant.
But m = 5 tests → Bonferroni corrected α:
0.05
𝛼corrected = = 0.01
5
Now only p < 0.01 is significant.
So:
• A: 0.03 → Not significant
• B: 0.01 → Significant (barely)
• C: 0.04 → Not significant
Bonferroni prevents us from thinking multiple subject lines are “winners” when only one
truly is.
6.7 When to Use Bonferroni Correction
Use Bonferroni when:
• Number of tests is small or moderate (m ≤ 20)
• You want strong protection against false positives
• Making even one false positive is costly (medicine, finance, experiments)
• Tests are independent
Bonferroni is reliable but conservative.
6.8 Limitations of Bonferroni
Bonferroni is very strict:
• Reduces the chance of false positives
• But increases the chance of false negatives
• You may miss real effects because the threshold becomes too small
• Not suitable when hundreds of tests are involved (e.g., ML feature selection)
But for Module-3 textbook content → this is the only correction method you must study.
6.9 Key Concepts to Remember
• Multiple Testing increases the probability of false positives
• Bonferroni adjusts α → making tests more conservative
• New threshold = α/m
• Controls FWER
• Best used for small number of comparisons
• Prevents misleading “significant” results caused by random chance
6.10 Applications in Data Science
• A/B testing multiple versions (A vs B vs C vs D…)
• Feature selection (checking many variables for correlation)
• Clinical trials testing multiple outcomes
• Marketing experiments
• Experimental research
Ensures the results are not false positives.
Degrees of Freedom (df)
7.1 Meaning of Degrees of Freedom
Degrees of Freedom (df) represent the number of independent values that can vary freely
when calculating a statistic.
Whenever we estimate a parameter (like mean or variance), we lose one or more degrees of
freedom because certain values become fixed due to constraints.
Simple intuition:
If you know:
• You have 5 numbers
• And their average is 10
You can choose any 4 values freely.
The 5th value is automatically determined by the mean constraint.
So df = 5 − 1 = 4
This general idea applies to many statistical tests.
7.2 Why Are Degrees of Freedom Important?
Degrees of freedom determine:
• The shape of the t-distribution
• The critical t-values used for hypothesis testing
• How much uncertainty exists in variance estimation
• Whether a test becomes more strict or more lenient
Key idea:
Smaller df → more uncertainty → heavier tails → larger critical values
As df increases:
• The t-distribution becomes narrower
• It begins to resemble the normal distribution
• Tests become more reliable and accurate
7.3 Degrees of Freedom in t-Tests
Different t-tests have different df formulas.
A) One-Sample t-Test
𝑑𝑓 = 𝑛 − 1
Reason:
• We estimate 1 parameter (the sample mean), which uses up 1 degree of freedom.
B) Two-Sample Independent t-Test (Equal Variances)
𝑑𝑓 = 𝑛1 + 𝑛2 − 2
Reason:
• We estimate 1 mean per group → 2 parameters estimated → lose 2 df.
This is the pooled variance version (classical two-sample t-test).
C) Paired t-Test
𝑑𝑓 = 𝑛 − 1
Reason:
• We work with differences (d-values)
• Only 1 mean (mean of differences) is estimated → lose 1 df
7.4 Why Low Degrees of Freedom Produce Heavier Tails
When df is small:
• The sample size is small
• Variance estimate is less reliable
• There is more uncertainty in the data
To reflect this uncertainty, the t-distribution assigns more probability to extreme values.
Hence:
• Tails become “heavier”
• Critical t values become larger
• You need stronger evidence to reject H₀
As df → ∞:
• The t-distribution becomes identical to the normal distribution.
7.5 Example for Understanding
Imagine you take the test scores of 5 students:
Scores: 70, 75, 80, 85, 90
Sample mean = 80
If you change any 4 values freely, the 5th value must adjust so the mean remains 80.
Thus df = 4.
When you calculate sample variance:
∑(𝑥 − 𝑥ˉ)2
𝑠2 =
𝑛−1
Division is by n − 1, not n — because 1 degree of freedom is lost.
This is the most direct and visible use of df in statistics.
7.6 Degrees of Freedom in Other Tests
Although Module-3 focuses on t-tests, degrees of freedom also appear in:
1. Chi-Square Test
𝑑𝑓 = (𝑟 − 1)(𝑐 − 1)
2. ANOVA (F-test)
• Between-groups df: k − 1
• Within-groups df: N − k
3. Regression
df = n − (number of parameters estimated)
7.7 Applications of Degrees of Freedom
Degrees of freedom are essential whenever we:
✓ Calculate t-statistics
✓ Lookup t-critical values in t-tables
✓ Construct confidence intervals
✓ Estimate sample variance
✓ Perform regression
✓ Conduct ANOVA or chi-square tests
✓ Run any test that uses a distribution dependent on df
Without df, we cannot choose the correct statistical distribution.
7.8 Key Points to Remember
• Degrees of freedom measure the number of free independent values
• df decreases when parameters are estimated
• t-distribution shape depends on df
• Low df → wider/ heavier tails → more conservative tests
• High df → distribution close to normal
• Each t-test has its own formula for df