0% found this document useful (0 votes)
9 views23 pages

Resampling Methods in Statistics

The document discusses resampling methods, specifically bootstrap resampling and permutation tests, which are used to estimate variability and test significance without relying on traditional statistical assumptions. Bootstrap involves drawing samples with replacement to estimate standard errors and confidence intervals, while permutation tests shuffle group labels to compare means without replacement. It also covers the concept of statistical significance, p-values, and t-tests, emphasizing their applications in various fields such as data science, finance, and healthcare.

Uploaded by

sg7893699
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Resampling Methods in Statistics

The document discusses resampling methods, specifically bootstrap resampling and permutation tests, which are used to estimate variability and test significance without relying on traditional statistical assumptions. Bootstrap involves drawing samples with replacement to estimate standard errors and confidence intervals, while permutation tests shuffle group labels to compare means without replacement. It also covers the concept of statistical significance, p-values, and t-tests, emphasizing their applications in various fields such as data science, finance, and healthcare.

Uploaded by

sg7893699
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module – 3

Resampling Methods
3.1 What Are Resampling Methods?

Resampling methods are computational techniques in which repeated samples are drawn
from the observed data itself to understand how a statistic behaves.

In traditional statistics, we rely heavily on formulas, assumptions (normality, independence),


and theoretical distributions.
But resampling uses the data itself to estimate:

• Variability

• Standard error

• Confidence intervals

• Significance of results

Resampling is especially powerful when:

• Theoretical formulas are complex or unavailable

• Sample sizes are small

• Distribution assumptions are violated

• Modern computing makes repeated sampling easy

Resampling includes two major techniques:

1. Bootstrap Resampling

2. Permutation Testing

3.2 Bootstrap Resampling

3.2.1 Meaning / Definition

Bootstrap is a data-driven resampling technique where we repeatedly draw samples with


replacement from the original dataset.
Each bootstrap sample is the same size as the original data.

Purpose:

• Estimate standard error


• Build confidence intervals

• Understand sampling variability

• Make inferences without strong assumptions

Bootstrap is especially useful when:

• Sample size is small

• Distribution is unknown

• Traditional formulas do not apply

3.2.2 How Bootstrap Works (Step-by-Step)

Assume we have a dataset of size n.

Step 1 – Collect original sample

Example: daily sales for 10 days.

Step 2 – Draw a bootstrap sample

Randomly pick n observations with replacement:

• Some values will repeat

• Some may be absent

Step 3 – Compute the statistic

Mean, median, standard deviation, etc.

Step 4 – Repeat many times

Usually 1,000 to 10,000 bootstrap samples.

Step 5 – Build a bootstrap distribution

The distribution of all computed statistics approximates the sampling distribution.

Step 6 – Use distribution for inference

• 95% Confidence interval

• Standard error

• Bias estimate

3.2.3 Why “With Replacement”?


Because in real life, samples vary randomly.
Bootstrapping simulates this randomness by allowing repeated selections.

This helps approximate:

• Sampling variability

• True population characteristics

3.2.4 Simple Example

Original data (5 numbers):


12,15,20,10,18

One bootstrap sample might look like:


15,20,20,12,10

Another:
12,12,18,20,15

Each time, we compute the statistic (mean, median, etc.).

If we repeat this 1,000 times:

• The distribution of means gives us the bootstrap sampling distribution.

• The spread of this distribution gives standard error.

• Percentiles give confidence intervals.

This avoids reliance on formulas.

3.2.5 Advantages of Bootstrap

• Does not require normality

• Works for small datasets

• Works for complicated statistics

• Robust and widely used

• Requires only the observed data

3.2.6 Applications of Bootstrap

Bootstrap is used in:


1. Analytics & Data Science

• Estimating uncertainty in model performance

• Constructing confidence intervals for regression coefficients

• Estimating standard errors of medians, quantiles

2. Finance

• Risk estimation

• Portfolio volatility

3. Machine Learning

• Evaluating model stability

• Bagging (bootstrap aggregating)

4. Medical research

• Estimating effects when sample sizes are small

3.3 Permutation Tests

3.3.1 Meaning / Definition

A permutation test is a non-parametric statistical test that evaluates whether two groups
differ significantly by randomly shuffling group labels.

Unlike bootstrap, permutation does NOT sample with replacement.

Purpose:

• Test significance without assuming normality

• Compare group means, medians, or any statistic

• Useful for small samples and unknown distributions

3.3.2 How Permutation Test Works

Suppose you want to test whether two groups have different means.

Step 1 — Combine both groups into one dataset

Step 2 — Shuffle (permute) all values randomly

Step 3 — Split the shuffled data back into two groups of the same sizes as original
Step 4 — Compute the difference in means

Step 5 — Repeat 1,000+ times

Step 6 — Compare actual difference to permutation differences

• If actual difference is extremely rare among permutation results → significant

• If common → not significant

3.3.3 Why Permutation Tests Are Powerful

• They make no distributional assumptions

• They reflect the idea of “random assignment” from experiments

• They directly simulate the null hypothesis:


“Both groups come from the same distribution.”

3.3.4 Simple Example

Group A scores: 10, 12, 14


Group B scores: 18, 20, 22

Observed difference = 20 − 12 = 8

Now combine all values, shuffle, and repeatedly compute differences.


If very few shuffled differences exceed 8, the result is significant.

3.3.5 When to Use Bootstrap vs Permutation Test

Aspect Bootstrap Permutation

Purpose Estimate variability Hypothesis testing

Sampling With replacement Without replacement

Assumptions Minimal Minimal

Output CI, SE, bias p-value

Used for Estimation Group comparison


3.4 Key Points to Remember

• Bootstrap = with replacement

• Permutation = shuffling labels

• Bootstrap estimates uncertainty

• Permutation tests significance

• Both are distribution-free and rely on computation

• Useful when classical formulas fail

Statistical Significance & p-values


4.1 Meaning of Statistical Significance

Statistical significance tells us whether an observed effect (difference in means, conversion


rate, performance metric, etc.) is likely to be real or whether it could have occurred just by
random chance.

When we run an experiment or test (such as A/B testing, t-test, or permutation test), we
observe a difference in outcomes. But not every difference is meaningful — some variations
naturally occur due to random sampling.

Statistical significance answers the question:


Is this difference strong enough that it is unlikely to occur accidentally?

It is determined through the p-value.

4.2 What Is a p-Value?

A p-value is the probability of observing a result at least as extreme as the one obtained,
assuming the null hypothesis (H₀) is true.

In simple words:
p-value tells how surprising the data is if the null hypothesis is correct.

• Small p-value → data is very unlikely under H₀ → strong evidence against H₀.

• Large p-value → data is consistent with H₀.

It does NOT tell:

• Probability that H₀ is true

• Probability that H₁ is true


• Degree of effect

It only tells how compatible the observed data is with the null hypothesis.

4.3 Significance Level (α)

The significance level α defines the cutoff for deciding whether a result is statistically
significant.

Common α values:

• α = 0.05 → 5% risk

• α = 0.01 → 1% risk

If p-value < α → Reject H₀ → result is statistically significant.


If p-value ≥ α → Fail to reject H₀ → result is not significant.

4.4 Interpreting p-Values (Textbook Style)

p-value Interpretation

< 0.01 Strong evidence against H₀

0.01 to 0.05 Moderate evidence against H₀

0.05 to 0.10 Weak evidence

> 0.10 No evidence against H₀

Important:
Even if p < 0.05, the effect might be very small.
Statistical significance ≠ practical significance.

4.5 How p-Values Are Computed (High Level)

p-values come from:

• t-distribution (for t-tests)

• z-distribution (for large samples or known SD)

• Permutation distribution (for non-parametric testing)

They all compare:


• Observed test statistic vs

• Expected distribution under H₀

If the observed value is in the extreme tail → small p-value.

4.6 Example: A/B Testing Interpretation

Suppose:

• Version A CTR = 12%

• Version B CTR = 15%

Statistical test gives p = 0.03.

Interpretation:

• If α = 0.05 → 0.03 < 0.05, so the improvement is significant.

• There is only a 3% chance of seeing such a difference if both versions were truly
equal.

So Version B performs better in a statistically meaningful way.

4.7 Why Statistical Significance Matters

It helps us:

• Avoid false conclusions

• Make reliable decisions

• Understand whether observed differences reflect real effects

• Ensure experiments (like A/B tests) are meaningful

In business and data science:

• Marketing teams test ads

• Product teams test UI changes

• Data scientists test model improvements

• Researchers validate scientific claims

Statistical significance ensures that improvements aren’t due to randomness.


4.8 Key Points to Remember

• Significance tells if effect is beyond random variation

• p-value measures compatibility with null hypothesis

• p-value does NOT measure effect size

• α is pre-chosen threshold (commonly 0.05)

• Statistical significance ≠ practical significance

• Small sample sizes → unstable p-values

• Large samples → p-values become smaller (easier to achieve significance)

t-Tests (One-Sample, Two- Sample, Paired)


5.1 Introduction to t-Tests

A t-test is a statistical method used to determine whether means of one or two groups differ
significantly.
It is used when:

• The sample size is small (n < 30)

• The population standard deviation (σ) is unknown

• Data is approximately normally distributed

• We want to compare averages (means)

t-tests rely on the t-distribution, which adjusts for the uncertainty that comes from
estimating population variance using sample data.

t-tests are very common in:

• A/B testing

• Business analytics

• Medical studies

• Machine learning model comparisons

• Psychology and social science experiments

5.2 Why t-Tests Are Needed

We use t-tests when:


• We want to compare mean performance between groups

• We cannot use z-tests because σ is unknown

• Sample sizes are not large enough to assume normality automatically

The t-distribution helps us determine whether the difference between observed means is
truly meaningful or due to sampling randomness.

5.3 Types of t-Tests

There are three major t-tests:

Type Purpose

Compare sample mean to a known/claimed population


One-Sample t-Test
mean

Two-Sample Independent t-
Compare means of two independent groups
Test

Compare means of the same group before and after a


Paired t-Test
treatment

Each has a different formula, different use-case, and different assumptions.

5.4 ONE-SAMPLE t-TEST

5.4.1 Meaning / Use-Case

Used when we want to test whether the mean of a single sample is equal to a known or
claimed value.

Examples:

• Does the average weight of a packet match the manufacturer claim?

• Is the average time on app > 10 minutes?

• Is the average test score equal to 50 marks?

5.4.2 Formula

𝑋ˉ − 𝜇0
𝑡=
𝑠/√𝑛

Where:
• 𝑋ˉ= sample mean

• 𝜇0 = claimed population mean

• 𝑠= sample standard deviation

• 𝑛= sample size

5.4.3 Assumptions

• Data roughly follows a normal distribution

• σ is unknown

• Sample is random

5.4.4 Simple Example

A company claims the average battery life = 10 hours.


We test 8 devices and get mean = 9.2 hours, s = 0.7.

Apply the formula → compute t → compare with t-table.


If |t| > t-critical → reject the claim.

5.5 TWO-SAMPLE INDEPENDENT t-TEST

5.5.1 Meaning / Use-Case

Used when comparing means of two different independent groups.

Examples:

• Do male and female students score differently?

• Does method A perform better than method B?

• Does website version A outperform version B?

This is commonly used in A/B testing.

5.5.2 Formula

𝑋ˉ1 − 𝑋ˉ2
𝑡=
𝑠12 𝑠22

𝑛1 + 𝑛2

Where:

• 𝑋ˉ1 , 𝑋ˉ2= means of the two groups


• 𝑠1 , 𝑠2 = sample standard deviations

• 𝑛1 , 𝑛2 = sample sizes

5.5.3 Assumptions

• Both groups are independent

• Data in each group roughly normal

• Variances are equal (for pooled t-test)

• If variances unequal → Welch’s t-test

5.5.4 Interpretation

If |t| > t-critical → difference in means is significant.

5.5.5 Example (Conceptual)

Group A avg = 78
Group B avg = 74
Compute t-statistic → If p < 0.05 → A performs significantly better.

5.6 PAIRED t-TEST

5.6.1 Meaning / Use-Case

Used when the same subjects are measured twice:

• Before vs After training

• Weight before vs after diet

• Marks before vs after coaching

• App performance before vs after optimization

The test compares differences within each pair.

5.6.2 Key Idea

Instead of comparing two separate groups, we compare difference values (d):

𝑑 = 𝑋𝑎𝑓𝑡𝑒𝑟 − 𝑋𝑏𝑒𝑓𝑜𝑟𝑒

Then we analyze whether the mean of d is significantly different from zero.

5.6.3 Formula
𝑑ˉ
𝑡=
𝑠𝑑 /√𝑛

Where:

• 𝑑ˉ = mean of differences

• 𝑠𝑑 = standard deviation of differences

• n = number of pairs

5.6.4 Why Paired t-Test Is Powerful

• Removes individual variation

• More sensitive to detecting changes

• Requires smaller sample sizes

5.6.5 Example

Before training: 60, 55, 58


After training: 65, 59, 62
Differences = 5, 4, 4

Compute 𝑑ˉ, 𝑠𝑑 , then t.


If t is significant → training improved performance.

5.7 Degrees of Freedom in t-Tests

Degrees of freedom (df) for t-tests:

• One-sample: df = n − 1

• Two-sample: df = n₁ + n₂ − 2

• Paired t-test: df = n − 1

df determines which t-distribution curve to use.

5.8 When to Use Each t-Test

Scenario t-Test

Compare sample mean to claimed value One-sample t-test

Compare two independent groups Two-sample t-test


Scenario t-Test

Same group measured twice Paired t-test

5.9 Applications of t-Tests

Business & Product Analytics

• Conversion rate comparison

• A/B test analysis

• Customer satisfaction changes

Machine Learning

• Comparing model performance

• Feature importance testing

Healthcare

• Treatment effectiveness

• Drug trials

Education

• Performance improvement

• Teaching method comparison

5.10 Key Points to Remember

• Use t-test when σ is unknown

• Sample size small → t-distribution more accurate

• Always check assumptions

• Paired t-test removes individual variation

• Welch’s t-test is used when variances differ

• t-tests rely on p-values for significance

Multiple Testing & Bonferroni Correction


6.1 Introduction to Multiple Testing
In real-world data science and analytics, it is common to perform several hypothesis tests at
the same time.

Examples:

• Checking which of 20 website features improve conversion

• Comparing performance of multiple marketing ads

• Testing many variables to see which correlates with sales

• Exploratory data analysis on large datasets

When multiple tests are conducted, the risk of making a Type I error (false positive)
increases significantly.

A Type I error happens when we incorrectly reject a true null hypothesis.

For a single test:

𝑃(Type I Error) = 𝛼

But when m tests are run simultaneously:

𝑃(at least one false positive) = 1 − (1 − 𝛼)𝑚

This becomes large very quickly.

Example:

If α = 0.05 and we run 20 tests:

𝑃 = 1 − (0.95)20 = 0.642

64.2% chance of getting at least one false positive even when all null hypotheses are
true.

This problem is called the:

Multiple Testing Problem

(or Multiple Comparisons Problem)

6.2 Why Multiple Testing Is a Problem

When many tests are run:

• Some statistically significant results occur just by chance


• Gives misleading conclusions

• Researchers may incorrectly claim a variable is important

• Businesses might implement changes thinking something “worked”

Multiple testing therefore increases:

• False positives

• Overfitting

• Misinterpretation of significance

To avoid this, corrections are applied to control error rates.

6.3 Bonferroni Correction — Concept

The Bonferroni Correction is a simple, conservative method used to control the Family-Wise
Error Rate (FWER).

Family-Wise Error Rate (FWER)

Probability of making at least one false positive among all tests.

To control FWER at α (usually 0.05), Bonferroni adjusts the significance threshold for each
individual test.

6.4 Bonferroni Correction — Formula


𝛼
𝛼corrected =
𝑚

Where:

• α = overall significance level (e.g., 0.05)

• m = number of hypothesis tests

This makes the threshold per test much smaller.

Example:

If m = 10 tests:
0.05
𝛼corrected = = 0.005
10
Now only p-values < 0.005 will be considered significant.

6.5 Why This Works

Bonferroni ensures:

𝑃(any false positive) ≤ 𝛼

By lowering α per test, it compensates for the fact that many tests are being conducted.

6.6 Example — Practical Interpretation

A product team tests 5 different email subject lines.

Subject Line p-value

A 0.03

B 0.01

C 0.04

D 0.20

E 0.07

Using normal α = 0.05 → we would claim A, B, C are significant.

But m = 5 tests → Bonferroni corrected α:


0.05
𝛼corrected = = 0.01
5

Now only p < 0.01 is significant.

So:

• A: 0.03 → Not significant

• B: 0.01 → Significant (barely)

• C: 0.04 → Not significant

Bonferroni prevents us from thinking multiple subject lines are “winners” when only one
truly is.
6.7 When to Use Bonferroni Correction

Use Bonferroni when:

• Number of tests is small or moderate (m ≤ 20)

• You want strong protection against false positives

• Making even one false positive is costly (medicine, finance, experiments)

• Tests are independent

Bonferroni is reliable but conservative.

6.8 Limitations of Bonferroni

Bonferroni is very strict:

• Reduces the chance of false positives

• But increases the chance of false negatives

• You may miss real effects because the threshold becomes too small

• Not suitable when hundreds of tests are involved (e.g., ML feature selection)

But for Module-3 textbook content → this is the only correction method you must study.

6.9 Key Concepts to Remember

• Multiple Testing increases the probability of false positives

• Bonferroni adjusts α → making tests more conservative

• New threshold = α/m

• Controls FWER

• Best used for small number of comparisons

• Prevents misleading “significant” results caused by random chance

6.10 Applications in Data Science

• A/B testing multiple versions (A vs B vs C vs D…)

• Feature selection (checking many variables for correlation)


• Clinical trials testing multiple outcomes

• Marketing experiments

• Experimental research

Ensures the results are not false positives.

Degrees of Freedom (df)


7.1 Meaning of Degrees of Freedom

Degrees of Freedom (df) represent the number of independent values that can vary freely
when calculating a statistic.

Whenever we estimate a parameter (like mean or variance), we lose one or more degrees of
freedom because certain values become fixed due to constraints.

Simple intuition:

If you know:

• You have 5 numbers

• And their average is 10

You can choose any 4 values freely.


The 5th value is automatically determined by the mean constraint.

So df = 5 − 1 = 4

This general idea applies to many statistical tests.

7.2 Why Are Degrees of Freedom Important?

Degrees of freedom determine:

• The shape of the t-distribution

• The critical t-values used for hypothesis testing

• How much uncertainty exists in variance estimation

• Whether a test becomes more strict or more lenient

Key idea:

Smaller df → more uncertainty → heavier tails → larger critical values

As df increases:
• The t-distribution becomes narrower

• It begins to resemble the normal distribution

• Tests become more reliable and accurate

7.3 Degrees of Freedom in t-Tests

Different t-tests have different df formulas.

A) One-Sample t-Test

𝑑𝑓 = 𝑛 − 1

Reason:

• We estimate 1 parameter (the sample mean), which uses up 1 degree of freedom.

B) Two-Sample Independent t-Test (Equal Variances)

𝑑𝑓 = 𝑛1 + 𝑛2 − 2

Reason:

• We estimate 1 mean per group → 2 parameters estimated → lose 2 df.

This is the pooled variance version (classical two-sample t-test).

C) Paired t-Test

𝑑𝑓 = 𝑛 − 1

Reason:

• We work with differences (d-values)

• Only 1 mean (mean of differences) is estimated → lose 1 df

7.4 Why Low Degrees of Freedom Produce Heavier Tails

When df is small:
• The sample size is small

• Variance estimate is less reliable

• There is more uncertainty in the data

To reflect this uncertainty, the t-distribution assigns more probability to extreme values.

Hence:

• Tails become “heavier”

• Critical t values become larger

• You need stronger evidence to reject H₀

As df → ∞:

• The t-distribution becomes identical to the normal distribution.

7.5 Example for Understanding

Imagine you take the test scores of 5 students:

Scores: 70, 75, 80, 85, 90

Sample mean = 80
If you change any 4 values freely, the 5th value must adjust so the mean remains 80.

Thus df = 4.

When you calculate sample variance:

∑(𝑥 − 𝑥ˉ)2
𝑠2 =
𝑛−1

Division is by n − 1, not n — because 1 degree of freedom is lost.

This is the most direct and visible use of df in statistics.

7.6 Degrees of Freedom in Other Tests

Although Module-3 focuses on t-tests, degrees of freedom also appear in:

1. Chi-Square Test

𝑑𝑓 = (𝑟 − 1)(𝑐 − 1)
2. ANOVA (F-test)

• Between-groups df: k − 1

• Within-groups df: N − k

3. Regression

df = n − (number of parameters estimated)

7.7 Applications of Degrees of Freedom

Degrees of freedom are essential whenever we:

✓ Calculate t-statistics
✓ Lookup t-critical values in t-tables
✓ Construct confidence intervals
✓ Estimate sample variance
✓ Perform regression
✓ Conduct ANOVA or chi-square tests
✓ Run any test that uses a distribution dependent on df

Without df, we cannot choose the correct statistical distribution.

7.8 Key Points to Remember

• Degrees of freedom measure the number of free independent values

• df decreases when parameters are estimated

• t-distribution shape depends on df

• Low df → wider/ heavier tails → more conservative tests

• High df → distribution close to normal

• Each t-test has its own formula for df

You might also like