Testing Hypotheses
By
Dr. Yang Zhang
1
Overview
● Models that involve chance
● Assessing models
● Comparing distributions
● Hypothesis testing and p-values
● Making decisions with incomplete information
● Error probabilities
2
Review: Distributions
● Any random quantity—aka, random variable (RV)—has a probability
distribution, which captures
○ all the possible values the RV can take; and
○ the probability that the RV takes each value.
● After repeated draws, the RV has an empirical distribution, which
captures
○ all the observed values the RV took; and
○ the proportion of times the RV took each value.
● Law of Large Numbers: With increasing number of independent
draws, the empirical distribution looks more and more like the probability
distribution.
3
Inference
4
Terminology
● Parameter
○ A number that characterizes an aspect of the population
○ Generally, impractical to determine directly
○ Example:
■ A coin of unknown, but fixed (deterministic), bias b in favor of a Head:
■ b = Probability of a Head
■ 1-b = Probability of a Tail
5
Inference
● Statistical Inference:
Draw conclusions (draw inferences) based on data in random
samples
● Example: but fixed (deterministic)
Use the data to guess the value of an unknown number
(parameter)
which depends on the random sample,
and is, therefore, itself random.
Create an estimate of the unknown number using a statistic.
6
Terminology
● Statistic is a number
○ calculated from the sample
○ descriptive of the entire sample
○ serves as an estimate of the unknown parameter
● Example
○ Flip the coin n times. Count the number H of Heads in those n flips.
○ An estimate of the coin’s bias b is the statistic given by
Note: H is random (capital letter). Hence, so is .
But b and n are deterministic (small letters).
7
Probability Distribution of a Statistic
● Values of a statistic vary, because random samples vary
● “Sampling distribution” or “probability distribution” of the
statistic captures
○ all possible values of the statistic; and
○ all the corresponding probabilities of those values.
● Typically hard to calculate:
○ must either do the math (often intractable); or
○ must generate all possible samples and calculate the
statistic based on each sample (impractical).
8
Empirical Distribution of a Statistic
● Empirical distribution of a statistic:
○ Based on simulated and/or sampled values (i.e.,
observed values) of the statistic;
○ Consists of all the observed values of the statistic; and
○ Shows proportion of times each value appeared (was
observed).
● Good approximation to the probability distribution of the
statistic
○ IF the number of repetitions in the simulation is large.
9
Assessing Models
10
Models
● A model is a set of assumptions about the data
● In data science, many models involve assumptions
about processes that involve randomness
○ “Chance models”
● Key question: Does the model fit the data?
11
Approach to Assessment
● If we can simulate data according to the model’s
assumptions, we can learn what the model predicts.
● Can then compare the predictions with the observed
data.
● If data and model’s predictions are inconsistent, we
have evidence against the model.
12
Minecraft (!?)
(An Example)
13
Scene
● Minecraft is a video game
● People called “Speed Runners” race to win the game as fast as
possible
● One famous speed runner has the username “Dream”
○ 3.7 million followers on Twitter
○ 5 million followers on the video game streaming site Twitch
○ $$$$$
● October 2020
○ 6 live streams on Twitch
○ Dream is trying to win the game as quickly as possible
○ Accused of cheating
14
Scene Continued
● In the game, there are characters called
“Piglin” who may give you items that help
you win faster. They may give you the
item when initiate a trade with them.
● Every time you initiate a trade, they
have a 5% chance to give you an item
called an “Ender Pearl” that helps you
win.
15
Summary
16
The data
● Data: across the 6 streams, Dream initiated trades 262
times, and received an “Ender Pearl” 42 times.
● Question: is 42/262 (16%) a realistic outcome, if
Dream was playing an unmodified version of the game?
● Recall: in the unmodified game, there is a 5% chance
of receiving an “Ender Pearl” in a trade.
17
Sampling from a Distribution
● Sample at random from a categorical distribution
sample_proportions(sample_size, pop_distribution)
● Samples at random from the population
○ Returns an array containing the distribution of the
categories in the sample
(Demo)
18
19
A Genetic Model
20
Gregor Mendel, 1822-1884
21
A Model
● Pea plants of a particular kind
● Each one has either purple flowers or white flowers
● Mendel’s model:
○ Each plant is purple-flowering with chance 75%,
○ regardless of the colors of the other plants
● Question:
○ Is the model good, or not?
22
Choosing a Statistic
● Take a sample, see what percent are purple-flowering
● If that percent is much larger or much smaller than 75,
that is evidence against the model
● Distance from 75 is the key
● Statistic:
| sample percent of purple-flowering plants - 75 |
● If the statistic is large, that is evidence against the
model (Demo)
23
Two Viewpoints
24
Model and Alternative
● Minecraft:
○ Model: Each trade has a 5% chance of returning an
ender pearl
○ Alternative viewpoint: The chance was higher than
5% in the version Dream played
● Genetics:
○ Model: Each plant has a 75% chance of having
purple flowers
○ Alternative viewpoint: No, it doesn’t
25
Steps in Assessing a Model
● Choose a statistic to measure discrepancy between
model and data
● Simulate the statistic under the model’s assumptions
● Compare the data to the model’s predictions:
○ Draw a histogram of simulated values of the statistic
○ Compute the observed statistic from the real sample
● If the observed statistic is far from the histogram, that is
evidence against the model
26
Comparing Distributions
27
Example: Haribo Goldbears
● There are 5 different flavors in a
pack of Haribo Goldbears (gummy
bears)
● Let’s say that Haribo claims that
each pack has the same proportion
of each flavor of gummy bears
● Yanay buys a pack of gummy bears
and notices that the proportions of
each flavor seems different from
what he expected. Is this difference
due to chance or is there something (Demo)
else going on?
28
A New Statistic:
Total Variation Distance
(TVD)
29
Distance Between Distributions
● Distribution of flavors is categorical
● To see whether the distribution of flavors in the bag is
close to that of what Haribo claims, we have to measure
the distance between two categorical distributions
30
Total Variation Distance
Every distance has a computational recipe
Total Variation Distance (TVD):
● For each category, compute the difference in
proportions between two distributions
● Take the absolute value of each difference
● Sum, and then divide the sum by 2
(Demo)
31
Summary of the Method
To assess whether a sample was drawn randomly from a known
categorical distribution:
● Use TVD as the statistic to measure the distance between categorical
distributions
● Sample at random from the population
● Compute the TVD from the random sample;
● Repeat numerous times (e.g., 1,000 times)
● Compare:
○ Empirical distribution of simulated TVDs
○ Actual TVD from the sample in the study
32
Testing Hypotheses
33
Testing Hypotheses
● A test chooses between two views of how data were
generated
● The views are called hypotheses
● The test picks the hypothesis that is better supported by the
observed data
34
Null and Alternative
The method only works if we can simulate data under one
of the hypotheses.
● Null Hypothesis
○ A well-defined chance model about how the data
were generated
○ We can simulate data under the assumptions of this
model – “under the null hypothesis”
● Alternative Hypothesis
○ A different view about the origin of the data
35
Haribo Example
Null Hypothesis: The distribution of flavors of gummy
bears is equal, with ⅕ probabilities per flavor. Any
difference is due to chance alone.
Alternative Hypothesis: The difference is not due to
chance - the number of gummy bears are not evenly
distributed among different flavors, with some flavors being
more prevalent than others.
36
Test Statistic
● Test Statistic: Statistic we choose to simulate and
decide between the two hypotheses
Questions before choosing the statistic:
● What values of the statistic will make us lean toward the
null hypothesis?
● What values will make us lean toward the alternative?
○ Preferably, the answer should be just “high”.
Try to avoid “both high and low,” if possible.
37
Haribo Example
● Test Statistic: Total variation distance
def tvd(dist1, dist2):
return sum(abs(dist1 - dist2))/2
38
Prediction Under the Null Hypothesis
● Simulate the test statistic under the null hypothesis many times
(e.g., 1,000 times)
● Draw the histogram of the simulated values
● This displays the empirical distribution of the statistic under
the null hypothesis
● It’s a prediction about the statistic, made by the null hypothesis
○ It shows all the likely values of the statistic
○ Shows how likely they are (if the null hypothesis is true)
● The probabilities are approximate, because we can’t generate
all the possible random samples
39
Haribo Example
● Simulate the test statistic under the null hypothesis many times
(e.g., 1,000 times)
● Draw the histogram of the simulated values
tvds = make_array()
num_simulations = 10000
for i in
[Link](num_simulations):
new_tvd = simulated_tvd()
tvds = [Link](tvds,
new_tvd)
40
Conclusion of the Test
Resolve choice between null and alternative hypotheses
● Compare the observed test statistic and its empirical
distribution under the null hypothesis
● If the observed value is not consistent with the distribution,
then the test favors the alternative (“data more consistent with
the alternative”)
Whether a value is consistent with a distribution:
● A visualization may be sufficient
● If not, there are conventions about “consistency” (stay tuned)
41
Hypothesis Testing with
Python
42
Defining Hypotheses
● First of all, we should understand which scientific question we are
looking for an answer to, and it should be formulated in the form of
the Null Hypothesis (H₀) and the Alternative Hypothesis (H₁ or Hₐ).
● Remember that H₀ and H₁ must be mutually exclusive, and H₁
shouldn’t contain equality:
○ H₀: μ=x, H₁: μ≠x
○ H₀: μ≤x, H₁: μ>x
○ H₀: μ≥x, H₁: μ<x
43
Assumption Check
● To decide whether to use the parametric or nonparametric version
of the test, we should check the specific requirements listed below:
○ Observations in each sample are independent and identically
distributed (IID).
○ Observations in each sample are normally distributed.
○ Observations in each sample have the same variance.
44
Selecting the Proper Test
● Then we select the appropriate test to be used.
● When choosing the proper test, it is essential to analyze
how many groups are being compared and whether the
data are paired or not.
● To determine whether the data is matched, it is
necessary to consider whether the data was collected
from the same individuals.
45
Selecting the Proper Test
● Accordingly, you can decide on the appropriate test
using the chart below.
46
Decision and Conclusion
● After performing the hypothesis testing, we obtain a
related p-value that shows the significance of the test.
● If the p-value is smaller than the alpha (the significance
level), in other words, there is enough evidence to prove
H₀ is not valid; you can reject H₀.
47
Decision and Conclusion
● Otherwise, you fail to reject
H₀. Please remember that
rejecting H₀ validates H₁.
● However, failing to reject H₀
does not mean H₀ is valid,
nor does it mean H₁ is wrong.
(Demo) 48
Decisions and Uncertainty
49
Incomplete Information
● Try to choose between two worldviews (hypotheses)—based on
data in samples (rarely, do we have access to entire population).
● Not always clear whether the data are consistent with one
hypothesis or the other.
● Easier (More Obvious) Decision:
Observed data can turn out quite extreme.
Unlikely, but possible.
● Harder (Less Obvious) Decision:
Observed data can turn out in the proverbial ‘gray area’ —
within reach of each of the two hypotheses.
50
Another Example
(“Gray Area” Type)
51
The Problem
● Large(ish) Statistics class divided into 12 discussion
sections
● Graduate Student Instructors (GSIs) lead the sections
● After midterm, students in Sec. 3 notice average score
in their section lower than in others!
52
The GSI’s Defense
Sec. 3 GSI Position (Null Hypothesis):
● Had we picked my section at random from the whole
class, we could’ve gotten an average like this one.
Alternative Hypothesis:
● No! Sec. 3’s average score too low.
Randomness not the only reason for lower scores.
(Demo)
53
Statistical Significance
54
Tail Areas
Minecraft Ender Pearls Haribo Goldbears Mendel’s Pea Plants
Observed Number (42) Observed TVD (0.13) Observed Distance (1.32)
To quantify reasonableness of observation relative to the
random samples, look at tail probabilities.
55
Conventions About Inconsistency
● “Inconsistent with the null”: The test statistic is in the tail
of the empirical distribution—under the null hypothesis
○ The farther out in the tail the test statistic lies, the more
inconsistent it is with the null hypothesis
● “In the tail,” first convention:
○ The area in the tail is less than 5%
○ The result is “statistically significant”
● “In the tail,” second convention:
○ The area in the tail is less than 1%
○ The result is “highly statistically significant”
56
The p-Value as an Area
● Empirical distribution Distribution under the
of the test statistic Null Hypothesis
under the null
hypothesis.
● Red dot denotes the
observed statistic.
● Yellow area denotes
the tail probability (p-
value).
(Demo)
57
Definition of the p-value
Formal name: observed significance level
The p-value is the chance (probability),
● under the null hypothesis,
● that the test statistic
● is equal to the value that was observed in the data
● or is even further in the direction of the alternative.
● Last two bullets mean: “test statistic is at least as
extreme as the observed value.”
58
P-Values and Error Probabilities
59
Can the Conclusion be Wrong?
Yes.
Null is true Alternative is
true
Test favors the
null
Test favors the
alternative
60
An Error Probability
● The cutoff for the P-value is an error probability.
● If:
○ your cutoff is 5%
○ and the null hypothesis happens to be true
● then there is about a 5% chance that your test will
reject the null hypothesis.
61
P-value cutoff vs P-value
● P-value cutoff
○ Does not depend on observed data or simulation
○ Decide on it before seeing the results
○ Conventional values at 5% and 1%
○ Probability of hypothesis testing making an error
● P-value (empirical)
○ Depends on the observed data and simulation
○ Probability under the null hypothesis that the test
statistic is the observed value or further towards the
alternative
62
How We’ve Tested Thus Far
63
Hypothesis Testing Review
● One Category (e.g. percent of flowers that are purple)
○ Test Statistic (1): empirical_percentage
○ Test Statistic (1): abs(empirical_percentage - null_percentage)
○ How to Simulate: sample_proportions(n, null_dist)
● Multiple Categories (e.g. flavor distribution of gummy bears)
○ Test Statistic: tvd(empirical_dist, null_dist)
○ How to Simulate: sample_proportions(n, null_dist)
● Numerical Data (e.g. scores in a lab section)
○ Test Statistic: empirical_mean
○ How to Simulate: population_data.sample(n, with_replacement=False)
64