Advanced Statistic & Hypothesis Testing
Hypothesis Testing
Hypothesis testing is a fundamental statistical method used to assess claims or hypotheses about populations based
on sample data. It involves two key hypotheses:
Null Hypothesis (H0): This is the default assumption or status quo. It suggests that there is no significant
effect, difference, or relationship in the population.
Alternative Hypothesis (Ha): This contradicts the null hypothesis and suggests that there is a significant
effect, difference, or relationship in the population.
Example
Scenario: Imagine you are a teacher, and you believe that giving students a snack before a test will improve their
performance. However, your friend suggests that snacks might not make a difference.
Hypotheses:
Null Hypothesis (H0): Giving students a snack has no effect on test performance.
Alternative Hypothesis (H1): Giving students a snack improves test performance.
Testing: To test this, you decide to conduct a small experiment. You randomly select two groups of students: one
group receives a snack before the test, and the other does not. After the test, you compare the average scores
between the two groups.
Results: If the group that received snacks shows a significantly higher average score, you might reject the null
hypothesis in favor of the alternative hypothesis, concluding that snacks do have an impact on test performance.
Conclusion: If there’s no significant difference in the average scores between the two groups, you would fail to reject
the null hypothesis, suggesting that snacks may not have a noticeable effect on test performance.
Hypothesis Testing Steps
Example:
Suppose the coffee shop owner collects data on the wait times for a random sample of 30 customers, and the wait
times (in minutes) are as follows:
4.2, 3.8 ,3.5, 4.0 ,4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1, 4.3, 3.6, 3.9,
4.0, 4.2, 4.2, 3.8, 3.5, 4.0, 4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1,
4.3, 3.6, 3.9, 4.0, 4.2
Now, let’s perform the hypothesis test based on this data:
Hypotheses:
Null Hypothesis (H0): The average wait time is 3 minutes (μ = 3).
Alternative Hypothesis (Ha): The average wait time is not 3 minutes (μ ≠ 3).
(We’re using a two-tailed test because we want to test if the average wait time is different from 3 minutes.)
Steps:
Collect Data: We have the wait times for the 30 customers in the sample.
Set Significance Level: Let’s choose a significance level (α) of 0.05.
Calculate Sample Mean: The sample mean ( x̄ ) is calculated as:
x̄ = (4.2+3.8+3.5+…+4.2) / 30 =4.0 minutes
Calculate Standard Error: This involves calculating the standard deviation of the sample and dividing it by the
square root of the sample size. Let’s assume a standard error of 0.2 minutes.
Calculate Test Statistic: Using the formula:
t = (4.0−3)/0.2 =5
Hypothesis Testing
Find Critical Values: For a two-tailed test with α = 0.05 and 29 degrees of freedom (df = n – 1 = 30 – 1 = 29),
the critical values are approximately ±2.045.
Calculate P-Value: The p-value is calculated as the probability of observing a t-statistic as extreme as 5 in a t-
distribution with 29 degrees of freedom. The p-value is extremely small, indicating strong evidence against
the null hypothesis.
Make a Decision: Since the p-value (very small) is less than α (0.05), we reject the null hypothesis. This
suggests that there is enough evidence to conclude that the average wait time for customers to receive their
coffee is not 3 minutes.
Significance Level (α)
The significance level, denoted as α (alpha), is a predetermined threshold used in hypothesis testing to
determine the level of evidence required to reject the null hypothesis. It represents the maximum acceptable
probability of making a Type I error, which is the error of rejecting a true null hypothesis.
Commonly used significance levels include α = 0.05 (5%), α = 0.01 (1%), and α = 0.10 (10%).
Confidence Interval
Confidence, in statistics, is another way to describe probability. For example, suppose you construct a confidence
interval with a 95% confidence level. In that case, you are confident that 95 out of 100 times the estimate will fall
between the upper and lower values specified by the confidence interval.
Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test
Confidence level = 1 − a
So, if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 =
0.95, or 95%.
Calculation and Interpretation
Confidence intervals are typically calculated using a formula that incorporates the sample statistic, the standard error
(a measure of variability), and a critical value from a statistical distribution (e.g., the normal distribution or t-
distribution).
The formula is:
CI = Sample Statistic ± Margin of Error
Interpretation: A confidence interval consists of two parts:
The point estimate (sample statistic) is the best guess for the population parameter.
The margin of error represents the range around the point estimate. It reflects the estimate’s precision and is
determined by the chosen confidence level.
P-Value
The p-value is a statistical measure that quantifies the strength of evidence against the null hypothesis. It represents
the probability of observing sample data as extreme as, or more extreme than, the data observed, assuming the null
hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
Definition: A p-value measures the strength of evidence against the null hypothesis in hypothesis testing.
Interpretation: Smaller p-values indicate stronger evidence against the null hypothesis.
Decision: If p ≤ α (chosen significance level, e.g., 0.05), you reject the null hypothesis; if p > α, you fail to reject it.
Example
Setting Alpha at 0.05 and Interpreting a P-Value of 0.03:
Suppose you conduct a hypothesis test with a significance level (α) of 0.05. After performing the test, you obtain a p-
value of 0.03.
Significance Level (α):
Choose a significance level of α = 0.05.
Accepted a 5% chance of Type I error (false rejection of a true null hypothesis).
P-Value Interpretation:
Obtained a p-value of 0.03.
P-value is less than α (0.05).
Indicates a 3% probability of observing the data or more extreme if the null hypothesis were true.
Decision:
Given the p-value is less than α.
Typically, reject the null hypothesis in favor of the alternative hypothesis.
Strong evidence against the null hypothesis.
Conclusion:
Statistical basis to reject the null hypothesis.
The observed data supports the conclusion that the null hypothesis is unlikely to be true.
In this example, by setting α at 0.05 and obtaining a p-value of 0.03, you have a strong statistical basis to reject the
null hypothesis and make a conclusion based on the evidence provided by the data.
_________________________________________________________________________
T-Tests
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups or
to test if the mean of a single sample is significantly different from a known or hypothesized population mean.
One Sample T-Test
Explanation: A one-sample t-test is used to determine if the mean of a single sample is significantly different
from a known or hypothesized population mean.
Assumptions: It assumes that the sample data is approximately normally distributed and that the
observations are independent.
When to Use: Use a one-sample t-test when you have one sample and want to test if its mean is different
from a specified value (population mean).
Hypothesis Testing – One Sample T Test
Example
Suppose the coffee shop owner collects data on the wait times for a random sample of 30 customers, and the wait
times (in minutes) are as follows:
4.2, 3.8 ,3.5, 4.0 ,4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1, 4.3, 3.6, 3.9,
4.0, 4.2, 4.2, 3.8, 3.5, 4.0, 4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1,
4.3, 3.6, 3.9, 4.0, 4.2
Now, let’s perform the hypothesis test based on this data:
Hypotheses
Null Hypothesis (H0): The average wait time is 3 minutes (μ = 3).
Alternative Hypothesis (Ha): The average wait time is not 3 minutes (μ ≠ 3). (We’re using a two-tailed test
because we want to test if the average wait time is different from 3 minutes.)
Steps of Hypothesis Testing
Collect Data: We have the wait times for the 30 customers in the sample.
Set Significance Level: Let’s choose a significance level (α) of 0.05.
Calculate Sample Mean & Standard Deviation:
Sample Mean (x̄ ) = (4.2+3.8+3.5+…+4.2) / 30 =4.0 minutes
Sample Standard Deviation (s): 0.2757819000266971
Calculate Test Statistic: Using the formula:
t = (3.979 – 3) / (0.276 / √30)
t = 19.43
Find Critical Values: For a two-tailed test with α = 0.05 and 29 degrees of freedom (df = n – 1 = 30 – 1 = 29),
the critical values are approximately ± 2.045.
Calculate P-Value: The p-value is calculated as the probability of observing a t-statistic and since the
calculated t-statistic (19.43) is far beyond the critical values, it indicates an extremely small p-value.
Make a Decision: Since the t-statistic provides overwhelming evidence against the null hypothesis (p-value <
0.0001), we reject the null hypothesis. This suggests that there is strong statistical evidence to conclude that
the average wait time for customers is significantly different from 3 minutes.
Two-Sample T-Test
Explanation: A two-sample t-test is used to compare the means of two independent samples to determine if
they are significantly different from each other.
Assumptions: It assumes that the data in both samples are approximately normally distributed, and the
observations in each sample are independent.
When to Use: Use a two-sample t-test when you have two separate groups or samples, and you want to test
if their means are significantly different from each other.
Hypothesis Testing – Two Sample T Test
Example
Imagine you are a teacher, and you want to determine if two different teaching methods, Method A and Method B,
have a significant impact on students’ test scores. You have two groups of students, one taught using Method A and
the other using Method B, and you want to compare their test scores to see if there’s a significant difference.
Hypotheses:
Null Hypothesis (H0): There is no significant difference in the mean test scores between the two teaching
methods (μA = μB).
Alternative Hypothesis (Ha): There is a significant difference in the mean test scores between the two
teaching methods (μA ≠ μB).
Steps of the Two-Sample T-Test
Collect Data: Collect test scores from two groups of students. Let’s assume:
Group A (Method A): [85, 88, 92, 78, 90]
Group B (Method B): [91, 89, 82, 87, 88]
Set Significance Level (α): Choose a significance level, such as α = 0.05, to determine the threshold for
statistical significance.
Calculate Sample Means: Calculate the sample means for both groups:
Sample Mean of Group A (x̄ A) = (85 + 88 + 92 + 78 + 90) / 5 = 86.6
Sample Mean of Group B (x̄ B) = (91 + 89 + 82 + 87 + 88) / 5 = 87.4
Calculate Standard Deviations: Calculate the sample standard deviations for both groups, which measure the
variability within each group.
Standard Deviation for Group A: 5.458937625582472
Standard Deviation for Group B: 3.361547262794322
Pooled Standard Deviation: 4.533210782657254
Calculate p-Value
t-Statistic: – 0.27903204256606634
Determine Degrees of Freedom (df)
df = n1 + n2 – 2 = 5 + 5 – 2
df = 8
Make a Decision: Compare the p-value to α. If p ≤ α, reject the null hypothesis, indicating a significant
difference in test scores between the two teaching methods. If p > α, fail to reject the null hypothesis.
Here,
|t-statistic| > Critical Value, you may reject the null hypothesis.
Visualization Plots for Data Exploration
1. Histogram:
Purpose:
Illustrates the distribution of a single numerical variable.
Usage:
Identifies patterns, central tendency, and spread.
Helps detect skewness, outliers, and potential data issues.
Implementation:
The x-axis represents the variable values, and the y-axis represents the frequency of occurrences.
Example Code:
import [Link] as plt
[Link](data, bins=10, color='skyblue', edgecolor='black')
[Link]('Histogram of Data')
[Link]('Variable')
[Link]('Frequency')
[Link]()
2. Box Plot:
Purpose:
Displays the summary of the distribution, including median, quartiles, and potential outliers.
Usage:
Facilitates comparisons of distributions between different groups.
Implementation:
Utilizes a rectangular box to represent the interquartile range (IQR) and “whiskers” to show
variability.
Example Code:
import seaborn as sns
[Link](x='Group', y='Variable', data=df)
[Link]('Box Plot of Variable by Group')
[Link]()
3. Scatter Plot:
Purpose:
Reveals the relationship between two numerical variables.
Usage:
Identifies patterns, trends, and correlations between variables.
Implementation:
Each point represents a data observation with x and y coordinates.
Example Code:
[Link](df['X'], df['Y'], color='green', alpha=0.7)
[Link]('Scatter Plot of X vs Y')
[Link]('X')
[Link]('Y')
[Link]()
4. Pair Plot:
Purpose:
Displays scatter plots for multiple pairs of variables in a dataset.
Usage:
Identifies relationships and distributions across multiple variables simultaneously.
Implementation:
Diagonal shows the distribution of each variable, and off-diagonal plots show scatter plots.
Example Code:
import seaborn as sns
[Link](df, hue='Category', diag_kind='kde')
[Link]('Pair Plot of Variables')
[Link]()
5. Heatmap:
Purpose:
Visualizes the correlation matrix between numerical variables.
Usage:
Identifies relationships and multicollinearity between variables.
Implementation:
Cells are colored based on the strength and direction of correlation.
Example Code:
import seaborn as sns
correlation_matrix = [Link]()
[Link](correlation_matrix, annot=True, cmap='coolwarm')
[Link]('Correlation Heatmap')
[Link]()
6. Violin Plot:
Purpose:
Combines aspects of box plots and kernel density plots to show the distribution of a numerical
variable across different categories.
Usage:
Provides a compact way to compare distributions.
Implementation:
Consists of a series of vertical violin-shaped plots.
Example Code:
import seaborn as sns
[Link](x='Category', y='Variable', data=df, inner='quartile')
[Link]('Violin Plot of Variable by Category')
[Link]()
7. Bar Chart:
Purpose:
Displays the frequency or count of categorical variables.
Usage:
Provides a visual representation of category counts.
Implementation:
Bars represent the count or frequency of each category.
Example Code:
import [Link] as plt
df['Category'].value_counts().plot(kind='bar', color='orange')
[Link]('Bar Chart of Category Counts')
[Link]('Category')
[Link]('Count')
[Link]()
Interpretation of Visualization
Guidelines: Understand plot-specific conventions.
Patterns: Identify trends, relationships, or groupings.
Outliers: Spot data points deviating from the pattern.
Example: Scatter Plot Analysis:
Pattern: Look for upward or downward trends.
Outliers: Identify exceptional data points.
Interpreting visualizations helps extract insights and make informed decisions.
Scatter Plot Analysis
Example:
Scatter Plot Analysis – Relationship between Study Hours and Exam Scores
Data: Suppose you have collected data on the study hours and corresponding exam scores for a group of students:
Interpretation:
Pattern: The scatter plot shows a clear upward trend, indicating a positive correlation between study hours
and exam scores. More study hours generally lead to better exam performance.
Outliers: No significant outliers are observed; all data points align with the trend.
This interpretation highlights the positive relationship between study hours and exam scores, emphasizing the
benefits of increased study time on academic performance.
Correlation and Regression
Correlation – Measure of Association:
Correlation is a statistical measure that quantifies the degree and direction of a relationship between two variables. It
assesses how changes in one variable are associated with changes in another.
Linear Regression – Modeling Relationships:
Linear regression is a statistical technique used to model the relationship between a dependent variable (response)
and one or more independent variables (predictors). It assumes a linear relationship and helps predict the dependent
variable based on the independent variables.
Example: Calculating and Interpreting Correlation Coefficients:
Suppose you are analyzing the relationship between study hours and exam scores for a group of students. You want
to calculate and interpret the correlation coefficient between these two variables:
Data: You have data on study hours (X) and corresponding exam scores (Y) for several students.
Calculate Correlation: You calculate the correlation coefficient, which can be Pearson’s correlation coefficient
(r), Spearman’s rank correlation coefficient, or others, depending on the data characteristics.
Interpretation:
If r is close to 1: Strong positive correlation; as study hours increase, exam scores tend to increase
significantly.
If r is close to -1: Strong negative correlation; as study hours increase, exam scores tend to decrease
significantly.
If r is close to 0: Weak or no linear correlation; study hours and exam scores are not strongly associated.
Correlation helps quantify the strength and direction of the relationship between study hours and exam scores.
Linear regression, on the other hand, can further model and predict exam scores based on study hours and may
provide insights into the extent of predictability.
__________________________________________________________________________
Confidence Interval
Confidence, in statistics, is another way to describe probability. For example, suppose you construct a confidence
interval with a 95% confidence level. In that case, you are confident that 95 out of 100 times the estimate will fall
between the upper and lower values specified by the confidence interval.
Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test:
Confidence level = 1 − a
So, if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 =
0.95, or 95%.
Calculation and Interpretation:
Calculation: Confidence intervals are typically calculated using a formula that incorporates the sample
statistic, the standard error (a measure of variability), and a critical value from a statistical distribution (e.g.,
the normal distribution or t-distribution). The formula is:
CI = Sample Statistic ± Margin of Error
Interpretation: A confidence interval consists of two parts:
The point estimate (sample statistic), which is the best guess for the population parameter.
The margin of error, which represents the range around the point estimate. It reflects the estimate’s
precision and is determined by the chosen confidence level.
For a z statistic, some of the most common values are shown in this table:
Confidence level 90% 95% 99%
alpha for one-tailed CI 0.1 0.05 0.01
alpha for two-tailed CI 0.05 0.025 0.005
z statistic 1.64 1.96 2.57
Example: Determining the Confidence Interval for the Mean Height of a Population
Scenario: Imagine you want to estimate the average height of all adults in a country. You collect a sample of
heights from 100 individuals and calculate the sample mean height to be 170 cm with a standard deviation of
5 cm. You want to determine a 95% confidence interval for the mean height.
Data: You collect a sample of heights from 100 individuals and calculate the sample mean height to be 170
cm with a standard deviation of 5 cm.
Goal: Determine a 95% confidence interval for the mean height.
Step 1: Calculation of Margin of Error To calculate the margin of error for the confidence interval, you need two
components:
Critical Value: Using the t-distribution (since it’s a sample) and a 95% confidence level, you find the critical
value. For this example, let’s assume it’s 1.96. This critical value represents how many standard errors the
margin of error should cover to achieve a 95% confidence level.
Standard Error: The standard error measures the variability of the sample mean. It’s calculated as the
standard deviation divided by the square root of the sample size.
The margin of Error Calculation:
Step 2: Constructing the Confidence Interval Now that you have the margin of error, you can construct the
confidence interval. The confidence interval formula is as follows:
CI=Sample Mean± Margin of Error
Substituting the values:
CI = 170 cm ± 0.98 cm
Interpretation: This means you are 95% confident that the true average height of all adults in the country falls within
the range of 169.02 cm to 170.98 cm.
Explanation: By calculating a confidence interval, you provide a range of heights within which you believe the true
population average height is likely to be. In this case, with 95% confidence, you estimate that the average height of all
adults in the country lies between 169.02 cm and 170.98 cm. The margin of error (0.98 cm) accounts for the
uncertainty associated with estimating the population mean from a sample.
Hypothesis Testing with Z-Test
Introduction to the Z-Test for Large Sample Sizes:
Z-Test: The Z-test is a statistical hypothesis test used to assess whether a sample mean is significantly different from a
known population mean when the sample size is sufficiently large. It relies on the standard normal distribution and is
suitable when the population standard deviation is known.
Comparison with the t-Test:
Comparison: The Z-test and t-test are both used for hypothesis testing, but they differ in their assumptions about the
population standard deviation and sample size:
Z-Test: Assumes a known population standard deviation and is appropriate for large sample sizes (typically n
> 30).
t-Test: Assumes an unknown population standard deviation and is suitable for smaller sample sizes.
Aspect T-Test Z-Test
Compare means of two groups (small Compare a sample mean to a known population mean
Purpose
samples) (large samples)
Sample Size Small sample sizes (typically < 30) Large sample sizes (typically ≥ 30)
Unknown and estimated from the
Population SD Known or estimated from a large sample
sample
Distribution Follows the t-distribution Follows the standard normal (z) distribution
Formula t = (Xˉ−μ) / (s/√n) z = (Xˉ−μ) / (σ/√n)
Critical Values Use t-distribution tables Use standard normal (z) distribution tables
Common Use
Comparing means when SD is unknown Comparing a sample mean to a known population mean
Case
Compare the average height of a sample to a known
Example Compare test scores of two groups
population mean
Steps in Z-test
Step 1: Formulate Hypotheses
Step 2: Set Significance Level
Step 3: Calculate the Sample Mean and Standard Deviation
Step 4: Set Up the Z-Test Statistic
Step 5: Find Critical Value or P-Value
Step 6: Make a Decision
Example
Using a Z-Test to Assess Average Weight Loss:
Imagine you’re evaluating the effectiveness of a weight loss program. You collect data on the weight loss of 100
participants who completed the program and want to test if, on average, participants lost a significant amount of
weight.
Here’s a sample dataset generated randomly:
import random
# Generate sample weight loss data (in pounds)
[Link](42) # For reproducibility
weight_loss_data = [[Link](0.5, 10) for _ in range(100)]
1. Formulate Hypotheses
Null Hypothesis (H0): The average weight loss in the program is equal to or less than zero (μ ≤ 0).
Alternative Hypothesis (Ha): The average weight loss in the program is greater than zero (μ > 0).
2. Calculate the Test Statistic (Z-Score)
Sample Mean (X̄ ): Calculate the sample mean of weight loss data.
Population Standard Deviation (σ): Known as 2 pounds.
Sample Size (n): 100 participants.
import numpy as np
# Calculate sample mean and sample standard deviation
sample_mean = [Link](weight_loss_data)
sample_std = [Link](weight_loss_data, ddof=1) # Use Bessel's correction for sample std
# Calculate Z-score
population_mean = 0 # Hypothesized population mean
z_score = (sample_mean - population_mean) / (sample_std / [Link](len(weight_loss_data)))
sample_mean, sample_std, z_score
Sample Mean (X̄ ): Approximately 5.44 pounds
Sample Standard Deviation (s): Approximately 2.51 pounds
Z-Score (Z): Approximately 8.62
3. Determine the Critical Value
Significance Level (α): Typically, 0.05 (corresponding to 95% confidence level).
Degrees of Freedom (df): n – 1 (99 degrees of freedom for this example).
from [Link] import norm
alpha = 0.05
df = len(weight_loss_data) - 1
# Calculate the critical Z-value
critical_value = [Link](1 - alpha)
critical_value
Critical Value (Z_critical): Approximately 1.645 (for α = 0.05, one-tailed test)
4. Make a Decision
Compare the calculated Z-score with the critical value.
If Z-Score > Critical Value, reject the null hypothesis (Ha).
If Z-Score ≤ Critical Value, fail to reject the null hypothesis (H0).
# Make a decision
reject_null = z_score > critical_value
reject_null
Decision: Reject the null hypothesis (H0).
5. Interpret the Result
Interpret the decision in the context of the weight loss program.
Decision: Reject H0.
Interpretation: There is sufficient statistical evidence to conclude that, on average, participants in the weight
loss program experienced significant weight loss.
Chi-Square Test for Categorical Data
Chi-Square Test for Independence:
Chi-Square Test: The chi-square test is a statistical test used to assess the independence (or association)
between two categorical variables. It helps determine whether there is a significant relationship between the
variables or if they are independent.
Independence: In the context of the chi-square test, independence means that the two categorical variables
are not related, and changes in one variable do not affect the distribution of the other variable.
Testing the Association Between Gender and Voting Preferences
Data: You have surveyed 200 voters in a local election to understand the association between their gender and voting
preferences. Here’s the dataset:
Using Statistics (Mathematically)
Using Python
import pandas as pd
import random
from [Link] import chi2_contingency
# Generate random data for 200 voters
[Link](42) # For reproducibility
# Generate random gender data
gender = [Link](['Male', 'Female'], k=200)
# Generate random voting preferences data
voting_preferences = [Link](['Candidate A', 'Candidate B', 'Undecided'], k=200)
# Create a dataframe from the data
data = {
'Gender': gender,
'Voting_Preferences': voting_preferences
df = [Link](data)
1. Hypotheses
You want to test whether there is a significant association between the gender of voters and their voting preferences.
Null Hypothesis (H0): Gender and voting preferences are independent.
Alternative Hypothesis (Ha): Gender and voting preferences are not independent; there is an association.
2. Create a Contingency Table
Create a contingency table that cross-tabulates the two categorical variables (gender and voting preferences):
contingency_table = [Link](df['Gender'], df['Voting_Preferences'])
contingency_table
3. Calculate the Chi-Square Statistic
Calculate the chi-square statistic, which measures the discrepancy between observed and expected frequencies.
from [Link] import chi2_contingency
# Calculate chi-square statistic, p-value, degrees of freedom, and expected frequencies
chi2, p, dof, expected = chi2_contingency(contingency_table)
chi2, p, dof, expected
Chi-Square Statistic (χ²): Approximately 0.1463
P-Value (p): Calculated based on data
Degrees of Freedom (df): 2
Expected Frequencies: Calculated values based on the data
4. Determine the Critical Value
Specify the desired level of significance (α) to determine the critical chi-square value. Let’s assume α = 0.05.
from [Link] import chi2
alpha = 0.05
critical_value = [Link](1 - alpha, dof)
critical_value
Critical Value (χ²_critical): Approximately 5.991 (for α = 0.05 and df = 2)
5. Make a Decision
Compare the calculated chi-square statistic with the critical value.
If χ² > χ²_critical, reject the null hypothesis.
If χ² ≤ χ²_critical, fail to reject the null hypothesis.
# Make a decision
reject_null = chi2 > critical_value
reject_null
Decision: Fail to reject the null hypothesis (H0).
There is insufficient evidence to conclude that there is a significant association between the gender of voters and
their voting preferences in this local election. Gender and voting preferences appear to be independent.
One-Way and Two-Way ANOVA
Introduction to Analysis of Variance (ANOVA):
Analysis of Variance (ANOVA): ANOVA is a statistical technique used to compare means among multiple groups or
treatments. It assesses whether there are any statistically significant differences between the means of the groups.
One-Way and Two-Way ANOVA:
One-Way ANOVA: One-way ANOVA is used when there is one categorical independent variable (with two or
more levels or groups) and one continuous dependent variable. It tests whether there are any significant
differences between the means of the groups.
Two-Way ANOVA: Two-way ANOVA is used when there are two categorical independent variables (factors)
and one continuous dependent variable. It assesses the interaction effect between the two factors and
whether they have a significant influence on the dependent variable.
Example
Performing a One-Way ANOVA to Compare Test Scores Across Different Schools
Formulate Hypotheses
Null Hypothesis (H0): There are no significant differences in test scores among the three schools.
Alternative Hypothesis (Ha): There are significant differences in test scores among the three schools.
Using Statistics (Mathematically)
Python Code for Above Calculations
import numpy as np
from [Link] import f
# Given data
school_1 = [Link]([85, 88, 90, 92, 86])
school_2 = [Link]([78, 82, 80, 85, 88])
school_3 = [Link]([92, 94, 89, 88, 90])
# Step 1: Calculate Grand Mean
all_scores = [Link]([school_1, school_2, school_3])
grand_mean = [Link](all_scores)
# Step 2: Calculate SST
sst = [Link]((all_scores - grand_mean)**2)
# Step 3: Calculate SSB
ssb = [Link]([len(school) * ([Link](school) - grand_mean)**2 for school in [school_1, school_2, school_3]])
# Step 4: Calculate SSW
ssw = [Link]([(score - [Link](school))**2 for school, scores in zip([school_1, school_2, school_3], [school_1,
school_2, school_3]) for score in scores])
# Step 5: Degrees of Freedom
df_between = 3 - 1
df_within = len(all_scores) - 3
# Step 6: Calculate Mean Squares
msb = ssb / df_between
msw = ssw / df_within
# Step 7: Calculate F-statistic
f_statistic = msb / msw
# Step 8: Compare with Critical Value or P-value
p_value = 1 - [Link](f_statistic, df_between, df_within)
# Print results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: \nThere are significant differences in test scores among the three schools.")
else:
print("Fail to reject the null hypothesis: \nThere are no significant differences in test scores among the three
schools.")
Using Python Libraries
Perform the One-Way ANOVA
We’ll use the [Link] library to perform the one-way ANOVA.
from [Link] import f_oneway
# Group the data by school
groups = [df['Test_Score'][df['School'] == school] for school in df['School'].unique()]
# Perform one-way ANOVA
f_statistic, p_value = f_oneway(*groups)
f_statistic, p_value
F-Statistic: Approximately 0.0404
P-Value: Approximately 0.9605
Determine the Critical Value
Specify the desired level of significance (α) to determine the critical F-value. Let’s assume α = 0.05.
Make a Decision
Compare the calculated F-statistic with the critical F-value.
alpha = 0.05 # Significance level
critical_value = 3.8853 # Critical F-value for α = 0.05 and df = (2, 12)
# Make a decision
reject_null = f_statistic > critical_value
reject_null
Interpret the Result
Interpret the decision in the context of test scores among the schools.
Decision: Reject or fail to reject H0.
Interpretation: Based on the analysis, we reject the null hypothesis (H0). There are significant
differences in test scores among the three schools.
Complete Python Code for the above Process
import [Link] as stats
import numpy as np
# Data: Test scores for three schools
school_1 = [Link]([85, 88, 90, 92, 86])
school_2 = [Link]([78, 82, 80, 85, 88])
school_3 = [Link]([92, 94, 89, 88, 90])
# Perform one-way ANOVA
statistic, p_value = stats.f_oneway(school_1, school_2, school_3)
# Print results
print(f"One-Way ANOVA Statistic: {statistic}")
print(f"P-value: {p_value}")
# Check significance at a 0.05 significance level
alpha = 0.05
if p_value < alpha:
print("\nReject the null hypothesis: There are significant differences in test scores among the three schools.")
else:
print("\nFail to reject the null hypothesis: There are no significant differences in test scores among the three
schools.")
Testing
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
1. Current
2. Review
3. Answered
1. Question 1: You are conducting a one-sample t-test with the following information:
Population mean = 70
Sample mean = 75
Sample standard deviation = 10
Sample size = 30
What is the t-statistic?
2.7
3.0
2.0
1.5
2. Question 2: In a chi-square test of independence, you have a contingency table with 3 rows and 4 columns.
How many degrees of freedom are there?
3
4
7
9
3. Question 3: You perform a two-sample t-test to compare the means of two groups. The p-value obtained is
0.034. If you set a significance level (alpha) of 0.05, what is your decision?
Reject the null hypothesis.
Fail to reject the null hypothesis.
Accept the null hypothesis.
Perform a z-test.
4. Question 4: You are analyzing the correlation between two variables. The correlation coefficient (r) you
calculated is -0.85. What does this indicate about the relationship between the variables?
Strong positive correlation.
No correlation.
Strong negative correlation.
Weak positive correlation.
5. Question 5: You conduct a one-way ANOVA test with three groups. The calculated F-statistic is 4.12. What is
the critical F-value at a 5% significance level (alpha = 0.05)?
2.42
3.88
3.24
5.62
6. Question 6: In a hypothesis test, you calculate a z-statistic of -1.98. What is the corresponding p-value if you
are conducting a two-tailed test?
0.0244
0.4761
0.9522
0.9768
7. Question 7: You are constructing a 95% confidence interval for a population mean. If the sample size is 50
and the standard error is 4, what is the margin of error?
1.96
0.08
0.98
7.84
8. Question 9: In a hypothesis test, you calculate a t-statistic of -2.15 with 15 degrees of freedom. What is the
corresponding p-value for a two-tailed test?
0.0246
0.0492
0.1048
0.2096
9. Question 9: You are conducting a chi-square test of independence with 4 rows and 3 columns in the
contingency table. What is the total degree of freedom for this test?
7
9
12
15
10. Question 10: You perform a paired-sample t-test and obtain a t-statistic of 3.42. If you have 29 pairs of data,
what is the degrees of freedom for this test?
28
29
30
58
11. You are conducting a hypothesis test to compare the means of two independent groups. The test statistic you
calculate is -2.87. If you set a significance level (alpha) of 0.05, what is your decision?
Reject the null hypothesis.
Fail to reject the null hypothesis.
Accept the alternative hypothesis.
Conduct a two-tailed test.
12. In a one-sample t-test, you are testing whether the population mean is greater than 50. If the calculated t-
statistic is 1.96 and the sample size is 30, what is the p-value for a one-tailed test?
0.0274
0.0548
0.0456
0.1096