Statistical inference concerning
two populations
Inferences concerning the difference
between two means
Independent random samples
- Samples that are completely
unrelated to one another
- The samples are clearly delineated
- No way influenced by the selection of
the other
- Battery lives between brand A and
Brand B
- Difference between male and female
salaries
Sample statistics
- Used to estimate population
parameter
- 𝑋 ̅ is the point estimator for the
population mean μ
- Has to be normally distributed or
approximately normally distributed
(sample data from normal population
and large sample size > 30)
Two means sample stats
Population variances
Confidence interval
d_0=hypothesized
difference
- Before and after studies
- Pairing of observations
Inference concerning mean differences
- Crucial assumption for inferences
about μ_1-μ_2 is the samples are
drawn independently.
- A common case of dependent
sampling is matched pairs
(comparison between apples and
apples)
- The estimator for 𝑝_1−𝑝_2 is
𝑃 ̅_1−𝑃 ̅_2.
- If 𝑛1 and 𝑛2 are sufficiently large, the
sampling distribution of 𝑃̅1 − 𝑃̅2 can
be approximated by the normal
distribution. (the general guideline is
that n1p1, n1 (1 − p1 ), n2p2, and n2(1
− p2) must all be greater than or equal
to 5)
- For independent samples, the
standard error is
- Since the population proportions are
not known, use the sample
proportions to estimate se(P ̅_1-P ̅_2 )
in the margin of error.
Inference concerning the difference
between two proportions
- Like with inferences for μ_1-μ_2, we
consider inferences for p_1-p_2 under
independent sampling.
Statistical inference concerning Skewness
variance
Introduction
- We use the sample variance 𝑆 2 as an
estimator of the population variance
𝜎 2.
- To make inferences regarding 𝜎 2 , we
first need the sampling distribution of
𝑆 2.
- This requires a new distribution.
Statistical inferences regarding 𝜎^2are based
on the 𝜒^2 or chi-square distribution
The 𝜒 2 distribution is a family of distributions.
2
- Like 𝑡𝑑𝑓 , 𝜒𝑑𝑓 depends on 𝑑𝑓.
- Probability distribution for the sum of
several independent squared
standard normal random variables
- 𝑑𝑓 is the number of squared standard
normal random variables in the
summation - We sometimes need values in the left
tail of the distribution
S2 is based on the squared differences 2
between the sample values and the sample - 𝜒𝑑𝑓 is not symmetric (unlike 𝑧 or 𝑡𝑑𝑓 ),
mean. values in the lower tail are not the
negative of values in the upper tail.
2
- Let 𝜒1−𝛼,𝑑𝑓 represent a value such
that the area to the right is 1 − 𝛼 and
the area to the left is 𝛼.
2 2
- 𝑃(𝜒𝑑𝑓 ≥ 𝜒1−𝛼,𝑑𝑓 )=1−𝛼
2 2
- Unlick pass chapters we use the - 𝑃(𝜒𝑑𝑓 < 𝜒1−𝛼,𝑑𝑓 )=𝛼
notation to represent a random
variable as well as its value
Confidence interval for population
variance
- Take a sample of size from a normal
population with finite variance.
2 (𝑛−1)𝑆 2 2
- 𝜒𝑑𝑓 = 𝜎2
has a 𝜒𝑑𝑓 distribution
with 𝑑𝑓 = 𝑛 − 1.
2
- Since 𝜒𝑑𝑓 is not symmetric, the
confidence interval does not follow
the form of point estimate ± margin of
error. Hypothesis testing
2 2
- Start with 𝑃(𝜒1−𝛼 ⁄2,𝑑𝑓 ≤ 𝜒𝑑𝑓 ≤
𝜒𝛼2⁄2,𝑑𝑓 ) = 1 − 𝛼.
- For SD
Inference concerning the ratio of two
population variances
- We compare two population
variances, 𝜎12 and 𝜎22 , through the
ratio 𝜎12 /𝜎22 .
- If 𝜎12 = 𝜎22 , then 𝜎12 /𝜎22 = 1.
Sample distribution
The sampling distribution of 𝑆_1^2/𝑆_2^2 the
𝐹 distribution.
The 𝐹 distribution is characterized by a family
of distributions.
2
- Like 𝑡𝑑𝑓 and 𝜒𝑑𝑓 , 𝐹 depends on two
degrees of freedom
- 𝑑𝑓1 is the numerator degrees of
freedom
- 𝑑𝑓2 is the denominator degrees of
freedom
is the probability distribution of the
ratio of two independent chi-squared
variables divided by their degrees of freedom.
Chi-square tests
Goodness of fit test for multinomial
experiment
Bernoulli process
- Also known as binomial experiment
- There is a series of n independent and
identical trials of an experiment.
- Each trial has only two outcomes:
success and failure.
- The probability of success is denoted
as 𝑝 and the probability of failure is
denoted as 1 − 𝑝.
- Let 𝑝1 and 𝑝2 represent these
probabilities, 𝑝1 + 𝑝2 = 1.
Goodness of fit
- About the probabilities/proportions of
the multinomial experiment
Choices for competing hypotheses
- Set all the population proportions
equal to the same specific value so
they are equal to one another.
- Set set population proportion equal to
a different predetermined
(hypothesized value).
Ex. Suppose you have four different
candidates.
- Is the proportion of voters who favor
the candidates not the same
- Contest specific value
Regression analysis Hypothesis testing
Hypothesis test for the correlation
coefficient
Sample covariance
- To determine whether the linear
- Measures the direction of the linear
relationship implied by the sample
relationship between two variables x
correlation coefficient is real or due to
and y
chance
- Negative: negative linear relationship
- Let 𝜌𝑥,𝑦 denote the population
- Positive: positive linear relationship
coefficient.
- Zero: no linear relationship
- Further interpretation is difficult Test statistic
because it is sensitive to units.
- Cannot be used to determine strength
of the linear relationships
- The sample correlation coefficient 𝑟𝑥𝑦
is easier to interpret.
Sample correlation coefficient rxy (R)
- Describes both the direction and
strength of the linear relationship
between x and y
- The correlation coefficient captures
only a linear relationship.
- The correlation coefficient may not be
a reliable measure in the presence of
outliers.
- Correlation does not imply causation.
- Even if two variables are highly
- The correlation is unit-free correlated, one does not necessarily
- Negative: negative linear relationship cause the other.
- Positive: positive linear relationship - The correlation cannot be used for
- Zero: no linear relationship prediction
- The correlation is between -1 and 1 - To predict values, we need a model.
- Correlation is -1: perfect negative
The linear regression model
linear relationship
- Correlation is 0: not linearly related - Regression analysis is one of the most
- Correlation is 1: perfect positive linear widely used methodologies in
relationship business.
- One variable, called the response
variable, is influenced by other
variables, called the explanatory
variables.
- The input or predictor variables are
denoted as 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑘
- Use information on the explanatory Components of a linear regression model
variables to predict/or describe - Deterministic: approximates the
changes in the response variable. relationship we want to model
- Allows us to make predictions - Stochastic: random error term
regarding the response variable.
- Just like correlation, regression has Simple linear regression model
limitations. - Uses only one predictor/explanatory
- A regression model appears to search
variable
for causality when it basically detects
- Multiple linear regression model for
correlation. many explanatory variable
- We cannot use standard regression
analysis to establish cause-and- (y = mx + B)
effect relationships.
- Causality can only be established
through randomized experiments
and/or advanced statistical models,
which are outside the scope of this
text.
- The expected value of y for a given
- No matter the response, we cannot
value of x lies on a straight line:
expect to predict its exact value.
- If the value of the response variable is
uniquely determined by the values of - The slope parameter 𝛽1 determines
the explanatory variables, we say that whether the linear relationship is
the relationship is deterministic. positive or negative.
- In most fields, we find that the - The population parameters 𝛽0 and 𝛽1
relationship between the explanatory are unknown and must be estimated.
variables and the response is
stochastic due to the omission of
relevant factors (sometimes not
measurable) that influence the
response variable
- Develop a mathematical model that
captures the relationship between the
response variable and explanatory
variables
- The target or response is denoted as y
- (read as y-hat) is the predicted - 𝑏_1 is the change in 𝑦 ̂ when 𝑥
value of the response variable given a increases by one unit
value of x. - 𝑏_0 is the predicted value when x has
- The difference between the observed a value of zero, not always meaningful
and the predicted values is the
residual: 𝑒 = 𝑦 − 𝑦̂.
Method of least squares
- Also referred to as ordinary least
squares (OLS)
- Common approach to fitting a line to
a scatterplot
- used to obtain estimates of 𝛽_0 and
𝛽_1
- OLS chooses the line (𝑏0 and 𝑏1 ) to
minimize the error sum of squares,
𝑆𝑆𝐸 = ∑(𝑦 − 𝑦̂)2 = ∑ 𝑒 2 .
Sum of the squared differences between
the observed values and their predicted
values.
- Sum of squared differences from the
regression equation.
- Desirable properties if certain
Multiple linear regression model
assumptions hold
- Gives an equation “closest” to the - Having only one explanatory variable
data might reduce the usefulness of the
model
- A multiple linear regression model
allows us to examine how the
response is influenced by two or more
explanatory variables.
- The choices of the explanatory
variables are based on economic
theory, intuition, and/or prior
research.
- Has at least 2 predictor/explanatory
variables
- 𝑏𝑗 measures the change in the
predicted value of the response given Goodness of fit measures
a unit increase in 𝑥𝑗 , holding all other
predictor variables constant. - By simply observing the sample
- These now represent the partial regression equation, we cannot
influence of 𝑥𝑗 on 𝑦̂. assess how well the explanatory
variables explain the variation in the
response variable
- We rely on several objective
“goodness-of-fit” measures that
summarize how well the sample
regression equation fits the data.
- If each predicted value is equal to its
observed value, then we have a
perfect fit.
- Since that almost never happens, we
evaluate the models on a relative
basis. Basis of the relative magnitude
of the residuals
- The sample regression equation
provides a good fit when the
dispersion of the residuals is relatively
small
The standard error of the estimate
measures the standard deviation of the
residuals:
Coefficient of determination (𝑅^2)
- The coefficient of determination
quantifies the sample variation in the
response variable that is explained by
- Interpret as the “average” squared
the sample regression equation,
residual.
denoted by 𝑅^2.
- Can take any value between 0 and
- How well a statistical model predicts
infinity.
an outcome
- The less dispersion, the smaller the 𝑠𝑒
and the better the model fits the data.
- we use the standard error of the
estimate in conjunction with other
measures to judge the overall
usefulness of a model.
For example, 𝑅^2=0.72.
- 72% of the variation in the response is
explained by the sample regression
- For a given 𝑛 sample size, increasing equation.
the number of explanatory variables 𝑘 - Other factors not included in the
reduces 𝑆𝑆𝐸 and the denominator model explain 28%
(𝑛 − 𝑘 − 1).
- The net effect, shown by the value of Use ANOVA in the context of the linear
𝑠_𝑒, allows us to determine if the regression model to derive 𝑅^2.
added explanatory variables improve - Take the total sum of squares and
the fit of the model break it into two parts.
- Explained variation (model)
- Unexplained variation (error)
- 𝑅 2 cannot be used for comparing
models that do not include the same
number of explanatory variables.
- 𝑅 2 never decreases as you add more
predictors.
- Increase 𝑅 2 by including a group of
explanatory variables that have no
foundation in the model.
Adjusted 𝑹𝟐
- accounts for the number of
explanatory variables in the model.
𝑛−1
- Adjusted 𝑅 2 = 1 − (1 − 𝑅 2 ) ( )
𝑛−𝑘−1
- Penalizes for adding additional
explanatory variables in the model
- Is used to compare linear regressions
with different numbers of explanatory
variables.
- The higher the adjusted 𝑅 2, the better
the model.
Inferencing with regression
models
Test of significance
- We can conduct hypothesis tests
about the unknown parameters
- Joint test about all of the parameters
- Individual tests about a single
parameter
- For the tests to be valid, certain
conditions about the model must be
met.
- If all of the coefficients equal zero,
then all of the explanatory variables
drop out.
- If at least one coefficient does not
equal zero, then at least one
explanatory variable has a linear
relationship with the response.
- A test of joint significance is regarded
as a test of the overall usefulness of a
regression.
Confidence interval
We use the anova significance F for P value
- Typically 𝛽_𝑗0=0, but it could be a
- If the confidence interval for the slope
nonzero value
coefficient contains zero, then the
explanatory variable is not significant.
- If the confidence interval does not
contain zero, then the explanatory
variable is statistically significant.
General test of linear restrictions
- The significance tests can also be
referred to as tests of linear
restrictions.
- The two-tailed t-test is a test of one
linear restriction about a single slope
coefficient.
- The F test is a test of k linear
restrictions that determines about all
the slope coefficients
- The partial F test is a general test of
linear restrictions.
A test for a nonzero slope coefficient
- We can apply this test to any subset
of the regression coefficients.
- The test statistic measures how well
the regression equation explains the
variability in the response variable
Interval estimates for the response Adjusted R-squared
variable
- modified version of R-squared that has
been adjusted for the number of
predictors in the model
- always lower than the R-squared
- useful for comparing different
regression models with different
- Prediction intervals are always wider amount of variables
than confidence interval.
Standard Error of the Regression
- the average distance that the observed
values fall from the regression line
- lower the better
- 0 to infinity
Testing
F-statistics
Cheat sheet - indicates whether the regression
model provides a better fit to the data
Fit of the model than a model that contains no
Multiple R (correlation coefficient) r independent variables.
- tests if the regression model as a
- measures the strength of the linear whole is useful
relationship between the predictor - Generally if none of the predictor
variables and the response variable variables in the model are statistically
- A multiple R of 1 indicates a perfect significant, the overall F statistic is
linear relationship while a multiple R of also not statistically significant.
0 indicates no linear relationship
whatsoever. Significance of F (P-value)
R-squared (Coefficient of determination) r2 - To see if the overall regression model is
significant, you can compare the p-
- It is the proportion of the variance in value to a significance level
the response variable that can be - If the p-value is less than the
explained by the predictor variable. significance level, there is sufficient
- A value of 1 indicates that the evidence to conclude that the
response variable can be perfectly regression model fits the data better
explained without error by the than the model with no predictor
predictor variable. variables
- 0-1 can be % - In this example, the p-value is 0.033,
which is less than the common
significance level of 0.05. This
indicates that the regression model as
a whole is statistically significant