Lecture 11: Introduction to Generalized Linear Models
Dipankar Bandyopadhyay, Ph.D.
BMTRY 711: Analysis of Categorical Data Spring 2011
Division of Biostatistics and Epidemiology
Medical University of South Carolina
Lecture 11: Introduction to Generalized Linear Models – p. 1/44
Outline
1. Introduction (motivation and history)
2. Review ordinary linear regression
3. Components of a GLM
4. Natural Exponential family
Lecture 11: Introduction to Generalized Linear Models – p. 2/44
Brief Overview of Statistics
Response Variable
Explanatory Variables Binary Nominal Continuous
Binary 2 × 2 table Contingency tables t-tests
logistic regression log-linear models
Nominal Logistic regression Contingency tables ANOVA
Log-linear models log-linear models
Continuous Dose-response models It depends Multiple regression
logistic regression
Some Continuous Logistic regression It depends ANCOVA
and some categorical Multiple regression
Note, in general, most common analyses can be approached from a “modelling” approach.
Some such as the log-linear and logistic are topics for this class.
Lecture 11: Introduction to Generalized Linear Models – p. 3/44
Motivation for Modeling
Why do we want to “model” data?
• The structural form of the model describes the patterns of interactions or associations
in data.
• Inference for the model parameters provides a way to evaluate which explanatory
variable(s) are related to the response variable(s) while statistically controlling for the
other variables.
• Estimated model parameters provide measures of the strength and importance of
effects.
• A model’s predicted values “smooth” the data - That is, they provide good estimates of
the mean of the response variable.
• Modeling enables use to examine general extensions to the methods we have studied
thus far.
Lecture 11: Introduction to Generalized Linear Models – p. 4/44
Review of Ordinary Linear Regression
Suppose you have a continuous response variable (Y ) and continuous and/or discrete
explanatory variables (X’s).
You want to model the responses as a function of the explanatory variables (covariates). For
example,
Yi = β0 + β1 x1i + β2 x2i + ei
where
1. Yi is the response for the ith subject
~ = (β0 , β1 , β2 )′ is a (column) vector of constants (or parameters) that describe the
2. β
shape of the regression “line” (line, curve, etc)
~ i = (1, x1i , x2i ) is the (row) vector of explanatory variables for the ith subject.
3. X
4. ei is the random error assumed to be distributed as N (0, σ 2 )
Lecture 11: Introduction to Generalized Linear Models – p. 5/44
In general, you can view the previous regression model as,
Y = E(Y ) + ǫ
Where
E(Y ) = β0 + β1 x1i + β2 x2i
or in more general terms
= Xn×p βp×1
Thus, E(Y ) is the n × 1 vector of expectations.
Note,
2 3 2 3
X1 x11 , x12 , . . . , x1p
6 X2 7 6 x21 , x22 , . . . , x2p 7
X=6 7=6
6 7 6 7
··· ···
7
4 5 4 5
Xn xn1 , xn2 , . . . , xnp
is called the design matrix.
Lecture 11: Introduction to Generalized Linear Models – p. 6/44
Common Models of this Type
The analysis of continuous data has relied heavily on the linear model presented. These
reflect just a few applications of the linear model.
1. Simple linear regression
2. Multiple regression
3. ANOVA
4. ANCOVA
Lecture 11: Introduction to Generalized Linear Models – p. 7/44
Estimators
The least squares estimator for β is
β̃ = (X ′ X)−1 X ′ Y
The predicted value of Y (denoted as Ŷ ) is
Ŷ = X β̃
Diagnostic of the regression fit can be accomplished with the Hat Matrix
H = X(X ′ X)−1 X ′
As we develop our ‘generalized’ approach, you will notice many similarities.
Lecture 11: Introduction to Generalized Linear Models – p. 8/44
Movement towards a GLM
1. For OLS, we are dependent on the distribution of Y being normal.
2. For categorical data (by definition), the normality assumption is rarely feasible.
3. We may also be interested in other relations of the Xβ with Y . Other mapping
functions that ensure the range of Y remains valid is one of the key justifications.
In terms of a GLM, we have three components related to these limitations.
Lecture 11: Introduction to Generalized Linear Models – p. 9/44
Three Components of a GLM
There are 3 components of a generalized linear model (or GLM)
1. Random Component (the outcome)
2. Systematic Component (the design matrix multiplied by the parameter vector)
3. Link function (the function, g(·) that “links” the systematic component to the random
component)
Nelder & Wedderburn (1972) are generally given credit for unifying a broad class (of existing)
models into the GLM definitions.
They showed that provided the random component was part of the ‘Exponential Class’, the
MLEs for all of the models could be obtained using the same algorithm.
Lecture 11: Introduction to Generalized Linear Models – p. 10/44
Random Component
• The random component of a GLM consists of a response variable Y with the
independent observations (y1 , y2 , . . . , yn ).
• For a GLM, Y needs to have a distribution in the natural exponential family.
• Recall from theory, an exponential class distribution is of the form
f (yi ; θi ) = a(θi )b(yi )exp[yi Q(θi )]
This, in terms of common language, is
• a(θi ) is a function only dependent on the unknown parameter
• b(yi ) is a function of the observed sample
• Q(θi ) is a function only dependent on the unknown parameter
Lecture 11: Introduction to Generalized Linear Models – p. 11/44
Easy Example Exponential Class Variable
Suppose,
Y ∼ Poisson(λ)
Then,
e−λ λy
f (y, λ) = y!
1
= e−λ ( y! )ey log λ
Here θ = λ, a(θ) = a(λ) = e−λ , b(y) = 1/y! and Q(π) = log λ
Thus,
A Poisson random variable is of the exponential class variable.
Lecture 11: Introduction to Generalized Linear Models – p. 12/44
Slightly more complicated example
Suppose,
Yi ∼ Bern(π)
where P (Yi = 1) = π and P (Yi = 0) = 1 − π
Then,
f (yi ; π) = π yi (1 − π)1−yi
π yi
= (1−π)yi (1−π)−1
π
= (1 − π)( 1−π )yi
π )
(yi log 1−π
= (1 − π)(1)e
Here θ = π, a(θ) = a(π) = (1 − π), b(y) = 1 and Q(π) = log(π/(1 − π))
Thus, a Bernoulli random variable is a member of the exponential class.
Lecture 11: Introduction to Generalized Linear Models – p. 13/44
Exponential Class Variables in General
In general,
• The majority of distributions we will be interested in are exponential class variables
• This includes the more common examples of
1. Normal
2. Poisson
3. Binomial
4. Multinomial
• It also includes the less common examples of
1. Gamma
2. Negative Binomial
Lecture 11: Introduction to Generalized Linear Models – p. 14/44
Examples
1. Dichotomous (binary) with a fixed number of trials
• MI / No MI
• Success/Failure
• Correct/Incorrect
• Agree/Disagree
These responses have a Bernoulli distribution.
2. Counts (including cells in a contingency table)
• Number of babies born at MUSC daily
• Number of car wrecks per year in Charleston County
These responses have a Poisson distribution.
Lecture 11: Introduction to Generalized Linear Models – p. 15/44
Most Common Distributions
Although, many distributions are members of the Exponential Class,
For the most part, we will focus on the
1. Binomial
2. Poisson
distributions.
However, the approach we will discuss works equally well for all exponential class
distributions.
Lecture 11: Introduction to Generalized Linear Models – p. 16/44
Systematic Component
Denote a new vector (η1 , η2 , . . . , ηn ) such that
P
ηi = jβj xij , i = 1, . . . , n
= Xi β
• Recall, previously, we let ηi be the E(Y ).
• However, this results in a linear regression model.
• If we want to minimize this dependency, let
η = f (E(Y ))
where f (·) is a function.
Lecture 11: Introduction to Generalized Linear Models – p. 17/44
Link Component
Denote,
E(Y ) = µi
Let, g(·) be a monotonic and differentiable function such that
ηi = g(µi )
Thus,
X
g(µi ) = βj xij , i = 1, . . . , N
j
In words, we are now modeling a function of the mean (or Expectation) as a combination of
linear predictors.
Lecture 11: Introduction to Generalized Linear Models – p. 18/44
Link Function
• The function g(·) is called the “link function” since it links the E(Yi ) to the set of
explanatory variables and their estimates.
• For example, if we let g(x) = x (the identify link) and Y is distributed as a Normal R.V.,
then, we are back to the linear model (either simple linear or multiple linear regression)
• For GLM, you generally have the flexibility to choose what ever link you desire.
• However, there is a Special link that we need to consider
Lecture 11: Introduction to Generalized Linear Models – p. 19/44
Canonical Link
f (yi ; θi ) = a(θi )b(yi )exp[yi Q(θi )]
If we revisit the density function for an exponential, we see a function Q(θi ) that looks
interesting.
Q(θ) is defined as the natural parameter of the distribution.
If we let g(·) be defined such that it transforms the mean to the natural parameter, we have
the Canonical link
Lecture 11: Introduction to Generalized Linear Models – p. 20/44
Example Canonical Link
Suppose
Yi ∼ Bern(π)
where P (Yi = 1) = π and P (Yi = 0) = 1 − π
Then we previously showed that
π )
(yi log 1−π
f (yi ; π) = (1 − π)(1)e
with Q(π) = log(π/(1 − π))
So, if we would let
X
g(π) = log(π/(1 − π)) = βj xj
j
We would have the canonical link of a Bernoulli/Binomial distribution.
Lecture 11: Introduction to Generalized Linear Models – p. 21/44
Recall, the function
g(π) = log(π/(1 − π))
was previously introduced as the ‘log odds’ and was called the logit.
Lets recap what we have just accomplished.
If we let the random component be Bernoulli/Binomial and consider the linking function as
logit, we can model the log odds ratio as a linear function of covariates using
X
g(π) = βj xj
j
Lecture 11: Introduction to Generalized Linear Models – p. 22/44
Since g(π) = log(π/1 − π), we can write the success probability as
π
1−π
= eXβ
π = eXβ − πeXβ
π(1 + eXβ ) = eXβ
eXβ
π = 1+eXβ
Lecture 11: Introduction to Generalized Linear Models – p. 23/44
What are the β’s and the X’s?
For the logistic model, we have
X
g(π) = log(π/(1 − π)) = βj xj
j
To answer the question, consider a model in which you have one predictor (i.e., treatment)
and you observe the response MI/No MI.
Let
(
1 if subject i received the active treatment
x1i =
0 else
and
Xi = [1, xi1 ]
Thus, if subject i is on active drug,
Xi = [1, 1]
and if on placebo
Xi = [1, 0]
Lecture 11: Introduction to Generalized Linear Models – p. 24/44
Then, the odds for a person on placebo would be
π
= eβ0 +β1 ·0 = eβ0
1−π
and for a subject on active drug, the log odds would be
π
= eβ0 +β1 ·1 = eβ0 +β1
1−π
Thus, the odds ratio of a success for comparing active treatment to placebo could be written
as
eβ0 +β1
OR = β
= eβ1
e 0
or that log(OR) = β1 .
If you recall, we introduced this notation when we introduced RD, RR and OR.
Lecture 11: Introduction to Generalized Linear Models – p. 25/44
Other Choices in the Link
An implied advantage of the GLM formulation is that you can specify other links to derive
additional parameter interpretations.
For example, suppose you used the “log” link (log(π) = Xβ) instead of the “logit” link.
Now, the log link is not the canonical link, but that is OKAY. Then,
log(π) = β0 + β1 x1i
or
π = eβ0 +β1 x1i
Therefore, the RR could be viewed as
π|x = 1 eβ0 +β1 β1
RR = = = e
π|x = 0 eβ0
β1 can now be interpreted as log Relative Risk.
Lecture 11: Introduction to Generalized Linear Models – p. 26/44
example
Recall our famous MI example.
Myocardial Infarction
Fatal Attack or No
Nonfatal attack Attack
Placebo 189 10845
Aspirin 104 10933
Previously, we estimated the OR to be
OR = (189 · 10933)/(104 · 10845) = 1.832
which indicates that subjects taking placebo had 1.8 times the odds of having an MI when
compared to subjects taking aspirin.
Lecture 11: Introduction to Generalized Linear Models – p. 27/44
Now using a GLM
For this analysis, we want to use the aspirin group as the reference group and estimate.
Therefore in terms of our regression dummy codes, we want
(
1 if subject i received PLACEBO
x1i =
0 if subject i received ASPIRIN
with the response coding of
(
1 if subject i has either a Fatal MI or a Non Fatal MI
Yi =
0 if subject i does not have an MI
Lecture 11: Introduction to Generalized Linear Models – p. 28/44
Inputting Data
And we could input this data into SAS as
data one;
input y x1 count;
cards;
1 1 189
1 0 104
0 1 10845
0 0 10933
;
run;
Lecture 11: Introduction to Generalized Linear Models – p. 29/44
Modeling Fitting Using SAS
And use PROC GENMOD (generalized linear models) to fit the data
proc genmod descending;
freq count;
model y = x1 /dist = bin link=logit;
estimate ’X1’ x1 1 /exp;
run;
Notes:
1. We have specified our design matrix to include just X1. GENMOD automatically
includes an intercept unless you tell it not to.
2. We used “dist=bin” to specify the distribution of Y as binomial
3. We used “link = logit” to fit the canonical link (logistic link)
4. The estimate statement invokes a contrast and exponentiates the parameter estimates
Don’t worry, we’ll become very familiar with GENMOD over the next few weeks.
Lecture 11: Introduction to Generalized Linear Models – p. 30/44
Selected Results
Response Profile
Ordered Total
Value y Frequency
1 1 293
2 0 21778
PROC GENMOD is modeling the probability that y=’1’.
Note: The most important line is the one that indicates what level of the response is
considered a success. In this case, we used “DESCENDING” to specify y=1 as the success.
(by default, SAS takes the first sorted response category as the success)
Lecture 11: Introduction to Generalized Linear Models – p. 31/44
Analysis Of Parameter Estimates
Standard Wald 95% Confidence
Parameter DF Estimate Error Limits
Intercept 1 -4.6552 0.0985 -4.8483 -4.4620
x1 1 0.6054 0.1228 0.3647 0.8462
Contrast Estimate Results
Standard
Label Estimate Error Alpha Confidence Limits
X1 0.6054 0.1228 0.05 0.3647 0.8462
Exp(X1) 1.8321 0.2251 0.05 1.4400 2.3308
Therefore our estimate of OR = 1.832 with a 95% CI of (1.44, 2.33).
Lecture 11: Introduction to Generalized Linear Models – p. 32/44
Using PROC LOGISTIC
proc logistic descending; freq count; model y = x1;
*estimate X1 x1 1 /exp;
run;
THE OUTPUT
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -4.6551 0.0985 2232.4885 <.0001
x1 1 0.6054 0.1228 24.2911 <.0001
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x1 1.832 1.440 2.331
Difference? GENMOD is moment-based, but LOGISTIC uses ML!
Lecture 11: Introduction to Generalized Linear Models – p. 33/44
Deviance
For a particular GLM for observations y = (y1 , . . . , yn ), let
l(µ, y)
denote the log likelihood function expressed in terms of the means µ = (µ1 , . . . , µn )
Let
l(µ̂, y)
denote the maximum of the log likelihood for the model.
If we have n observations and fit a model with n parameters, we have a saturated model.
We then have a ‘perfect fit’ of the data (i.e., no degrees of freedom).
Denote the likelihood under the saturated model as
l(y, y)
Lecture 11: Introduction to Generalized Linear Models – p. 34/44
Then, the DEVIANCE of a model is
D = −2[l(µ̂, y) − l(y, y)]
Note:
1. D is distributed as χ2 with df = N − p
2. p is the number of parameters estimated under the alternative (or fitted model).
3. Recall, N in the saturated model is the number of parameters included (one for each
observation).
4. Therefore, using the rules for calculating the df of a contingency table we developed
earlier, df equals the difference in parameters estimated under the null (saturated
model) and the alternative (at least one β not equal to zero)
5. For contingency tables D ≡ G2
6. As we proceed, the Deviance will be used to provide a measure of model fit.
Lecture 11: Introduction to Generalized Linear Models – p. 35/44
Calculation of Deviance by Hand -Okay, not really
If you recall, we used a Poisson log linear model to calculate the Pearson residuals for a
contingency table.
A saturated model for a contingency table is one that contains an interaction term. For
example;
proc genmod;
model count = x1 y x1*y /dist=poi link = log;
run;
produces a saturated model.
Lecture 11: Introduction to Generalized Linear Models – p. 36/44
Proof
The design matrix for this model would be
X = [1, x1i , yi , x1i yi ]
Since x1i = yi = (0, 1)
cell (1,1) has X = (1, 1, 1, 1)
cell (1,2) has X = (1, 1, 0, 0)
cell (2,1) has x = (1, 0, 1, 0)
cell (2,2) has x = (1, 0, 0, 0)
That is, counts for all cells are determined by a combination of Xβ.
Lecture 11: Introduction to Generalized Linear Models – p. 37/44
GENMOD Results
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 0 0.0000 .
Scaled Deviance 0 0.0000 .
Pearson Chi-Square 0 0.0000 .
Scaled Pearson X2 0 0.0000 .
Log Likelihood 181840.4662
Note: df = 0 since we are fitting the saturated model (df = N − N )
l(y, y) = 181840.4662
Lecture 11: Introduction to Generalized Linear Models – p. 38/44
Analysis Of Parameter Estimates
Standard Wald 95% Confidence
Parameter DF Estimate Error Limits
Intercept 1 9.2995 0.0096 9.2808 9.3183
x1 1 -0.0081 0.0136 -0.0346 0.0185
y 1 -4.6552 0.0985 -4.8483 -4.4620
x1*y 1 0.6054 0.1228 0.3647 0.8462
Therefore, cell (1,1)’s count would be
log(count cell 1,1) = 9.2995 − 0.0081 − 4.6552 + 0.6054
= 5.2416
or
count cell 1,1 = e5.2416
= 189
Lecture 11: Introduction to Generalized Linear Models – p. 39/44
The Alternative Model
proc genmod;
model count = x1 y /dist=poi link = log;
run;
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 1 25.3720 25.3720
Scaled Deviance 1 25.3720 25.3720
Pearson Chi-Square 1 25.0139 25.0139
Scaled Pearson X2 1 25.0139 25.0139
Log Likelihood 181827.7802
l(µ̂, y) = 181827.7802
Lecture 11: Introduction to Generalized Linear Models – p. 40/44
Analysis Of Parameter Estimates
Standard Wald 95% Confidence
Parameter DF Estimate Error Limits
Intercept 1 9.2956 0.0096 9.2769 9.3144
x1 1 -0.0003 0.0135 -0.0267 0.0261
y 1 -4.3085 0.0588 -4.4238 -4.1932
Thus, the expected cell count for cell (1,1) would be
log(expected count cell 1,1) = 9.2956 − 0.0003 − 4.3085
= 4.9868
or
count cell 1,1 = e4.9868
= 146.467
This is the same values as (with some rounding error) 293 ∗ 11034/22071 = 146.48
Lecture 11: Introduction to Generalized Linear Models – p. 41/44
Therefore,
D = −2(181827.7802 − 181840.4662)
= 25.37
on df = 4 − 3 = 1
Lecture 11: Introduction to Generalized Linear Models – p. 42/44
Summary of Canonical Links
Distribution Natural Parameter Canonical Link
Poisson log(λ) log
Normal µ identity
Binomial log(π/(1 − π)) logit
As stated before, just because the natural parameter suggests a certain link, there is no
requirement to model only using the canonical link.
Lecture 11: Introduction to Generalized Linear Models – p. 43/44
Recap
Some key summary points:
1. We generalized linear models by allowing for the specification of the distribution of Y
and the relationship of the expectation to the design matrix
2. We do not need normality for the regression model
3. GLMs provide a unified theory of modeling that encompasses most of the important
models for continuous and discrete variables
4. As we will see next, model parameters can be estimated by ML
5. By restricting the distributions to only exponential class distributions, we can use the
same algorithm for ML estimation
Lecture 11: Introduction to Generalized Linear Models – p. 44/44