Chapter 3
Hidden Markov Models (HMMs)
3.1 Hidden Markov Models (HMMs)
There is a variant of the notion of DFA with ouput, for ex-
ample a transducer such as a gsm (generalized sequential
machine), which is widely used in machine learning.
This machine model is known as hidden Markov model ,
for short HMM .
There are three new twists compared to traditional gsm
models:
(1) If the set of states is denoted Q = {q1, . . . , qn), then
the transitions between states are labeled with prob-
abilities rather that symbols from an alphabet. For
any two states qi and qj , the edge from qi to qj is
labeled with a probability A(i, j).
95
96 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
(2) There is a finite set O = {O1, . . . , Om} of possible
outputs (called the observation space) that can be
emitted, and for every state qi, there is a probability
B(i, j) that output Oj is emitted (produced).
(3) Sequences of outputs O = (O1 , . . . , OT ) (with t
{1, . . . , m} for t = 1, . . . , T ) emitted by the model
are directly observable, but the sequences of states
S = (qi1 , . . . , qiT ) (with it {1, . . . , n} for t =
1, . . . , T ) that caused some sequence of output to be
emitted are not observable.
In this sense the states are hidden, and this is the
reason for calling this model a hidden Markov model.
HMMs are among the most eective tools to solve the
following types of problems:
3.1. HIDDEN MARKOV MODELS (HMMS) 97
(1) DNA and protein sequence alignment in the
face of mutations and other kinds of evolutionary change.
(2) Speech understanding systems: When we talk,
our mouths produce sequences of sounds from the sen-
tences that we want to say. This process is complex.
Multiple words may map to the same sound, words
are pronounced dierently as a function of the word
before and after them, we all form sounds slightly
dierently, and so on.
All a listener can hear (perhaps a computer system)
is the sequence of sounds, and the listener would like
to reconstruct the mapping (backward) in order to
determine what words we were attempting to say.
For example, when you talk to your TV to pick a
program, say game of thrones, you dont want to get
the price is right.
98 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
(3) Optical character recognition (OCR). When
we write, our hands map from an idealized symbol to
some set of marks on a page (or screen).
The marks are observable, but the process that gen-
erates them isnt.
A system performing OCR, such as a system used by
the post oce to read addresses, must discover which
word is most likely to correspond to the mark it reads.
3.1. HIDDEN MARKOV MODELS (HMMS) 99
Example 3.1. Say we consider the following behavior
of some professor at some university.
On a hot day (denoted by Hot), the professor comes to
class with a drink (denoted D) with probability 0.7, and
with no drink (denoted N) with probability 0.3.
On the other hand, on a cold day (denoted Cold), the
professor comes to class with a drink with probability
0.2, and with no drink with probability 0.8.
Suppose a student intrigued by this behavior recorded
a sequence showing whether the professor came to class
with a drink or not, say NNND.
Several months later, the student would like to know
whether the weather was hot or cold the days he recorded
the drinking behavior of the professor.
100 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
Now the student heard about machine learning, so he
constructs a probabilistic (hidden Markov) model of the
weather, namely based on some experiments, he deter-
mines the probability of a transition from a hot day to
another hot day, or to a cold day, and the probability of
a transition from a cold day to another cold day, or to a
hot day.
He also knows that when he started his observations, it
was a cold day with probability 0.45, and a hot day with
probability 0.55.
The above data determine an HMM depicted in Figure
3.1.
3.1. HIDDEN MARKOV MODELS (HMMS) 101
start
0.45 0.55
0.7 0.75
0.3
Cold Hot
0.25
0.2 0.3
0.8 0.7
N D
Figure 3.1: Example of an HMM modeling the drinking behavior of a professor at the
University of Pennsylvania.
The portion of the state diagram involving the states
Cold, Hot, is analogous to an NFA in which the tran-
sition labels are probabilities; it is the underlying Markov
model of the HMM.
For any given state, the probabilities sum to 1.
The start state is a convenient way to express the proba-
bilities of starting either in state Cold or in state Hot.
102 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
Also, from each of the states Cold and Hot, we have emis-
sion probabilities of producing the ouput N or D, and
these probabilities also sum to 1.
We can also express these data using matrices.
If we associate Cold with the index 1 and Hot with the
index 2, then the matrix
! "
0.7 0.3
A=
0.25 0.75
describes the transitions of the Makov model.
The vector
! "
0.45
=
0.55
describes the probabilities of starting either in state Cold
or in state Hot.
3.1. HIDDEN MARKOV MODELS (HMMS) 103
If we associate N with the index 1 and D with the index
2, then the matrix
! "
0.8 0.2
B=
0.3 0.7
describes the emission probabilities.
The student would like to solve what is known as the
decoding problem.
Namely, given the output sequence NNND, find the
most likely state sequence of the Markov model that
produces the output sequence NNND.
Is it (Cold, Cold, Cold, Cold), or (Hot, Hot, Hot, Hot), or
(Hot, Cold, Cold, Hot), or (Cold, Cold, Cold, Hot)?
Given the probabilities of the HMM, it seems unlikely
that it is (Hot, Hot, Hot, Hot), but how can we find the
most likely one?
104 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
Let us consider another example taken from Stamp [?].
Example 3.2. Suppose we want to determine the av-
erage annual temperature at a particular location over a
series of years in a distant past where thermometers did
not exist.
Since we cant go back in time, we look for indirect evi-
dence of the temperature, say in terms of the size of tree
growth rings.
For simplicity, assume that we consider the two tempera-
tures Cold and Hot, and three dierent sizes of tree rings:
small, medium and large, which we denote by S, M, L.
The HMM shown in Figure 3.2 is a model of the situation.
3.1. HIDDEN MARKOV MODELS (HMMS) 105
start
0.4 0.6
0.6 0.7
0.4
Cold Hot
0.3
0.1 0.1
0.2 0.4
0.7 0.5
S M L
Figure 3.2: Example of an HMM modeling the temperature in terms of tree growth rings.
Suppose we observe the sequence of tree growth rings
(S, M, S, L).
What is the most likely sequence of temperatures over a
four-year period which yields the observations
(S, M, S, L)?
Going back to Example 3.1, we need to figure out the
probability that a sequence of states S = (qi1 , qi2 , . . . , qiT )
produces the output sequence O = (O1 , O2 , . . . , OT ).
106 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
Denote our set of states by Q = {q1, q2} = {Cold, Hot}
and our set of outputs by O = {O1, O2} = {N, D}.
Then the probability that we want is just the product of
the probability that we begin with state qi1 , times the
product of the probabilities of each of the transitions,
times the product of the emission probabilities, namely
T
#
Pr(S, O) = (i1)B(i1, 1) A(it1, it)B(it, t).
t=2
In our example, (1, 2, 3, 4) = (1, 1, 1, 2), which cor-
responds to NNND.
The brute-force method is to compute these probabilities
for all 24 = 16 sequences of states of length 4 (in general,
there are nT sequences of length T ).
3.1. HIDDEN MARKOV MODELS (HMMS) 107
For example, for the sequence S = (Cold, Cold, Cold, Hot),
which corresponds to the sequence of indices (1, 1, 1, 2),
we find that
Pr(S, NNND) = (1)B(1, 1)A(1, 1)B(1, 1)A(1, 1)B(1, 1)
A(1, 2)B(2, 2)
= 0.45 0.8 0.7 0.8 0.7 0.8
0.3 0.7 = 0.0237.
A much more ecient way to proceed is to use dynamic
programming.
For t = 1, 2, 3, 4, for every state i = 1, 2, we compute
score(i, t) to be the highest probability that a path of
length t 1 ending in qi produces the output sequence
(O1 , . . . , Ot ), and for t 2, we let pred(i, t) be the
state that precedes qi in a best path of length t 1
ending in qi.
108 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
Initially, we set
score(j, 1) = (j)B(j, 1), j = 1, 2,
and since 1 = 1 we get score(1, 1) = 0.45 0.8 = 0.36
and score(2, 1) = 0.55 0.3 = 0.165.
Next we compute score(1, 2) and score(2, 2) as follows.
For j = 1, 2, for i = 1, 2, compute temporary scores
tscore(i, j) = score(i, 1)A(i, j)B(j, 2);
then pick the best of the temporary scores,
score(j, 2) = max tscore(i, j).
i
Since 2 = 1, we get tscore(1, 1) = 0.36 0.7 0.8 =
0.2016, tscore(2, 1) = 0.165 0.25 0.8 = 0.0330, and
tscore(1, 2) = 0.36 0.3 0.3 = 0.0324, tscore(2, 2) =
0.165 0.75 0.3 = 0.0371.
3.1. HIDDEN MARKOV MODELS (HMMS) 109
Then
score(1, 2) = max{tscore(1, 1), tscore(2, 1)}
= max{0.2016, 0.0330} = 0.2016,
and
score(2, 2) = max{tscore(1, 2), tscore(2, 2)}
= max{0.0324, 0.0371} = 0.0371.
Since the state that leads to the optimal score score(1, 2)
is 1, we let pred(1, 2) = 1, and since the state that leads
to the optimal score score(2, 2) is 2, we let pred(2, 2) = 2.
110 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
We compute score(1, 3) and score(2, 3) in a similar way.
For j = 1, 2, for i = 1, 2, compute
tscore(i, j) = score(i, 2)A(i, j)B(j, 3);
then pick the best of the temporary scores,
score(j, 3) = max tscore(i, j).
i
Since 3 = 1, we get
score(1, 3) = max{0.1129, 0.074} = 0.1129,
and
score(2, 3) = max{0.0181, 0.0083} = 0.0181.
We also get pred(1, 3) = 1 and pred(2, 3) = 1.
Finally, we compute score(1, 4) and score(2, 4) in a sim-
ilar way.
3.1. HIDDEN MARKOV MODELS (HMMS) 111
For j = 1, 2, for i = 1, 2, compute
tscore(i, j) = score(i, 3)A(i, j)B(j, 4);
then pick the best of the temporary scores,
score(j, 4) = max tscore(i, j).
i
Since 4 = 2, we get
score(1, 4) = max{0.0158, 0.0009} = 0.0158,
and
score(2, 4) = max{0.0237, 0.0095} = 0.0237,
and pred(1, 4) = 1 and pred(2, 4) = 1.
Since max{score(1, 4), score(2, 4)} = 0.0237, the state
with the maximun score is q2 = Hot, and by following
the predecessor list we find the most likely path to pro-
duce the sequence NNND to be
(q1, q1, q1, q2) = (Cold, Cold, Cold, Hot).
The method we just described is known as the Viterbi
algorithm.
112 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
Definition 3.1. A hidden Markov model , for short
HMM , is a quintuple M = (Q, O, , A, B) where
Q = {q1, . . . , qn} is a finite set of states, with n
states.
O = {O1, . . . , Om} is a finite output alphabet (also
called set of possible observations), with m observa-
tions.
A = (A(i, j)) is an n n matrix called the state
transition probability matrix , with
n
$
A(i, j) 0, 1 i, j n, and A(i, j) = 1,
j=1
for i = 1, . . . , n.
B = (B(i, j)) is an n m matrix called the state ob-
servation probability matrix (also called confusion
matrix ), with
m
$
B(i, j) 0, 1 i, j n, and B(i, j) = 1,
j=1
for i = 1, . . . , n.
3.1. HIDDEN MARKOV MODELS (HMMS) 113
Examples of HMMs are shown in Figure 3.1, Figure 3.2,
and Figure 3.3.
Note that an ouput is emitted when visiting a state, not
when making a transition, as in the case of a gsm. So the
analogy with the gsm model is only partial; it is meant
as a motivation for HMMs.
If we ignore the output components O and B, then we
have what is called a Markov chain.
114 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
There are three types of problems that can be solved using
HMMs:
(1) The decoding problem: Given an HMM M =
(Q, O, , A, B), for any observed output sequence O =
(O1 , O2 , . . . , OT ) of length T , find a most likely
sequence of states S = (qi1 , qi2 , . . . , qiT ) that pro-
duces the output sequence O.
More precisely, this means finding a sequence S such
that the probability
T
#
Pr(S, O) = (i1)B(i1, 1) A(it1, it)B(it, t)
t=2
is maximal.
This problem is solved eectively by the Viterbi al-
gorithm.
3.1. HIDDEN MARKOV MODELS (HMMS) 115
(2) The evaluation problem: Given a finite collec-
tion {M1, . . . , ML} of HMMs with the same output
alphabet O, for any output sequence
O = (O1 , O2 , . . . , OT ) of length T , find which
model M is most likely to have generated O.
More precisely, given any model Mk , we compute the
probablity tprobk that Mk could have produced O
along any path.
Then we pick an HMM M for which tprob is max-
imal. We will return to this point after having de-
scribed the Viterbi algoritm.
A variation of the Viterbi algorithm called the for-
ward algorithm eectively solves the evaluation prob-
lem.
116 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
(3) The training problem: Given a set {O1, . . . , Or }
of output sequences on the same output alpabet O,
usually called a set of training data, given Q, find the
best , A, and B for an HMM M that produces
all the sequences in the training set, in the sense
that the HMM M = (Q, O, , A, B) is the most likely
to have produced the sequences in the training set.
The technique used here is called expectation maxi-
mization, or EM . It is an iterative method that starts
with an initial triple , A, B, and tries to impove it.
There is such an algorithm known as the Baum-Welch
or forward-backward algorithm, but it is beyond the
scope of this introduction.
Let us now describe the Viterbi algorithm in more details.
3.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 117
3.2 The Viterbi Algorithm and the Forward Algorithm
Given an HMM M = (Q, O, , A, B), for any observed
output sequence O = (O1 , O2 , . . ., OT ) of length T ,
we want to find a most likely sequence of states S =
(qi1 , qi2 , . . . , qiT ) that produces the output sequence O.
For this, we need to find a sequence S such that the
probability
T
#
Pr(S, O) = (i1)B(i1, 1) A(it1, it)B(it, t)
t=2
is maximal.
In general, there are nT sequences of length T .
This problem can be solved eciently by the method of
dynamic programming.
118 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
For any t, 1 t T , for any state qj Q, we compute
score(j, t), which is the largest probability that a path
of length t 1 ending with qj has produced the output
sequence (O1 , O2 , . . . , Ot ).
The point is that if we know score(k, t 1) for k =
1, . . . , n (with t 2), then we can find score(j, t) for
j = 1, . . . , n, because the probability associated with the
path (qi1 , . . . , qit1 , qj ) is
tscore(it1, j) = score(it1, t 1)A(it1, j)B(j, t),
so to maximize this probability we just have to find the
maximum of the probabilities tscore(it1, j) over all
it1, that is, we must have
score(j, t) = max tscore(k, j).
k
To get started, we set score(j, 1) = (j)B(j, 1) for j =
1, . . . , n.
3.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 119
The algorithm goes through a forward phase for t =
1, . . . , T , during which it computes the probabilities
score(j, t) for j = 1, . . . , n.
When t = T , we pick a state qiT = qj such that
score(j, T ) is maximal.
The machine learning community is fond of the notation
j = arg max score(k, T )
k
to express the above fact.
This gives us the last state in an optimal sequence that
yields the output sequence O.
The algorithm then goes through a backward phase.
120 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
To to this, when we compute
score(j, t) = max tscore(k, j),
k
we also record which state qk = qit1 was the last state of
the best sequence (qi1 , . . . , qit1 , qj ) for which tscore(k, j)
is maximal, as pred(j, t) = k.
This state may not be unique, we just pick one of them.
Again, this can be expressed by
pred(j, t) = arg max tscore(k, j).
k
The predecessors pred(j, t) are only defined for t = 2, . . . , T ,
but we can let pred(j, 1) = 0.
3.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 121
The Viterbi algorithm is shown below.
The input to the algorithm is M = (Q, O, , A, B) and
the observed sequence O = (O1 , O2 , . . . , OT ) of length
T.
The output is a sequence of states
(y1, . . . , yT ) = (qI1 , . . . , qIT ), where (I1, . . . , IT ) is the
sequence of indices of the states.
122 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
The Viterbi Algorithm
begin
for j = 1 to n do
score(j, 1) = (j)B(j, 1)
endfor;
( forward phase to find the best (highest) scores )
for t = 2 to T do
for j = 1 to n do
for k = 1 to n do
tscore(k) = score(k, t 1)A(k, j)B(j, t)
endfor;
score(j, t) = maxk tscore(k);
pred(j, t) = arg maxk tscore(k)
endfor
endfor;
( backward phase to find the optimal path )
IT = arg maxj score(j, T );
yT = qIT ;
for t = T to 2 by 1 do
It1 = pred(It, t);
yt1 = qIt1
endfor
end
3.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 123
If we run the Viterbi algorithm on the output sequence
(S, M, S, L) of Example 3.2, we find that the sequence
(Cold, Cold, Cold, Hot) has the highest probability, 0.00282,
among all sequences of length four.
One may have noticed that the numbers involved, being
products of probabilities, become quite small.
Indeed, underflow may arise in dynamic programming.
Fortunately, there is a simple way to avoid underflow by
taking logarithms.
It immediately verified that the time complexity of the
Viterbi algorithm is O(n2T ).
Let us now to turn to the second problem, the evaluation
problem.
124 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
This time, given a finite collection {M1, . . . , ML} of HMMs
with the same output alphabet O, for any observed out-
put sequence O = (O1 , O2 , . . . , OT ) of length T , find
which model M is most likely to have generated O.
More precisely, given any model Mk , we compute the
probablity tprobk that Mk could have produced O along
any path.
Then we pick an HMM M for which tprob is maximal.
It is easy to adapt the Viterbi algorithm to compute
tprobk . This algorithm is called the forward algorithm.
Since we are not looking for an explicity path, there is
no need for the backward phase, and during the forward
phase, going from t 1 to t, rather than finding the
maximum of the scores tscore(k) for k = 1, . . . , n, we
just set score(j, t) to the sum over k of the temporary
scores tscore(k).
At the end, tprobk is the sum over j of the probabilities
score(j, T ).
3.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 125
The input to the algorithm is M = (Q, O, , A, B) and
the observed sequence O = (O1 , O2 , . . . , OT ) of length
T.
The output is the probability tprob.
The Foward Algorithm
begin
for j = 1 to n do
score(j, 1) = (j)B(j, 1)
endfor;
for t = 2 to T do
for j = 1 to n do
for k = 1 to n do
tscore(k) = score(k, t 1)A(k, j)B(j, t)
endfor;
%
score(j, t) = k tscore(k)
endfor
endfor;
%
tprob = j score(j, T )
end
126 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)
We can now run the above algorithm on M1, . . . , ML to
compute tprob1, . . . , tprobL, and we pick the model M
for which tprob is maximum.
As for the Viterbi algorithm, the time complexity of the
forward algorithm is O(n2T ).
Example 3.3. To illustrate the forward algorithm, as-
sume that our observant student also recorded the drink-
ing behavior of a professor at Harvard, and that he came
up with the HHM shown in Figure 3.3.
start
0.13 0.87
0.33 0.9
0.67
Cold Hot
0.1
0.05 0.8
0.95 0.2
N D
Figure 3.3: Example of an HMM modeling the drinking behavior of a professor at Harvard.
3.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 127
However, the student cant remember whether he ob-
served the sequence NNND at Penn or at Harvard.
So he runs the forward algorithm on both HMMs to find
the most likely model. Do it!
128 CHAPTER 3. HIDDEN MARKOV MODELS (HMMS)