0% found this document useful (0 votes)
21 views56 pages

Information Theory and Coding Basics

The document summarizes key concepts in information theory and coding. It discusses how information theory addresses questions about compressing data from a source and communicating over a noisy channel. The two fundamental theorems by Shannon state that the entropy of a source determines the minimum rate to compress it, and the capacity of a channel determines the maximum rate of error-free communication over it. The capacity is defined as the maximum mutual information between the input and output of the channel over all possible input distributions.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views56 pages

Information Theory and Coding Basics

The document summarizes key concepts in information theory and coding. It discusses how information theory addresses questions about compressing data from a source and communicating over a noisy channel. The two fundamental theorems by Shannon state that the entropy of a source determines the minimum rate to compress it, and the capacity of a channel determines the maximum rate of error-free communication over it. The capacity is defined as the maximum mutual information between the input and output of the channel over all possible input distributions.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SUB:INFORMATION THEORY & CODING

SUB CODE:15EC34
Communications-centric Motivation: Communication over noisy channels.

Noise

Source → Encoder → Channel → Decoder → Destination

Examples: Email over internet, Voice over phone network, Data over USB storage.

In these examples, the source would be the voice, email text, and data, the channel would be the (wired or wireless)
phone network, internet, and storage device.
Purpose: Reproduce source data at the destination (with presence of noise in channel)

• Channel can introduce noise (static), interference (multiple copies at destination arriving at dif-ferent times
from radio waves bouncing on buildings, etc), or distortion (amplitude, phase, modula-tion)

• Sources have redundancy, e.g. the english language, can reconstruct words from parts: “th_t” → ”that”, and
images, where individual pixel differences are not noticed by the human eye. Some sources can tolerate
distortion (images).

A typical example of an encoder is broken into three components:

From Source → Source Encoder → Channel Encoder → Modulator → to Channel

• The source encoder outputs a “source codeword”, an e fficient representation of the source, removing
redundancies. The output is taken to be a sequence of bits (binary)

• The channel encoder outputs a “channel codeword”, introducing “controlled” redundancy to tol-erate errors
that may be introduced by the channel. The controlled redundancy is taken into account while decoding.
Note the distinction from redundancies present in the source, where the redundancies may be arbitrary (many
pixels in an image). The codeword is designed by examining channel properties. The simplest channel
encoder involves just repeating the source codeword.

• The modulator converts the bits of the channel codeword into waveforms for transmission over the channel.
They may be modulated in amplitude or frequency, e.g. radio broadcasts.

The corresponding decoder can also be broken into three components that invert the operations of the source
encoder above:

From Channel → Demodulator → Channel Decoder → Source Decoder → To Destination

We call the output of the demodulator the “received word” and the output of the channel decoder the “estimated
source codeword”.

Remark. Later we will see that separating the source and channel encoder/decoder as above does not cause a loss in
performance under certain conditions. This allows us to concentrate on each component independently of the other
components. Called the Joint Source Channel Coding Theorem.

2
Information Theory addresses the following questions:

• Given a source, how much can I compress the data? Are there any limits?

• Given a channel, how noisy can the channel be, or how much redundancy is necessary to minimize error in
decoding? What is the maximum rate of communication?

Information Theory interacts with many other fields as well, probability/statistics, computer science, eco-nomics,
quantum theory, etc. See book for details.

There are two fundamental theorems by Shannon that address these questions. First we define a few quantities.

Typically, the source data is unknown, and we can treat the input data as a random variable, for instance, flipping
coins and sending that as input across the channel to test the communication scheme.

We define the entropy of a random variable X, taking values in the alphabet X as


X
H (X ) = − p(x) log2 p(x)
x∈X

The base 2 logarithm measures the entropy in bits. The intuition is that entropy describes the “compress-ibility” of
the source.

Example 1. Let X ∼ unif{1, , 16}. Note we need 4 bits to represent the values of X intuitively. The
entropy is
16
H (X ) = − x= 1 1 6log 2 16
1 1 = 4 bits
X

1 1 1 1 1 1 1 1
_ _
Example 2. 8 horses in a race with winning probabilities 2 , 4 , 8 , 16 , 64 , 64 , 64 , 64 . Note
1 1 1 1 1 1 1 1 4 1
H (X ) = − 2log2 2 − 4log2 4 − 8 log2 8 − 16 log2 16 − 64 log2 64 = 2 bits

The naive encoding for outputting the winner of the race uses 8 bits, labelling the horses from 000 to 111 in binary.
However, since some horses win more frequently, it makes sense to use shorter encoding words for more frequent
winners. For instance, label the horses (0,10,110,1110,111100,111101,111110,11111). This particular encoding has
the property that it is “prefix-free”, no codeword is a prefix of another code-word. This makes it easy to determine
how many bits to read until we can determine which horse is being referenced. Now note some codewords are
longer than 3 bits, but on the average, we see that the average description length is

1 1 1 1 4
2 (1) + 4 (2) + 8 (3) + 16 (4) + 64 (6) = 2 bits
which is exactly the entropy.

Now we are ready to state (roughly) Shannon’s First Theorem:

Theorem 3. (Source Code Theorem) Roughly speaking, H (X) is the minimum rate at which we can compress a
random variable X and recover it fully.

Intuition can be drawn from the previous two examples. We will visit this again in more detail later.

3
Now we define the mutual information between two random variables (X , Y ) ∼ p(x, y) defined as

X p(x, y)
IX Y p x, y
( ; )= ( ) log2 p(x)p(y)
x,y

Later we will show that I (X ; Y ) = H (X) − H (X |Y ), where H (X |Y ) is to be defined later. Here H (X ) is to be


interpreted as the compressibility (as stated earlier), or the amount of uncertainty in X. H (X |Y ) is the amount of
uncertainty in X, knowing Y , and finally I(X ; Y ) is interpreted as the reduction in uncer-tainty of X due to
knowledge of Y , or a measure of dependency between X and Y . For instance, if X and Y are completely dependent
(knowing Y gives all information about X ), then H (X |Y ) = 0 (no uncer-tainty given Y ), and so I(X ; Y ) = H (X )
(a complete reduction of uncertainty). On the other hand, if X and Y are independent, then H (X |Y ) = H (X) (still
uncertain) and I(X ; Y ) = 0 (no reduction of uncer-tainty).

Now we will characterize a communications channel as one for which the channel output Y depends prob-abilitstic
on the input X . that is, the channel is characterized by the conditional probability P (Y |X). For a simple example,
suppose the input is {0, 1}, and the channel flips the input bit with probability ε, then
P (Y = 1 |X = 0) = ε, P (Y = 0 |X = 0) = 1 − ε.
Define the capacity of the channel to be

C = max I(X ; Y )
p(x)

noting here that I(X ; Y ) depends on the joint p(x, y) = p(y |x) p(x), and thus depends on the input distri-bution as
well as the channel. The maximization is taken over all possible input distributions.
We can now roughly state Shannon’s Second Theorem:

Theorem 4. (Channel Coding Theorem) The maximum rate of information over a channel with arbitrarily low
probability of error is given by its capacity C.

Here “rate” is “number of bits per transmission”. This means that as long as the transmission rate is below the
capacity, the probability of error can be made arbitrarily small.
We can motivate the theorem above with a few examples:

Example 5. Consider a noiseless binary channel, where input is {0, input. We1} and the output is equal to the
would then expect to be able to send 1 bit of data in each with no probability transmission through the channel
of error. Indeed if we compute

I(X ; Y ) = H (X) − H (X |Y ) = H (X)

since knowing Y gives complete information about X , and thus

C = max H (X ) = max p log p (1 p) log (1 p)


p(x) p [− 2 − − 2 − ]

taking derivative and setting to zero gives us

− log p − 1 + log(1 − p) + 1 = 0
1
and so p = 1 − p and p = 2 . Then
1
C = − log2 2 = 1 bit

4
as expected.

Example 6. Consider a noisy 4 symbol channel, where the input and output X , Y take values in {0, 1, 2, 3} and

1
P (Y = j |X = j) = P (Y = j + 1 |X = j) = 2
where j + 1 is taken modulo 4. That is, a given input is either preserved or incremented upon output with equal
probability. To make use of this noisy channel, we note that we do not have to use all the inputs when sending
information. If we restrict the input to only 0 or 2, then we note that 0 is sent to 0 or 1, and 2 is sent to 2 or 3, and
we can easily deduce from the output whether we sent a 0 or a 2. This allows us to send one bit of information
through the channel with zero probability of error. It turns out that this is the best strategy. Given an input
distribution pX (x), we compute the joint distribution:

1
p(x, y) = 2 pX(x) y ∈ {x, x + 1

( 0 else }
Note that the marginal distribution of the output is

1
pY (y) = 2 (pX(y) + pX(y − 1))

Applying the definition of conditional entropy, we have


_
H (Y |X ) = x 1p(x) H_ 2
=1
X

Then the mutual information is

I(X ; Y ) = H (Y ) − 1
The entropy of Y is
3
H (Y ) = − y=0 pY (y) log2 pY (y)

X
Since given pY (y), we can find pX corresponding to pY , it suffices to maximize I(X ; Y ) over possibilities q, pY
for pY . Letting pY (0) = p, pY (1) =
(2) = r, pY (3) = 1 − p − q − r, we have the maximization problem
max − 1 − p log p − q log q − r log r − (1 − p − q − r) log(1 − p − q − r)
p, q,r

Taking the gradient and setting it to zero, we have the equations

− log p − 1 + log(1 − p − q − r) + 1 = 0
− log q − 1 + log(1 − p − q − r) + 1 = 0
− log r − 1 + log(1 − p − q − r) + 1 = 0
which reduces to

p = 1− p − q − r
q = 1− p − q − r
r = 1− q − q − r

5
1
so that p = q = r = 4 . Plugging this into the above gives
_
1 1
C=−1−4 4log 4 = 1 bit
as expected.

Remark. Above, to illustrate how to use the noisy 4-symbol channel, suppose the input string is 0101110. If the 4-
symbol channel has no noise, then we could send 2 bits of information per transmission by using the following
scheme:

Block the inputs two bits at a time: 01, 01, 11, 10, and send the messages 1, 1, 3, 2, across the channel. Then the
channel decoder inverts back to 01, 01, 11, 10 and reproduces the input string.

However, since the channel is noisy, we cannot do this without incurring some probability of error. Thus instead, we
decide to sent one bit at a time, and so we send 0, 2, 0, , and the output of the channel is 0 or 1, 2 or 3, 0 or 1, 2 or
3, which we can easily invert to recover the input.

Furthermore, by the theorem, 1 bit per transmission is the capacity of the channel, so this is the best scheme for this
channel. An additional bonus is that we can even operate at capacity with zero proba-bility of error!

In general, we can compute the capacity of a channel, but it is hard to find a nice scheme. For instance, consider the
binary symmetric channel, with inputs {0, 1} where the output is flipped with probability ε. Note that sending one bit
through this channel has a probability of error ε. One possibility for using the channel involves sending the same bit
repeatedly through the channel and taking the majority in the channel decoding. For instance, if we send the same bit
three times, the probability of error is

3 2 2 3 2
P (2 or 3 flips) = ε + 3ε (1 − ε) = 3ε − 2ε = O(ε )

1
whereas the data rate becomes 3 bits per transmission (we needed to send 3 bits to send 1 bit of informa-tion). To
1
lower the probability of error we would then need to send more bits. If we send n bits, then the data rate is n and the
probability of error is
_
n
P (n/2
n εj(1
flips or more) = j =n/2 _ j − ε)n −j

X
We can compute the capacity of the channel directly in this case as well. Note that

1
X

H (Y |X ) = pX(x) H (ε) = H (ε)


x=0

1
and I(X ; Y ) = H (X) − H (ε) = − H (ε) − p log p − (1 − p) log p, which is maximized at p = 2 , corresponding to a
channel capacity of 1 − H (ε). Note that the proposed repitition codes above do not achieve this capacity for
arbitrarily low probability of error. Later we will show how to find such codes.

Following Chapter 2, defining and examining properties of basic information-theoretic quantities.

Entropy is for a discrete random variable X, taking values in a countable alphabet X with probability mass function
p(x) = P (X = x), x ∈ X . Later we will discuss a notion of entropy for the continuous case.

6
Definition. The entropy of X ∼ p(x) is defined to be
X
H (X ) = − p(x) log p(x) = Ep(x)( − log p(x))
x∈X

If we use the base 2 logarithm, then the measure is in bits. If we use the natural logarithm, the measure is in “nats”.
Also, by convention (continuity), we take 0 log 0 = 0.
Note that H (X) is “independent on labeling.” It only depends on the masses in the pmf p(x).
Also, H (X ) ≥ 0, which follows from the fact that − log(p(x)) ≥ 0 for 0 ≤ p(x) ≤ 1.

Example 7. The entropy of a Bernoulli(p) random variable,


_
0 w.p. p
X=
1 w.p. 1 − p
is H (X) = − p log p − (1 − p) log(1 − p).

We will often use the notation


H (p) − p log p − (1 − p) log(1 − p)
(binary entropy function) for convenience. Note the graph of H (p):

1.2
(-x*log(x)-(1-x)*log(1-x))/log(2)

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

In particular, we note the following observations:

• H (p) is concave. Later we will show that entropy in general is a concave function of the underlying
probability distribution.

• H (p) = 0 for p = 0, 1. This reinforces the intuition that entropy is a measure of randomness, since in the case
of no randomness p = 0, 1, there is no entropy.

• H (p) is maximized when p = 1/2. Likewise, this is the situation where a binary distribution is the most
random.

7
• H (p) = H (1 − p).

Example 8. Let
1 w.p. 1
2

1
X= 2 w.p. 4
3 w.p. 1
4

1 1 1 1 1 3
In this case H (X) = − 2 log 2 − 2 · 4 log 4 = 2 + 1 = 2 bits. Last time we viewed entropy as the average
length of a binary code needed to encode X. We can interpret such binary codes as a sequence of yes/no questions
that are needed to determine the value of X. The entropy is then the average number of ques-tions needed using the
optimal strategy for determining X . In this case, we can use the following simple strategy, taking advantage of the
fact that X = 1 occurs more often than the other cases:

• Question 1: Is X = 1?

• Question 2: Is X = 2?
1
In this case, the number of questions needed is 1 if X = 1, which happens with probability 2 and 2 other-wise, with
1 1 1 3
probability 2 , so the expected number of questions needed is 1 · 2 + 2 · 2 = 2 , which coincides with the
entropy.

Example 9. Let X1, , Xn be i.i.d Bernoulli(p) distributed (0 with probability p, 1 with probability 1 − p). Then the
joint distribution is given by
# of 0′s # of 1′s P Xi n−P Xi
p(x1, , xn) = p (1 − p) = (1 − p) i
p i

n
Observe that if n is large, then by the law of large numbers X n p
P
that i=1 i≈ (1 − ) with high probability, so
n(1−p) np −nH(p)
p(x1, , xn) ≈ (1 − p) p =2
nH(p )
The interpretation is that for n large, the probability of the joint is concentrated in a set of size 2 on which it is
approximately uniformly distributed. In this case we can exploit this compression and use n H (p) bits to distinguish
sequences in this set (per random variable we are using H (p) bits. For this
nH(p)
reason 2 is the “effective alphabet size” of (X1, , Xn).

Definition. The joint entropy of (X , Y ) ∼ p(x, y) is given by


H (X , Y ) = − p(x, y) log p(x, y) = Ep(x,y)( − log p(x, y))
x∈X y ∈Y

X X
Definition. The conditional entropy of Y given X is
H (Y |X) = x∈X
p(x) H (Y |X = x)

= −
X p(x) p(y |x) log p(y |x)
x∈X y ∈Y

X X
X
=− p(x, y) log p(y |x)
x, y

= Ep(x, y)( − log p(y |x))


We expect that we can write the joint entropy in terms of the individual entropies, and this is indeed the case, in a
result we call the Chain Rule for Entropy:

8
Proposition 10. (Chain Rule for Entropy) For two random variables X , Y,

H (X , Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y )

For multiple random variables X1, , Xn, we have that


n
H (X1, , Xn) = i=1
H (Xi |X1, , Xi−1)

X
Proof. We just perform a simple computation
H (X , Y ) = − x∈X y ∈Y
p(x, y) log p(x, y)

= −
X X log p(x)
p(x, y) − p(x, y) log p(y |x)
x∈X y ∈Y x∈X y ∈Y

X X X X
= H (X) + H (Y |X)

by symmetry (switching the roles of X and Y ) we get the other equality. For n random variables X 1, , Xn we
proceed by induction. We have just proved the result for two random variables. Assume it is true for n − 1 random
variables. Then
H (X1, , Xn) = H (Xn |X1, , Xn −1) + H (X1, , Xn −1)
n−1
= H (Xn |X1, , Xn −1) +H (Xi |X1, , Xi−1)
i=1
n X
= i=1
H (Xi |X1, , Xi −1)

X
where we have used the usual chain rule and the induction hypothesis.

We also immediately get the chain rule for conditional entropy:

Corollary 11. (Chain Rule for Conditional Entropy) For three random variables X , Y , Z, we have

H (X , Y |Z) = H (Y |Z) + H (X |Y , Z) = H (X | Z) + H (Y |X , Z)

and for n random variables X1, , Xn and a random variable Z, we have


n
X

H (X1, , Xn |Z) = H (Xi |Z , X1, Xi−1)


i=1

Proof. This follows from the chain rule for entropy given Z = z, and averaging over z gives the result.

Example 12. Consider the following joint distribution for X , Y in tabular form with the marginals:

y \x 1 2 py
1 1 3
1
4 2 4
p(x, y) = 1 1
2 0
4 4
p 1 1
x 2 2

9
Note that H (X) = 1 bit and H (Y ) = H 1 4
HX,HY ,HY X,H X Y ,HX,Y
=− 1 log 1
4 4
− 3log 3
4 4
=2− 3log
4
3 bits. The conditional entropies
_ _
are We compute the different entropies ( )( )( | )( | )() below:
_ _
1
H (X) = H
2
= 1 bit _
1
H (Y ) = H
4
1
1 3 3
= − 4 log 4 − 4 log 4
3
= 2 − 4 log 3 bits
H (Y |X) = P (X = 1) · H (Y |X = 1) + P (X = 2) · H (Y |X = 2)
1 _ 1
= 2 · H 2 _
1
= bits
2
H (X |Y ) = P (Y = 1) · H (X |Y = 1) + P (Y = 2) · H (X |Y = 2)
3 _ 1
= 4 · H 3 _ _ _

3 2 1 1 2
= 4 −
3 log 3 − 3 log 3
3 1
= 4 log 3 − 2 bits
1 1 1 1
H (X , Y ) = − 2 · 4 log 4 − 2 log 2
3
= 2 bits

general H (Y | X) _
Above we note that H (Y |X = 2) = H (X |Y = 2) = H (0) = 0 since in each case the distribution becomes deterministic. Note that in
H (X | Y ) and in this example H (X | Y ) ≤ H (X ), H (Y |X ) ≤ H (Y ). We will show this in general later. Also,
H (Y |X = x) has no relation to H (Y ), can be larger or smaller.

The conditional entropy is related to a more general quantity called relative entropy, which we now define.

Definition. For two pmf’s p(x), q(x) on X the relative entropy (Kullback-Leibler distance) is defined by

X p(x)
D(p k q) = p(x) log q(x)
x∈X

_
0 0 p
using the convention that 0 log q = 0 log 0 = 0 and p log 0 = ∞. In particular, the distance is allowed to be infinite. Note that the

“distance” is not symmetric, so D(p k q) D(q k p) in general, and so it is not a metric. However it does satisfy positivity. We will

show that D(p k q) ≥ 0 with equality when p = q. Roughly speaking this describes the loss from using q instead of p for compression.

Definition. The mutual information between two random variables X , Y ∼ p(x, y) is

I(X ; Y ) = x∈X y ∈Y
p(x, y) log p(x, y) = d(p(x, y) k p(x) p(y))
p(x) p(y )

X X
10
Relationship between Entropy and Mutual Information

I(X ; Y ) = p(x, y) log p(x, y)


_ _
x,y p(x) p(y )

=
X
x,y p(x, y ) log
p(x, y) p(x ) −y
log p(y) x
p(x, y)
X X X

= − H (Y |X ) + H (Y )
= H (Y ) − H (Y |X)

and thus we conclude by symmetry that

I (X ; Y ) = H (Y ) − H (Y |X ) = H (X ) − H (X |Y )

Remark. As we said before, this is the reduction in entropy of X given knowledge of Y , and we will show later that
I (X ; Y ) ≥ 0 with equality if and only if X , Y are independent.
Also, note that I(X ; X ) = H (X) − H (X , X ) = H (X), and sometimes H (X) is referred to as “self-informa-tion”.

The relationships between the entropies and the mutual information can be represented by a Venn Dia-gram. H (X ),
H (Y ) represent the two circles, the intersection is I (X ; Y ), the union is H (X , Y ), and the differences X − Y and
Y − X represent H (X |Y ) and H (Y |X), respectively.

We will also establish chain rules for mutual information and relative entropy. First we define the mutual
information conditioned on a random variable Z:

Definition. The conditional mutual entropy of X , Y given Z for (X , Y , Z) ∼ p(x, y, z) is

p x, y z
_ ( | )
I(X ; Y |Z) = H (X | Z) − H (X |Y , Z) = Ep(x, y,z) log p(x |z)p( y |z) _

noting that this is also I (X ; Y |Z = z ) averaged over z.


The chain rule for mutual information is then

n
I(X1, , Xn ; Y ) = i=1 I(Xi ; Y | X1, , Xi −1)

X
The proof just uses the chain rules for entropy and conditional entropy:

X Y H X, ,X H X, , X Y
(
I(X1, n; ) = n 1 ( 1 n| n) −
)
= H (Xi |X1, , Xi−1) − H (Xi |Y , X1, , Xi −1)
i=1
n
X
X
= I(Xi ; Y |X1, , Xi−1)
i=1

Now we define the relative entropy conditioned on a random variable Z, and prove the corresponding chain rule:

11
Definition. Let X be a random variable with distribution given by pmf p(x). Let p(y |x), q(y |x) be two conditional
pmfs. The conditional relative entropy of p(y |x) and q(y |x) is then defined to be
D( p(y |x) k q(y |x)) = x
p(x) D(p(y |X = x) k q(y |X = x))

X
= p(x)p(y x) log p(y |x)
x,y
| q(y x)
X |
p(Y X)

_ |
= E p(x, y) log q(Y | X) _
Then the chain rule for relative entropy is

_ p(X , Y ) _
E log
D(p(x, y) k q(x, y)) = p(x,y) q(X , Y )
_ p(Y X ) p(X )
|
= Ep(x,y) log q(Y | X ) + log q(X ) _
_ p(Y X ) _ _ p(X )
|
E
= p(x,y) log q(Y | X ) + E p(x) log q(X ) _
= D(p(y |x) k q(y |x)) + D(p(x) k q(x))

The various chain rules will help simplify calculations later in the course.

Jensen’s Inequality
Convexity and Jensen’s Inequality will be very useful for us later. First a definition:
′ ′
Definition. A function f is said to be convex on [a, b] if for all [a , b ] ⊂ [a, b] and 0 ≤ λ ≤ 1 we have

′ ′ ′ ′
f (λa + (1 − λ)b ) ≤ λf (a ) + (1 − λ)f (b )

If the inequality is strict, then we say f is strictly convex.

Note that if we have a binary random variable X ∼ Bernoulli(λ), then we can restate convexity as

f (Ep(x)(X)) ≤ Ep(x)(f (X ))

We can generalize this to more complicated random variables, which leads to Jensen’s inequality:

Theorem 13. (Jensen’s Inequality) If f is convex, and X is a random variable, then

f (E(X)) ≤ E(f (X))

and if f is strictly convex, then equality above implies X = E(X ) with probability 1, i.e. X is constant.

Proof. There are many proofs, but for finite random variables, we can show this by induction and the fact that it is
already true by definition for binary random variables. To extend the proof to the general case, can use the idea of
supporting lines: we note that

f (x) + m(y − x) ≤ f (y)

12
for all x and some m dependent on x. Set x = E(X ) to get

f (E(X )) + m(y − E(X )) ≤ f (y)

and integrating in y gives the result

f (E(X)) ≤ E(f (X))

If we have equality, then

f (y) − f (E(X )) + m(y − E(X)) = 0

or

f (y) = f (E(X)) + m(y − E(X ))

for a.e. y. However, by strict convexity this can occur only if y = E(X ) a.e.

We continue with establishing information theoretic quantities and inequalities. We start with applica-tions of
Jensen’s inequality.

Theorem 14. (Information Inequality) For any p(x), q(x) pmfs, we have that

D(p kq) ≥ 0
with equality if and only if p = q.

Proof. First let A = {x: p(x) > 0}, and compute


D(p k q) = x∈A
p(x) log p(x) q(x)

X
_
= x ∈
A p(x) − log q(x) p(x)
_

X q(x) !
≥ − log x∈A
p(x) · p(x)

X !
≥ − log x ∈X q(x)

= 0 X

Examining the proof above, equality occurs if and only if


X X
q(x) = q(x)
x∈A x∈X

and equality is achieved in the application of Jensen’s inequality. The first condition leads to q(x) = 0 whenever p(x)
= 0.

13
Equality in the application of Jensen’s inequality with strict convexity of − log( · ) implies that on A,

q(x) = q(x) = 1
p(x)

x∈A

X
and therefore q = p.

This has quite a few implications:

Corollary 15. I(X ; Y ) ≥ 0 if and only if X , Y are independent.

Proof. Noting I(X ; Y ) = D(p(x, y) kp(x) p(y)), we have that I(X ; Y ) ≥ 0 with equality if and only if p(x,
y) = p(x)p(y), i.e. X , Y are independent.

Corollary 16. If |X | < ∞, then we have the following useful upper bound:

H (X) ≤ log |X |

1
Proof. Let u(x) ∼ unif(X ), i.e. u(x) = |X | for x ∈ X . Let X ∼ p(x) be any random variable taking values
in X . Then

0 ≤ D(p ku)
= p(x) log p(x) u(x)

=
X
− H (X) + p(x) log |X | x

X
= − H (X) + log |X |
from which the result follows. Equality holds if and only if p = u.

Theorem 17. (Conditioning Reduces Entropy)

H (X |Y ) ≤ H (X)

with equality if and only if X , Y are independent.

Proof. Note I(X ; Y ) = H (X ) − H (X |Y ) ≥ 0, where equality holds if and only if X , Y are indepen-
dent.

Remark. Do not forget that H (X |Y = y) is not necessarily ≤ H (X ).

The conditional entropy inequality allows us to show the concavity of the entropy function:

Corollary 18. Suppose X ∼ p(x) is a r.v. Then


H (p) H (X)

is a concave function of p(x), i.e.

H (λp1 + (1 − λ)p2) ≥ λH (p1) + (1 − λ)H (p2)

14
Note that pi are pmf’s, so H (pi) is not the binary entropy function.

Proof. Let Xi ∼ pi(x) for i = 1, 2 defined on the same alphabet X . Let

1 w.p. λ
_
θ= 2 w.p. (1 − λ)
and define
X1 w.p. λ
_
w.p. (1 − λ) Z = Xθ = X2
Now note that H (Z |θ) = λH (p1) + (1 − λ)H (p2), and by the conditional entropy inequality,

λH (p1) + (1 − λ)H (p2) = H (Z |θ) ≤ H (Z) = H (λp1 + (1 − λ)p2)

Note that P (Z = a) = λp1(a) + (1 − λ)p2(a).


n

P
Corollary 19. H (X1, , Xn) ≤ i=1 H (Xi)
Proof. By the chain rule,
n n
H (X1, , Xn) =H (Xi |X1, , X i−1) ≤ i=1 i=1
H (Xi)

X X
using the conditional entropy inequality. Equality holds if and only if X1, , Xn are mutually indepen-
dent, following from Xi being independent from the joint (X1, , Xi −1).

Another application of Jensen’s inequality will give us another set of useful inequalities.

Theorem 20. (Log Sum Inequality) Let a1, , an and b1, , bn be nonnegative numbers. Then

n n
ai ! ai
b
ai log bi ≥ ai log i
i=1 i=1
_ P
_
X X

P
Proof. This follows from the convexity of t log t (with second derivative 1 ). Then defining
t

ai bi

P
we have that X = bi w.p. bi
n
1 ai log ai = E(X log X )
bi bi

i=1

X
P ≥ E(X ) log E(X )
P
ai _ P
ai
= bi log bi _

P P ai ai
and multiplying by bi gives the result. Note that equality holds if and only if bi =
P
bi , or
a Cb
i= i.

P P
Then we have the following corollaries:

Corollary 21. D(p kq) is convex in the pair (p, q), i.e.

D(λp1 + (1 − λ)p2 kλq1 + (1 − λ)q2) ≤ λD(p1 k q1) + (1 − λ) D(p2 kq2)

Proof. Expanding the inequality above, we want to show that

(λp1(x) + (1 λ)p2(x)) log λp1(x) + (1 ≤− λ)p2(x) λ p1(x) logp1(x) + (1 λ) p2(x) log p2(x)
x
− λq1(x) + (1 λ)q2(x) x
q1(x) − q2(x)

X − X X
Then for a fixed x, it suffices to show that
!_
i=1,2
λi pi ≥
λp i=1,2 λipi i λi
P _
pi
i i log
λi qi log i λi qi
X X

P
noting λ1 = λ, λ2 = (1 − λ), which is exactly the log sum inequality. Summing over x gives the result.

Theorem 22. Let (X , Y ) ∼ p(x, y) = p(x) p(y |x). Then

1. I (X ; Y ) is a concave function of p(x) for a fixed conditional distribution p(y |x).

2. I (X ; Y ) is a convex function of p(y |x) for a fixed marginal distribution p(x).

Remark. This will be useful in studying the source-channel coding theorems, since p(y |x) captures infor-mation
about the channel. Concavity implies that local maxima are also global maxima, and when com-puting the capacity
of a channel (maximizing mutual information between input and output of channel) will allow us to look for local
maxima. Likewise, convexity implies that local minima are global minima, and when we study lossy compression
we will want to compute given a fixed input distribution, what is the worst case scenario, which involves
minimizing the mutual information.

Proof. For (1), note that I (X ; Y ) = H (Y ) − H (Y |X). Since p(y) = x p(x) p(y |x) is a linear function of
px p y x H Y

P p y see that H (Y ) is a concave function

( ) for a fixed ( | ), and ( ) is a concave function of ( ), we


of p(x), since a composition of a concave function φ with a linear function T is concave:

φ(T (λx1 + (1 − λ)x2)) = φ(λTx1 + (1 − λ)Tx2) ≥ λφ(Tx1) + (1 − λ) φ(Tx2)

As for the second term, note that


X
H (Y |X ) = p(x) H (Y |X = x)
x

P
where H (Y |X = x) = − p(y |x) log p(y |x) which is fixed for a fixed p(y |x), and thus H (Y |X ) is a linear
function of p(x), which in particular is concave. Thus I(X ; Y ) is a sum of two concave functions which is also
concave:
X
(φ1 + φ2)(λx + (1 − λ)y) ≤ (λφi(x) + (1 − λ)φi(y)) = λ(φ1 + φ2)(x) + (1 − λ)(φ1 + φ2)(y)
i

This proves (1).

16
To prove (2), consider p1(y | x), p2(y | x), two conditional distributions given p(x), with corresponding joints p1(x,
y), p2(x, y) and output marginals p1(y), p2(y). Then also consider

pλ(y |x) = λp1(y |x) + (1 − λ)p2(y |x)

with corresponding joint pλ(x, y) = λp1(x, y) + (1 − λ)p2(x, y) and marginals pλ(y) = λp1(y) + (1 − λ)p2(y). Then
the mutual information Iλ(X ; Y ) using pλ is

Iλ(X ; Y ) = D(pλ(x, y) kp(x) pλ(y))


≤ λD(p1(x, y) k p(x)p1(y)) + (1 − λ) D(p2(x, y) kp(x)p2(y)) =
λI1(X ; Y ) + (1 − λ) I2(X ; Y )

which shows that Iλ is a convex function of p(y |x) for a fixed p(x).

Data Processing Inequality


We now turn to proving a result about Markov chains, which we define first:

Definition 23. The random variables X , Y , Z form a Markov chain, denoted X → Y → Z if the joint fac-tors in the
following way:

p(x, y, z) = p(x) p(y |x) p(z | y)


for all x, y, z.

Properties.

1. X → Y → Z if and only if X , Z are conditionally independent given Y , i.e.

p(x, z | y) = p(x | y) p(z | y)

Proof. First assume X → Y → Z. Then


p(x, z y) = p(x, y, z) = p(x, y) p(z | y) = p(x y) p(z y)
| p(y) p(y) | |

Conversely, suppose p(x, z | y) = p(x | y) p(z | y). Then

p(x, y, z) = p(y) p(x, z | y) = p(y) p(x | y) p(z | y) = p(x) p(y |x) p(z | y)

as desired.

2. X → Y → Z _ Z → Y → X , which follows from the previous property.

3. If Z = f (Y ), then X → Y → Z. This also follows from the first property, since given Y , Z is inde-pendent of
X:
_
1 z = f (y)
p(z |x, y) = p(z | y) =
0 else
Now we can state the theorem:

Theorem 24. (Data Processing Inequality) Suppose X → Y → Z. Then I(X ; Y ) ≥ I(X ; Z).

17
Remark. This means that along the chain, the dependency decreases. Alternatively, if we interpet I (X ; Y ) as the
amount of knowledge Y gives about X , then any processing of Y to get Z may decrease the information that we can
get about X .

Proof. Expand I(X ; Y , Z) by chain rule in two ways:

I(X ; Y , Z) = I(X ; Y ) + I(X ; Z |Y )

I(X ; Y , Z) = I(X ; Z) + I(X ; Y |Z)

Note that I(X ; Z |Y ) = 0 since X , Z are independent given Y . Also I(X ; Y |Z) ≥ 0, and thus

I(X ; Y ) = I(X ; Y , Z) = I(X ; Z) + I(X ; Y |Z) ≥ I(X ; Z)

as desired.

Now we illustrate some examples.

Example 25. (Communications Example) Modeling the process of sending a (random) message W to some
destination:

W → Encoder → X → Channel → Y → Decoder → Z

Note that W → X → Y → Z is a Markov chain, each only depends directly on the previous. The data pro-cessing
inequality tells us that

I(X ; Y ) ≥ I(X ; Z)

Thus, if we are not careful, the decoder may decrease mutual information (and of course cannot add any information
about X ). Similarly,

I (X ; Y ) ≥ I (W ; Z)

Example 26. (Statistics Examples) In statistics, frequently we want to estimate some unknown dis-tribution X by
making measurements (observations) and then coming up with some estimator:

X → Measurement → Y → Estimator → Z

For instance, X could be the distribution of heights of a population (gaussian with unknown mean and standard
deviation). then Y would be some samples, and from samples we can come up with an estimator (say the sample
mean / standard deviation). X → Y → Z, and by the data processing inequality we then note that the estimator may
decrease information. An estimator that does not decrease information about X is called a “sufficient statistic”.

Here is a probabilistic setup. Suppose we have a family of probability distributions {pθ(x)}, and Θ is the parameter to
be estimated. X represents samples from some distribution p Θ in the family. Let T (X ) be any function of X (for
instance, the sample mean). Then Θ → X → T (X) form a Markov Chain. By the data processing inequality, I (Θ; T
(X )) ≤ I (Θ; X).

Definition 27. T (X ) is a sufficient statistic for Θ if Θ → T (X ) → X also forms a Markov Chain. Note that this
implies I(Θ; T (X )) = I(Θ; X ).

18
Asnan example, consider pθ(x) ∼ Bernoulli(θ). And consider X = (X1, , Xn) ∼ i.i.d pθ(x). Then T (X ) =
T (X ) counts the number of 1’s. T hen the joint
i=1 Xi is a sufficient statistic for Θ, i.e. given T (X), the distribution of X does not depend on Θ. Here

P
1
xi = k
P (X = (x1, , xn) |T (X ) = k, Θ = θ) = n
_k _
0 else
P
k n k

where we note that P (X = (x ,


k Θ = θ) =
n
k
k
θ (1 θ ) n −k. 1
, xn) ∧ T (X) = k | Θ = θ) = θ (1 − θ) − (if P
xi = k) and P (T (X) =

| _ _ −
As a second example, consider f (x) N (θ, 1) and consider X = (X , ,X ) [Link] (x). Then T (X ) =
1 n θ ∼ n 1 n ∼ θ
n i=1 Xi is a sufficient statistic for Θ. Note that i=1 Xi ∼ N (nθ, n) so that T (X ) ∼ N (θ, 1/n). Con-
sider the joint of (X , ,X ,T X
P 1 n ( )), which is P
1 n 1
x θ 2
f (x1, , xn , x¯ ) = fθ(x1, , xn) P (T (X) = x¯ x1, , xn) = exp 2 i=1 | i − | x¯ = n xi
| 0 _ P _ else P
n
P
Note the θ-dependent part is − i=1 xiθ. Also, n

f (x¯ ) =
_
2
exp 2 |x¯ − θ | _
1 n
and if x¯ = n i xi, then the θ-dependent part of f (x¯ ) is − n x¯θ = − i=1 xiθ. Thus the θ-dependent
P fx, ,x x, T X X and Θ are independent.

parts cancel in ( 1 n| ¯ Θ) so that given ( ), we have that P


As a final example, if gθ ∼ Unif(θ, θ + 1), and X = (X1, , Xn) ∼ i.i.d gθ, then a sufficient statistic for θ is
T X, ,X max X , min X
_
In this case we note that ( 1 n) = _ i i i i

n
P (x 1 ∈ I , , x 1 n
∈ I n | max X = M , min X = m) =
i i i i i=1
Ii ∩ [ m, M ]|
|

Y
which does not depend on the parameter θ. Given the max and min, we know that the X i are in between, so we do
not need to know θ to compute the conditional distribution.

Now consider the problem of estimating an unknown r.v. X ∼ p(x) by observing Y with conditional distri-
ˆ ˆ_
bution p(y |x). Use g(Y ) = X as the estimate for X, and consider the error P e = Pr(X X ). Fano’s
inequality relates this error to the conditional entropy H (X |Y ). This is intuitive, since the lower the
entropy, the lower we expect the probability of error to be.

Theorem 28. (Fano’s Inequality) Suppose we have a setup as above with X taking values in the finite alphabet X.
Then
H (Pe) + Pe log(|X | − 1) ≥ H (X |Y )

where H (Pe) is the binary entropy function Pe log Pe + (1 − Pe) log (1 − Pe). Note that a weaker form of the
inequality is (after using H (Pe) ≤ 1 and log(|X | − 1) ≤ log(|X |):

Pe ≥ H (X |Y ) − 1
log( )

|X |
Remark. Note in particular from the first inequality that if Pe = 0 then H (X |Y ) = 0, so that X is a function of Y .

ˆ
1
( X

ˆ
_ X

Proof. Let E = 0 X = X be the indicator of error, and note E ∼ Bernoulli(Pe). Examine H (E , X |Y )


using the chain rule: H (E |Y ) + H (X |E , Y ) = H (E , X |Y ) = H (X |Y ) + H (E |X , Y ). Note that
H (E |Y ) ≤ H (E) = H (Pe). Also,

H (X |E , Y ) = P (E = 0)H (X |E = 0, Y ) + P (E = 1) H (X |E = 1, Y )

where in the first component H (X |E = 0, Y ) = 0 because knowing Y and that there was no error gives us
information that X = g(Y ), so it becomes deterministic. Also, given E = 1, Y , we know that X takes
ˆ
values in X \{X }, an alphabet of size |X | − 1. Then using the estimate H (X) ≤ log |X |, we have that
H (X |E , Y ) ≤ Pe log(|X | − 1)

and finally since H (E |X , Y ) = 0 since knowing X , Y allows us to determine whether there is an error,

H (X |Y ) ≤ Pe log(|X | − 1) + H (Pe)
which is the desired inequality.

Remark. The theorem is also true with g(Y ) random. See 2.10.1 in book. In fact, we just make a slight
ˆ
adjustment. In the proof above we have actually showed the result with H (X | X ) instead of H (X |Y ).
Then the rest is a consequence of the data processing inequality, since
ˆ ˆ ˆ
I (X , X ) ≤ I(X , Y ) _H (X ) − H (X | X ) ≤ H (X ) − H (X |Y ) _H (X |Y ) ≤ H (X | X )
Also, the inequality may be sharp, there exist distributions for which equality is satisfied. See book.

2
Suppose we have X , X i.i.d with entropy H (X ). Note P (X X px
′ P
probability below: ′ = )= x∈X ( ) . We can bound this

′ −H(X)
Lemma 29. P (X = X ) ≥ 2 , with equality if and only if X is uniformly distributed.

x
Proof. Since 2 is convex, and − H (X ) = Ep(x) log p(x), we have that
E log p(x) 2 ′
2−H(X) = 2 p(x)
≤ Ep(x) p(x) = p(x) = P (X = X )
x∈X

With the same proof, we also get


Lemma 30. Let X ∼ p(x), X ∼ r(x). Then

P (X = X ′) ≥ 2−H(p)−D(p kr)
or

P (X = X ′) ≥ 2−H(r)−D(r kp)

20
Proof.
p(x)
Ep(x) log p(x)− log
2−H (p)−D(p kr) = 2 r(x)

≤ Ep(x)r(x) = P

(X = X )
and swapping the roles of p(x), r(x), the other inequality holds.

These match intuition, since the higher the entropy (uncertainty), the less likely the two random variables will agree.

Asymptotic Equipartition Property


Now we will shift gears and discuss an important result that allows for compression of random variables. First recall
the different notions of convergence of random variables:

• Usual convergence (of real numbers, or deterministic r.v.):

an → a _ ∀ε, ∃N s.t. for n > N , |an − a| ≤ ε

• Convergence a.e. (with probability 1):

Xn → X a.e. _ P (ω: Xn(ω) → X (ω)) = 1

• Convergence in probability (weaker than above)

p
Xn_X _lim P (|Xn(ω) − X (ω)| > ε) = 0 ∀ε
n

• Convergence in law/distribution:
d

Xn_X _P (Xn ∈ [a, b]) _ P (X ∈ [a, b]) for all [a, b] with P (X = a) = P (X = b) = 0

(or cdf’s converge at pts of continuity, or characteristic functions converge pt-wise, etc...)

Now we work towards a result known as the Law of Large Numbers for i.i.d r.v. (sample means tend to the
expectation in the limit). First,

2
Lemma 31. (Chebyshev’s Inequality) Let X be a r.v. with mean µ and variance σ . Then
2
σ
2
P (|X − µ| > ε) ≤ ε
Proof.

Z 2 Z 2
σ2 = |X − µ| dP ≥ |X −µ|2>ε |X − µ| dP
2 2
≥ ε P (|X − µ| > ε)

from which we conclude the result.

21
Theorem 32. (Weak Law of Large Numbers (WLLN)) Let X1, , Xn be i.i.d ∼ X with mean µ
2
and variance σ . Then
n
1
n

i=1
Xn _X
in probability, i.e. X
n
lim P 1 X µ>ε =0
n _n n− _ !
_ i=1
_
_
_
X __

2
1 n _ _ σ
Proof. Note that letting A = X , the mean of A is µ and the variance is
P
shev’s inequality, n n
i=1
n n n . Then by Cheby-
2
_ 1 n
_ ! σ
P n Xn − µ > ε ≤ nε
2
_0
_ i =1 _
_
_
X _
_

_ _
so that we have convergence in probability.

With a straightforward application of the law of large numbers, we arrive at the following:

Theorem 33. (Asymptotic Equipartition Property (AEP)) If X1, , Xn are i.i.d ∼ p(x), then
1
− n log p(x1, , xn) _ H (X )
in probability.

Proof. Since p(x1, , xn) = p(x1) p(xn), we have that

n
1 1
− n log p(x1, , xn) = − n log p(xi) _ − E( log p(X)) = H (X )

X
i=1

a.e. using the Law of Large Numbers on the random variable log p(Xi), which are still i.i.d, since (X , Y )
independent implies (f (X ), g(Y )) independent.

− nH( X)
Remark. Note that this shows that p(x1, , xn) ≈ 2 , so that the joint is approximately uniform on a smaller
space. This is where we get compression of the r.v., and motivates the following definition.

( n)
Definition 34. The typical set A ε with respect to p(x) is the set of sequences (x1, , xn) such that

2−n (H(X)+ε) ≤ p(x1, , xn) ≤ 2−n(H(X)−ε)


( n)
Properties of A ε :

1. Directly from the definition, we have that

(n) 1
(X1, , Xn) ∈ Aε _ H (X ) − ε ≤ − n log p(x1, , xn) ≤ H (X ) + ε
_ 1 − n log p(x1,
_
, xn) − H (X) <ε
_ _
_ _

_ _
_
( n)
2. P A ε ≥ 1 − ε for large n.

Proof. The AEP shows that for some


_
P_ 1 _
_
_
_

, Xn) − H (X) > ε < ε


_
_
− n log
p(X1,
_

_
and comparing with the first property, this gives a bound on the measure of the complement of the typical
set, and this gives the result.

_ (n)_
_
n(H(X )−ε) ≤ A ≤ 2n(H(X )+ε)
3. (1 − ε)2
_ ε _
Proof. p x , , xn) ≥
1 = (x1, ,xn) ( 1 Aε(n) p(x1, , xn) ≥ _ Aε
(n) _
2−n(H(X)+ε)
X X _ _

_ _
gives the RHS inequality, and
_=
1−ε<P _ Aε
(n) Aε(n) p(x1, , xn) ≤ _ Aε
(n) _
2−n(H(X)−ε)
X _ _

_ _
gives the LHS inequality (only true for large enough n).

Given these properties, we can come up with a scheme for compressing data. Given X 1, , Xn i.i.d ∼ p(x) over the
alphabet X , note that sequences from the typical set occur with very high probability > (1 − ε). Thus we expect to
n( H (X)+ε)
be able to use shorter codes for typical sequences. The scheme is very simple. Since there are at most 2
sequences in the typical set, we need at most n(H (X ) + ε) bits, and rounding up this is n(H (X) + ε) + 1 bits. To
indicate in the code that we are describing a typical sequence, we will append a 0 to any code word of a typical
sequence. Thus to describe a typical sequence we use at most n(H (X ) + ε) + 2 bits. For non-typical sequences,
since they occur rarely we can afford to use the lousy description with at most n log |X | + 1 bits, and appending a 1
at the beginning to indicate that it is not typical, we describe non-typical sequences with n log |X | + 2 bits. Note that
this code is 1-1: given the code word we will be able to decode by looking at the first digit to see whether it is a
typical sequence, and then looking in the appropriate table to recover the sequence.

n n
Now we compute the average length of the code. To fix notation, let x = (x1, , xn) and let l(x ) denote the length of
the corresponding codeword. Then the expected length of code is

n n n
E(l(x )) = n
p(x ) l(x )
x
X n n
p(x
≤ A (n) p(x )[ n(H (X) + ε) + 2] + A (n) c )[n log |X | + 2]
ε ε
X X
_ _
( n) ( n)
≤ P (A ε ) [ n(H (X) + ε)] + (1 − P (A ε ))[n log |X |] + 2
≤ n(H (X ) + ε) + εn log |X | + 2

≤ n(H (X ) + ε )

′ ′
with ε = ε + ε log |X | + 2/n. Note that ε can be made arbitrarily small, since ε → 0
_
_ n→∞ _ ′
ε → 0. Thus

1
n ′
E n l(x ) ≤ H (X) + ε
23
for large n, and on the average we have represented X using ≈ nH (X) bits.
Question: Can one do better than this? It turns out that the answer is no, but we will only be able to answer this
much later.

Remark. This scheme is also not very practical:

• For one, we cannot compress individual r.v., we must take n large and compress them together.

• The table to look up will be large and inefficient.

• This requires knowledge of the underlying distribution p(x).

We will later see schemes that can estimate the underlying distribution as the input comes, a sort of “uni-versal
encoding scheme”.

Next time: We will discuss entropy in the context of stochastic processes.

Definition 35. A stochastic process is an indexed set of r.v.’s with arbitrary dependence. To describe a
stochastic process we specify the joint p(x1, , xn) for all n .

And we will be focusing on stationary stochastic processes, i.e. ones for which

p(x1, , xn) = p(xt+1, , xt+ n)

for all n, t, i.e. independent of time shifts.


n
In the special case of i.i.d. random variables, we had that H (X1, , Xn ) = i=1 H (X) = nH (X ), and we
know how the entropy behaves for large n . We will ask how HX, , X ) grows for a general stationary
( 1 n P
process.

Entropy Rates for Stochastic Processes


Continuing on, we can now define an analogous quantity for entropy:

Definition 36. The entropy rate of a stochastic process (arbitrary) {Xi } is

H (X ) = nlim 1 H(X1,
n
, Xn )
→∞

when the limit exists. Also define a related quantity,



H ( ) = lim H (Xn X , , Xn )
X n | 1 −1
→∞
when the limit exists.


Note that H (X ) is the Cesaro limit of H (Xn |X1, , Xn −1) (using the chain rule), whereas H (X ) is the
direct limit.

24
Also, for i.i.d random variables, this agrees with the usual definition of entropy, since

1
n H (X1, , Xn) _ H (X)
in that case.


Theorem 37. For a stationary stochastic process, the limits H (X ) and H (X ) exist and are equal.

Proof. This follows from the fact that H (Xn |X1, , Xn −1) is monotone decreasing for a stationary pro-cess:

H (Xn+1 |X1, , Xn) ≥ H (Xn+1 |X2, , Xn)

= H (Xn |X1, , Xn−1)


where the inequality follows since conditioning reduces entropy and the equality follows since the process is
stationary. Then the sequence is monotone decreasing and also bounded below by 0 (entropy is nonneg-ative) and
thus the limit and the Cesaro limit exists and are equal.

Here is an important result that we will not prove (maybe later), by Shannon, McMillan, and Breiman:

Theorem 38. (Generalized AEP) For any stationary ergodic process,

1
− n log p(X1, , Xn) _ H (X ) a.e.
Thus, the results about compression and typical sets generalize to stationary ergodic process.
Now we introduce the following notation for Markov Chains (MC) and also some review:

1. We will take X to be a discrete, at most countable alphabet, called the state space. For instance,
X could be {0, 1, 2, }.
2. A stochastic process is a Markov Chain (MC) if

P (Xn = j |X1 = i, X2 = i2, , Xn −1 = in −1) = P (Xn = j |Xn−1 = i)

i.e. conditioning only depends on the previous time. We call this probability which we denote
n,n+1
Pij the one-step transition probability. If this is independent of n, then we call the process
a time-invariant MC.

Thus, we can write X0 → X1 → and Xn+1 → Xn → → X0 (since the joint on any finite seg-
ment factorizes with dependencies only on the previous random variable.

3. For a time-invariant MC, we express Pij in a compact probability transition matrix P (may
P

P
be “infinite dimensional”), where Pij is the entry of the i-th row, j-th column. Note that 1 for each j ij=
i.

Given a distribution µ on X , which we can write

µ = [ p(0), p(1), ]

25
if at time 0 the marginal distribution is µ, then at time 1 the probabilities can be found by the for-mula

X
P (X1 = j) = P (X0 = i) Pij = [µP ]j
i

where we use the notation row vector µ times matrix P . It turns out that a necessary and suffi-cient condition
for stationarity is that the marginal distributions all equal some distribution µ for every n, where µ = µP . For
a time-invariant MC, the joint depends only on the initial distribution P (X0 = i) and the time invariant
transition probability (which is fixed), and thus if P (X0 = i) = µi where µ = µP , stationarity follows.

4. Under some conditions on a general (not necessarily time-invariant) Markov chain (ergodicity, for instance),
no matter what initial distribution µ0 we start from µn → µ (convergence a.e. to a unique stationary
distribution µ).
Perron-Frobenius has such a result about conditions on P .

An example of a “not so nice” Markov chain is a Markov chain with two isolated components. Then the
limits exist but depend on the initial distribution.

Now we turn to results about Entropy Rates. For stationary MC’s, note that


H (X ) = H (X )
= lim H (Xn |X1, , Xn−1)
n→∞
= lim H (Xn |Xn −1)
n→∞
= H (X2 |X1)
= − i, j
µi Pij log Pij

X
Example 39. For a 2 state Markov chain with

P= 1−α α

β 1−β _ _
α β
we can spot a stationary distribution to be µ1 = α + β , µ2 = α + β . Then we can compute the entropy rate
accordingly
β α
H (X ) = H (X2 |X1) = α + β H (α) + α + β H (β)

Example 40. (Random Walk on a Weighted Graph)

For a graph with m nodes X = {1, , m}, assign weights wij ≥ 0 with wij = wji and wii = 0 to edges of the
graph. Let wi = j wi j, the total weight of edges leaving i, and W =i <j wij, the sum of all the
weights. Then define transition probabilities p resulting stationary distribution.
= wi j

P ij wi and examine the P


(An example random walk fixes a starting point, say 1, which corresponds to initial distribution µ0 = [1,
wi
0, ]). The stationary distribution (turns out to be unique) can be guessed: µi = 2W . To verify:
µp
[µP ]j = i
i ij =
i
wi · wij = 1
2W wi 2W i
wij = wj
2W

X X X
26
which shows that µP = µ. Now we can compute the entropy rate:

H (X ) = H (X2 |X1)
= − µi pij log pi j
ij
X
wi wij wij
= − ij
2W
· wi
log wi

X wij wij wij 2W


= − ij
2W
log wi + ij
2W
log wi

X X
wi j wi
_
= H 2 W _ + H_ 2 W _
which we can interpret as the entropy of the edge distribution plus the entropy of all the node distribu-tion, so to
speak.
In the special case where all the weights are equal, let Ei be the number of edges from node i, and let E
Ei
be the total number of edges. Then µi = 2E
and

E1 Em
_
H (X ) = log(2 E) − H 2 E , ,2 E _

Example 41. (Random Walk on a Chessboard) In the book, but is just a special case of the above.

Example 42. (Hidden Markov Model) This concerns a Markov Chain {Xn } which we cannot observe (is hidden),
and functions (observations) Yn(Xn) which we can observe. We can bound the entropy rate of Y n. Yn does not
necessarily form a Markov Chain.

Data Compression
We now work towards the goal of arguing that entropy is the limit of data compression. First let’s con-sider
properties of codes. For instance, a simple example is the Morse code, where each letter is encoded by a sequence of
dots and dashes (binary alphabet), where the idea is to put frequently used letters with shorter descriptions, and the
descriptions have arbitrary length.

We define a code C for some source (r.v.) X to be a mapping from X → D , finite length strings from a D-ary
alphabet. We denote C(x) to be the code word of x and l (x) to be the length of C(x). The expected codeword length
is then L(C) = E(l(X )).
Reasonable Properties of Codes.
We say a code is non-singular if it is one to one, i.e. each C(x) is easily decodable to recover x.
Now consider the extension C∗ of a code C to be defined by C(x1, , xn) = C(x1) C(xn) (concatenation of codes), for encoding many
realizations of X at once (for instance, to send across a channel). It would be good for the extension then to be non-singular as well. We
call a code uniquely decodable (UD) if the extension is nonsingular.

Even more specific, a code is prefix-free (or prefix) if no codeword is a prefix of another code word. In this case the
moment a code word is transmitted we can decode it immediately (without having to receive the next letter)

Example binary codes:


x singular nonsingular but UD UD but not prefix prefix
1 0 0 0 0
2 0 1 01 10
3 1 00 011 110
4 10 11 0111 111

27
The goal is to minimize L(C), and to this end we note a useful inequality.

Theorem 43. (Kraft Inequality) For any prefix code over an alphabet of size D, the codeword lengths l1, , lm must
satisfy
X
m

D−l ≤ 1i
i=1

Conversely, for any set of lengths satisfying this inequality, there exists a prefix code with the specified lengths.

Proof. The idea is to examine a D-ary tree, where from the root, each branch represents a letter from the alphabet.
The prefix condition implies that the codewords correspond to the leaves of the D-ary tree. The argument is then to
extend the tree to a full D-ary tree (add nodes until the depth is lmax, the max length). The number of leaves at lmax
l
for a full tree is D MAX, and if a codeword has length li, then the number of leaves at the bottom level that correspond
l −li
to this word is D MAX . Thus, the codewords of a prefix encoding satisfy
X

i
Dl −li
MAX
≤ Dl MAX

and the result follows. The converse is easy to see by reverseing the construction. Associate each code-word to
l −li
D MAX consecutive leaves (which have a common ancestor).

Next time we will show that the Kraft inequality needs to hold also for UD codes, in which case there is no loss or
benefit from focusing on prefix codes.

First we note that we can extend the Kraft inequality discussed last week to countably many lengths as well.

Theorem 44. (Extended Kraft Inequality) For any prefix code over an alphabet of size D, the code-word lengths
l1, l2, must satisfy
X

D−l ≤ 1i
i=1

Conversely, for any set of lengths satisfying this inequality, there exists a prefix code with the specified lengths.

Proof. Given a prefix code, we can again consider the “infinite” tree with the codewords at the leaves. We still
extend the tree so that the tree is “full” in a different sense, adding nodes until every node either has D children or no
children. Such a tree can then be associated to a probability space on the (count-ably-infinite) leaves by the formula
−depth(l)
P (X = l) = D , and the sum of all the probabilities is 1. Since we have added dummy nodes, we have only
li
increased the sum of D , and thus

X

D−l ≤ 1i
i=1

28
Conversely, as before we can reverse the procedure above to construct a D-ary tree with leaves at depth li.

It turns out that UD codes also need to satisfy Kraft’s inequality.

Theorem 45. (McMillan) The codeword lengths l1, l2, , lm of a UD code satisfy

m
X

D−l ≤ 1
i
i=1

and conversely, for any set of lengths satisfying this inequality, there exists a UD code with the specified lengths.

Proof. The converse is immediate from the previous Theorem 44, since we can find a prefix code satis-fying the
specified lengths, and prefix codes are a subset of UD codes.
Now given a UD code, examine Ck, the extension of the code, where C(x1, , xk) = C(x1) C(xk), and
k

,x P lx
l(x1, k) = m=1 ( m). Now look at k klMAX
! −l(x1, ,xk) −m
x ∈X D−l(x) = x1, ,xk D = m=1 a(m)D

X X X
−1 k
which is a polynomial in D , where a(m) is the number of sequences with code length m, and since C is
m
nonsingular, a(m) ≤ D . Thus

k klMAX
!
x ∈X D−l(x) m=1 1 = klmax

X X
Taking k-th roots, and taking the limit as k → ∞, we then have that
X
−l(x) 1/k
D ≤ (klmax) →1
x∈X

which gives the result.

Remark. Thus, there is no loss in considering prefix codes as opposed to UD codes, since for any UD code there is
a prefix code with the same lengths.

Optimal Codes
Given a source with probability distribution (p1, , pm), we would want to know how to find a prefix code with the
minimum expected length. The problem is then a minimization problem:
m
min L =pili
i=1
s.t.
m X D−li ≤ 1

X
i=1
+
li ∈ Z

29
where D is a fixed positive integer (the base). This is a linear program with integer constraints. First, we ignore the
integer constraint and assume equality in Kraft’s inequality. Then we have a Lagrange multi-plier problem. The
derivative of the Lagrangian gives the equations
p − li
i = λD log D
pi
D−l i = λ log D
m
pi
i=1 λ log D = 1
X 1
λ log D = 1
1
λ = log D
p
D−l i = i


Thus the optimal codeword lengths are li = − logD pi which yields the expected code length
X X
∗ ∗
L = pi li = − pi logD pi = H (X)
i i

the entropy being in base D. This gives a lower bound for the minimization problem.

Theorem 46. For any prefix code for a random variable X, the expected codeword length satisfies

L ≥ H (X)
−li
with equality if and only if D = p i.

Proof. This has already been proven partially via Lagrange multipliers (with equality constraint in
−l
D i
P − li
Kraft), but we now turn to an Information Theoretic approach. Let r i = c where c = iD . Then
X X

L − H (X ) = i
pili + i
pi log pi
−li
= pi log D +pi log pi

i i
X pi X
= i
pi log ri
i
− pilog c

X X
≥ D(p kr)
≥0

and thus L ≥ H (X) as desired. Equality holds if and only if log c = 0 and p = r, in which case we have equality in
− li
Kraft’s inequality (c = 1) and D = pi, which finishes the proof.

Now if pi _ D , we would like to know how close to the entropy lower bound we can get. A good subop-timal code is the Shannon-Fano code,
−li

which involves a simple rounding procedure.

∗ ∗ ∗ P ∗
Theorem 47. For optimal codeword lengths li , , lm for a random variable X, L = i pili satisfies

H (X) ≤ L ≤ H (X) + 1

30
Proof. Recall from Theorem 46 that lengths of log 1 achieve the lower bound if they are integers. We
pi
now suppose we used log 1 instead. This corresponds to the Shannon code, but first we note that the
pi
Kraft’s inequality :

lengths still satisfy l m


i
D−⌈log 1/pi⌉ ≤ i
D−log 1/pi ≤ 1

X X
and also that this gives the desired upper bound:
_
1 1 = H (X ) +
_ _
i pi log pi
_≤ i pi logpi
+ 11
X X

Since the optimal code is better than this Shannon code, this gives the desired upper bound H (X ) + 1. The lower
bound was already proved in the previous theorem 46.

Remark 48. The overhead of 1 in the previous theorem can be made insignificant if we encode blocks (X1, , XB) to
increase H (X1, , XB ), effectively spreading the overhead across a block. By the previous
theorem, we can find a code for (X 1, , XB) with lengths l(x1, , xn ) such that

H (X1, , Xn) ≤ E [l(x1, , xn)] ≤ H (X1, , Xn) + 1


Then per outcome, we have that

H (X1, , Xn) E [l(x1, , xn)] H (X1, , Xn) 1


n ≤ n ≤ n +n

E [l(x1, , xn)]
and letting Ln = n , we see that Ln → H (X ) if X1, , Xn is stationary. Note the connection to
the Shannon-McMillan-Breiman Theorem (or AEP in the case of i.i.d X i). There we saw that we can design codes
using typical sequences to get a compression rate arbitrarily close to the entropy. This remark uses a slightly diff
erent approach using Shannon codes.

Huffman Codes
Now we consider a particular prefix code for a probability distribution, which will turn out to be optimal. The idea is
essentially to construct a prefix code tree (as in the proofs of Kraft’s inequality) from the bottom up. We first look at
the case D = 2 (binary codes)
Huffman Procedure:

1. Order the symbols in decreasing order of probabilities, so x 1, , xm satisfies p1 ≥ p2 ≥ ≥ pm.

2. Merge the two symbols of lowest probability (last 2 symbols) into a single tree associated with probability
pm + pm −1.

p p
m −1 + m

x
xm −1 m


Denote this symbol xm −1.
3. Repeat this procedure until all symbols are merged into a giant tree.

4. Assign {0, 1} to the branches of the tree (each node should have one branch labeled 0 and one branch labeled
1).

If D > 2, then there is an obvious generalization, except that every merge step reduces the number of symbols by D
− 1, so that the procedure works without a hitch when the number of symbols is 1 mod D − 1. The appropriate way
to deal with this is to add dummy symbols associated with probability 0 until the number of symbols is 1 mod D −
1, and perform the procedure.

Example 49. Consider four symbols a, b, c, d with probabilities 0.5, 0.3, 0.1, 0.1 and D = 3. Then we add a fourth
dummy variable and perform the procedure. There are two steps, first merging c, d, dummy, and then merging the
rest:

0 2
1
a b
0 1 2
c d dummy

so that C(a) = 0, C(b) = 1, C(c) = 20, C(d) = 21. The expected length is 1.2 ternary digits, and the entropy is 1.06
ternary digits.

Remark 50. (Huffman vs Shannon codes) Note that there are examples where Shannon codes give longer or shorter
codewords than Huffman codes for particular codes, but in expected length, Huffman always performs better, as we
will soon prove. In the examples below, we will also see that Huffman’s pro-cedure does not give a unique tree in
cases involving ties, and how ties are broken can produce codes with different sets of lengths.

Consider four symbols a, b, c, d with probabilities 1/3, 1/3, 1/4, 1/12. There are two possible binary code trees:

b abcdcd

corresponding to different length codes. To c, corresponding to probability 1/4, Shannon’s code gives a code of
length 2, which is shorter than the code for c in the first code tree.

Also, an easy example shows that the Shannon code can be worse than optimal for a particular outcome. Consider a
binary random variable X distributed Bernoulli(1/10000). Huffman code uses just 1 bit, whereas the Shannon code
uses ⌈log 10000⌉ = 14 bits for one, and 1 bit for the other. But of course on the average it is off by at most 1 bit, as
proven previously.

Now we prove the optimality of the Huffman Code. First assume that p1 ≥ p2 ≥ ≥ pm.
Lemma 51. For any distribution, any optimal prefix code satisfies

1. If pj > pk then lj ≤ lk

2. The two largest codewords have the same length

and there exists an optimal prefix code satisfying

3. The largest codewords differ only in t he last bit and correspond to the least likely symbols.


Proof. First we prove (1). Given an optimal prefix code Cm, and let pj > pk corresponding to lj , lk, and let Cm be
the code resulting from swapping xj and xk. Then

L(Cm ) − L(Cm) = pj(lk − lj) + pk(lj − lk) = (pj
− pk)(lk − lj)

and since L(Cm) is optimal, L(Cm ) − L(Cm) ≥ 0, and since pj > pk, pj − pk > 0 and therefore lk − lj ≥ 0 as well,
proving the result.
To prove (2), suppose that the two largest codewords in an optimal code are not the same length. Then we can delete
the last bit of the longest codeword, preserving the prefix property but shortening the expected length, contradicting
optimality.

To prove (3), we suppose that the two largest codewords x, y differ in more than the last bit. This means that the two
are leaves corresponding to different parents. Note that both must have at least one sibling, or else we can trim the
code while preserving prefix property and shortening the expected length, which contradicts optimality. Then we
can swap one sibling of x with y and preserve the structure of the tree, preserving the prefix property and the
expected length (since x, y are at the same depth). This gives a new code with the same expected length but which
satisfies (3).

Now we are in a position to prove optimality.

Theorem 52. The Huffman code is optimal, i.e. any other code has larger expected length.

Proof. We will proceed by induction on the number of symbols. If there are only 2 symbols, then triv-ially the only
code has length 1 and agrees with the Huffman procedure. Now assume that the Huffman procedure gives the

optimal expected length for m − 1 symbols, which we denote L m −1. Now suppose that we have a code for m
symbols Cm with lengths li that is not generated from the Huffman procedure. By the lemma we can rearrange the
code so that the three properties are satisfied (in particular the two largest code words differ only by the last bit).
Now as in the Huffman procedure, we combine the two largest code words of lengths l m to a single code word of

length lm −1 = lm − 1 (truncating the last bit) corresponding to probability p m −1 + pm. This gives a code for m − 1

symbols, which we denote Cm −1. By the induction hypothesis L m −1 ≤ L(Cm −1). Also we have that
m
X

L(Cm) = pili
i=1 m
−2
′ ′
X

= pili + pm −1(lm −1 + 1) + pm(lm −1 + 1)


i=1
= L(Cm−1) + pm −1 + pm
∗ ∗ ∗
and thus L(Cm) ≥ L m −1 + pm −1 + pm = L m, where L m is the expected length of the Huffman code on m
symbols. This completes the proof.
Other Codes with Special Properties
Shannon-Fano-Elias Coding
Here we describe an explicit code with lengths ⌈log 1/ p i ⌉ + 1 (slightly worse than lengths in Shannon
code). Let X = {1, , m} and assume p(x) > 0 for all x. Let F (x) = a<x p(a) be the cdf (step function
in this case), and let F (x) be the same as F (x) except that F (x) = F (x) +
¯ ¯ 1
px

P 2 ( ). However instead of
1 ¯
using F (x) + 2 p(x), we round off this number for l(x) bits of precision, so that F (x) and its truncation
1 1 p(x)
differs by at most 2l(x) . Then if we take l(x) = ⌈log 1/px ⌉ + 1, then we have that 2l(x) < 2 , so that the
¯ ¯ ˜
truncation of F (x) does not drop below F (x). Denote the truncated version of F (x) by F (x).
˜ ˜
The claim is that the l(x) bits of F (x) form a prefix code. Note F (x) = 0.z1 zl(x), and the prefix condi-
tion tells us that no other code word is of the form 0.z1 zl(x) ∗ , or in other words, no other code word
˜ ˜ −l(x)
is in the interval [F (x), F (x) + 2 ). So in this way we can identify each codeword with an interval, and
the prefix condition is satisfied if and only if the intervals do not overlap. By construction, this is satis-
˜ ¯ ¯
fied, since the interval corresponding to F (x) is between F (x) = F (x) − p(x)/2 and F(x) + p(x)/2 = F (x +
1), and thus the intervals are disjoint. See diagram:

F (x + 1) −l(x)
¯ ˜( )+2
( ) F x
F x ˜
( )
F x
F (x)

Thus we have a prefix code, with expected length


__ _ _
L = x p(x) 1 log p(x)
+ 1 ≤ H (X ) + 2
X

So the overhead is slightly worse, and in fact there is a way to modify the procedure so that the overhead is reduced
to 1. The main advantage to this coding is that we do not need to rearrange the probabilities. The second advantage
is that this code can be slightly generalized to an “arithmetic coding”, a coding that is easily adapted to encoding n
n
realizations of the r.v. X (which is how we increase the entropy to make the overhead insignificant). More
n n n+1
specifically, it is already easy to see that F (x ) can be computed from p(x ), and more importantly, F (x ) can be
n
computed from F (x ). See Section 13.3 of text.
Lempel-Ziv Algorithm
This is from Section 13.4, 13.5. Can we compress information without knowing p(x)? It turns out there is an
algorithm which can compress information adaptively, as the information is received, in such a way to achieve rates
of compression close to the entropic lower bound. The algorithm is as follows:
Given a block in the source sequence, e.g. 01001110110001:

1. Parse the block with commas, so that every term in the sequence is unique. i.e. place next comma after the
shortest string not already in the list.
For the given example,

0, 1, 00, 11, 10, 110, 001

2. Letting n be the block length, suppose we end up with c(n) strings. We encode each string by an index
describing the prefix of the string (excluding the last bit), and the last bit. Possible prefixes in this case is the
null prefix ∅ and the first c(n) − 1 strings (the last string cannot be a prefix of the previous strings by
construction). Thus there are c(n) prefixes, so we need ⌈log c(n)⌉ bits to describe the prefix.

For the example, we have n = 14 and c(n) = 7 and ⌈log c(n)⌉ = 3. Thus, we index the prefixes:

Prefix ∅ 0 1 00 11 10 110
Index 000 001 010 011 100 101 110

and then we encode each string:

String 0 1 00 11 10 110 001


Code 000, 0 000, 1 001, 0 010, 1 010, 0 100, 0 011, 1

Then the block 01001110110001 is encoded roughly as

(000, 0)(000, 1)(001, 0)(010, 1)(010, 0)(100, 0)(011, 1)

(Need some scheme to either indicate block sizes at the beginning or represent parentheses and commas)
Note that knowing the algorithm used to construct this sequence, we can decode easily: First two are 0, 1,
the third says 10, etc. We can construct a table of prefixes as we decode.

Note that in this case the “compressed” sequence is actually longer! The original block is 14 bits and the compressed
sequence is around 28 bits. However, for a very long block we can start to see savings. In gen-eral, for a block of
size n, we achieve a rate of compression of

c(n)(log (⌈c(n)⌉) + 1) → H (X )
n

where the convergence can be shown for a stationary ergodic source sequence. Such codes are called uni-versal
codes, since we do not need to know p(x), yet the average lengths per outcome still become close to the entropy H
(X ).
Lossy Compression
We can then go further and allow for losses according to some distortion function. Given a source out-2
ˆ ˆ ˆ ˆ
come, we reproduce as X , so that d(X , X ) is minimized, and for instance d(X , X ) = (X − X ) . Letting D
be the maximum average distortion, and R the rate of compression, Rate distortion theory is about possible (R, D)
points. We may return to this later. This is very common, for instance in JPEG compres-sion, where we allow for a
slight blurring of the image to achieve a compressed representation.

Channel Capacity

We now turn to communication over a channel, working towards Shannon’s Second Theorem, (Theorem 4)
discussed in the overview at the beginning of the notes. Recall a channel takes input X and outputs Y according to
the transition probability p(y |x).

X → Channel → Y
n
We want to find the best way to make use of the channel. As in compression, it turns out to be useful to look at X
(sending n pieces of data at a time through the channel in some form). The picture looks like

Xn Yn

n n
For each point X ∈ X , the picture above describes possibilities for what the channel can output. The images may
certainly overlap, and the goal is the maximize the number of inputs such that the output sets do not intersect. Then
every output can be uniquely decoded without any error. We may also allow for a small probability of error as well.
In any case, the scheme relies on redundancy to achieve a robust communication code suitable for the channel.

First, we establish a few definitions:


A discrete channel processes a sequence of r.v. Xk and outputs Yk according to some probability distri-bution,
where Xk and Yk come from discrete alphabets. A channel is memoryless if the output only depends on input at a
particular time, i.e.

p(yk |xk , y0, , yk −1) = p(yk |xk)


and
n
Y

n n
p(y |x ) = p(yi |xi)
i=1
n
Recall that y is notation for (y1, , yn).
We will mainly be dealing with discrete memoryless channels (DMC), which is completely described by a
probability transition matrix p(y |x).
The “Information” Channel Capacity of a DMC p(y |x) is defined as

C = max I(X ; Y )
p(x)

noting that I(X ; Y ) depends entirely on the joint p(x, y) = p(x)p(y |x) and p(y |x) is fixed by the channel.
We will show that this quantity describes the max communication rate (in bits per channel use) of infor-mation we
can send through the channel with arbitrarily low probability of error. For now we first com-pute C for a few easy
examples.

_1 y = x
Example 53. Noiseless binary channel . This channel is entirely deterministic, and p(y |x) = 0 y_x , so
that Y = X (input always equals output).

0 0
X Y
1 1
We then expect to be able to send 1 bit perfectly across the channel at every use, and so we expect the channel
capacity C to be 1 bit. Then we note

I(X ; Y ) = H (Y ) − H (Y |X) = H (Y ) ≤ 1

noting that H (Y |X ) = 0 since Y = X (Y is a function of X ), and since Y is binary the entropy is upper bounded by
1, with equality if and only if Y ∼ Bern(1/2). Then equality can easily be achieved if we choose X ∼ Bern(1/2), and
this shows that C = 1 bit per transmission.

Example 54. Noisy channel with nonoverlapping outputs. Here we have

1
0
2
X Y
3
1
4

_ 1 w.p. 1/2
so that if X = 0 then Y = 2 w.p. 1/2 , and similarly for X = 1. Since given Y we can determine completely
X, we expect to be able to communicate 1 bit of information per transmission. This time we expand I(X ; Y ) a little
differently when computing C:

I(X ; Y ) = H (X) − H (X |Y ) = H (X) ≤ 1

Here we note that X is a function of Y , so H (X |Y ) = 0, and again the upper bound can be achieved with X ∼
Bern(1/2). Thus C = 1 bit per transmission. Note that the transition probabilites above can be arbi-trary and the
same result holds.

Example 55. Noisy typewriter .

0 0
1 1
X Y

2 2
3 3

Consider the diagram above, where each branch is labeled with probability 1/2. Note that if we restrict ourselves to
using only 0 or 2 (or 1 and 3), then this situation looks exactly like the previous example, which allows us to
communicate at 1 bit per transmission through this channel. It turns out that this is the best we can do.

I(X ; Y ) = H (Y ) − H (Y |X) ≤ 1
= 1 and therefore H (Y |X ) = 1, and H (Y ) ≤ 1,
2, 3}, which is easily achieved if we take X quite
37 match our intuition for how best to use
noting that given X, Y ∼ Bern(1/2) so that H (Y |X = x) log 4 =
2. Note that equality is achieved when Y ∼ unif{0, also to be
unif{0, 1, 2, 3}. Note that the X chosen does not the channel, but
it allows us to compute C easily.

Example 56. Binary symmetric channel . This is perhaps the most common channnel model used, and models a
channel that is a little noisy, where a bit is flipped with some probability.

1−α
0 α
0
X Y
α
1 1
1−α

Thus X is preserved with probability 1 − α and flipped with probability α. Computing the capacity, we have that

I(X ; Y ) = H (Y ) − H (Y |X)
1
X

= H (Y ) − p(x) H (Y |X = x)
0
= H (Y ) − H (ε)

≤ 1 − H (ε)

noting that given X, Y is Bern(ε) distributed. Equality is achieved if Y ∼ Bern(1/2). Again, we note that X ∼
Bern(1/2) works, and the capacity is 1 − H (ε). It is not clear how to use this channel to operate slightly below
capacity with arbitrarily low probaiblity of error.
Since we will use this channel often, we will often abbreviate with BSC(α).

Example 57. Binary erasure channel. This is perhaps the third most common, after the Gaussian channel (covered
later). Here the input is preserved, and with some probability is erased, denoted by Y = e.

1−α
0 0
α

X e Y
α
1 1
1−α

Computing the capacity, we have

I (X ; Y ) = H (Y ) − H (Y |X)
= H (Y ) − H (α)

It is tempting to use H (Y ) ≤ log 3, with equality when Y ∼ unif{0, e, 1}, however there is no distribution on X that
can achieve this. Instead, we compute more directly. Note that
_
e w.p. α
Y=
X w.p. 1 − α
38
which is a disjoint mixture, and therefore

H (Y ) = (1 − α)H (X) + H (α)

Plugging in, we then have

I(X ; Y ) = (1 − α) H (X ) ≤ 1 − α

with equality when X ∼ Bern(1/2). Thus the capacity is 1 − α bits per transmission.
Here we have a little bit of intuition. Noting that the channel drops α proportion of the input, the best we can do is
communicate at a rate of 1 − α, though it is not clear how to do so.

We can generalize the procedure used in a few of the examples above to a more general class of channels. Suppose
that our DMC has finite alphabet, and thus our transition probability can be written in matrix form P ij = P (Y = j |X
= i). We say that a DMC (finite) is symmetric if the rows of Pij are permuta-tions of each other, and the same for
the columns. For example,

0.1 0.4 0.1 0.4


_
Pij = 0.4 0.1 0.4 0.1 _
is such an example. Another example is the BSC(α), with transition matrix
_ _
1−α α
α 1−α

We then can state a result for symmetric channels:

Theorem 58. For symmetric channels with input alphabet X of size n and output alphabet Y of size m, the capacity
is

C = log |Y| − H (Pij , j = 1, , m)

i.e. log |Y| minus the entropy of a row of P.

Proof. As usual, we compute

I(X ; Y ) = H (Y ) − H (Y |X)
≤ log |Y| − H (Pij , j = 1, , m)

where we note that given X, the probabilities in each row are permutations, and thus correspond to the same entropy.
Equality holds if Y ∼ unif(Y). The most sensible candiate for X is then also X ∼ unif(X ), which will turn out to work
due to symmetry in columns:
X
p(y) = p(x)p(y |x)
x∈X X
1
= p(y |x)
|X | x∈X
C
=
|X |
39
where the last equality follows from symmetry (all column sums are the same). Thus Y is uniform, and
we have achieved the upper bound established earlier. Thus the capacity is log |Y| − H (Pij , j = 1, , m) as
desired.

Note that in the proof above we only needed the columns of P ij to sum to the same value, which is slightly weaker
than the symmetry condition. Thus if we then define a weakly symmetric channel to be one where the rows are
permutations and the columns have the same sum, the above result holds for this case as well. A simple example of
such a channel that is not symmetric is

1/3 1/6 1/2


_
Pi j = 1/3 1/2 1/6 _
Next time we will establish the connection between the channel capacity and the maximum rate of com-munication
across the channel.

The information channel capacity C = maxp(x) I(X ; Y ) satisfies the following properties:

• C ≥ 0, (since I(X ; Y ) ≥ 0)

• C ≤ log |X | and C ≤ log |Y| (since I(X ; Y ) ≤ H (X) ≤ log |X | and similarly for Y )

• I (X ; Y ) is a continuous, concave function of p(x), and thus a local maximum of I(X ; Y ) is the global
maximum.

Channel Coding Theorem


First, we provide some intuition for why I (X ; Y ) is the relevant quantity to study for channels. As we did with
n n
results about compression, we will make use of the typical sequences for X and Y in n uses of the channel.

n n
X Y
n n
typical set of Y given X

n
typical set of X typical set of Yn

Suppose we fix the distribution of X. Recall that to make the best use of the channel, we want to use as many
n
elements of X whose images do not overlap (in the best case scenario). Once we have chosen such elements the
n
situation looks like the noisy nonoverlapping outputs channel. Recall that the size of the typ-ical set of Y is around
nH(Y ) n n nH(Y |X )
2 and the size of the typical set of Y given X is 2 . We can then upper bound the maximal number
of non-overlapping images to be simply

2nH(Y )
nI(X ; Y )
2nH(Y |X) = 2

40
nI(X ; Y )
These correspond to 2 sequences, and using these we can communicate at a rate of

1 nI(X ; Y )
n log 2 = I(X ; Y )

bits per transmission. Thus we want to choose X that maximizes this rate.
Preliminaries for Channel Coding Theorem
Recall that when communicating across a channel, we have

n
X Channel _Yn ˆ
W (message) _ Encoder _ Decoder _ W (estimate of message)
p(y |x)

n n
W is drawn from the set {1, , M }, the encoder forms X (W ) and decoder receives Y (W ) with distribu-
n n ˆ _
tion p(y |x (W )). If W W , then we say there is an error.
Question: What is the maximum number of messages we can send with arbitrarily low probability of error?

n n n
Denote a discrete channel by ,py x, n-th extension of a DMC is given by ,p y x ,
n
Y ) where (X (| ) Y). The (X ( | )
k k −1
p(yk |x , y ) = p(yk |xk), k = 1, , n

We also impose a no feedback condition, i.e. p(xk


|x
k −1 , yk −1) = p(xk |xk −1) (cannot use previous output
n n Q n

to improve next input). Also, p(y |x ) = i=1 p(yi |xi).


Definition: An (M , n)-code for the channel (X , p(y |x), Y) consists of

1. An index set {1, , M }


n n n n
2. An encoding function for X : {1, , M } → X , yielding codewords X (1), , X (M ). We call a
codebook the collection of all codewords.

n
3. A decoding function g: Y → {1, ,M}
The probability of error for the error code for input i is given by

λi = Pr{g(Y n) _ i |Xn = X n(i)} = n


g(y ) _ i
n
p(y |X (i))
n

X
(n)
and the maximal probability of error, denoted λ , is defined to be max λ.
1≤i ≤M i
(n) 1 M
The average probability of error, denoted pe is defined to be M i=1 λi. In particular, we note that
(n) (n)
pe ≤ λ . This is the probability of error when I ∼ unif{1, , M }. P
The rate of an (M , n)-code is defined to be

log M
R= bits/transmission n

nR (n)
A rate R is achievable if there exists a sequence of (⌈2 ⌉, n)-code for which λ → 0. Then as n → ∞, the
(operational) capacity is the supremum of all achievable rates.

41
The channel coding theorem roughly follows the intuition described earlier, but we need a result about which
n n
sequences in the joint (X , Y ) are typical.
( n) n n
Definition: The set A ε of jointly typical sequences {(X , Y )} is given by:

(n) (n) (n) (n)


n n n n
Aε = {(x , y ) ∈ A(X ,Y ),ε: x ∈ AX ,ε , y ∈ AY ,ε }
n n
(same as usual definition of typical, except we require that the components are also typical). In other words, (x , y )
is typical if:
1 n _
1. _ − n log p(x ) − H (X) < ε
_ _

2. _ _
_
− n log p(y
1 n _< ε
) − H (Y )_
_

3. _ _
_
− n log p(x , y
1 n n _ ) − H (X , Y )_
_

We
_
have the following theorem: _

Theorem 59. (Joint AEP) Let (Xi, Yi) be i.i.d ∼ p(x, y) for i = 1, 2, Consider

n
Y

n n n n
(X , Y ) ∼ p(x , y ) = p(xi, yi)
i=1
Then

n n ( n)
1. Pr((X , Y ) ∈ A ε ) → 1 as n → ∞
( n)
2. |A ε | ≤ 2n(H(X ,Y )+ε)
n n
˜ ˜
3. Suppose X
n
,Y
n
are independent, but have the same marginals as X n , Y n , respectively. Then the
˜ ˜ n n
joint (X , Y ) ∼ p(x )p(y ), and

n n (n) −n [I (X ; Y )−3ε]
P ((X˜ , Y˜ ) ∈ Aε ) ≤ 2
and for large n,

n n (n) −n [I(X ; Y )+3ε]


P ((X˜ , Y˜ ) ∈ Aε ) ≥ (1 − ε)2

Proof. The first two properties follow immediately from the usual AEP. The third is slightly different, and tells us
n n n n
that taking a pair consisting of a typical X and a typical Y does not give a jointly typical sequence (X , Y ). The
n n
proof is a direct computation. Note that for elements (x , y ) in the typical set
(n) n −n [H(X)−ε] n −n [H(Y )−ε]
Aε , we have that p(x ) ≤ 2 and p(y ) ≤ 2 , and thus
n n
˜ ˜ (n) n n
P ((X , Y ) ∈ Aε ) = n n
p(x )p(y )
x ,y ) A(n)
( X
∈ ε

2−n[H(X)−ε]2−n[H(Y )−ε]
≤ n n
x ,y ) A(n)
( X ∈ ε

≤2 n [H(X ,Y )+ε] −n[H(X)−ε] −n[H(Y )−ε]


2 2
= 2−n [I(X ; Y )−3ε]
The lower bound is proved in the same way, using the lower bounds in the properties of the typical set, found below
( n) n (H(X ;Y )−ε) n n[H(X)+ε]
Theorem 33. In particular, |A ε | ≥ (1 − ε)2 and p(x ) ≥ 2 , and thus
n n (n) −n[I(X ; Y )+3ε
(( ˜ ˜ )∈ ) ≥ (1 − ) 2
P X ,Y Aε ε

Now we are ready to present the Channel Coding Theorem (Shannon’s Second Theorem, referenced in Theorem 4).

Theorem 60. (Channel Coding Theorem) Let C be the information capacity C = maxp(x) I(X ; Y ). Then all rates
nR (n)
below C are achievable, i.e. if R < C, then there exists a sequence of (⌈2 ⌉, n)-codes for which λ → 0 (the
maximum probability of error decays)
nR (n)
Conversely, if R > C, then R is not achievable, i.e. for any sequence of (⌈2 ⌉, n)-codes for which λ → 0, it must
be the case that R < C.

Ideas (due to Shannon):

• Allow for arbitrarily small probability of error

• Use the channel many times, exploiting AEP

• Construct a codebook at random, and compute the probability of error, and then show existence of a
codebook (nonconstructive proof)

nR
Proof. (of Theorem 60) First we prove achievability of any R < C. First we fix p(x), and generate an (⌈2 ⌉, n)-
nR n n Q
code as follows: For each i ∈ {1, , ⌈2 ⌉}, pick X (i) independently from the distribution p(x ) = i p(xi). We
then get the codebook

=_ Xn (1)_

C X n(⌈2nR ⌉) _
i.e. a random codebook. We will use this code and compute the probability of error of the following pro-cess:

nR n
1. Draw W ∼ unif{1, , ⌈2 ⌉}, and send X (W ) to channel
n n n n Q
2. Channel outputs Y from distribution p(y |x = X (W )) = i p(yi |xi(W )) (assuming DMC and no
feedback, as stated earlier).

ˆ n ˆ n
3. Decoder guesses W if (X (W ), Y ) are jointly typical, and no other codeword is jointly typical
n
with Y . If no such codeword exists, report “error”
ˆ
Then if _ W W , this is also an error. Then we compute the expected probability of error (averaged over

possible codebooks):

43
ˆ
If ε = {W
_W } then
(n)
Pr(ε) = Pr(C) Pe (C)
C nR
X 1 2
= C
Pr(C) 2nR
w=1
λw(C)

X
X X
= Pr(C) λ1(C)
C
= P (ε |W = 1)

where the third line follows since λw(C) is the same for all w by symmetry.
n n ( n) nR
Now let Ei = {(X (i), Y ) ∈ A ε } for i = 1, , 2 , the event that the i-th codeword is typical with the received
sequence. Then the event that there was an error given that the first codeword was sent is simply

c
ε = E1 ∪ E2 ∪ ∪ E2n R

which has probability upper bounded by the union bound:

2n R
c X

Pr(ε) ≤ Pr(E1 ) + Pr(Ei)


i=2

n n n n
Q W = 1, the input to the channel is X (1) and the output is Y . Note that (X (1), Y ) has distribu-tion given
Given
by p(xi)p(yi |xi). By Joint AEP, we know that for sufficiently large n, most sequences with this distribution are
n n ( n) c
jointly typical. More precisely, Pr{(X (1), Y ) ∈ A ε } ≥ 1 − ε for large n, which means that Pr(E1 ) < ε for large
n.
n n n n
As for other Ei, i > 1, we note that X (i) and X (1) are independent by construction, and thus X (i) and Y are
n n n n
independent. However, X (i) and Y have the same marginals as X (i) and Y , p(x) and p(y), and thus by Joint
AEP again, we have that

n n ( n) −n(I(X ; Y )−3ε)
Pr(Ei) = Pr{(X (i), Y ) ∈ A ε } ≤ 2
Now we have that

c
2n R
Pr(ε |W = 1) ≤ Pr(E1 ) +Pr(Ei) i=2

X
ε + 2−n(I(X ; Y )−R −3ε)

and if R < I(X ; Y ), then the second term can be made smaller than ε, for 3ε < I(X ; Y ) − R. Now we can choose
p(x) that maximizes I(X ; Y ), so that this is true for R < C. Thus we now have that

Pr(ε) ≤ 2ε

for sufficiently large n (and sufficiently small ε). Since this is an average probability over all possible code-books,
∗ ∗
there exists a codebook C such that Pr(ε | C ) ≤ 2ε (otherwise Pr(ε) > 2ε).

What remains is to show that we can make the maximum probability of error small using C . Although not all the
codewords will have a small probability of error, a fraction of them will. In fact, at least half of the codewords must
∗ ∗
satisfy λi(C ) ≤ 4ε. To show this, let r be the proportion of codewords satisfying λ i(C ) ≤ 4ε. Then we note that

∗ 1 ∗ ∗ !
Pr(ε | C ) = 2nR
λ
i ≤
4ε λi(C ) + λi >4ε λi(C ) ≥ (1 − r)4ε ≥ 2ε

X X
44
and thus r ≥ 1/2.
∗ n(R −1/n)
Now we simply throw away the codewords for which λi(C ) > 4ε. We now have 2 codewords for which
(n) ∗
λ (C ) ≤ 4ε (the max probabability is bounded by 4ε for the remaining codewords). Using this set of codewords
reduces the rate from R to R − 1/n, but taking n sufficiently large this reduction in rate can be made arbitrarily small.
Thus, any rate < C is achievable.
nR (n)
Conversely, suppose we have a sequence of (2 , n)-codes where λ → 0, then we show that R ≤ C.
(n) n
First consider a special case. Suppose that Pe = 0, i.e. λi = 0 for all codewords. Then g(Y ) = W with probability
n nR
1, and H (W |Y ) = 0. Assuming W ∼ unif{1, , 2 }, we have that

nR = H (W )
n n
= H (W |Y ) + I (W ; Y )
n n
= I (X ; Y )
n
X

≤ I(Xi ; Yi)
i=1
≤ nC

n
The third line follows since g(Y ) = W is a one-to-one function, and thus the mutual information does not change.
The fourth line follows by the fact that the channel is DMC:

n n
n n i− 1 i−1 n
I(X ; Y ) = H (Yi |Y )− H (Yi |Y ,X )
i=1 i=1
X
n
X
X
n X
≤ H (Yi ) − H (Yi |Xi)
i=1 i=1
n
X

≤ I(Xi ; Yi)
i=1

and the last line follows from the fact that the mutual information is bounded above by the capacity. This shows that
R ≤ C.
n
The general case follows from the fact that H (W |Y ) is close to 0 by Fano’s inequality: Above we upper bound H
n (n) (n)
(W |Y ) by 1 + Pe nR, which is o(n), (Pe → 0 as n → ∞).

This gives the existence of a code, and establishing good codes that achieve rates close to the capacity is quite diffi
cult. Read 7.11 in the texts for an explicit code, called Hamming Codes. In 7.12 the text dis-cusses what happens if
we allow feedback (using the output of the channel of the previous codeword when deciding how to encode the next
input). It turns out that the capacity is exactly the same!

In summary, we have established two results. The Source Coding Theorem (Theorem 46) tells us that the rate of
compression is lower bounded R > H by the entropy. The Channel Coding Theorem (Theorem 60) tells us that the
rate of communication across the channel is upper bounded R < C by the channel capacity. In general, to
communicate a source across a channel, we can first compress the source with some compression code, and then
encode it further with a channel code and send it across the channel. This is a two step process, that allows us to
communicate the source across the channel at any rate R with H < R < C. The question is whether there is another
scheme that takes into consideration the source and the channel together, and allows us to achieve better rates of
communication. The answer is no, given by the Joint Source-Channel Coding Theorem.

45
Theorem 61. (Joint Source-Channel Coding Theorem) If V1, , Vn is a finite-alphabet stochastic process
(n)
satisfying AEP, then there exists a joint-source channel code for which Pe → 0 if H (V) ≤ C. Conversely, if H (V)
> C, then the probability of error cannot be made arbitrarily small.
( n)
Proof. For achievability, we simply use the typical set. Since the source satisfies AEP, there is a typical set A ε of
n( H(V)+ε)
size bounded by 2 which contains most of the probability. We can use n(H (V) + ε) bits to describe the
elements of the typical set, and simply ignore non-typical elements (generate an error if we see a non-typical
element). As long as H (V) + ε < C, we can use the channel code to transmit the source.

The probability of error in this case is the probability that either the source sequence is not typical, or the
channel decoder makes an error. Each is bounded by ε, and thus the error is bounded by 2ε. Thus, if H (V) < C then
(n)
Pe → 0.
(n) (n) n
Conversely, suppose that Pe is small for a given code, where X (V ) is the input to the channel and
n n
ˆ n ˆ (n)
the output is some V . Then Fano’s inequality gives us that H (V | V ) ≤ 1 + nPe log |V|. Then using a
similar proof to converse of the previous theorem, we have that
n
H (V )
H (V) ≤ n
H (V | Vˆ ) + I(V ; Vˆ )
n n n n

≤ n n n
1 (n) I (V ; Y )
≤ n + Pe log |V| + n
→C

and thus H (V) ≤ C.

So the separation of source and channel codes in theory is good. But in practice, we may not want codes that use
block lengths that are so large. Also, the channel may vary with time. In these cases it may then be important to
consider the source and channel jointly.

Differential Entropy
Now we move onto continuous time random variables, to discuss the Gaussian channel (second most com-monly
used channel)
First a few definitions. Let F (x) = Pr(X ≤ x) be the cdf or a random variable X. If F is continuous, then

we say R
that X is a continuous random variable. Then f (x) = F (x) is the probability density function
(pdf) for X, and f = 1. Let SX = {x: f (x) > 0}, the support set of X .
Then we define the differential entropy of a continuous r.v. X with pdf f (x) to be

Z Z
+
h(X) = − f (x) log f (x) dx = − f (x) log f (x) dx
S

when the limit exists. Note this only depends on the density.

1
Example 62. For the uniform distribution f (x) = a 1[0,a], we have that
Za
1 1

h(X) = − 0 a log a dx = log a


In particular, if a < 1, then h(X ) < 0, so that differential entropy may be negative.

46
2
x
2 1 − 2σ
2

Example 63. For the normal distribution X ∼ N (0, σ ), with f (x) = √ 2πσ e 2
, we have that
2
Z ∞ 1 x
h(X ) = − 2 2
−∞ f (x) _ ln √ 2πσ − 2σ _
1 2 1 2
= 2 ln (2πσ ) + 2 σ2 E(X )
1 2 1
= 2 ln (2πσ )+ 2
1 2
= 2 ln(2πeσ )

in terms of nats. If we used the base 2 logarithm instead, we would divide the entire expression by ln 2 and this
1 2
gives us 2 log2(2πeσ ) bits.

Now many of the properties of entropy for discrete random variables carry over. For instance, AEP still follows
from the law of large numbers:

Theorem 64. Let X1, , Xn be i.i.d ∼ f (x). Then


1
− n log f (x1, , xn) _ E( − log f (X)) = h(X)
−nh(X)
Thus for large n, f (x1, , xn) ≈ 2 , i.e. almost constant.

( n)
Then as before we define the typical set A ε with respect to f (x) to be
_ _
(n) = (x1,

, xn): _ 1 − n log f (x1, , xn) − h(X)_
≤ε
_ _
_ _

_ _
To quantify the size of the typical set, we talk about the volume (Lebesgue measure) of the set instead of the
cardinality. Then we have the following properties of the typical set:

( n)
Proposition 65. The typical set A ε has the following properties:
( n)
1. Pr(A ε ) > 1 − ε for large n
( n) n(h(X)+ε)
2. Vol(A ε ) ≤ 2 for all n
( n) n(h(X)−ε)
3. Vol(A ε ) ≥ (1 − ε)2 for large n

The proofs are exactly the same as before (now dealing with the integral instead of the sum). So the volume of the
nh(X ) h(X)
typical set is roughly 2 for large n, and per dimension this gives an interpretation for 2 as an “effective
sidelength” or “radius”.
The joint differential entropy of (X , Y ) ∼ f (x, y) is as expected, h(X , Y ) = E( − log(f (X , Y ))), and follows the
chain rule h(X , Y ) = h (X |Y ) + h(Y ) = h(Y |X ) + h(X ).
Example 66. We compute the differential entropy of a multivariate Gaussian (X1, , Xn) ∼ N (µ, K) where K is the
covariance matrix, and the joint is

1 − 1
−1 _
f (x1, , xn) = f (X) = n e 2 X−µ,K (X−µ)

p
(2π) |K |
47
Then
1 n
h(X1, , Xn) = 2 log[(2πe) |K |] bits

The computation is very straightforward:

1 1
Z " _#
n
h(X1, , Xn) = − f (x1, , xn) ln (2π) |K | − 2 X − µ, K −1(X − µ)
n
=1 2 ln [(2
π K |] + 1 p E x
) |
µ x − µ j)] K −1
2 ij [( i − i)( j ij

1 n X
= 2 ln [(2
π K |] + 1
) | 2 ij
K ji
K −1 ij

=
1 ln[(2π)n |K |] + 1 X I
jj
2 2 j

1 X
n
= 2 ln[(2πe) |K |]

The relative entropy, or the Kullback Liebler distance between two pdfs f (x), g(x) is
Z
f (x)
D(f kg) = f (x)log dx
g(x)
supp f

Note that if the set where g(x) = 0 and f (x) > 0 has positive measure, then the distance is ∞. As with the discrete case, the
relative entopy still satisfies D(f k g) ≥ 0 with D(f k g) = 0 if and only if f = g
almost everywhere.

The mutual information is as before,

Z f (x, y)

I(X ; Y ) = D(f (x, y) k f (x)g(y)) = f (x, y) logf (x) g(y) dxdy

and satisfies I(X ; Y ) = h(Y ) − h(Y |X ) = h(X) − h(X |Y ). Also I(X ; Y ) ≥ 0 with equality if and only if X , Y are
independent. Furthermore, h (Y |X ) ≤ h(Y ), and so conditioning reduces differential entropy, and combined with
the chain rule we have that

n n
i−1
h(X1, , Xn) = i=1
h(Xi |X ) ≤h(Xi) i=1

X X

First we start with more properties about differential entropy:

Proposition 67.

1. h(X + c) = h(X), i.e. entropy is translation invariant.

2. h(aX ) = h(X) + log |a| for any constant a, and similarly, for vector valued r.v. X and invertible matrices A,
we have h(AX) = h(X) + log |A|.

48
Proof. The first statement follows from the fact that the density of X + c is the same as X , except shifted by c. Then
a simple change of variables verifies the first statement. For the second statement, we
1
simply use the fact that the density of aX is just f (x/a) where f is the density of X . Then the result
|a|
follows from change of variables. The same proof holds for vector valued matrices A.

The following proposition is useful for maximization problems that occur in computing capacity. First, we have
Hadamard’s inequality:

Theorem 68. (Hadamard’s Inequality) Let K be a symmetric, positive definite matrix. Then

n
Y

|K | ≤ Kii
i=1

Proof. Let X ∼
n
N (0, K) (which in particular means that Xi ∼ N (0, K ii )). Then we use the fact that
P
h(X) ≤ i=1 h(Xi) to get that n n
1
2 log(( 2πe)n | K | ) ≤ i=1
1
2 log( 2πeKii) =
1
2 log
(2πe)n i=1 Kii
!

X Y

n
Q K
and thus |K | ≤ i=1 i i.
The next result shows that among all vector valued random variables with covariance matrix K, the random variable
that maximizes entropy is the Gaussian N (0, K).

n T
Theorem 69. Suppose X ∈ R is a vector valued r.v. with 0 mean and covariance matrix K = E(X X ). Then

1 n
h(X) ≤ 2 log((2πe) |K |)

with equality if and only if X ∼ N (0, K)

Proof. The result follows from computing the relative entropy of the random variable compared with the Gaussian
random variable, resulting in

1 n
0 ≤ D(g k f ) = − h(X) + 2 ln[(2πe) |K |]

The computation is similar to the computation of the entropy of a joint Gaussian.

Gaussian Channels
First, let’s consider a noiseless continuous channel Y = X, where X ∈ X ⊂ R. If X is not discrete then the capacity
is ∞ (we can use arbitrarily many symbols, so the channel looks like a noiseless channel with n inputs, with capacity
log n).
2
Now suppose Y = X + Z where Z is Gaussian noise, distributed N (0, σN ), where X ∈ R. Note that the capacity
with no restrictions on X is still infinite. If we let X be discrete, evenly spaced on R, then we have a channel with
infinitely many inputs and whose probability of error decays as the spacing increases. However, this is not a realistic
model, since for instance we would require arbitrarily large voltages. A more realistic constraint is some sort of
2
power constraint, for instance a peak power constraint Xi ≤ P for i = 1, , n (n uses of the channel), or a more
1P 2
relaxed average power constraint n Xi ≤ P .
49
Example 70. Consider such a constraint for 1 use of the channel, and suppose further that we are
sending 1 bit, with the constraint that X 2 ≤ P . Then we use the symbols { ± √ P }, and given the output

Y = X + Z, we decode to the nearest symbol; that is, if ˆ √ P , otherwise


Y < 0, then decoder outputs X =−

ˆ √ √ √ P , and by symmetry the



X= P . The probability of error if X = − P is the probability that Z >

same probability of error holds for X = P . Thus



P
σ
2
P (error) = Q N !
∞ 1
−t /2
where Q(z) = x √ 2π e d t. Letting pe be the probability of error, we have reduces the channel to a
symmetric channel with crossover probability p . The capacity is 1 H (p ) 1.

Binary R e − e ≤
2
Consider now n uses of the channel. The output Yi = Xi + Zi where Zi are i.i.d distributed N (0, σN ) and
1 n 2
Xi, Zi are independent. We impose an average power constraint P on X , so n i=1 Xi ≤ P , and we want
Gaussian channel with average
to know what the capacity is. We define the information capacity of a P
power constraint P to be
C =max I (X ; Y )
2
X , E(X )≤P

and we will show the analogue to the Channel-Coding Theorem for Gaussian channels. First we compute what the
information capacity is for the Gaussian channel:

I(X ; Y ) = h(Y ) − h(Y |X )


= h(Y ) − h(X + Z |X )
= h(Y ) − h(Z |X ) (invariance of entropy)
= h(Y ) − h(Z) (X , Z independent)
1
2
= h(Y ) − 2 log(2πeσN )
2
Now we note that since E(X ) ≤ P , then
2 2
E(Y ) = E[(X + Z) ]
2 2
= E(X ) + E(Z ) + 2 E(X)E(Z) (X , Z independent)
2
≤ P + σN
1
2
and thus we have that h(Y ) ≤ 2 log(2πe(P + σN )) by Theorem 69, and so

I(X ; Y ) 1 log 1 + P
_
Y N ,P σ2
≤ 2 _ σN2 we choose X N (0, P ). Note that

with2 equality if and only if ∼ (0 + N ), which is the1 case if P P∼


E(X ) ≤ P so we have achieved the upper bound. Thus C = log 1 + 2 σN2
, and σN2
is the SNR (signal to

noise ratio). _ _
Read Theorem 9.1.1 for the full proof of the channel capacity theorem, but the definitions of codes and achievable
rates remain the same, and the proof idea is also the same.
We will instead sketch an intuitive reason for why the information capacity gives the operational capacity, and even
n n
this is the same as before. If we send input x , then by AEP y will lie in a set of volume
n
2nh(Y |X) with high probability and for large n. We want to choose inputs x so that the typical set for
n n n nh(Y )
y |x do not overlap. Since the typical set of y lies in a set of volume 2 , so the maximum number
of inputs with no overlap is bounded above by

2nH(Y )/2nH(Y |X) = 2nI(X ; Y )

50
which is maximized by the information capacity.

Parallel Gaussian Channels


Now we consider the problem of using Parallel Gaussian Channels, where we can send input X 1, , Xn
2
simultaneously through n Gaussian channels, and get output Y = X + Z i where Zj = N (0, σ ) and inde- i i Nj
2 2
pendent, and Xi, Zi are independent. Denote σ = j
σ . We impose a common power constraint
N NJ
k
j
j =1
E(X 2) P . Then the information capacity is P

P ≤ C= max I(X1, , X n ; Y 1, , Yn )
n 2

P
Now we have Xi , j =1E(Xj )≤P

I(X1, , Xn ; Y1, , Yn) = h(Y1, , Yn) − h(Y1, , Yn |X1, , Xn)


= h(Y , ,Y h(Z , , Z
1 n) − X
n 1 n)
= h(Y1, , Yn) − h(Zj)
j =1
n
X

≤ [h(Yj) − h(Zj)]
j =1
n
≤ j =1 2
log
1 1+
Pj2
σN j
!
X

where the second to last inequality is equality if Yj are independent, and the last line is equality if Yj are Gaussian N
2
(0, Pj + σN j), which is the case if we choose Xj ∼ N (0, Pj). The Pj are to be determined. Thus
_

P 1 Pj P
the problem reduces to maximizing j 2log 1 + σN2j such that j Pj ≤ P , Pj ≥ 0. Since we have a P

concave function that increases with Pj, we can reduce the constraint further to j Pj = P . Now Lagrange multipliers shows
that the optimal Pj satisfy

2
σN j + Pj = const

2 2 +
Since Pj = const − σN j may not be positive, we need a slight adjustment. It turns out that Pj = (const − σN j)
satisfies the Karush-Kuhn-Tucker conditions for optimality in the presence of the additional inequality constraint P j
2
≥ 0 (see wikipedia; in particular, “stationarity” constraint yields σ N j + Pj = const, with µ = 0, and this satisfies
“primal and dual feasibility”, etc). We then choose the constant so that
P

j Pj = P , and pictorally we can view this as a “water-filling procedure”:

const
P
P1 P3 4

2 2 2 2
σN 1 σN 2 σN 3 σN 4

Note that in the picture P2 = 0. We keep raising the constant until the area of the shaded regions gives P , hence the
name.

51
Non-white Gaussian noise channels
In this scenario, we again have n parallel Gaussian channels, but this time the noise is correlated, (Z1, ,
Z) N (0, K ), though we still have Z , X independent. We impose the usual power constraint
nn ∼ N j j
P 2
j =1 E(Xj ) ≤ P , which we can now write as tr(KX) ≤ P . Computing the capacity, we again have
I(X1, , Xn ; Y1, , Yn) = h(Y1, , Yn) − h(Z1, , Zn)
1
n
= h(Y1, , Yn) − 2log[(2πe) |K |]
T
Now we note that the covariance of Y is KY = E(YY ) = KX + KN using Y = X + Z and that X , Z are
independent. The entropy of Y is thus maximized when Y ∼ N (0, KX + KN ), which is achieved if we choose X ∼ N
1 n
(0, KX ). Then the problem reduces to maximizing h(Y ) = 2 log[(2πe) |KX + KN |] such that tr(KX) ≤ P . Now
T
we diagonalize KN = QΛQ with unitary Q, then
T
|KX + KN | = |Q KXQ + Λ|

T T
and noting that tr(Q KXQ) = tr(Q Q KX) = tr(KX) the problem reduces to maximizing |A + Λ| such that tr(A) ≤ P .
By Hadamard’s Theorem,
n
Y

|A + Λ| ≤ (Ajj + Λj j)
j =1

with equality if and only if A + Λ is diagonal. Thus we choose A to be diagonal, and the problem finally
n
reduces to maximizing k=1 (Akk + Λkk) such that j Ajj ≤ P , which is the same maximization
problem as in the parallel white Gaussian
Q
noise channel. Thus the maximum
P
is achieved when A jj =
+ P
(const − Λj j) where the constant is chosen so that Aj j = P .
Side Note for Continuous Time Channels: Consider a channel Y (t) = X(t) + Z(t) where Z is now white Gaussian
noise with power spectral density N0/2, and let W be the channel bandwidth, and P be the input power constraint.
Then the capacity of the channel is
_
P

C = W log 1 + N0W

bits per second. It is interesting to note that we still have a “water-filling” procedure to choose the power spectral
density of the input.

Rate Distortion Theory


We revisit source coding theorems, where we had the earlier assumption of lossless compression, where ˆ
we wanted ( ) to be small, i.e. we wanted exact recovery with small probability of error. In this

P X_X
case we know that the lower bound for compression rates is the entropy H (X ), and is achieved by Shannon-type
codes or AEP-based codes with arbitrarily small probability of error (zero error for Shannon/Huffman codes),
ˆ
Instead, we consider lossy compression, where we allow and to differ by a little. We use a distortion
X X
function d to determine how close two words are. Thus we now find a compression scheme so that P (d(X ,
ˆ
)

X <ε
) is small for small. Th is is a natural assumption for analog sources, for ins tance representing real

ε
numbers, which we may only represent with finitely many digits, and also for representing images, where smearing
a few pixels does not affect the overall image by too much.

52
2 2
Example 71. Suppose X ∼ N (0, σ ), and we consider the distortion function d(x, xˆ) = (x − xˆ) . First we consider
representing X with a single bit (compression rate R = 1 bit). This means that once we observe
ˆ ˆ
X, we encode as one of two values {a, b}, which we denote X . We want to minimize E(d(X , X )). By

symmetry considerations, we should choose b = − a, and choose


ax > 0
ˆ _
X (x) = −a x<0
In this case,
0

ˆ 2 Z ∞ 2 Z 2
E(|X − X (X )| ) = 0 (x − a) f (x) dx + −∞ (x + a) f (x) dx
and the optimal choice of a is
r
2
a = E(X |X > 0) = σ
π

(The conditional expectation minimizes mean square error. Linking this to the abstract probability theory
ˆ , which is constant on
and conditional expectation, we wish to find another estimator random variable X
2
ˆ ˆ
{X < 0} and {X > 0}, such that E(|X − X | ) is minimized. In other words, X is measurable with respect
ˆ
to the σ-field Σ = {∅, {X < 0}, {X > 0}, R}, and the minimizer is found by conditional expectation X =
E(X |Σ). In this case, this is exactly

ˆ (
E(X X > 0) = σ 2/π X > 0 |
X = E(X X < 0) = σ 2/π X < 0

) | − p

Now if we consider two bits instead, we consider the following procedure. Pick four points arbitrarily {a, b, c, d}.
This corresponds to four regions, determined by which of the four points is closest, A, B , C , D, where all points in
A are closest to a, and etc. However, given the four regions A, B, C , D, the four points
ˆ 2
a, b, c, d may not be the best encoding points that minimize E(|X − X | ). We know from above that this
is given by the conditional means over each region (E(X |Σ), where Σ = σ(A, B, C , D), the σ-algebra gen-erated by
′ ′ ′ ′ ′ ′ ′
the four sets). The conditional means gives us four new points a ,b ,c ,d , which correspond to new regions A ,B , C ,

D . We can then iterate this procedure ot obtain the best four points. This is known as the Lloyd Algorithm. We do
not discuss convergence here.

We now turn to the general setting, and introduce a few definitions. In the above example we dealt with encoding a
single random variable. In general we will work as in the lossless case to encode n i.i.d copies of a random variable,
or a stationary stochastic process. To simplify discussion, we consider the i.i.d case. Given X 1, , Xn i.i.d ∼ p(x), with
x ∈ X , we have

X
n
_ Encoder fn(X
n
) ∈ {1, , 2_nR
} Decoder gn(fn(X
compression function
n
)) = X
ˆ
n
_ _ decoding function

ˆ +
The distortion function is a mapping d: X × X → R where d(x,ˆ xˆ) is the cost of representing the
source outcome x as the reproduction outcome x. In most cases . Common distortion functions:
ˆ X=X
• Hamming Distortion
_
1 x _ xˆ

d(x, xˆ) =
0 x = xˆ

53
ˆ ˆ
and in this case ( ( )) = ( )
EdX,X P X_X

• Squared Error Distortion


2
d(x, xˆ) = (x − xˆ)

A distortion function is allowed to be ∞, in which case compression must avoid the ∞ cases. A bounded distortion
function is one which does not take ∞. We can extend distortion functions to look at the joint as well (or blocks of
random variables). The single-letter distortion function is
n n
d(X
n ,X
ˆ )= n
1 i=1 d(Xi, Xi)
ˆ

X
which is essentially the “average per-letter distortion”.
nR
Given the distortion function, a (2 , n) rate distortion code consists of an encoding function fn mapping
n
n nR nR ˆ

X → {1, , 2 } and a decoding function gn mapping {1, , 2 } → X . The distortion of the code is
n n n n n
D = E(d(X , gn(fn(X )))) = Xn
p(x ) d(x , gn(fn(x )))

X
ˆn nR
we represent gn(i) as X (i), i = 1, , 2 .

nR
Definition 72. A rate-distortion pair (R, D) is achievable if there exists a sequence of (2 , n) rate
n n
distortion codes with limn→∞ E(d(X , gn(fn(X )))) ≤ D. The rate distortion region is the closure of the set of
achievable (R, D) pairs. The rate distortion function R(D) is the infimum of R such that (R, D) is achievable and
the distortion rate function D(R) is the infimum of D such that (R, D) is achievable.

It is always the case that R(D) is convex, though we will not prove this here. As before, it turns out that
R(D) will coincide with the information rate distortion function

R(D) = min I (X , X )
ˆ
n n
E
p(xˆ |x ) ˆ

d(X ,X )≤D

This is the rate-distortion theorem, which we also do not prove, but the idea is similar to the channel coding theorem
proof with a random codebook. Instead we focus on how to compute the information rate distortion function. Now
we are fixing the source p(x) and minimizing the mutual information over all
ˆ
possible choices of the conditional distribution p(xˆ |x). The key is in the observation that I (X ; X ) is a
convex function of p(xˆ |x), so again we only need to look for critical points to find a global minimum.

Binary Source / Hamming Distribution

Let X ∼ Bern(p), where X = 1 with probability p, and let p < 1/2. We compute the mutual information
ˆ ˆ
I(X ; X ) = H (X ) − H (X | X )
ˆ ˆ
= H (p) − H (X ⊕ X |X)
ˆ
≥ H (p) − H (X ⊕ X )
ˆ ˆ
using Z = X ⊕ X since Z = 1 if and only if X _X (relates to distortion function), and we note that
ˆ ˆ

P (Z = 1) = P (X _ X ) = E(d(X , X )) ≤ D
54
ˆ
so that I(X ; X ) ≥ H (p) − H (D) if D ≤ 1/2 (recall H (p) is concave with maximum at 1/2). To meet the
ˆ ˆ
lower bound, we need to find p(xˆ |x) such that H (X | X ) = H (D). In order to talk about H (X | X ), we
ˆ
consider a “test” channel, with input X and output X. Note that Bayes’ rule allows us to invert p(xˆ |x) to
ˆ ˆ
get p(x | xˆ), which corresponds to the test channel and is relevant for H (X | X ). Note that H (X | X ) =
D _
_
x xˆ

H (D) in the case of BSC(D) channel. Then if we set p(x | xˆ) = 1 − D x = xˆ as in BSD, if the corre-
sponding p(xˆ) is a valid probability distribution, then we have achieved the minimum. We have the fol-lowing
picture:

(1 a) 0 1−D 0 (1 p)
ˆ − D −
X X
a1 1 p

ˆ p−D
so that a = P (X = 1). Then we compute that a(1 − D) + (1 − a)D = p, or a = 1 2D , which is valid if D ≤
1 −
p< 2 . In this case we have achieved the minimum mutual information, so the rate distortion function is
ˆ
0). In this case (
R(D) = H (p) − H (D). In the case D > p, we note that simply setting X = 0 always (i.e. p(xˆ |x) = 1 if xˆ =
ˆ
)= already , and we have achieved transmission at a rate of 0, which is the

P X_X p<D
ˆ
minimum possible value for I(X ; X ). Thus R(D) = 0 for D > p. In summary

R(D) = H (p) − H (D) 0 ≤ D ≤ min {p, 1 − p}

_0 else
Gaussian Source with Squared-Error Distortion

2
Let X ∼ N (0, σ ) as in the previous example. Computing the information rate distortion function, we have

ˆ ˆ
I(X ; X ) = h(X) − h(X | X )
1 2 ˆ ˆ
=2 log(2πeσ ) − h(X − X | X )
1 2 ˆ
≥ 2 log(2πeσ ) − h(X − X )
ˆ 2
considering Z = X −2 X , we note E(Z ) ≤ D from the constraint. We then maximize h(Z) subject to the

constraint that E(Z ) ≤ D. Then the Gaussian Z ∼ N (0, D) will maximize h(Z), and so
2
1 1 1 σ

ˆ 2
_
I(X ; X ) ≥ 2 log(2πeσ ) − 2 log(2πeD) = 2 log D _
ˆ ˆ
Studying the bounds above, we note that equality is achieved if Z = X − X is independent of X and Z ∼
2 ˆ 2 ˆ ˆ
N (0, D). We are given that X2 ∼ N (0, σ ). This shows that X ∼ N (0, σ − D) since X = Z −2 X . X is a
valid distribution so long as σ ≥ D, and in this case we can achieve the minimum. Now if σ < D, as in
ˆ 2
the previous case if we simply set X = 0 the distortion is σ < D already with a rate of 0. Thus
2
1 σ 2

R(D) = 2 log
0 _
D _ D ≤ σ2
D>σ

55
2 −2R
Note that D(R), computed by inverting R(D) is D(R) = σ 2 . This can be interpreted by saying that for “every
2
additional bit we want to send, the distortion lowers by a factor of 1/4”. For R = 1 we have D = σ /4, and in the
2
earlier example where we used 1-shot, 1-bit quantization, D ≈ 0.36σ (but this tells us that we can lower the
distortion using some other method).

Independent Gaussian Random Variables with Single-Letter Distortion


2
Now consider X1, , Xm independent with Xi ∼ N (0, σi ). The distortion function is

m m
m ˆ ˆ 2
d(X
,X )=
(Xi − Xi)
X
i=1

We can compute the information rate distortion function similarly to the channel capacity for independent Gaussian
noise. We have that
m m
ˆ ˆ
I(X m
,X ) = h(X m ) h(X m
| X ) ˆ )
m − m −1 m
X X i ,X
= h(Xi) − h(Xi |X
i=1 i=1
m m
ˆ
≥ h(Xi) − h(Xi | Xi)
i=1 i=1
m
X X
= i=1
I(Xi ; Xi)
ˆ

ˆ
X i −1 ˆ
where equality is achieved if given ,

X X is independent of and other . We can lower bound

X
i i X
j _i
further by
2
i=1
m
I X Xˆ
( i ; ) ≥ i=1
i
m
RD
i( i) = i=1
m
1 2 log+_ Dii
σ _
X X X

where equality is achieved if we use independent Gaussians X ˆ ∼ N (0 , D D are to be



X
i i
ˆ
2 i i), and the

D D
chosen. Since Di = E[(Xi − Xi) ], we have the constraint i i≤ .
2
m 1 + σi
P D D
The problem therefore reduces to minimizing i=1 2 log Di subject to the constraint i i ≤ .
+ yields D ˆ N , σ2 D
∼ i −
= λ. Recall that X
Ignoring the positive part by log with log, the Lagrangian _ _ i i (0 i),
P
2 λ ≥ σi2 P
2 σ , and choose λ such that
2
P
and so we need that λ ≤ σi. We then verify KKT for the solution Di = (λ
i
λ < σi
Di = D. This can be interpreted as a “reversed” water-filling procedure:

σ2 2
3 σ
4
λ
2
σ1 2
σ2

D
D1 D2 D3 4

2
1 σi

∗ +
P ∗
So we have Ri = 2log Di
∗ and R(D) = i Ri .

You might also like