Introduction to Information Theory
Information theory is a mathematical framework developed to quantify information, uncertainty,
and the efficiency of data transmission. It was pioneered by Claude Shannon in 1948, laying the
foundation for modern digital communication, data compression, and error correction
techniques. The theory is widely used in telecommunications, cryptography, machine learning,
and artificial intelligence.
At its core, information theory deals with the measurement of uncertainty in a system and how
efficiently information can be transmitted through a noisy channel.
1. Uncertainty and Information
In everyday communication, uncertainty refers to the unpredictability of an event. If an event is
certain, it carries no new information. Conversely, if an event is highly uncertain, it provides
more information when it occurs.
For example:
● If you check the weather forecast in a desert region and see "Sunny", it provides little to no new
information (low uncertainty).
● If the forecast predicts "Heavy Snowfall", it is surprising and conveys more information (high
uncertainty).
Mathematical Representation: The amount of information received from an event X = x can
be measured using the formula:
Interpretation:
If P(x) is high, I(x) is low (less information).
If P(x) is low, I(x) is high (more information).
Example:
● If a coin flip results in Heads or Tails, each with probability P(H)=P(T)=0.5, the
information content is:
● If an event is very rare, say P(x)=0.01, it carries more information:
2. Entropy (Shannon Entropy)
Entropy measures the average uncertainty in a random variable. It quantifies how much
information is required to describe an outcome.
For a discrete random variable X with probability distribution P(x), Shannon entropy is
defined as:
where:
3. Relative Entropy (Kullback-Leibler Divergence):
Problem 1: Compute the Entropy of a Fair Die
A fair six-sided die has equal probability for each face. Find the entropy of this system.
Solution:
Problem 2: Compute the Entropy of a Biased Coin
A coin is biased such that P(H)=0.7 and P(T)=0.3. Find the entropy.
Solution:
Problem 3: Calculate the Entropy of a 4-Sided Biased Dice
A four-sided die has probabilities:
P(1)=0.1, P(2)=0.2, P(3)=0.3, P(4)=0.4
Find the entropy H(X).
Solution:
Joint Entropy in Information Theory
Definition:
The joint entropy of two discrete random variables X and Y, denoted as H(X,Y), measures the
total amount of uncertainty (or information) contained in the pair of variables.
Mathematically, it is defined as:
Interpretation of Joint Entropy:
● Joint entropy measures the total information of the combined system (X,Y).
Example Calculation:
Suppose X and Y take the following values with the given joint probabilities:
Relationship to Other Entropy Measures
Joint entropy is connected to conditional entropy and mutual information:
1. Conditional Entropy:
H(X∣Y) = H(X,Y)−H(Y)
where H(X∣Y) is the uncertainty of X given Y.
2. Mutual Information:
I(X;Y)=H(X)+H(Y)−H(X,Y)
which measures the reduction in uncertainty of one variable given the others.
Marginal Entropy in Information Theory
Definition:
The marginal entropy of a single random variable X, denoted as H(X), measures the uncertainty
(or randomness) associated with X alone, ignoring any additional information about other
variables.
Mathematically, it is given by:
Interpretation of Marginal Entropy
● Marginal entropy measures the uncertainty of a single variable, ignoring other
variables.
● If X and Y are independent, then:
H(X,Y)=H(X)+H(Y)
● If X and Y are not independent, then:
H(X)≤H(X,Y)H(X)
because knowing Y reduces the uncertainty in X.
Problem : Marginal Entropy for Joint Distribution
Solution:
First, find the marginal probability distribution of X:
4. Mutual Information
Mutual information measures how much knowing one variable reduces uncertainty about
another.
For two random variables X and Y, mutual information is given by:
5. Average Mutual Information
Average mutual information measures the expected reduction in uncertainty of one variable due to
another.
It tells us how much on average we learn about X when observing Y.
Example:
Problem 4: Compute Mutual Information
Consider a communication system where a sender transmits bits X that can be received as Y. The
joint probability distribution is:
Solution:
Introduction to Lossless Coding
Lossless coding is a data compression technique that ensures the original data can be perfectly
reconstructed from the compressed data. Unlike lossy compression, which sacrifices some data
quality for higher compression rates, lossless coding retains all information, making it ideal for
applications such as text compression, medical imaging, and file storage.
Key Concepts in Lossless Coding
1. Redundancy Removal:
o Lossless compression reduces redundancy in data by identifying patterns and
encoding them efficiently.
o It relies on statistical methods to assign shorter codes to frequently occurring
symbols.
2. Entropy and Source Coding:
o According to Shannon's Information Theory, the entropy (H) of a source
represents the minimum average number of bits required to encode its symbols.
o Lossless coding techniques aim to achieve compression rates close to this
theoretical limit.
3. Prefix Codes:
o Lossless coding often uses prefix codes, where no codeword is a prefix of another.
o This ensures unique decodability without requiring delimiters between encoded
symbols.
Types of Lossless Coding Techniques
1. Huffman Coding:
o A widely used variable-length coding algorithm that assigns shorter codes to more
frequent symbols.
o Constructs a binary tree based on symbol frequencies.
o Used in applications like ZIP file compression and PNG images.
2. Arithmetic Coding:
o Assigns a single number between 0 and 1 to an entire message rather than
encoding symbols individually.
o Achieves higher compression efficiency compared to Huffman coding, especially
for sources with skewed probability distributions.
3. Run-Length Encoding (RLE):
o Efficient for data with repeated sequences, like simple images or text files with
repeated spaces.
o Example: "AAAAABBBCC" → "5A3B2C".
4. Lempel-Ziv (LZ) Compression:
o Dictionary-based coding method used in ZIP, GIF, and PNG formats.
o Examples include LZ77 and LZ78, which replace repeated sequences with
references.
5. Burrows-Wheeler Transform (BWT):
o A preprocessing step for compression that rearranges data to make it more
suitable for dictionary-based methods like LZ77.
o Used in Bzip2 compression.
Source Coding Theorem
The Source Coding Theorem is a fundamental result in Information Theory, introduced by
Claude Shannon in 1948. It establishes the theoretical limit for lossless data compression and
provides a foundation for efficient coding techniques.
Statement of the Theorem
The Source Coding Theorem states that:
"For a discrete memoryless source (DMS) with entropy H(X)), the minimum average length L of
any lossless encoding satisfies:"
where:
● L is the average code length per symbol.
● H(X) is the entropy of the source, which represents the average amount of uncertainty or
information in each symbol.
● The lower bound is asymptotically achievable using optimal coding schemes, such as
Huffman coding or Arithmetic coding.
Key Insights from the Theorem
1. Entropy as a Bound:
o The entropy H(X) sets the fundamental limit on how much a source can be compressed.
o If a source is compressed below H(X), some information will be lost, violating the
lossless condition.
2. Achievability:
o If L≈H(X), the encoding scheme is optimal.
o The closer L is to H(X), the more efficient the compression.
3. Uniqueness:
o No lossless compression scheme can achieve an average code length below H(X), making
entropy the absolute lower limit.
Example Calculation
Consider a source that emits three symbols A,B,C with probabilities:
Block Codes and Their Properties
A block code is a type of error-detecting and error-correcting code in which messages are
divided into fixed-length blocks of symbols and encoded into a longer sequence of symbols for
transmission. Block codes are used in digital communication systems to ensure reliable data
transmission over noisy channels.
Properties of Block Codes
1. Fixed-Length Encoding: Each input message of k bits is encoded into a codeword of n
bits (denoted as an (n,k) code).
2. Error Detection and Correction: Block codes can detect and correct errors based on the
redundancy added to the code.
3. Code Rate: The ratio of information bits to total transmitted bits is given by:
4. A higher rate means less redundancy and lower error correction capability.
5. Minimum Distance: The minimum Hamming distance between any two codewords
determines the error detection and correction ability.
6. Systematic and Non-Systematic Codes:
o Systematic Codes: The original message appears in the codeword, with
additional redundant bits.
o Non-Systematic Codes: The message is completely transformed into a different
codeword.
7. Linear Block Codes: A block code is linear if the sum of any two codewords is also a
valid codeword.
8. Hamming Weight and Hamming Distance:
o The Hamming weight of a codeword is the number of nonzero bits in it.
o The Hamming distance between two codewords is the number of positions at
which they differ.
Kraft-McMillan Inequality
The Kraft-McMillan inequality provides a necessary and sufficient condition for the existence
of a prefix-free or uniquely decodable code.
Statement of Kraft's Inequality
Implications of Kraft-McMillan Inequality
1. Prefix-Free Code: If a code satisfies Kraft’s inequality, then there exists a prefix-free
code with those codeword lengths.
2. Uniquely Decodable Code: For uniquely decodable codes, the inequality must also hold,
but not necessarily with equality.
3. Code Construction: Given a set of lengths satisfying Kraft’s inequality, we can construct
a prefix-free code.
Example of Kraft's Inequality
Example 1: Block Code Construction
Problem: Code Rate Calculation
A block code has code words of length n=7 and each message consists of k=4 information
bits. Find the code rate.
Solution:
So, the code rate is 0.571 or 57.1% efficiency.