Introduction to Data Compression
Overview
Data compression is the process of reducing the size of digital data by encoding information
using fewer bits than the original representation. It is an essential technique in multimedia,
networking, storage systems, and communication because it helps save storage space, reduce
transmission time, and improve system efficiency.
With the rapid growth of digital media such as images, audio, and video, large volumes of
data must be handled efficiently. Compression makes it possible to store and transmit this
data without excessive resource usage while maintaining acceptable quality.
Objectives of Data Compression
Reduce storage requirements
Minimize transmission bandwidth
Speed up data transfer
Lower cost of storage and communication
Improve system performance
Basic Principle of Compression
Compression works by eliminating redundancy in data. Redundancy refers to repeated or
predictable patterns that can be represented more efficiently.
Types of Redundancy Removed
1. Spatial Redundancy – Repetition within data (e.g., repeated pixels)
2. Temporal Redundancy – Repetition over time (e.g., video frames)
3. Coding Redundancy – Inefficient data representation
By identifying these patterns, algorithms encode the information more compactly.
Compression Model
A typical compression system consists of:
Encoder (Compressor)
Converts original data into compressed form
Decoder (Decompressor)
Reconstructs data from compressed form
Original Data → Encoder → Compressed Data → Decoder → Reconstructed Data
Categories of Compression
Lossless Compression
No information is lost
Original data perfectly restored
Used for text, documents, software
Lossy Compression
Some data removed permanently
Higher compression ratio
Used for audio, image, video
Applications of Data Compression
Multimedia streaming
File storage and archiving
Internet data transmission
Mobile communication
Medical imaging
Satellite communication
Advantages of Data Compression
Saves storage space
Reduces bandwidth usage
Improves transfer speed
Enables efficient multimedia distribution
Limitations
Compression requires processing power
Lossy methods reduce quality
Some algorithms are complex to implement
Lossless & Lossy Compression
Data compression techniques are broadly classified into two main categories based on
whether the original data can be perfectly reconstructed after decompression:
[Link] Compression
[Link] Compression
Each method serves different purposes depending on the type of data and required accuracy.
Lossless Compression
Overview
Lossless compression is a technique in which the original data can be reconstructed exactly
from the compressed data without any loss of information. Every bit of data remains intact
after decompression.
This type of compression is essential when accuracy and data integrity are critical.
Working Principle
Lossless methods remove redundancy by:
Identifying repeating patterns
Encoding frequently occurring symbols efficiently
Replacing repeated sequences with shorter representations
No data is discarded — only reorganized.
Common Techniques
Huffman Coding
Run-Length Encoding
Lempel–Ziv methods (LZ77, LZW)
(These will be discussed later in detail)
Applications
Text files
Program files
Databases
Medical images
Legal or financial documents
Advantages
Perfect reconstruction
No quality degradation
Suitable for critical data
Limitations
Lower compression ratio compared to lossy methods
Less efficient for multimedia data
Lossy Compression
Overview
Lossy compression reduces file size by permanently removing less important or less
noticeable data. The reconstructed data is an approximation of the original.
This method is widely used where slight quality loss is acceptable.
Working Principle
Lossy algorithms:
Analyze human perception limits
Remove inaudible or invisible details
Approximate original signals
Once removed, the discarded information cannot be recovered.
Common Techniques
Transform coding
Quantization
Perceptual encoding
Applications
Audio streaming
Image sharing
Video broadcasting
Multimedia storage
Advantages
High compression ratio
Reduced storage and bandwidth use
Efficient multimedia delivery
Limitations
Quality loss
Irreversible process
May introduce distortion
Comparison Between Lossless and Lossy Compression
Feature Lossless Lossy
Data Recovery Exact Approximate
Quality Preserved Reduced
Compression Ratio Lower Higher
Usage Text, software Audio, video
Reversibility Yes No
Huffman Coding & Algorithm — Detailed
Notes
Overview
Huffman Coding is a widely used lossless data compression technique developed by David
Huffman. It is based on assigning variable-length binary codes to symbols depending on their
frequency of occurrence. Symbols that appear more frequently are assigned shorter codes,
while less frequent symbols receive longer codes. This approach reduces the overall number
of bits required to represent data.
Huffman coding is used in many compression standards such as image, audio, and file
compression systems because it efficiently removes coding redundancy without losing
information.
Basic Principle
The algorithm works on the idea that:
Frequently occurring symbols should use fewer bits
Rare symbols should use more bits
No code should be a prefix of another code (Prefix Property)
This ensures unambiguous decoding.
Steps in Huffman Coding Algorithm
1. List Symbols and Frequencies
Determine how often each symbol appears in the data.
2. Create Leaf Nodes
Treat each symbol as a node labeled with its frequency.
3. Build the Huffman Tree
Select two nodes with the smallest frequencies
Combine them into a new node
Assign the sum of their frequencies
Repeat until one tree remains
4. Assign Binary Codes
Traverse tree from root
Assign:
Left branch → 0
Right branch → 1
The path to each symbol forms its code
5. Encode Data
Replace symbols with corresponding binary codes.
Example Concept (Illustrative)
If symbol frequencies are:
Symbol Frequency
A High
B Medium
C Low
Then codes might be:
A → Short code
B → Medium length
C → Longer code
This reduces total storage compared to fixed-length coding.
Decoding Process
Read encoded bits sequentially
Traverse Huffman tree accordingly
Output symbol when leaf node reached
Since codes follow prefix property, decoding is straightforward and accurate.
Characteristics of Huffman Coding
Variable-length coding
Prefix-free representation
Optimal for symbol-by-symbol coding
Based on probability distribution
Advantages
No loss of information
Efficient for data with unequal symbol frequencies
Simple decoding mechanism
Widely used in compression systems
Limitations
Requires frequency table storage
Less effective if symbol frequencies are uniform
Static Huffman needs preprocessing
Tree construction adds computation
Applications
File compression utilities
Image compression
Multimedia encoding
Data transmission systems
Shannon–Fano Algorithm — Detailed Notes
Overview
The Shannon–Fano algorithm is an early lossless data compression technique developed by
Claude Shannon and Robert Fano. It is based on assigning variable-length binary codes to
symbols according to their probabilities or frequencies of occurrence. Similar to Huffman
coding, symbols that appear more frequently receive shorter codes, while less frequent
symbols receive longer codes.
Although it is not always optimal like Huffman coding, the Shannon–Fano method
introduced the fundamental idea of entropy-based coding and laid the foundation for modern
compression techniques.
Basic Principle
The algorithm follows these principles:
Arrange symbols based on decreasing probability (or frequency).
Divide the set of symbols into two groups with nearly equal total probabilities.
Assign binary digits to each group:
o First group → 0
o Second group → 1
Repeat the division process recursively for each group until each symbol has a unique
code.
This results in prefix-free variable-length codes that enable correct decoding.
Steps of the Shannon–Fano Algorithm
[Link] Symbol Frequencies
Determine how often each symbol appears in the input data.
[Link] Symbols
Arrange symbols in descending order of frequency.
[Link] Symbols
Divide the list into two parts such that the total frequency of each part is as close as possible.
[Link] Codes
First group receives binary digit 0
Second group receives binary digit 1
[Link] Recursively
Apply the same partitioning and coding to each subgroup until each symbol has a distinct
code.
Encoding Process
Replace each symbol with its assigned binary code
Store or transmit encoded data
This reduces the total number of bits required compared to fixed-length encoding.
Decoding Process
Read binary sequence
Match prefix codes
Identify corresponding symbols
Since codes are prefix-free, decoding remains unambiguous.
Characteristics
Variable-length coding
Based on symbol probability
Prefix property maintained
Simpler conceptually than Huffman
Advantages
Reduces coding redundancy
Easy to understand and implement
Efficient for uneven symbol distributions
Historically important for compression theory
Limitations
Not always optimal (may produce longer codes than Huffman)
Efficiency depends on partition accuracy
Less commonly used in modern systems
Comparison with Huffman Coding
Feature Shannon–Fano Huffman
Optimal coding Not guaranteed Guaranteed optimal
Tree construction Top-down partitioning Bottom-up merging
Efficiency Slightly lower Higher
Usage Historical/educational Widely used
Adaptive & Arithmetic Coding
Overview
Adaptive coding and arithmetic coding are advanced techniques used in lossless data
compression. Unlike simple variable-length coding methods, these approaches improve
compression efficiency by dynamically adjusting symbol representation or encoding entire
messages as numerical ranges. They are widely applied in modern multimedia and
communication systems.
Adaptive Coding
Introduction
Adaptive coding refers to compression methods that adjust coding parameters dynamically
while processing data, rather than relying on predefined symbol frequencies. The encoder and
decoder update their models continuously based on the data being processed.
This makes adaptive coding suitable when symbol probabilities are unknown or change over
time.
Working Principle
Initially assume equal probability for symbols
As symbols are processed, update frequency statistics
Modify codes according to updated probabilities
Decoder performs identical updates to maintain synchronization
This allows the coding scheme to “learn” patterns from incoming data.
Characteristics
No prior knowledge of data distribution required
Model updates during encoding
Responds to changing data patterns
Often used with adaptive Huffman coding
Advantages
Better performance for dynamic data streams
Eliminates need to transmit frequency tables
Flexible and efficient for real-time processing
Limitations
Higher computational complexity
Encoder and decoder must remain synchronized
Implementation more complex than static methods
Arithmetic Coding
Introduction
Arithmetic coding is a powerful lossless compression technique that encodes an entire
message into a single fractional number between 0 and 1. Instead of assigning codes to
individual symbols, it represents the sequence as a range that becomes narrower as more
symbols are processed.
This method often achieves higher compression efficiency than Huffman coding.
Working Principle
[Link] probability ranges to symbols
[Link] with interval [0,1)
[Link] each symbol:
Subdivide interval according to probabilities
Select sub-interval corresponding to symbol
[Link] interval uniquely represents message
Any number within the final interval can represent the encoded data.
Example Concept
If symbols A, B, and C have assigned probability ranges:
A → first portion
B → middle portion
C → last portion
As symbols are processed, the interval becomes smaller and more precise.
Characteristics
Encodes entire message as one value
Uses probability modeling
Produces highly efficient compression
Works well with adaptive probability updates
Advantages
Higher compression efficiency
Handles fractional bit representation
Effective for skewed probability distributions
Limitations
Complex implementation
Slower processing than simpler methods
Requires high numerical precision
Comparison: Adaptive vs Arithmetic Coding
Feature Adaptive Coding Arithmetic Coding
Approach Updates symbol model Encodes message range
Complexity Moderate High
Efficiency Improved over static Very high
Usage Dynamic data streams Advanced compression systems
Dictionary-Based Compression (LZ77,
LZW)
Overview
Dictionary-based compression techniques are a class of lossless data compression methods
that reduce data size by replacing repeated patterns with shorter references stored in a
dictionary. Instead of encoding symbols individually, these methods identify recurring strings
or sequences and substitute them with references pointing to earlier occurrences.
These techniques are widely used because they are efficient, adaptable, and suitable for
different types of data such as text, images, and executable files.
Two well-known dictionary-based algorithms are:
LZ77 (Lempel–Ziv 1977)
LZW (Lempel–Ziv–Welch)
LZ77 Compression
Introduction
LZ77 is a sliding-window compression algorithm that replaces repeated data sequences with
references to previous occurrences within a fixed-size window of recently processed data.
It does not maintain a permanent dictionary; instead, it uses previously encoded data as its
reference source.
Working Principle
The algorithm divides data into two buffers:
[Link] Buffer
Contains recently encoded data
Used to find matching patterns
[Link]-Ahead Buffer
Contains upcoming data to encode
Process:
Compare look-ahead data with search buffer
Find longest matching sequence
Encode reference as a triple:
(offset, length, next symbol)
Offset → Distance to match
Length → Size of match
Next symbol → Following unmatched character
Characteristics
Uses sliding window
No explicit dictionary storage
Effective for repetitive data
Basis for many modern compressors
Advantages
Good compression for repeated patterns
No prior knowledge required
Adaptive to input data
Limitations
Performance depends on window size
Searching for matches can be slow
Less effective on non-repetitive data
LZW Compression
Introduction
LZW improves dictionary compression by building a dictionary dynamically during
encoding. Unlike LZ77, it stores patterns explicitly in a dictionary table and assigns codes to
sequences.
It is widely used in formats such as GIF images and file compression utilities.
Working Principle
[Link] dictionary with basic symbols
[Link] input sequence
[Link] longest sequence in dictionary
[Link] dictionary code
[Link] new sequence to dictionary
[Link] until input ends
The decoder rebuilds the same dictionary using identical logic.
Characteristics
Dictionary grows dynamically
Encodes sequences instead of single symbols
No need to transmit dictionary
Advantages
High compression efficiency
Fast encoding and decoding
Works well with structured data
Limitations
Dictionary size management required
Less efficient for random data
Memory usage increases
Comparison: LZ77 vs LZW
Feature LZ77 LZW
Dictionary Sliding window Explicit table
Encoding Offset-length reference Dictionary codes
Memory use Lower Higher
Speed Slower search Faster lookup
Efficiency Good Often better
Compression Ratio & Comparison
Compression ratio is an important metric used to evaluate the effectiveness of a data
compression technique. It measures how much the size of data has been reduced after
compression. By comparing different compression methods using this metric and other
performance factors, multimedia designers and system engineers can choose the most
appropriate algorithm for their needs.
Compression Ratio
Definition
Compression ratio expresses the relationship between the size of original data and the size of
compressed data.
Formula
Compression Ratio=Original SizeCompressed Size\text{Compression Ratio} = \frac{\
text{Original Size}}{\text{Compressed
Size}}Compression Ratio=Compressed SizeOriginal Size
Interpretation
Higher ratio → Better compression efficiency
Lower ratio → Less size reduction
Example
If a file of 100 MB is compressed to 25 MB:
Compression Ratio=10025=4:1Compression\ Ratio = \frac{100}{25} =
4:1Compression Ratio=25100=4:1
This means the compressed file is four times smaller than the original.
Compression Efficiency Measures
• Bit Rate Reduction
Indicates how much data transmission requirement has been reduced.
• Space Saving Percentage
Shows percentage of storage saved.
Space Saving=Original−CompressedOriginal×100\text{Space Saving} = \frac{Original -
Compressed}{Original} \times 100Space Saving=OriginalOriginal−Compressed×100
• Quality Retention
Important in lossy compression to evaluate how much original quality is preserved.
Factors Affecting Compression Ratio
Nature of data (text, audio, video)
Redundancy level in data
Algorithm used
Compression settings
Acceptable quality loss
Highly repetitive data typically compresses better than random data.
Comparison of Compression Techniques
Technique Type Compression Efficiency Complexity Typical Use
Huffman Lossless Moderate Low Text/data files
Shannon–Fano Lossless Moderate Low Educational use
Arithmetic Coding Lossless High High Advanced systems
LZ77 Lossless Good Moderate File compression
LZW Lossless Good/High Moderate Images/files
Lossy Methods Lossy Very High Moderate Multimedia
Trade-Offs in Compression
Choosing a compression method often involves balancing:
Compression ratio
Processing speed
Memory usage
Reconstruction quality
Application requirements
For example:
Text documents require lossless accuracy
Video streaming prioritizes higher compression ratios
Importance in Multimedia Systems
Compression ratio analysis helps:
Optimize storage resources
Improve streaming performance
Reduce network bandwidth usage
Enhance user experience
It plays a crucial role in multimedia design, communication systems, and cloud storage.