Unit – 2
Hadoop Distributed File System
HDFS (Hadoop Distributed File System) The Design of HDFS, HDFS Concepts, Command Line Interface,
Hadoop file system interfaces, Data flow, Data Ingest with Flume and Scoop and Hadoop archives, Hadoop
I/O: Compression, Serialization, Avro and File-Based Data structures.
Hadoop - HDFS (Hadoop Distributed File System)
Before learning about HDFS (Hadoop Distributed File System), it’s important to understand what a file
system is. A file system is a way an operating system organizes and manages files on disk storage. It helps
users store, maintain, and retrieve data from the disk.
Example: Windows uses file systems like NTFS (New Technology File System) and FAT32 (File
Allocation Table 32). FAT32 is an older file system but is still supported on versions like Windows XP.
Similarly, Linux uses file systems such as ext3 and ext4.
Distributed File System
DFS stands for distributed file system, it is a concept of storing file in multiple nodes in a distributed
manner. DFS actually provides Abstraction for a single large system whose storage is equal to the sum of
storage of other nodes in a cluster.
Why We Need DFS?
Storing very large files (e.g., 30TB) on a single system is impractical because:
Disk capacity of one machine is limited and can only grow so much.
Processing huge datasets on a single machine is inefficient and slow.
Distributed File Systems (DFS) overcome these issues by storing data across multiple machines, enabling
faster and scalable processing.
Example:
Suppose you have a 40TB file to process. On a single machine, it might take about 4 hours to complete.
However, using a Distributed File System (DFS), as shown in the image below, 40TB file is split across 4
nodes in a cluster, with each node storing 10TB. Since all nodes work simultaneously, processing time
reduces to just 1 hour. This demonstrates why DFS is essential for faster and efficient big data processing.
Local File System Processing:
Distributed File System Processing:
HDFS
HDFS (Hadoop Distributed File System) is a distributed file system designed to store and manage large
datasets across multiple machines (nodes) in a cluster.
It is a core component of the Apache Hadoop ecosystem.
HDFS (Hadoop Distributed File System) is the main storage system in Hadoop. It stores large files by
breaking them into blocks (default 128 MB) and distributing them across multiple low-cost machines.
HDFS ensures fault-tolerance by keeping copies of data blocks on different machines. This makes it
reliable, scalable and ideal for handling big data efficiently.
HDFS Architecture
HDFS (Hadoop Distributed File System) follows a Master–Slave Architecture.
Master → NameNode
Slaves → DataNodes
Optional helper → Secondary NameNode / Checkpoint Node
It is designed for high fault tolerance, scalability, and large data storage.
Main Components
NameNode (Master) : The brain of HDFS.
Responsibilities:
Maintains metadata
o File names
o Directory structure
o Permissions
o Block locations
Decides where blocks are stored
Monitors DataNodes (via heartbeats)
Important:
Does NOT store actual file data
Stores metadata in:
o FsImage
o EditLog
DataNodes (Slaves)
These store the actual data blocks.
Responsibilities:
Store file blocks
Send heartbeat to NameNode
Send block reports
Handle read/write requests from clients
If a DataNode fails:
NameNode replicates blocks to other nodes.
Secondary NameNode (Checkpoint Node)
Merges FsImage + EditLog
Creates updated checkpoints
Helps reduce NameNode load
❗ Not a backup NameNode
How Data is Stored in HDFS
Step-by-step process:
1. Client sends file to NameNode.
2. NameNode splits file into blocks (default 128MB).
3. NameNode selects DataNodes.
4. Blocks are stored with replication (default = 3).
5. Data is written in a pipeline:
o DN1 → DN2 → DN3
Read Operation
1. Client requests file from NameNode.
2. NameNode provides block locations.
3. Client reads data directly from nearest DataNode.
Rack Awareness
HDFS places replicas across different racks
Prevents data loss if one rack fails
Improves fault tolerance
Example:
1 replica in local rack
2 replicas in different rack
Features of HDFS
It's easy to access the files stored in HDFS.
HDFS also provides high availability and fault tolerance.
Provides scalability to scaleup or scaledown nodes as per our requirement.
Data is stored in distributed manner i.e. various Datanodes are responsible for storing the data.
HDFS provides Replication because of which no fear of Data Loss.
HDFS Provides High Reliability as it can store data in a large range of Petabytes.
HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the
cluster information.
Provides high throughput.