0% found this document useful (0 votes)
2 views6 pages

Unit - 2 - Hadoop Distributed File System

HDFS (Hadoop Distributed File System) is a distributed file system designed to manage large datasets across multiple machines, ensuring fault tolerance and scalability. It operates on a Master-Slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks. HDFS allows for efficient big data processing by splitting files into blocks and replicating them across nodes, enhancing reliability and throughput.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Unit - 2 - Hadoop Distributed File System

HDFS (Hadoop Distributed File System) is a distributed file system designed to manage large datasets across multiple machines, ensuring fault tolerance and scalability. It operates on a Master-Slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks. HDFS allows for efficient big data processing by splitting files into blocks and replicating them across nodes, enhancing reliability and throughput.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

Unit – 2

Hadoop Distributed File System

HDFS (Hadoop Distributed File System) The Design of HDFS, HDFS Concepts, Command Line Interface,
Hadoop file system interfaces, Data flow, Data Ingest with Flume and Scoop and Hadoop archives, Hadoop
I/O: Compression, Serialization, Avro and File-Based Data structures.

Hadoop - HDFS (Hadoop Distributed File System)

Before learning about HDFS (Hadoop Distributed File System), it’s important to understand what a file
system is. A file system is a way an operating system organizes and manages files on disk storage. It helps
users store, maintain, and retrieve data from the disk.

Example: Windows uses file systems like NTFS (New Technology File System) and FAT32 (File
Allocation Table 32). FAT32 is an older file system but is still supported on versions like Windows XP.
Similarly, Linux uses file systems such as ext3 and ext4.

Distributed File System

DFS stands for distributed file system, it is a concept of storing file in multiple nodes in a distributed
manner. DFS actually provides Abstraction for a single large system whose storage is equal to the sum of
storage of other nodes in a cluster.

Why We Need DFS?

Storing very large files (e.g., 30TB) on a single system is impractical because:

 Disk capacity of one machine is limited and can only grow so much.

 Processing huge datasets on a single machine is inefficient and slow.

Distributed File Systems (DFS) overcome these issues by storing data across multiple machines, enabling
faster and scalable processing.

Example:

Suppose you have a 40TB file to process. On a single machine, it might take about 4 hours to complete.
However, using a Distributed File System (DFS), as shown in the image below, 40TB file is split across 4
nodes in a cluster, with each node storing 10TB. Since all nodes work simultaneously, processing time
reduces to just 1 hour. This demonstrates why DFS is essential for faster and efficient big data processing.

Local File System Processing:

Distributed File System Processing:

HDFS

HDFS (Hadoop Distributed File System) is a distributed file system designed to store and manage large
datasets across multiple machines (nodes) in a cluster.
It is a core component of the Apache Hadoop ecosystem.

HDFS (Hadoop Distributed File System) is the main storage system in Hadoop. It stores large files by
breaking them into blocks (default 128 MB) and distributing them across multiple low-cost machines.

HDFS ensures fault-tolerance by keeping copies of data blocks on different machines. This makes it
reliable, scalable and ideal for handling big data efficiently.
HDFS Architecture

HDFS (Hadoop Distributed File System) follows a Master–Slave Architecture.

 Master → NameNode

 Slaves → DataNodes

 Optional helper → Secondary NameNode / Checkpoint Node

It is designed for high fault tolerance, scalability, and large data storage.

Main Components

NameNode (Master) : The brain of HDFS.

Responsibilities:

 Maintains metadata

o File names

o Directory structure

o Permissions

o Block locations

 Decides where blocks are stored


 Monitors DataNodes (via heartbeats)

Important:

 Does NOT store actual file data

 Stores metadata in:

o FsImage

o EditLog

DataNodes (Slaves)

These store the actual data blocks.

Responsibilities:

 Store file blocks

 Send heartbeat to NameNode

 Send block reports

 Handle read/write requests from clients

If a DataNode fails:

 NameNode replicates blocks to other nodes.

Secondary NameNode (Checkpoint Node)

 Merges FsImage + EditLog

 Creates updated checkpoints

 Helps reduce NameNode load

 ❗ Not a backup NameNode

How Data is Stored in HDFS


Step-by-step process:

1. Client sends file to NameNode.

2. NameNode splits file into blocks (default 128MB).

3. NameNode selects DataNodes.

4. Blocks are stored with replication (default = 3).

5. Data is written in a pipeline:

o DN1 → DN2 → DN3

Read Operation

1. Client requests file from NameNode.

2. NameNode provides block locations.

3. Client reads data directly from nearest DataNode.

Rack Awareness

 HDFS places replicas across different racks

 Prevents data loss if one rack fails

 Improves fault tolerance

Example:

 1 replica in local rack

 2 replicas in different rack

Features of HDFS

 It's easy to access the files stored in HDFS.

 HDFS also provides high availability and fault tolerance.

 Provides scalability to scaleup or scaledown nodes as per our requirement.


 Data is stored in distributed manner i.e. various Datanodes are responsible for storing the data.

 HDFS provides Replication because of which no fear of Data Loss.

 HDFS Provides High Reliability as it can store data in a large range of Petabytes.

 HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the
cluster information.

 Provides high throughput.

You might also like