0% found this document useful (0 votes)

11 views3 pages

HDFS Overview and Key Features

Hadoop Distributed File System (HDFS) is designed for storing extremely large files across a network of commodity hardware, providing reliable and fault-tolerant data storage. It uses a master-slave architecture with a NameNode managing metadata and DataNodes storing the actual data, which is divided into blocks for efficient access and replicated for fault tolerance. While HDFS excels in high-throughput data access, it struggles with low-latency requirements and inefficiencies related to handling numerous small files.

Uploaded by

rajeshmeheto.ica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views3 pages

HDFS Overview and Key Features

Uploaded by

rajeshmeheto.ica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Introduction to Hadoop Distributed File

System(HDFS)
Last Updated : 04 Apr, 2025




With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines.
Such filesystems are called distributed filesystems. Since data is stored
across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable
filesystems. HDFS (Hadoop Distributed File System) is a unique design that
provides storage for extremely large files with streaming data access pattern,
and it runs on commodity hardware. Let's elaborate on the terms:
 Extremely large files: Here, we are talking about the data in a range of
petabytes (1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-
once and read-many-times. Once data is written large portions of dataset
can be processed any number times.
 Commodity hardware: Hardware that is inexpensive and easily available in
the market. This is one of the features that especially distinguishes HDFS
from other file systems.
Nodes: Master-slave nodes typically form the HDFS cluster.
1. NameNode(MasterNode):

 Manages all the slave nodes and assigns work to them.

 It executes filesystem namespace operations like opening, closing, and
renaming files and directories.
 It should be deployed on reliable hardware that has a high configuration.
not on commodity hardware.
2. DataNode(SlaveNode):

Actual worker nodes do the actual work like reading, writing, processing,
etc.
 They also perform creation, deletion, and replication upon instruction
from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in the background.
 Namenodes:

o Run on the master node.

oStore metadata (data about data) like file path, the number of
blocks, block Ids. etc.
o Requires a high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
 DataNodes:

oRun on slave nodes.

o Require high memory as data is actually stored here.
Data storage in HDFS: Now let's see how the data is stored in a distributed
manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will

first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x
and above). Then these blocks are stored across different
datanodes(slavenode). Datanodes(slavenode) replicate the blocks among
themselves and the information of what blocks they contain is sent to the
master. Default replication factor is 3 means for each block 3 replicas are
created (including itself). In [Link] we can increase or decrease the
replication factor i.e we can edit its configuration here.

Note: MasterNode has the record of everything, it knows the location and info
of each and every single data nodes and the blocks they contain, i.e. nothing is
done without the permission of masternode.

Why divide the file into blocks?

Answer: Let's assume that we don't divide, now it's very difficult to store a 100
TB file on a single machine. Even if we store, then each read and write
operation on that whole file is going to take very high seek time. But if we have
multiple blocks of size 128MB then its become easy to perform various read
and write operations on it compared to doing it on a whole file at once. So we
divide the file to have faster data access i.e. reduce seek time.

Why replicate the blocks in data nodes while storing?

Answer: Let's assume we don't replicate and only one yellow block is present
on datanode D1. Now if the data node D1 crashes we will lose the block and
which will make the overall data inconsistent and faulty. So we replicate the
blocks to achieve fault-tolerance.

Terms related to HDFS:

 HeartBeat : It is the signal that datanode continuously sends to namenode.
If namenode doesn't receive heartbeat from a datanode then it will consider
it dead.
 Balancing : If a datanode is crashed the blocks present on it will be gone
too and the blocks will be under-replicated compared to the remaining
blocks. Here master node(namenode) will give a signal to datanodes
containing replicas of those lost blocks to replicate so that overall
distribution of blocks is balanced.
 Replication:: It is done by datanode.

Note: No two replicas of the same block are present on the same datanode.

Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple
datanodes.
 Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
 High fault tolerance.

Limitations: Though HDFS provide many features there are some areas
where it doesn't work well.
 Low latency data access: Applications that require low-latency access to
data i.e in the range of milliseconds will not work well with HDFS, because
HDFS is designed keeping in mind that we need high-throughput of data
even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of seeks and
lots of movement from one datanode to another datanode to retrieve each
small file, this whole process is a very inefficient data access pattern.

Common questions

HDFS offers high data availability and reliability through its distributed architecture. Key features include data replication across multiple nodes, allowing HDFS to recover from node failures without data loss. The distributed storage of blocks ensures no two replicas reside on the same DataNode, minimizing risk in cases of hardware failures . The system's design of managing large files over smaller manageable blocks reduces seek times and optimizes processing across available resources, further enhancing data access efficiency . These features together make HDFS a robust system for high reliability and availability, ensuring continuous operations .

HDFS facilitates efficient storage of large datasets by dividing files into smaller blocks, which are then stored across multiple data nodes. This approach allows the system to handle petabyte-scale datasets, as opposed to a single machine struggling with such immense data. The block division not only makes storage manageable across distributed nodes but also enhances data access speed and reliability through parallel operations and replication strategies . By separating storage across nodes, HDFS overcomes the limitations of traditional filesystems which are constrained by the physical storage capacities of individual machines, enhancing both fault tolerance and data availability at a massive scale .

Replication and block distribution are central to HDFS's fault tolerance. By replicating each data block across multiple DataNodes (commonly three nodes), the system ensures that even if a DataNode fails, the data can still be retried from other nodes that have replicas. This redundancy prevents data loss and maintains consistency despite hardware failures . Additionally, no two replicas of the same block are stored on the same node, further enhancing fault tolerance by dispersing data distribution and minimizing risks associated with node failures . It allows load balancing across all nodes and supports continual operations even if some nodes encounter issues .

HDFS handles potential data inconsistencies due to node failures using block replication and the HeartBeat mechanism. When a node fails, the NameNode, upon not receiving a HeartBeat, marks the node as dead and commences re-replication of the blocks it stored from surviving DataNodes that contain duplicates. By maintaining at least the preset replication factor, HDFS ensures that a consistent state is recoverable even after hardware failures . Furthermore, the NameNode keeps precise track of block locations and statuses, coordinating replication to maintain system balance and prevent data inconsistencies across the cluster . This strategic approach enables HDFS to preserve consistency and integrity of data across failures.

The NameNode is crucial in an HDFS cluster because it operates as the master node, managing namespace operations and coordinating the DataNodes by maintaining all filesystem metadata. Thus, its failure can render an entire HDFS cluster inoperative. It is important to deploy the NameNode on reliable, high-performance hardware to ensure system reliability and fast retrieval of metadata stored in RAM. Reliable hardware helps prevent single points of failure and supports the NameNode's intensive demands for memory and processing power .

Adjusting the replication factor in HDFS configuration is vital for optimizing storage efficiency and fault tolerance. A higher replication factor increases data reliability by providing more backup copies, while a lower factor saves storage resources. The replication factor directly impacts system performance; higher replication decreases risks of data loss during node failures at the cost of additional storage and network use, potentially affecting the overall cluster throughput. Conversely, a lower factor might economize storage and network bandwidth but at a risk of reduced data reliability . Thus, balancing the replication factor according to system needs and resource availability is critical for achieving optimal performance.

HDFS may not be ideal in scenarios that require low-latency data access or involve numerous small files. The architecture is designed to prioritize high throughput over low latency, leading to inefficiencies in applications needing quick access times measured in milliseconds . Moreover, the "small file problem" arises when HDFS handles many small files inefficiently, causing an extensive number of seek operations and high disk movement across nodes, leading to suboptimal data retrieval performance . These limitations suggest that applications requiring efficient handling of numerous small files or extremely low-latency access are better served by alternative storage solutions that are optimized for such workloads.

HDFS balances the load across DataNodes by distributing file blocks evenly across available nodes and allowing multiple replicas of each block. When a DataNode becomes unavailable, the NameNode detects the loss through ceased HeartBeat signals and initiates a process to restore balance by instructing other DataNodes that have replicas of the lost blocks to create new replicas. This proactive re-replication maintains the set replication factor, enhancing data resiliency and preventing under-replication . The balance ensures efficient resource utilization and preserves data availability across the cluster despite individual node failures .

DataNodes in HDFS are responsible for storing actual data and executing read, write, and replication operations. Because HDFS leverages data replication and distributes blocks across multiple nodes, it can afford to use inexpensive commodity hardware for DataNodes without compromising system reliability or data availability . If a DataNode fails, its data is still accessible from other nodes that contain its replicas. In contrast, the NameNode requires more robust hardware due to its critical role in maintaining metadata and coordinating node activities. Deploying NameNodes on high-performance hardware minimizes the likelihood of failure in this crucial coordination role .

The HeartBeat mechanism in HDFS is critical for maintaining integrity and coordination within the cluster. Each DataNode sends regular HeartBeat signals to the NameNode, confirming its active status. These signals help the NameNode identify failed DataNodes promptly, as the absence of a heartbeat indicates a node might be down . Upon identifying inactive nodes, the NameNode can initiate re-replication of blocks from surviving nodes, ensuring data availability and system stability despite hardware failures. This mechanism enables dynamic cluster management and preserves data consistency throughout distributed operations .

Dea Unit 2
No ratings yet
Dea Unit 2
28 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
5 pages
Understanding HDFS: Design & Concepts
No ratings yet
Understanding HDFS: Design & Concepts
46 pages
Unit 2 HDFS
No ratings yet
Unit 2 HDFS
63 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
12 pages
Understanding Hadoop and HDFS Basics
No ratings yet
Understanding Hadoop and HDFS Basics
9 pages
Unit 3 HDFS
No ratings yet
Unit 3 HDFS
179 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
183 pages
Overview of Hadoop HDFS Features
No ratings yet
Overview of Hadoop HDFS Features
90 pages
HDFS: Overview of Hadoop Storage System
No ratings yet
HDFS: Overview of Hadoop Storage System
148 pages
BigDataAnalytics Unit3 Part1
No ratings yet
BigDataAnalytics Unit3 Part1
15 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
20 pages
HDFS Architecture and High Availability
No ratings yet
HDFS Architecture and High Availability
21 pages
RDBMS vs Hadoop: Key Differences
No ratings yet
RDBMS vs Hadoop: Key Differences
19 pages
HDFS: Latency Considerations in Big Data
No ratings yet
HDFS: Latency Considerations in Big Data
3 pages
Overview of Hadoop HDFS Architecture
No ratings yet
Overview of Hadoop HDFS Architecture
15 pages
HDFS Concepts and Architecture Overview
No ratings yet
HDFS Concepts and Architecture Overview
4 pages
Understanding HDFS Architecture and Benefits
No ratings yet
Understanding HDFS Architecture and Benefits
39 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
22 pages
Overview of Hadoop HDFS Features and Architecture
No ratings yet
Overview of Hadoop HDFS Features and Architecture
89 pages
Data Node
No ratings yet
Data Node
15 pages
Hadoop Modules and MapReduce Overview
No ratings yet
Hadoop Modules and MapReduce Overview
46 pages
HDFS Fault Tolerance Mechanisms
No ratings yet
HDFS Fault Tolerance Mechanisms
9 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
258 pages
Understanding Big Data and HDFS
No ratings yet
Understanding Big Data and HDFS
421 pages
Understanding Apache Hadoop Basics
No ratings yet
Understanding Apache Hadoop Basics
22 pages
History and Function of Hadoop HDFS
No ratings yet
History and Function of Hadoop HDFS
5 pages
Hadoop HDFS: Big Data Storage Explained
No ratings yet
Hadoop HDFS: Big Data Storage Explained
11 pages
HDFS Architecture and Concepts Explained
No ratings yet
HDFS Architecture and Concepts Explained
20 pages
Introduction to the Hadoop Ecosystem
No ratings yet
Introduction to the Hadoop Ecosystem
46 pages
GFS vs HDFS: Key Features Explained
No ratings yet
GFS vs HDFS: Key Features Explained
60 pages
HDFS Architecture Overview and Features
No ratings yet
HDFS Architecture Overview and Features
74 pages
Understanding HDFS in Big Data Analytics
No ratings yet
Understanding HDFS in Big Data Analytics
8 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
29 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Big Data Storage Technologies Overview
No ratings yet
Big Data Storage Technologies Overview
44 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
16 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
58 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
29 pages
Big Data Unit III
No ratings yet
Big Data Unit III
38 pages
HDFS Data Replication Explained
No ratings yet
HDFS Data Replication Explained
65 pages
HDFS Overview: Design, Benefits, and Operations
No ratings yet
HDFS Overview: Design, Benefits, and Operations
27 pages
Overview of Hadoop Architecture
No ratings yet
Overview of Hadoop Architecture
48 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
248 pages
Overview of Hadoop Modules and HDFS
0% (1)
Overview of Hadoop Modules and HDFS
101 pages
하둡 개요 및 아키텍처 설명
No ratings yet
하둡 개요 및 아키텍처 설명
28 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Big Data Trends and HDFS Overview
No ratings yet
Big Data Trends and HDFS Overview
20 pages
Big Data (Unit 3)
No ratings yet
Big Data (Unit 3)
27 pages
Unit 3
No ratings yet
Unit 3
49 pages
Introduction to Apache Hadoop and HDFS
No ratings yet
Introduction to Apache Hadoop and HDFS
57 pages
HDFS Overview by Neha Pathipati
No ratings yet
HDFS Overview by Neha Pathipati
25 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
16 pages
Understanding HDFS: Key Features & Goals
No ratings yet
Understanding HDFS: Key Features & Goals
3 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
23 pages
Overview of HDFS Architecture and Functions
No ratings yet
Overview of HDFS Architecture and Functions
14 pages
HDFS Architecture and Key Components
No ratings yet
HDFS Architecture and Key Components
66 pages
5.HDFS
No ratings yet
5.HDFS
4 pages
Introduction to Hadoop HDFS Concepts
No ratings yet
Introduction to Hadoop HDFS Concepts
31 pages
Chapter
No ratings yet
Chapter
6 pages
Batteries and Corrosion
No ratings yet
Batteries and Corrosion
14 pages
IMNCI Guidelines for Child Health Care
No ratings yet
IMNCI Guidelines for Child Health Care
31 pages
UTKRISHT+K2 Classes Overview
No ratings yet
UTKRISHT+K2 Classes Overview
29 pages
Feature Extraction Techniques Overview
No ratings yet
Feature Extraction Techniques Overview
1 page
Code Scheduling Techniques for ILP
No ratings yet
Code Scheduling Techniques for ILP
6 pages
Python Programming Lab Practical File
No ratings yet
Python Programming Lab Practical File
1 page
I.C. Engine and Thermodynamics Exam Guide
No ratings yet
I.C. Engine and Thermodynamics Exam Guide
2 pages
Components of Information Systems
No ratings yet
Components of Information Systems
5 pages
Machine Design Exam Questions
No ratings yet
Machine Design Exam Questions
2 pages
Core Java Programming Exercises Guide
No ratings yet
Core Java Programming Exercises Guide
2 pages
History of Greenock1905
100% (1)
History of Greenock1905
51 pages
Overview of Philippine History and Culture
No ratings yet
Overview of Philippine History and Culture
9 pages
JSSD 2025 Account Statement Summary
No ratings yet
JSSD 2025 Account Statement Summary
14 pages
WATI Assistive Technology Plan Overview
No ratings yet
WATI Assistive Technology Plan Overview
5 pages
FizzDragon: AIGC Innovation in Singapore
No ratings yet
FizzDragon: AIGC Innovation in Singapore
17 pages
Picnic Comprehension for Young Learners
No ratings yet
Picnic Comprehension for Young Learners
2 pages
OPTHA Imp Notes
No ratings yet
OPTHA Imp Notes
6 pages
Masi's Temptation: A Theater Encounter
No ratings yet
Masi's Temptation: A Theater Encounter
10 pages
Effective Strategies for Academic Reading
No ratings yet
Effective Strategies for Academic Reading
13 pages
Class 11 Business Studies Revision Worksheet
No ratings yet
Class 11 Business Studies Revision Worksheet
2 pages
Unlawful Arrest Analysis in Kenya
No ratings yet
Unlawful Arrest Analysis in Kenya
4 pages
Who PCPNC PDF
No ratings yet
Who PCPNC PDF
179 pages
The Complexity of Love and Work
No ratings yet
The Complexity of Love and Work
101 pages
Basic Accounting Concepts Explained
100% (1)
Basic Accounting Concepts Explained
3 pages
Suramadenusra XIII Workshop Schedule
No ratings yet
Suramadenusra XIII Workshop Schedule
11 pages
Networking Tips for Career Success
No ratings yet
Networking Tips for Career Success
11 pages
LCD-TV: Service
0% (1)
LCD-TV: Service
79 pages
Clarification on GeM Bid for FRP Fans
No ratings yet
Clarification on GeM Bid for FRP Fans
1 page
Fried 2014 MHMTN
No ratings yet
Fried 2014 MHMTN
1 page
Jar Doc 06 Jarus Sora Annex C v1.0
No ratings yet
Jar Doc 06 Jarus Sora Annex C v1.0
17 pages
Superior Hotel Internet Solutions
No ratings yet
Superior Hotel Internet Solutions
2 pages
Airplane Arrival Emoji (U+1F6EC)
No ratings yet
Airplane Arrival Emoji (U+1F6EC)
3 pages
Discover The Complete Collection of Resources
No ratings yet
Discover The Complete Collection of Resources
72 pages
Emotional Intelligence in Entrepreneurs
No ratings yet
Emotional Intelligence in Entrepreneurs
6 pages
Indian Weight Loss Diet Plan PDF
No ratings yet
Indian Weight Loss Diet Plan PDF
2 pages
Research Essay Revision Checklist
No ratings yet
Research Essay Revision Checklist
2 pages
Rose Bakeshop: Challenges and Strategies
No ratings yet
Rose Bakeshop: Challenges and Strategies
3 pages
Examination Centers in Southern Punjab
No ratings yet
Examination Centers in Southern Punjab
7 pages
Understanding SEO for Online Marketing
No ratings yet
Understanding SEO for Online Marketing
9 pages
Oral Biology MCQs on Oral Mucosa
75% (8)
Oral Biology MCQs on Oral Mucosa
3 pages

HDFS Overview and Key Features

Uploaded by

HDFS Overview and Key Features

Uploaded by

Introduction to Hadoop Distributed File

 Manages all the slave nodes and assigns work to them.

o Run on the master node.

oRun on slave nodes.

Lets assume that 100TB file is inserted, then masternode(namenode) will

Why divide the file into blocks?

Why replicate the blocks in data nodes while storing?

Terms related to HDFS:

Common questions

What architectural features of HDFS enable it to provide high data availability and reliability?

How does the structure of HDFS facilitate storage of large datasets efficiently compared to traditional filesystems?

What role do replication and block distribution play in ensuring fault tolerance within HDFS?

Discuss how HDFS handles data inconsistencies that might arise during node failures.

Why is it important for the NameNode to be deployed on reliable hardware in an HDFS setup?

Why is the manual adjustment of the replication factor in HDFS configuration important, and what impact can it have on system performance?

In what scenarios might HDFS not be the ideal storage solution, and why?

How does HDFS balance the load across DataNodes, and what occurs when a DataNode becomes unavailable?

Explain why DataNodes in HDFS can be deployed on commodity hardware, unlike the NameNode.

What is the role of the HeartBeat mechanism in maintaining the integrity of HDFS operations?

You might also like