MODULE-2 HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
Introduction to Distributed Filesystems
When a dataset exceeds the storage capacity of a single physical machine, it becomes necessary to partition
and distribute it across multiple machines. Distributed filesystems manage this distributed storage across a
network of machines, allowing for scalability and fault tolerance.
Key Points:
• Definition: Distributed filesystems are filesystems that operate over a network, managing storage
across multiple machines.
• Complexity: They are more complex than traditional disk filesystems due to the challenges of network
programming and ensuring data integrity across nodes.
• Challenges:
o Node Failure: Ensuring the filesystem can tolerate the failure of nodes without data loss.
o Network Issues: Handling latency, bandwidth limitations, and reliability of the network.
2. Hadoop Distributed Filesystem (HDFS)
HDFS is a distributed filesystem designed to store and manage large datasets across a cluster of machines. It
is a core component of the Hadoop ecosystem.
Key Features:
• Scalability: HDFS is designed to scale out by adding more nodes to the cluster, which increases
storage capacity and computational power.
• Fault Tolerance: It replicates data across multiple nodes to ensure data is not lost in case of hardware
failures.
• High Throughput: Optimized for high throughput rather than low latency, making it suitable for
processing large files.
Key Components:
• NameNode: Manages the metadata of the filesystem, such as file names, directory structure, and file-
to-block mapping. It is a single point of failure and crucial for the filesystem's operation.
• DataNode: Stores the actual data blocks. Data is replicated across multiple DataNodes for fault
tolerance.
• Secondary NameNode: Performs periodic checkpoints of the NameNode's metadata to provide a
backup in case of failures.
HDFS Operation:
• Data Storage: Files are split into blocks (default size is 128 MB or 256 MB) and distributed across the
cluster. Each block is replicated multiple times (default replication factor is 3) to ensure data reliability.
• Access: HDFS is designed for large-scale data processing tasks and provides write-once, read-many
access patterns. It is not optimized for small file storage or frequent updates.
Integration with Other Storage Systems
Hadoop’s filesystem abstraction allows it to integrate with various storage systems beyond HDFS:
• Local Filesystem: Hadoop can use the local filesystem for development or smaller datasets.
• Amazon S3: Hadoop can interact with Amazon S3 as a storage backend, allowing for scalable storage
in the cloud. This integration uses the S3A file system client to read and write data.
• RDBMS, DateLake, DataWarehouses, streaming systems, cloud systems and so on
Key Considerations:
• Data Consistency: When integrating with other storage systems, ensure that the data consistency
models are compatible with your use case.
• Performance: Consider the performance implications of different storage backends, particularly in
terms of latency and throughput.
Design and Use Cases of HDFS
1. Overview of HDFS Design
HDFS (Hadoop Distributed File System) is specifically designed to address the needs of storing and processing
very large files in a distributed computing environment. Here’s a breakdown of its key design principles:
Key Features:
• Very Large Files:
o Definition: In HDFS, "very large" refers to files that range from hundreds of megabytes to
terabytes, and even petabytes in some cases.
o Purpose: It is optimized for storing such large datasets efficiently, which is essential for big
data applications and analytics.
• Streaming Data Access:
o Access Pattern: HDFS is built around a write-once, read-many-times model. This model suits
scenarios where data is initially ingested and then analyzed multiple times.
o Performance: The design prioritizes high throughput for reading large datasets over low-
latency access. The time to read the entire dataset is more critical than the time to read the first
record.
• Commodity Hardware:
o Cost Efficiency: HDFS is designed to run on clusters of inexpensive, commonly available
hardware. This design choice makes it cost-effective and scalable.
o Fault Tolerance: It anticipates hardware failures and is designed to continue operating with
minimal user impact, leveraging the redundancy built into the system.
2. Challenges and Limitations
While HDFS is powerful for many use cases, there are specific scenarios where it may not be the best fit:
Low-Latency Data Access:
• Latency Constraints: Applications requiring quick, sub-second data access (in the tens of
milliseconds range) may not perform well with HDFS. Its optimization for high throughput rather than
low latency means it may not meet the performance needs of such applications.
• Alternative Solutions: HBase, which is built on top of HDFS, is often used for scenarios requiring
low-latency data access. HBase provides capabilities for fast random access to data.
Handling Lots of Small Files:
• Metadata Management: HDFS stores metadata (such as file and directory information) in memory
on the NameNode. This means the scalability of HDFS in terms of the number of files is limited by the
memory capacity of the NameNode.
• Scalability Constraints: Each file, directory, and block take up about 150 bytes of memory. For
example, handling one million files requires at least 300 MB of memory. While managing millions of
files is feasible, managing billions of files can exceed current hardware capabilities.
Multiple Writers and Arbitrary Modifications:
• Write Restrictions: HDFS supports a single writer per file, with data being appended to the end of the
file. It does not support multiple writers or modifications at arbitrary file offsets.
• Future Considerations: Although support for multiple writers and arbitrary modifications might be
introduced in the future, such features are likely to be less efficient compared to the current append-
only model.
HDFS Concepts
Blocks
1. Understanding Blocks in Filesystems
In both traditional filesystems and HDFS, the concept of "blocks" is fundamental. Here's a detailed look at
what blocks are and their significance:
Filesystem Blocks vs. HDFS Blocks:
• Traditional Filesystems:
o Disk Blocks: The smallest unit of storage on a disk, usually 512 bytes.
o Filesystem Blocks: Typically larger, a few kilobytes in size, used by the filesystem to manage
data. The filesystem block size is an integral multiple of the disk block size.
o Tools: Commands like df and fsck operate at the filesystem block level for maintenance and
checking.
• HDFS Blocks:
o Size: Much larger than traditional filesystem blocks, with a default size of 128 MB.
o Function: Files in HDFS are divided into blocks of this size, which are stored independently
across the cluster.
o Storage Efficiency: If a file is smaller than a block, only the space needed for the file is used
(e.g., a 1 MB file on a 128 MB block uses only 1 MB of disk space).
2. Why Are HDFS Blocks So Large?
The choice of large block sizes in HDFS is driven by several practical considerations:
• Minimizing Seek Time:
o Seek Time vs. Transfer Rate: The time to seek to the start of a block can be minimized relative
to the time taken to transfer the data. Large blocks ensure that the transfer time dominates over
the seek time.
• MapReduce Considerations:
o Map Tasks: In MapReduce, tasks typically process one block at a time. Having too few blocks
compared to the number of nodes can lead to slower job execution due to insufficient
parallelism.
3. Benefits of Block Abstraction
The block abstraction in HDFS provides several advantages:
• Handling Large Files:
o Scalability: HDFS allows files to exceed the size of any single disk in the cluster. Blocks can
be distributed across all available disks, enabling the storage of very large files.
• Simplified Storage Management:
o Fixed Size: Blocks are of a fixed size, making it straightforward to calculate storage
requirements and manage space on disks.
o Metadata Management: Since blocks are only chunks of data, file metadata (like permissions)
is managed separately from the blocks, simplifying the storage subsystem.
• Replication and Fault Tolerance:
o Replication: To ensure fault tolerance, each block is replicated across multiple machines
(typically three). This redundancy allows for recovery from disk or machine failures.
o Automatic Recovery: If a block becomes unavailable due to corruption or a failure, it can be
replicated from other available copies to restore the required replication factor.
• Load Distribution:
o Read Load: Applications may set a higher replication factor for blocks in frequently accessed
files to distribute the read load across the cluster.
4. Filesystem Check (fsck) with HDFS
HDFS provides a command to understand and manage blocks:
• Command: hdfs fsck / -files -blocks
o Function: Lists the blocks that make up each file in the HDFS filesystem.
o Purpose: Useful for checking the health and integrity of the filesystem and its blocks.
Namenodes and Datanodes in HDFS
HDFS (Hadoop Distributed File System) operates with a master-worker architecture involving two main types
of nodes: Namenodes and Datanodes. Understanding their roles and mechanisms is essential for maintaining
the health and performance of an HDFS cluster.
1. Namenode
The Namenode is the master node in the HDFS architecture, responsible for managing the filesystem
namespace and metadata.
Responsibilities:
• Filesystem Namespace Management:
o Maintains the directory tree of the filesystem and metadata for all files and directories.
o Stores metadata persistently on local disks in two files: the namespace image and the edit log.
• Block Management:
o Keeps track of which Datanodes store the blocks for each file.
o Does not persistently store block locations; instead, this information is reconstructed from
Datanodes during system startup.
• Client Interaction:
o Clients interact with the Namenode to perform filesystem operations.
o Provides a filesystem interface, abstracting the complexities of the Namenode and Datanodes
from the user.
Failure and Recovery:
• Critical Role:
o If the Namenode fails, the filesystem cannot be used, and data loss can occur because the system
would not know how to reconstruct files from the blocks on Datanodes.
• Resilience Mechanisms:
o Backup:
▪ Namenode's persistent state is backed up to multiple filesystems. This includes
synchronous and atomic writes to local disks and remote NFS mounts.
o Secondary Namenode:
▪ Role: Periodically merges the namespace image with the edit log to prevent the edit log
from becoming too large.
▪ Operation: Runs on a separate physical machine with sufficient CPU and memory. It
keeps a copy of the merged namespace image.
▪ Limitations: The state of the Secondary Namenode lags behind the primary, so there is
a risk of data loss if the primary fails. In such cases, the primary's metadata files are
copied to the Secondary Namenode, which then acts as the new primary.
▪ Alternative: A hot standby Namenode can be used for high availability, which provides
a more robust failover solution.
2. Datanodes
Datanodes are the worker nodes in the HDFS architecture, responsible for storing and managing the data
blocks.
Responsibilities:
• Block Storage and Retrieval:
o Store and retrieve blocks as instructed by clients or the Namenode.
o Report periodically to the Namenode with lists of blocks they are storing.
Operation:
• Data Management:
o Datanodes handle the actual data blocks, and their efficiency directly impacts the performance
of HDFS.
o They ensure data redundancy and availability by storing multiple replicas of each block.
3. Interaction Between Namenode and Datanodes
• Client Operations:
o Clients interact with the Namenode to get information about where to find the blocks of a file.
o The Namenode directs clients to the appropriate Datanodes for block retrieval or storage.
• Datanode Reporting:
o Datanodes regularly send heartbeat signals and block reports to the Namenode.
o This reporting helps the Namenode monitor the health of the Datanodes and manage data
replication and block recovery.
Block Caching
Purpose:
• Block caching improves read performance by storing frequently accessed blocks in the memory of
Datanodes.
• It allows quick access to data without having to read from disk repeatedly.
Mechanics:
• Caching Location: Blocks are cached in an off-heap memory area on the Datanodes. Off-heap
memory is used to avoid garbage collection overhead.
• Cache Scope: By default, each block is cached in one Datanode’s memory. This is configurable on a
per-file basis, allowing multiple Datanodes to cache the same block if needed.
• Cache Management: Administrators can specify which files should be cached and the duration of
caching using cache directives added to a cache pool.
• Cache Pools: Cache pools are used to manage cache permissions and resource usage. They help in
organizing and controlling access to cached data.
Benefits:
• Performance Improvement: Increases read performance by reducing disk I/O. For instance, job
schedulers in frameworks like MapReduce or Spark can schedule tasks on Datanodes that have relevant
blocks cached, leading to faster data access.
• Use Cases: Ideal for use cases such as small lookup tables used in joins, where data is frequently
accessed.
HDFS Federation
Purpose:
• HDFS Federation enhances the scalability of HDFS by allowing the use of multiple Namenodes.
Mechanics:
• Namespace Management: In HDFS Federation, the filesystem namespace is split across multiple
Namenodes. Each Namenode manages a portion of the namespace and its associated block pool.
• Namespace Volumes: Each Namenode is responsible for a namespace volume, which includes
metadata for its portion of the namespace. These volumes are independent, meaning that the failure of
one Namenode does not affect others.
• Block Pool Storage: Datanodes register with all Namenodes in the cluster and store blocks from
multiple block pools. This ensures that blocks from different namespaces can be stored and managed
efficiently.
Access:
• Client Interaction: Clients use client-side mount tables to map file paths to the appropriate
Namenodes. Configuration is managed using ViewFileSystem and the viewfs:// URIs.
Benefits:
• Scalability: Allows the HDFS cluster to scale beyond the limitations of a single Namenode's memory
by distributing the namespace management.
• Fault Tolerance: Improves fault tolerance by isolating namespace management across multiple
Namenodes.
HDFS High Availability (HA)
Purpose:
• HDFS HA addresses the single point of failure issue associated with the Namenode by providing a pair
of Namenodes in an active-standby configuration.
Mechanics:
• Active-Standby Configuration: One Namenode acts as the active Namenode, handling client
requests, while the other acts as the standby, ready to take over if the active Namenode fails.
• Shared Storage: Namenodes use highly available shared storage to keep the edit log. Two main
choices for this storage are:
o NFS Filer: Traditional Network File System for shared storage.
o Quorum Journal Manager (QJM): A specialized HDFS component designed to provide
highly available edit logs. It uses a group of journal nodes where each edit must be written to a
majority of nodes (e.g., three nodes, allowing for one node failure).
• Datanode Reporting: Datanodes must report block information to both the active and standby
Namenodes.
• Client Configuration: Clients must be configured to handle Namenode failover transparently.
Failover Process:
• Quick Failover: The standby Namenode can take over quickly (within seconds) as it maintains up-to-
date state in memory, including the latest edit log and block mappings.
• Recovery: In case the standby Namenode is down when the active fails, the administrator can start the
standby from cold. While this process is better than the non-HA scenario, it still requires standard
operational procedures.
Advantages:
• Reduced Downtime: Provides high availability and reduces downtime by enabling a rapid failover
mechanism.
• Operational Efficiency: Standardizes the failover process, making it more predictable and
manageable.
Failover Process
Failover Controller:
• Role: Manages the transition between the active and standby Namenodes. It ensures that only one
Namenode is active at a time.
• Default Implementation: Uses ZooKeeper, which coordinates the failover process by monitoring the
health of the Namenodes and triggering failover if needed.
• Process:
o Heartbeat Mechanism: Each Namenode runs a lightweight failover controller that sends
heartbeats to check the status of the other Namenode.
o Graceful Failover: Can be initiated manually by an administrator, such as during routine
maintenance. This involves an orderly transition where both Namenodes switch roles smoothly.
o Ungraceful Failover: Occurs automatically if the active Namenode fails unexpectedly. This
can happen due to issues like network partitions or slow networks, where the active Namenode
might still be running but is unreachable.
Client Failover Handling:
• Transparent to Clients: The client library manages failover transparently. Clients are configured with
a logical hostname that maps to a pair of Namenode addresses.
• Failover Mechanism: The client library attempts connections to each Namenode address in turn until
it succeeds. This ensures continuous service availability even during failover.
Fencing Mechanisms
Purpose:
• Prevent Data Corruption: Ensures that the previously active Namenode, which might still be running
or reachable, does not interfere with the cluster operations or cause data corruption.
Fencing Techniques:
• SSH Fencing Command: A simple method where an SSH command is used to kill the process of the
previously active Namenode. This is effective if the failover controller is unsure if the old Namenode
has stopped completely.
• NFS Filer: When using an NFS filer for shared edit logs, stronger fencing methods are required
because NFS filers do not enforce exclusive write access as effectively as QJM.
o Revoking Access: Commands to revoke the Namenode’s access to the shared storage directory
can be used.
o Disabling Network Port: The Namenode’s network port can be disabled via remote
management commands to prevent it from accepting requests.
o STONITH (Shoot The Other Node In The Head): This drastic measure involves using a
specialized power distribution unit (PDU) to forcibly power down the host machine of the failed
Namenode, ensuring it can no longer affect the system.
Hadoop Filesystems: Overview and Usage
Hadoop’s flexible filesystem abstraction allows it to interact with various storage systems, each designed for
specific use cases. Understanding these filesystems can help in choosing the right one for your needs, whether
for local testing, distributed processing, or cloud integration.
1. Local FileSystem ([Link])
• URI Scheme: [Link]
• Description: Represents local disk storage. Ideal for small-scale testing or development on a single
machine.
• Purpose:
o Testing and development in a local environment.
o When data integrity through client-side checksums is needed.
• How to Use:
o Access local files directly using the [Link] scheme.
o For environments where checksums are not required, use RawLocalFileSystem.
2. Hadoop Distributed File System (HDFS) ([Link])
• URI Scheme: hdfs://
• Description: A distributed storage system designed for high-throughput access to large datasets.
Provides fault tolerance through data replication.
• Purpose:
o Handling large-scale data storage and processing.
o Optimized for use with MapReduce and other Hadoop processing frameworks.
o Ensures data reliability and fault tolerance.
• How to Use:
o Access files using the hdfs:// scheme.
o Ideal for processing large volumes of data across multiple nodes.
3. WebHDFS ([Link])
• URI Scheme: webhdfs://