0% found this document useful (0 votes)

7 views12 pages

Module 2

The document provides an overview of the Hadoop Distributed File System (HDFS), emphasizing its role in managing large datasets across multiple machines for scalability and fault tolerance. It details the architecture, including the roles of Namenodes and Datanodes, and discusses key features such as block management, replication, and integration with other storage systems. Additionally, it highlights challenges and limitations of HDFS, including low-latency access and handling small files, while also addressing high availability and federation mechanisms to enhance performance and reliability.

Uploaded by

rxthgowda12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views12 pages

Module 2

Uploaded by

rxthgowda12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODULE-2 HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

Introduction to Distributed Filesystems

When a dataset exceeds the storage capacity of a single physical machine, it becomes necessary to partition
and distribute it across multiple machines. Distributed filesystems manage this distributed storage across a
network of machines, allowing for scalability and fault tolerance.

Key Points:

• Definition: Distributed filesystems are filesystems that operate over a network, managing storage
across multiple machines.
• Complexity: They are more complex than traditional disk filesystems due to the challenges of network
programming and ensuring data integrity across nodes.
• Challenges:
o Node Failure: Ensuring the filesystem can tolerate the failure of nodes without data loss.
o Network Issues: Handling latency, bandwidth limitations, and reliability of the network.

2. Hadoop Distributed Filesystem (HDFS)

HDFS is a distributed filesystem designed to store and manage large datasets across a cluster of machines. It
is a core component of the Hadoop ecosystem.

Key Features:

• Scalability: HDFS is designed to scale out by adding more nodes to the cluster, which increases
storage capacity and computational power.
• Fault Tolerance: It replicates data across multiple nodes to ensure data is not lost in case of hardware
failures.
• High Throughput: Optimized for high throughput rather than low latency, making it suitable for
processing large files.

Key Components:

• NameNode: Manages the metadata of the filesystem, such as file names, directory structure, and file-
to-block mapping. It is a single point of failure and crucial for the filesystem's operation.
• DataNode: Stores the actual data blocks. Data is replicated across multiple DataNodes for fault
tolerance.
• Secondary NameNode: Performs periodic checkpoints of the NameNode's metadata to provide a
backup in case of failures.
HDFS Operation:

• Data Storage: Files are split into blocks (default size is 128 MB or 256 MB) and distributed across the
cluster. Each block is replicated multiple times (default replication factor is 3) to ensure data reliability.
• Access: HDFS is designed for large-scale data processing tasks and provides write-once, read-many
access patterns. It is not optimized for small file storage or frequent updates.

Integration with Other Storage Systems

Hadoop’s filesystem abstraction allows it to integrate with various storage systems beyond HDFS:

• Local Filesystem: Hadoop can use the local filesystem for development or smaller datasets.
• Amazon S3: Hadoop can interact with Amazon S3 as a storage backend, allowing for scalable storage
in the cloud. This integration uses the S3A file system client to read and write data.
• RDBMS, DateLake, DataWarehouses, streaming systems, cloud systems and so on

Key Considerations:

• Data Consistency: When integrating with other storage systems, ensure that the data consistency
models are compatible with your use case.
• Performance: Consider the performance implications of different storage backends, particularly in
terms of latency and throughput.
Design and Use Cases of HDFS

1. Overview of HDFS Design

HDFS (Hadoop Distributed File System) is specifically designed to address the needs of storing and processing
very large files in a distributed computing environment. Here’s a breakdown of its key design principles:

Key Features:

• Very Large Files:

o Definition: In HDFS, "very large" refers to files that range from hundreds of megabytes to
terabytes, and even petabytes in some cases.
o Purpose: It is optimized for storing such large datasets efficiently, which is essential for big
data applications and analytics.
• Streaming Data Access:
o Access Pattern: HDFS is built around a write-once, read-many-times model. This model suits
scenarios where data is initially ingested and then analyzed multiple times.
o Performance: The design prioritizes high throughput for reading large datasets over low-
latency access. The time to read the entire dataset is more critical than the time to read the first
record.
• Commodity Hardware:
o Cost Efficiency: HDFS is designed to run on clusters of inexpensive, commonly available
hardware. This design choice makes it cost-effective and scalable.
o Fault Tolerance: It anticipates hardware failures and is designed to continue operating with
minimal user impact, leveraging the redundancy built into the system.

2. Challenges and Limitations

While HDFS is powerful for many use cases, there are specific scenarios where it may not be the best fit:

Low-Latency Data Access:

• Latency Constraints: Applications requiring quick, sub-second data access (in the tens of
milliseconds range) may not perform well with HDFS. Its optimization for high throughput rather than
low latency means it may not meet the performance needs of such applications.
• Alternative Solutions: HBase, which is built on top of HDFS, is often used for scenarios requiring
low-latency data access. HBase provides capabilities for fast random access to data.
Handling Lots of Small Files:

• Metadata Management: HDFS stores metadata (such as file and directory information) in memory
on the NameNode. This means the scalability of HDFS in terms of the number of files is limited by the
memory capacity of the NameNode.
• Scalability Constraints: Each file, directory, and block take up about 150 bytes of memory. For
example, handling one million files requires at least 300 MB of memory. While managing millions of
files is feasible, managing billions of files can exceed current hardware capabilities.

Multiple Writers and Arbitrary Modifications:

• Write Restrictions: HDFS supports a single writer per file, with data being appended to the end of the
file. It does not support multiple writers or modifications at arbitrary file offsets.
• Future Considerations: Although support for multiple writers and arbitrary modifications might be
introduced in the future, such features are likely to be less efficient compared to the current append-
only model.

HDFS Concepts
Blocks

1. Understanding Blocks in Filesystems

In both traditional filesystems and HDFS, the concept of "blocks" is fundamental. Here's a detailed look at
what blocks are and their significance:

Filesystem Blocks vs. HDFS Blocks:

• Traditional Filesystems:
o Disk Blocks: The smallest unit of storage on a disk, usually 512 bytes.
o Filesystem Blocks: Typically larger, a few kilobytes in size, used by the filesystem to manage
data. The filesystem block size is an integral multiple of the disk block size.
o Tools: Commands like df and fsck operate at the filesystem block level for maintenance and
checking.
• HDFS Blocks:
o Size: Much larger than traditional filesystem blocks, with a default size of 128 MB.
o Function: Files in HDFS are divided into blocks of this size, which are stored independently
across the cluster.
o Storage Efficiency: If a file is smaller than a block, only the space needed for the file is used
(e.g., a 1 MB file on a 128 MB block uses only 1 MB of disk space).
2. Why Are HDFS Blocks So Large?

The choice of large block sizes in HDFS is driven by several practical considerations:

• Minimizing Seek Time:

o Seek Time vs. Transfer Rate: The time to seek to the start of a block can be minimized relative
to the time taken to transfer the data. Large blocks ensure that the transfer time dominates over
the seek time.
• MapReduce Considerations:
o Map Tasks: In MapReduce, tasks typically process one block at a time. Having too few blocks
compared to the number of nodes can lead to slower job execution due to insufficient
parallelism.

3. Benefits of Block Abstraction

The block abstraction in HDFS provides several advantages:

• Handling Large Files:

o Scalability: HDFS allows files to exceed the size of any single disk in the cluster. Blocks can
be distributed across all available disks, enabling the storage of very large files.
• Simplified Storage Management:
o Fixed Size: Blocks are of a fixed size, making it straightforward to calculate storage
requirements and manage space on disks.
o Metadata Management: Since blocks are only chunks of data, file metadata (like permissions)
is managed separately from the blocks, simplifying the storage subsystem.
• Replication and Fault Tolerance:
o Replication: To ensure fault tolerance, each block is replicated across multiple machines
(typically three). This redundancy allows for recovery from disk or machine failures.
o Automatic Recovery: If a block becomes unavailable due to corruption or a failure, it can be
replicated from other available copies to restore the required replication factor.
• Load Distribution:
o Read Load: Applications may set a higher replication factor for blocks in frequently accessed
files to distribute the read load across the cluster.

4. Filesystem Check (fsck) with HDFS

HDFS provides a command to understand and manage blocks:

• Command: hdfs fsck / -files -blocks

o Function: Lists the blocks that make up each file in the HDFS filesystem.
o Purpose: Useful for checking the health and integrity of the filesystem and its blocks.

Namenodes and Datanodes in HDFS

HDFS (Hadoop Distributed File System) operates with a master-worker architecture involving two main types
of nodes: Namenodes and Datanodes. Understanding their roles and mechanisms is essential for maintaining
the health and performance of an HDFS cluster.

1. Namenode

The Namenode is the master node in the HDFS architecture, responsible for managing the filesystem
namespace and metadata.

Responsibilities:

• Filesystem Namespace Management:

o Maintains the directory tree of the filesystem and metadata for all files and directories.
o Stores metadata persistently on local disks in two files: the namespace image and the edit log.
• Block Management:
o Keeps track of which Datanodes store the blocks for each file.
o Does not persistently store block locations; instead, this information is reconstructed from
Datanodes during system startup.
• Client Interaction:
o Clients interact with the Namenode to perform filesystem operations.
o Provides a filesystem interface, abstracting the complexities of the Namenode and Datanodes
from the user.

Failure and Recovery:

• Critical Role:
o If the Namenode fails, the filesystem cannot be used, and data loss can occur because the system
would not know how to reconstruct files from the blocks on Datanodes.
• Resilience Mechanisms:
o Backup:
▪ Namenode's persistent state is backed up to multiple filesystems. This includes
synchronous and atomic writes to local disks and remote NFS mounts.
o Secondary Namenode:
▪ Role: Periodically merges the namespace image with the edit log to prevent the edit log
from becoming too large.
▪ Operation: Runs on a separate physical machine with sufficient CPU and memory. It
keeps a copy of the merged namespace image.
▪ Limitations: The state of the Secondary Namenode lags behind the primary, so there is
a risk of data loss if the primary fails. In such cases, the primary's metadata files are
copied to the Secondary Namenode, which then acts as the new primary.
▪ Alternative: A hot standby Namenode can be used for high availability, which provides
a more robust failover solution.

2. Datanodes

Datanodes are the worker nodes in the HDFS architecture, responsible for storing and managing the data
blocks.

Responsibilities:

• Block Storage and Retrieval:

o Store and retrieve blocks as instructed by clients or the Namenode.
o Report periodically to the Namenode with lists of blocks they are storing.

Operation:

• Data Management:
o Datanodes handle the actual data blocks, and their efficiency directly impacts the performance
of HDFS.
o They ensure data redundancy and availability by storing multiple replicas of each block.

3. Interaction Between Namenode and Datanodes

• Client Operations:
o Clients interact with the Namenode to get information about where to find the blocks of a file.
o The Namenode directs clients to the appropriate Datanodes for block retrieval or storage.
• Datanode Reporting:
o Datanodes regularly send heartbeat signals and block reports to the Namenode.
o This reporting helps the Namenode monitor the health of the Datanodes and manage data
replication and block recovery.
Block Caching

Purpose:

• Block caching improves read performance by storing frequently accessed blocks in the memory of
Datanodes.
• It allows quick access to data without having to read from disk repeatedly.

Mechanics:

• Caching Location: Blocks are cached in an off-heap memory area on the Datanodes. Off-heap
memory is used to avoid garbage collection overhead.
• Cache Scope: By default, each block is cached in one Datanode’s memory. This is configurable on a
per-file basis, allowing multiple Datanodes to cache the same block if needed.
• Cache Management: Administrators can specify which files should be cached and the duration of
caching using cache directives added to a cache pool.
• Cache Pools: Cache pools are used to manage cache permissions and resource usage. They help in
organizing and controlling access to cached data.
Benefits:

• Performance Improvement: Increases read performance by reducing disk I/O. For instance, job
schedulers in frameworks like MapReduce or Spark can schedule tasks on Datanodes that have relevant
blocks cached, leading to faster data access.
• Use Cases: Ideal for use cases such as small lookup tables used in joins, where data is frequently
accessed.

HDFS Federation

Purpose:

• HDFS Federation enhances the scalability of HDFS by allowing the use of multiple Namenodes.

Mechanics:

• Namespace Management: In HDFS Federation, the filesystem namespace is split across multiple
Namenodes. Each Namenode manages a portion of the namespace and its associated block pool.
• Namespace Volumes: Each Namenode is responsible for a namespace volume, which includes
metadata for its portion of the namespace. These volumes are independent, meaning that the failure of
one Namenode does not affect others.
• Block Pool Storage: Datanodes register with all Namenodes in the cluster and store blocks from
multiple block pools. This ensures that blocks from different namespaces can be stored and managed
efficiently.

Access:

• Client Interaction: Clients use client-side mount tables to map file paths to the appropriate
Namenodes. Configuration is managed using ViewFileSystem and the viewfs:// URIs.

Benefits:

• Scalability: Allows the HDFS cluster to scale beyond the limitations of a single Namenode's memory
by distributing the namespace management.
• Fault Tolerance: Improves fault tolerance by isolating namespace management across multiple
Namenodes.
HDFS High Availability (HA)

Purpose:

• HDFS HA addresses the single point of failure issue associated with the Namenode by providing a pair
of Namenodes in an active-standby configuration.

Mechanics:

• Active-Standby Configuration: One Namenode acts as the active Namenode, handling client
requests, while the other acts as the standby, ready to take over if the active Namenode fails.
• Shared Storage: Namenodes use highly available shared storage to keep the edit log. Two main
choices for this storage are:
o NFS Filer: Traditional Network File System for shared storage.
o Quorum Journal Manager (QJM): A specialized HDFS component designed to provide
highly available edit logs. It uses a group of journal nodes where each edit must be written to a
majority of nodes (e.g., three nodes, allowing for one node failure).
• Datanode Reporting: Datanodes must report block information to both the active and standby
Namenodes.
• Client Configuration: Clients must be configured to handle Namenode failover transparently.

Failover Process:

• Quick Failover: The standby Namenode can take over quickly (within seconds) as it maintains up-to-
date state in memory, including the latest edit log and block mappings.
• Recovery: In case the standby Namenode is down when the active fails, the administrator can start the
standby from cold. While this process is better than the non-HA scenario, it still requires standard
operational procedures.

Advantages:

• Reduced Downtime: Provides high availability and reduces downtime by enabling a rapid failover
mechanism.
• Operational Efficiency: Standardizes the failover process, making it more predictable and
manageable.

Failover Process

Failover Controller:
• Role: Manages the transition between the active and standby Namenodes. It ensures that only one
Namenode is active at a time.
• Default Implementation: Uses ZooKeeper, which coordinates the failover process by monitoring the
health of the Namenodes and triggering failover if needed.
• Process:
o Heartbeat Mechanism: Each Namenode runs a lightweight failover controller that sends
heartbeats to check the status of the other Namenode.
o Graceful Failover: Can be initiated manually by an administrator, such as during routine
maintenance. This involves an orderly transition where both Namenodes switch roles smoothly.
o Ungraceful Failover: Occurs automatically if the active Namenode fails unexpectedly. This
can happen due to issues like network partitions or slow networks, where the active Namenode
might still be running but is unreachable.

Client Failover Handling:

• Transparent to Clients: The client library manages failover transparently. Clients are configured with
a logical hostname that maps to a pair of Namenode addresses.
• Failover Mechanism: The client library attempts connections to each Namenode address in turn until
it succeeds. This ensures continuous service availability even during failover.

Fencing Mechanisms

Purpose:

• Prevent Data Corruption: Ensures that the previously active Namenode, which might still be running
or reachable, does not interfere with the cluster operations or cause data corruption.

Fencing Techniques:

• SSH Fencing Command: A simple method where an SSH command is used to kill the process of the
previously active Namenode. This is effective if the failover controller is unsure if the old Namenode
has stopped completely.
• NFS Filer: When using an NFS filer for shared edit logs, stronger fencing methods are required
because NFS filers do not enforce exclusive write access as effectively as QJM.
o Revoking Access: Commands to revoke the Namenode’s access to the shared storage directory
can be used.
o Disabling Network Port: The Namenode’s network port can be disabled via remote
management commands to prevent it from accepting requests.
o STONITH (Shoot The Other Node In The Head): This drastic measure involves using a
specialized power distribution unit (PDU) to forcibly power down the host machine of the failed
Namenode, ensuring it can no longer affect the system.

Hadoop Filesystems: Overview and Usage

Hadoop’s flexible filesystem abstraction allows it to interact with various storage systems, each designed for
specific use cases. Understanding these filesystems can help in choosing the right one for your needs, whether
for local testing, distributed processing, or cloud integration.

1. Local FileSystem ([Link])

• URI Scheme: [Link]

• Description: Represents local disk storage. Ideal for small-scale testing or development on a single
machine.
• Purpose:
o Testing and development in a local environment.
o When data integrity through client-side checksums is needed.
• How to Use:
o Access local files directly using the [Link] scheme.
o For environments where checksums are not required, use RawLocalFileSystem.

2. Hadoop Distributed File System (HDFS) ([Link])

• URI Scheme: hdfs://

• Description: A distributed storage system designed for high-throughput access to large datasets.
Provides fault tolerance through data replication.
• Purpose:
o Handling large-scale data storage and processing.
o Optimized for use with MapReduce and other Hadoop processing frameworks.
o Ensures data reliability and fault tolerance.
• How to Use:
o Access files using the hdfs:// scheme.
o Ideal for processing large volumes of data across multiple nodes.

3. WebHDFS ([Link])

• URI Scheme: webhdfs://

Understanding Block Abstraction in HDFS
No ratings yet
Understanding Block Abstraction in HDFS
22 pages
HDFS Fault Tolerance Mechanisms
No ratings yet
HDFS Fault Tolerance Mechanisms
9 pages
Big Data (Unit 3)
No ratings yet
Big Data (Unit 3)
27 pages
HDFS Architecture and Concepts Explained
No ratings yet
HDFS Architecture and Concepts Explained
20 pages
HDFS Overview by Neha Pathipati
No ratings yet
HDFS Overview by Neha Pathipati
25 pages
HDFS Overview: Design, Benefits, and Operations
No ratings yet
HDFS Overview: Design, Benefits, and Operations
27 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
15 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
16 pages
Hadoop Framework: HDFS & MapReduce Concepts
No ratings yet
Hadoop Framework: HDFS & MapReduce Concepts
36 pages
Overview of HDFS Features and Operations
No ratings yet
Overview of HDFS Features and Operations
51 pages
Overview of Hadoop HDFS Architecture
No ratings yet
Overview of Hadoop HDFS Architecture
6 pages
Understanding Big Data and HDFS
No ratings yet
Understanding Big Data and HDFS
421 pages
Overview of Hadoop HDFS Features
No ratings yet
Overview of Hadoop HDFS Features
90 pages
Overview of HDFS Architecture and Features
No ratings yet
Overview of HDFS Architecture and Features
5 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
12 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
258 pages
Unit2 HDFS 6feb2026
No ratings yet
Unit2 HDFS 6feb2026
24 pages
HDFS Data Replication Explained
No ratings yet
HDFS Data Replication Explained
65 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
17 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
29 pages
Big Data Unit 3
No ratings yet
Big Data Unit 3
21 pages
Overview of HDFS and AFS Features
No ratings yet
Overview of HDFS and AFS Features
9 pages
HDFS Architecture and Key Concepts
No ratings yet
HDFS Architecture and Key Concepts
16 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
6 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
20 pages
Overview of HDFS Architecture and Features
No ratings yet
Overview of HDFS Architecture and Features
20 pages
Understanding HDFS: Design & Concepts
No ratings yet
Understanding HDFS: Design & Concepts
46 pages
RDBMS vs Hadoop: Key Differences
No ratings yet
RDBMS vs Hadoop: Key Differences
19 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
41 pages
Understanding HDFS in Big Data Analytics
No ratings yet
Understanding HDFS in Big Data Analytics
8 pages
HDFS Understanding
No ratings yet
HDFS Understanding
8 pages
Introduction to HDFS Architecture and Features
No ratings yet
Introduction to HDFS Architecture and Features
96 pages
Overview of Hadoop HDFS Features and Architecture
No ratings yet
Overview of Hadoop HDFS Features and Architecture
89 pages
Overview of Hadoop HDFS Architecture
No ratings yet
Overview of Hadoop HDFS Architecture
88 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
73 pages
Introduction to the Hadoop Ecosystem
No ratings yet
Introduction to the Hadoop Ecosystem
46 pages
Harnessing The Power of Hadoop Distributed File System (HDFS) : Unleashing Scalable and Fault-Tolerant Data Storage
No ratings yet
Harnessing The Power of Hadoop Distributed File System (HDFS) : Unleashing Scalable and Fault-Tolerant Data Storage
4 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
183 pages
Overview of Hadoop Ecosystem
No ratings yet
Overview of Hadoop Ecosystem
24 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
48 pages
HDFS - Design - Notes Unit 3
No ratings yet
HDFS - Design - Notes Unit 3
33 pages
Unit - 2 - Hadoop Distributed File System
No ratings yet
Unit - 2 - Hadoop Distributed File System
6 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
39 pages
HDFS Architecture and Design Overview
No ratings yet
HDFS Architecture and Design Overview
22 pages
Unit 3 HDFS
No ratings yet
Unit 3 HDFS
179 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
23 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
57 pages
HDFS Concepts and Architecture Overview
No ratings yet
HDFS Concepts and Architecture Overview
4 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
248 pages
HDFS Architecture and Performance Insights
No ratings yet
HDFS Architecture and Performance Insights
29 pages
Dea Unit 2
No ratings yet
Dea Unit 2
28 pages
Overview of Hadoop Architecture
No ratings yet
Overview of Hadoop Architecture
48 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
9 pages
HDFS: Latency Considerations in Big Data
No ratings yet
HDFS: Latency Considerations in Big Data
3 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
12 pages
Understanding Block Abstraction in HDFS
No ratings yet
Understanding Block Abstraction in HDFS
24 pages
Understanding HDFS for Big Data Storage
No ratings yet
Understanding HDFS for Big Data Storage
23 pages
Momentum Candles Indicator Code
No ratings yet
Momentum Candles Indicator Code
4 pages
ICT Grade 7: Business Data Overview
No ratings yet
ICT Grade 7: Business Data Overview
7 pages
Smart AC App Registration Guide
No ratings yet
Smart AC App Registration Guide
40 pages
Civil Engineering Computing Test Guide
No ratings yet
Civil Engineering Computing Test Guide
7 pages
Dji VSM User Guide: Ugcs 2.9.929
No ratings yet
Dji VSM User Guide: Ugcs 2.9.929
15 pages
Internal Controls in CIS Auditing
No ratings yet
Internal Controls in CIS Auditing
40 pages
Student Attendance System Proposal
No ratings yet
Student Attendance System Proposal
36 pages
G Legends Cup Mobile Legends Rules
No ratings yet
G Legends Cup Mobile Legends Rules
5 pages
Python for Everybody Specialization Completion
No ratings yet
Python for Everybody Specialization Completion
1 page
Data Center Architecture
No ratings yet
Data Center Architecture
37 pages
Candidate Performance Report Summary
No ratings yet
Candidate Performance Report Summary
5 pages
B.Tech Course Structure & Syllabus 2021-22
No ratings yet
B.Tech Course Structure & Syllabus 2021-22
59 pages
DO-297 Guidance for IMA Certification
No ratings yet
DO-297 Guidance for IMA Certification
10 pages
70-680 Lab Setup Guide for AD DS
No ratings yet
70-680 Lab Setup Guide for AD DS
10 pages
Overview of AI and Machine Learning
No ratings yet
Overview of AI and Machine Learning
16 pages
Full Stack Developer Resume Summary
No ratings yet
Full Stack Developer Resume Summary
1 page
PGS/MGS V9R1 Release Notes Summary
No ratings yet
PGS/MGS V9R1 Release Notes Summary
36 pages
PLC Programming and Interview Insights
No ratings yet
PLC Programming and Interview Insights
9 pages
Citra Emulator Log Analysis Errors
No ratings yet
Citra Emulator Log Analysis Errors
36 pages
Python Programming Language Overview
No ratings yet
Python Programming Language Overview
37 pages
Maximum Supported Hopping Rate Measurements Using The Universal Software Radio Peripheral Software Defined Radio
No ratings yet
Maximum Supported Hopping Rate Measurements Using The Universal Software Radio Peripheral Software Defined Radio
7 pages
Offline Pre Payment Energy Meter Solution
No ratings yet
Offline Pre Payment Energy Meter Solution
2 pages
Adidas Fit-Out Defect Checklist 2025
No ratings yet
Adidas Fit-Out Defect Checklist 2025
3 pages
Object Tracking Techniques in Computer Vision
No ratings yet
Object Tracking Techniques in Computer Vision
20 pages
SNAP PAC R-Series Controller Guide
No ratings yet
SNAP PAC R-Series Controller Guide
80 pages
3D CAD Dimension Drawing Guide
No ratings yet
3D CAD Dimension Drawing Guide
1 page
Comprehensive EDA Tools Directory
No ratings yet
Comprehensive EDA Tools Directory
4 pages
Open HTML
No ratings yet
Open HTML
6 pages
Orchid Labor Document System Manual
No ratings yet
Orchid Labor Document System Manual
19 pages
CoMPASS Quick Start Guide for DAQ
No ratings yet
CoMPASS Quick Start Guide for DAQ
139 pages

Module 2

Uploaded by

Module 2

Uploaded by

MODULE-2 HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

Introduction to Distributed Filesystems

2. Hadoop Distributed Filesystem (HDFS)

Integration with Other Storage Systems

1. Overview of HDFS Design

• Very Large Files:

2. Challenges and Limitations

Low-Latency Data Access:

Multiple Writers and Arbitrary Modifications:

1. Understanding Blocks in Filesystems

Filesystem Blocks vs. HDFS Blocks:

• Minimizing Seek Time:

3. Benefits of Block Abstraction

The block abstraction in HDFS provides several advantages:

• Handling Large Files:

4. Filesystem Check (fsck) with HDFS

HDFS provides a command to understand and manage blocks:

• Command: hdfs fsck / -files -blocks

Namenodes and Datanodes in HDFS

• Filesystem Namespace Management:

Failure and Recovery:

• Block Storage and Retrieval:

3. Interaction Between Namenode and Datanodes

Client Failover Handling:

Hadoop Filesystems: Overview and Usage

1. Local FileSystem ([Link])

• URI Scheme: [Link]

2. Hadoop Distributed File System (HDFS) ([Link])

• URI Scheme: hdfs://

• URI Scheme: webhdfs://

You might also like