0% found this document useful (0 votes)

2 views6 pages

Unit - 2 - Hadoop Distributed File System

HDFS (Hadoop Distributed File System) is a distributed file system designed to manage large datasets across multiple machines, ensuring fault tolerance and scalability. It operates on a Master-Slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks. HDFS allows for efficient big data processing by splitting files into blocks and replicating them across nodes, enhancing reliability and throughput.

Uploaded by

sanjaykumarsonkar.cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views6 pages

Unit - 2 - Hadoop Distributed File System

Uploaded by

sanjaykumarsonkar.cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

Unit – 2

Hadoop Distributed File System

HDFS (Hadoop Distributed File System) The Design of HDFS, HDFS Concepts, Command Line Interface,
Hadoop file system interfaces, Data flow, Data Ingest with Flume and Scoop and Hadoop archives, Hadoop
I/O: Compression, Serialization, Avro and File-Based Data structures.

Hadoop - HDFS (Hadoop Distributed File System)

Before learning about HDFS (Hadoop Distributed File System), it’s important to understand what a file
system is. A file system is a way an operating system organizes and manages files on disk storage. It helps
users store, maintain, and retrieve data from the disk.

Example: Windows uses file systems like NTFS (New Technology File System) and FAT32 (File
Allocation Table 32). FAT32 is an older file system but is still supported on versions like Windows XP.
Similarly, Linux uses file systems such as ext3 and ext4.

Distributed File System

DFS stands for distributed file system, it is a concept of storing file in multiple nodes in a distributed
manner. DFS actually provides Abstraction for a single large system whose storage is equal to the sum of
storage of other nodes in a cluster.

Why We Need DFS?

Storing very large files (e.g., 30TB) on a single system is impractical because:

 Disk capacity of one machine is limited and can only grow so much.

 Processing huge datasets on a single machine is inefficient and slow.

Distributed File Systems (DFS) overcome these issues by storing data across multiple machines, enabling
faster and scalable processing.

Example:

Suppose you have a 40TB file to process. On a single machine, it might take about 4 hours to complete.
However, using a Distributed File System (DFS), as shown in the image below, 40TB file is split across 4
nodes in a cluster, with each node storing 10TB. Since all nodes work simultaneously, processing time
reduces to just 1 hour. This demonstrates why DFS is essential for faster and efficient big data processing.

Local File System Processing:

Distributed File System Processing:

HDFS

HDFS (Hadoop Distributed File System) is a distributed file system designed to store and manage large
datasets across multiple machines (nodes) in a cluster.
It is a core component of the Apache Hadoop ecosystem.

HDFS (Hadoop Distributed File System) is the main storage system in Hadoop. It stores large files by
breaking them into blocks (default 128 MB) and distributing them across multiple low-cost machines.

HDFS ensures fault-tolerance by keeping copies of data blocks on different machines. This makes it
reliable, scalable and ideal for handling big data efficiently.
HDFS Architecture

HDFS (Hadoop Distributed File System) follows a Master–Slave Architecture.

 Master → NameNode

 Slaves → DataNodes

 Optional helper → Secondary NameNode / Checkpoint Node

It is designed for high fault tolerance, scalability, and large data storage.

Main Components

NameNode (Master) : The brain of HDFS.

Responsibilities:

 Maintains metadata

o File names

o Directory structure

o Permissions

o Block locations

 Decides where blocks are stored

 Monitors DataNodes (via heartbeats)

Important:

 Does NOT store actual file data

 Stores metadata in:

o FsImage

o EditLog

DataNodes (Slaves)

These store the actual data blocks.

Responsibilities:

 Store file blocks

 Send heartbeat to NameNode

 Send block reports

 Handle read/write requests from clients

If a DataNode fails:

 NameNode replicates blocks to other nodes.

Secondary NameNode (Checkpoint Node)

 Merges FsImage + EditLog

 Creates updated checkpoints

 Helps reduce NameNode load

 ❗ Not a backup NameNode

How Data is Stored in HDFS

Step-by-step process:

1. Client sends file to NameNode.

2. NameNode splits file into blocks (default 128MB).

3. NameNode selects DataNodes.

4. Blocks are stored with replication (default = 3).

5. Data is written in a pipeline:

o DN1 → DN2 → DN3

Read Operation

1. Client requests file from NameNode.

2. NameNode provides block locations.

3. Client reads data directly from nearest DataNode.

Rack Awareness

 HDFS places replicas across different racks

 Prevents data loss if one rack fails

 Improves fault tolerance

Example:

 1 replica in local rack

 2 replicas in different rack

Features of HDFS

 It's easy to access the files stored in HDFS.

 HDFS also provides high availability and fault tolerance.

 Provides scalability to scaleup or scaledown nodes as per our requirement.

 Data is stored in distributed manner i.e. various Datanodes are responsible for storing the data.

 HDFS provides Replication because of which no fear of Data Loss.

 HDFS Provides High Reliability as it can store data in a large range of Petabytes.

 HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the
cluster information.

 Provides high throughput.

Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
248 pages
Overview of Hadoop HDFS Architecture
No ratings yet
Overview of Hadoop HDFS Architecture
88 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
57 pages
Key Features of HDFS Explained
No ratings yet
Key Features of HDFS Explained
2 pages
Hadoop Architecture Overview
No ratings yet
Hadoop Architecture Overview
84 pages
Overview of HDFS Features and Operations
No ratings yet
Overview of HDFS Features and Operations
51 pages
Big Data Storage with Hadoop DFS
No ratings yet
Big Data Storage with Hadoop DFS
14 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
258 pages
Introduction to Hadoop HDFS Architecture
No ratings yet
Introduction to Hadoop HDFS Architecture
11 pages
Overview of Hadoop HDFS Features
No ratings yet
Overview of Hadoop HDFS Features
90 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
12 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
23 pages
Overview of Hadoop HDFS Features and Architecture
No ratings yet
Overview of Hadoop HDFS Features and Architecture
89 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
73 pages
HDFS Architecture and Components Overview
No ratings yet
HDFS Architecture and Components Overview
30 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
71 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
183 pages
Hadoop Framework: HDFS & MapReduce Concepts
No ratings yet
Hadoop Framework: HDFS & MapReduce Concepts
36 pages
Overview of HDFS Architecture and Features
No ratings yet
Overview of HDFS Architecture and Features
5 pages
RDBMS vs Hadoop: Key Differences
No ratings yet
RDBMS vs Hadoop: Key Differences
19 pages
Understanding Hadoop HDFS and MapReduce
No ratings yet
Understanding Hadoop HDFS and MapReduce
113 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
48 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
16 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
102 pages
HDFS Latency and Design Overview
No ratings yet
HDFS Latency and Design Overview
7 pages
5.HDFS
No ratings yet
5.HDFS
4 pages
Unit 3 HDFS
No ratings yet
Unit 3 HDFS
179 pages
HDFS Architecture and Performance Insights
No ratings yet
HDFS Architecture and Performance Insights
29 pages
HDFS Architecture Overview and Features
No ratings yet
HDFS Architecture Overview and Features
74 pages
HDFS Architecture and Concepts Explained
No ratings yet
HDFS Architecture and Concepts Explained
20 pages
HDFS Architecture and Components Overview
No ratings yet
HDFS Architecture and Components Overview
46 pages
Module 2
No ratings yet
Module 2
12 pages
Understanding Block Abstraction in HDFS
No ratings yet
Understanding Block Abstraction in HDFS
22 pages
Overview of HDFS Architecture and Functions
No ratings yet
Overview of HDFS Architecture and Functions
14 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
6 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
34 pages
HDFS, Sqoop, Hive, Pig, HBase Overview
No ratings yet
HDFS, Sqoop, Hive, Pig, HBase Overview
104 pages
Understanding HDFS in Big Data
No ratings yet
Understanding HDFS in Big Data
21 pages
05-06. HDFS
No ratings yet
05-06. HDFS
46 pages
HDFS Data Replication Explained
No ratings yet
HDFS Data Replication Explained
65 pages
Understanding HDFS: Features & Architecture
No ratings yet
Understanding HDFS: Features & Architecture
16 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
16 pages
HDFS Understanding
No ratings yet
HDFS Understanding
8 pages
HDFS Basics for Big Data Processing
No ratings yet
HDFS Basics for Big Data Processing
22 pages
Introduction to HDFS Architecture and Features
No ratings yet
Introduction to HDFS Architecture and Features
96 pages
Introduction to the Hadoop Ecosystem
No ratings yet
Introduction to the Hadoop Ecosystem
46 pages
HDFS: Architecture and Features Overview
No ratings yet
HDFS: Architecture and Features Overview
89 pages
Understanding HDFS in Hadoop Clusters
No ratings yet
Understanding HDFS in Hadoop Clusters
5 pages
HDFS Architecture and Functionality Overview
No ratings yet
HDFS Architecture and Functionality Overview
32 pages
Comprehensive Hadoop HDFS Guide
No ratings yet
Comprehensive Hadoop HDFS Guide
4 pages
Understanding HDFS: Design & Concepts
No ratings yet
Understanding HDFS: Design & Concepts
46 pages
Hadoop Distributed File System - Distributed Syste - 230326 - 083557
No ratings yet
Hadoop Distributed File System - Distributed Syste - 230326 - 083557
9 pages
Importance of Hadoop Distributed Filesystem
No ratings yet
Importance of Hadoop Distributed Filesystem
4 pages
HDFS Overview: Design, Benefits, and Operations
No ratings yet
HDFS Overview: Design, Benefits, and Operations
27 pages
HDFS Installation and Operations Guide
No ratings yet
HDFS Installation and Operations Guide
11 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
20 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
17 pages
Overview of Hadoop Modules and HDFS
0% (1)
Overview of Hadoop Modules and HDFS
101 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
41 pages
FUP and ATM Command Guide in Hindi
No ratings yet
FUP and ATM Command Guide in Hindi
6 pages
Install Oracle Linux 6.7 on VirtualBox
No ratings yet
Install Oracle Linux 6.7 on VirtualBox
13 pages
Understanding Inodes in Unix File Systems
No ratings yet
Understanding Inodes in Unix File Systems
10 pages
Keycloak Docker Build Process Logs
No ratings yet
Keycloak Docker Build Process Logs
2 pages
Idmap Generation Failures Log
No ratings yet
Idmap Generation Failures Log
14 pages
Autodata Installation Guide for Windows
No ratings yet
Autodata Installation Guide for Windows
2 pages
Git Branching and Commit Guide
No ratings yet
Git Branching and Commit Guide
11 pages
GPT Header and First LBA Analysis
No ratings yet
GPT Header and First LBA Analysis
21 pages
AVG Antivirus License Keys List
No ratings yet
AVG Antivirus License Keys List
14 pages
Linux User Account and Password Management
No ratings yet
Linux User Account and Password Management
21 pages
ashokitech.com-
No ratings yet
ashokitech.com-
12 pages
Python File Analysis Functions
100% (1)
Python File Analysis Functions
3 pages
Installing Apache Hadoop 3.2.3 Guide
No ratings yet
Installing Apache Hadoop 3.2.3 Guide
5 pages
uLoader Setup and Usage Guide
No ratings yet
uLoader Setup and Usage Guide
2 pages
Daily PC Shutdown Scheduling Guide
No ratings yet
Daily PC Shutdown Scheduling Guide
5 pages
Windows File Management Guide
No ratings yet
Windows File Management Guide
7 pages
DOS Interrupt 21 Functions Overview
No ratings yet
DOS Interrupt 21 Functions Overview
7 pages
Activating and Configuring Windows Server 2003
No ratings yet
Activating and Configuring Windows Server 2003
2 pages
Internal File Representation in Databases
No ratings yet
Internal File Representation in Databases
45 pages
Automate Windows Update Configuration
No ratings yet
Automate Windows Update Configuration
2 pages
Active Directory Backup with NetWorker
No ratings yet
Active Directory Backup with NetWorker
9 pages
Overview of Unix Operating System
No ratings yet
Overview of Unix Operating System
26 pages
Slackware Linux Installation Guide
No ratings yet
Slackware Linux Installation Guide
16 pages
Speed WiFi Zone Voucher Details
No ratings yet
Speed WiFi Zone Voucher Details
7 pages
Linux Commands for Process Management
No ratings yet
Linux Commands for Process Management
19 pages
Trillium Software System Installation Guide
No ratings yet
Trillium Software System Installation Guide
47 pages
Rundll32 Commands for Windows 10/8/7
No ratings yet
Rundll32 Commands for Windows 10/8/7
4 pages
OCFS2 1.2 FAQ and Installation Guide
No ratings yet
OCFS2 1.2 FAQ and Installation Guide
33 pages
Linux Privilege Escalation Scripts Guide
No ratings yet
Linux Privilege Escalation Scripts Guide
24 pages
Windows 7 Shortcut Commands List
No ratings yet
Windows 7 Shortcut Commands List
5 pages

Unit - 2 - Hadoop Distributed File System

Uploaded by

Unit - 2 - Hadoop Distributed File System

Uploaded by

Unit – 2

Hadoop Distributed File System

Hadoop - HDFS (Hadoop Distributed File System)

Distributed File System

Why We Need DFS?

 Processing huge datasets on a single machine is inefficient and slow.

Local File System Processing:

Distributed File System Processing:

HDFS (Hadoop Distributed File System) follows a Master–Slave Architecture.

 Optional helper → Secondary NameNode / Checkpoint Node

NameNode (Master) : The brain of HDFS.

 Decides where blocks are stored

 Does NOT store actual file data

 Stores metadata in:

These store the actual data blocks.

 Store file blocks

 Send heartbeat to NameNode

 Send block reports

 Handle read/write requests from clients

 NameNode replicates blocks to other nodes.

Secondary NameNode (Checkpoint Node)

 Merges FsImage + EditLog

 Creates updated checkpoints

 Helps reduce NameNode load

 ❗ Not a backup NameNode

How Data is Stored in HDFS

1. Client sends file to NameNode.

2. NameNode splits file into blocks (default 128MB).

3. NameNode selects DataNodes.

4. Blocks are stored with replication (default = 3).

5. Data is written in a pipeline:

o DN1 → DN2 → DN3

1. Client requests file from NameNode.

2. NameNode provides block locations.

3. Client reads data directly from nearest DataNode.

 HDFS places replicas across different racks

 Prevents data loss if one rack fails

 Improves fault tolerance

 1 replica in local rack

 2 replicas in different rack

 It's easy to access the files stored in HDFS.

 HDFS also provides high availability and fault tolerance.

 Provides scalability to scaleup or scaledown nodes as per our requirement.

 HDFS provides Replication because of which no fear of Data Loss.

 Provides high throughput.

You might also like