0% found this document useful (0 votes)

6 views36 pages

Hadoop Framework: HDFS & MapReduce Concepts

Unit II covers the Hadoop framework, focusing on distributed file systems like HDFS, which allows for scalable and fault-tolerant storage of large datasets across multiple machines. It explains the architecture of HDFS, including the roles of NameNode and DataNode, as well as the MapReduce processing framework that enables distributed data processing through map and reduce functions. Key features include data replication for fault tolerance, high throughput, and the ability to handle large files efficiently.

Uploaded by

codegeass0047

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views36 pages

Hadoop Framework: HDFS & MapReduce Concepts

Uploaded by

codegeass0047

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT II - HADOOP FRAMEWORK

Distributed File Systems - Large-Scale File System Organization – HDFS

concepts - MapReduce Execution, Algorithms using MapReduce, Matrix
Vector Multiplication – Hadoop YARN

Distributed File Systems / Large-Scale File-System Organization

A file system is a method used by an operating system to organize, store, retrieve,

and manage files on a storage device, such as a hard drive or SSD. It provides a
way to keep track of where files are located, how they are named, and who can
access them.

Types of local File Systems:

Different operating systems use different file systems. Examples

1)NTFS (New Technology File System) for Windows,

2)ext (Extended File System) for Linux,

3)APFS (Apple File System) for macOS,

4)FAT32 (File Allocation Table) for various systems.

Local file systems are simple and effective for single-machine environment,

But it has several disadvantages compared to distributed file systems:

1. Limited Scalability: Local file systems are confined to the storage

capacity and resources of a single machine. Distributed file systems can
scale out by adding more machines, allowing them to handle larger
amounts of data and more simultaneous users.
2. Single Point of Failure: A local file system is susceptible to hardware
failures, such as disk crashes, which can lead to data loss. Distributed file
systems often have built-in redundancy and fault tolerance mechanisms,
spreading data across multiple nodes to prevent data loss from a single
point of failure.

3. Accessibility: Files on a local file system are only accessible from the
machine where they are stored. Distributed file systems provide remote
access, allowing users to access files from any machine connected to the
network, which is crucial for collaboration and remote work.

4. Resource Utilization: A local file system can lead to inefficient use of

resources since the storage and processing capabilities of a single
machine can become bottlenecks. Distributed file systems leverage
multiple machines, balancing the load and improving resource utilization.

A distributed file system (DFS) is a type of file system that allows access to
files from multiple hosts sharing via a computer network.

What Are the Different Types of Distributed File Systems?

These are the most common DFS implementations:

NFS (Network File System): Developed by Sun Microsystems, NFS allows

users to access files over a network as easily as if they were on their local disks.
It is widely used in UNIX and Linux environments.

AFS (Andrew File System): Developed by Carnegie Mellon University,

AFS uses a set of trusted servers to present a homogeneous, location-transparent
file namespace to all the client workstations.

HDFS (Hadoop Distributed File System): Part of the Apache Hadoop

project, HDFS is designed for storing large data sets reliably and streaming
those data sets at high bandwidth to user applications. It is particularly suited
for big data analytics.

Google File System (GFS): Designed by Google to meet the company's

rapidly growing data processing needs, GFS is highly scalable and fault-
tolerant, optimized for handling large amounts of data across many servers.

Amazon S3 (Simple Storage Service): While not a traditional file system,

S3 provides a distributed storage system designed to make web-scale computing
easier by offering object storage through a web service interface.

Cloud Store, an open-source DFS originally developed by Kosmix.

How Does a Distributed File System Work?

A distributed file system works as follows:

• Distribution: First, a DFS distributes datasets across multiple clusters or nodes.

Each node provides its own computing power, which enables a DFS to process
the datasets in parallel.

• Replication: A DFS will also replicate datasets onto different clusters by

copying the same pieces of information into multiple clusters. This helps the
distributed file system to achieve fault tolerance—to recover the data in case of
a node or cluster failure—as well as high concurrency, which enables the same
piece of data to be processed at the same time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks
are replicated, perhaps three times, at three different compute nodes. Moreover,
the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure. Normally, both the chunk size and
the degree of replication can be decided by the user.

Advantages of DFS:

1. Data Distribution: In a DFS, data is spread across multiple servers or

locations, which can improve reliability and availability.

2. Transparency: Users can access and manage files as if they were on

their local machines, without needing to know the physical location of
the data.

3. Scalability: Distributed file systems can easily scale by adding more

servers to handle increased data loads and user requests.

4. Fault Tolerance: By replicating data across multiple servers, a DFS can

continue to function even if one or more servers fail.

5. Performance: Distributed file systems often provide better performance

for large-scale data processing because they can handle multiple requests
simultaneously.
HDFS - Hadoop Distributed File System

HDFS stands for Hadoop Distributed File System. It is a distributed file system
designed to store large amounts of data across many machines, providing high
availability and fault tolerance.

Key features of HDFS:

1. Scalability: HDFS can handle petabytes of data by distributing it across

multiple servers, making it easy to scale out by adding more machines.

2. Fault Tolerance: Data is replicated across multiple nodes (typically three

copies by default), so if one node fails, the data can still be accessed from
another node.

3. High Throughput: HDFS is optimized for high throughput rather than

low latency, making it suitable for applications that need to process large
volumes of data.

4. Streaming Data Access: HDFS is designed for high-throughput data

access and is optimized for reading large files sequentially.

5. Large File Support: HDFS is capable of handling very large files,

making it ideal for big data applications.

6. Cost-Effective Storage: By using commodity hardware, HDFS offers a

cost-effective solution for storing large datasets. Hadoop doesn’t require
expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available
from multiple vendors).
HDFS Architecture:

The following diagram illustrates the Hadoop concepts

Components:

1)Client:

o The interface through which users or applications interact with the

HDFS.

2)Name Node

Name Node plays a Master role in Master/Slaves Architecture

It manages the metadata and regulates access to files in HDFS,

❖ Name Node is the single point of contact for accessing files in HDFS So,
whereas Data Nodes acts as slaves. File System metadata is stored on
Name Node.

❖ Name node has Metadata. It contains

1)Number of blocks based on the block size of the file system

2)Replication information

3)block location based on Rack Awareness Algorithm

Thus, Metadata is relatively small in size and fits into Main Memory of a
computer machine. So, it is stored in Main Memory of Name Node to
allow fast access.

3)Data Node

❖ Name Node plays a slave role in Master/Slaves Architecture

❖ In data node, HDFS files are stored in the form of fixed size chunks of data
which are called blocks.

❖ Data Nodes serve read and write requests of clients on HDFS files
and also perform block creation, replication and deletions.

HDFS Blocks

❖ HDFS is a block structured file system. Each HDFS file is broken into
blocks of fixed size. The default size of the block in Hadoop 1.0 is 64
MB and the Hadoop 2.0 is 128 MB which are stored across various data
nodes on the cluster. Each of these blocks is stored as a separate file on
local file system on data nodes (Commodity machines on cluster).

Hadoop Commands :

Hadoop commands are used to upload / distributing the data / downloading

the data from HDFS file system.

❖ HDFS’s fsck command is useful to get the files and blocks details of file
system.

❖ Example: The following command list the blocks that make up each file in
the file system.
$ hadoop fsck -files -blocks

Advantages of Blocks

1. Quick Seek Time:

By default, HDFS Block Size is 128 MB which is much larger than any other
file system. In HDFS, large block size is maintained to reduce the seek time for
block access.

2. Ability to Store Large Files:

Another benefit of this block structure is that, there is no need to store all blocks
of a file on the same disk or node. So, a file’s size can be larger than the size of a
disk or node.

3. How Fault Tolerance is achieved with HDFS Blocks:

HDFS blocks feature suits well with the replication for providing fault tolerance
and availability.

By default, each block is replicated to three separate machines. This feature

insures blocks against corrupted blocks or disk or machine failure. If a block
becomes unavailable, a copy can be read from another machine. And a block that
is no longer available due to corruption or machine failure can be replicated from
its alternative machines to other live machines to bring the replication factor back
to the normal level (3 by default).

Rack Awareness Algorithm

1) First replica will get stored on the local rack.

2) Second replica will get stored on the other Datanode in the same rack.
3) Third replica will get stored on the different rack.
Replication:

• The diagram shows data replication where blocks written to DataNodes

in one rack are replicated to DataNodes in another rack based on Rack
Awareness Algorithm.

• This ensures data availability and reliability in case of node or rack

failure.

Note: No two replicas of the same block are present on the same datanode.

Heartbeat:

In the Hadoop Distributed File System (HDFS), a heartbeat is a mechanism

used to monitor the status and health of the DataNodes within the HDFS cluster.

DataNodes send heartbeat signals to the NameNode at regular intervals,

typically every 3 seconds by default.

Command-Line Interface

Action Command

Create a directory named /foodir bin/hadoop dfs -mkdir /foodir

View the contents of a file named bin/hadoop dfs –cat

/foodir/[Link]

Hadoop Filesystems

The Hadoop File System (HFS) is the primary data storage system used by
Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data
across highly scalable Hadoop clusters.
The Java abstract class [Link] represents a filesystem
in Hadoop, and there are several concrete implementations, which are described
in Table

Anatomy of File Read :

Read Process:

o The client requests to read a file.

o The client first contacts the NameNode to retrieve the metadata,

including the locations of the blocks of the file.

o The NameNode provides the block locations.

o The client then directly reads the data blocks from the respective
DataNodes.
1)Open File: The HDFS client opens the file by contacting the NameNode.

2)Get Block Locations: The NameNode returns the block locations for the
file.

3)Establish FS Data Input Stream: The client sets up an input stream to read
data.

4)Read Data from DataNodes: The client reads the first block from the
DataNode.

5)Continue Reading: The client reads remaining blocks from the respective
DataNodes.

6)Close FS Data Input Stream: The client closes the input stream after reading
all data.
Anatomy of File Write

Write Process:

o The client requests to write a file.

o The client contacts the NameNode to get block allocations.

o The NameNode provides the DataNode locations for the new

blocks.

o The client writes the data blocks to a series of DataNodes, typically

in a pipeline fashion (one after another).

o Data blocks are replicated to ensure fault tolerance, typically

across different racks for redundancy (e.g., Rack 1 to Rack 2).

Steps in the File Write Operation:

1)Create (1): - Create File:

o The client initiates a file creation request.

o This is done by calling the [Link] method.

o This request is sent to the NameNode to create a new file in the

HDFS namespace.

2. Create (2) - NameNode Response

The NameNode responds by acknowledging the file creation request

o The NameNode allocates the necessary metadata and responds to

the client.

o This includes the list of DataNodes where the data blocks will be
stored.

3. Write (3): - Write Data

o The client starts writing data to the HDFS using the

FSDataOutputStream object.

o The data is written in packets and sent to the first DataNode in the
pipeline.

4. Write Packet (4): - Datanode Pipeline

o The first DataNode receives the data packet and writes it to its
local storage.

o It then forwards the packet to the next DataNode in the pipeline.

This process continues until all DataNodes in the pipeline have received
and stored the data packet

[Link] Packet (5): - Acknowledgement Packet

o Each DataNode, after writing the packet, sends an
acknowledgment back to the previous DataNode.

o This acknowledgment propagates back to the client.

6. Close (6): - Close the Stream

1. Once all data is written, the client closes the FSDataOutputStream.

2. This action indicates that the write operation is complete.

7. Complete (7): - Completion Acknowledgment

1. The NameNode receives a notification that the file write is

complete.

2. The NameNode finalizes the file creation and updates the file
system metadata.

MapReduce

➢ MapReduce is a processing framework for handling large data sets with a

distributed algorithm on a cluster.

➢ It provides two interfaces in the form of two functions:

➢ Map and Reduce. Users can override these two functions to interact with
and manipulate the data flow of running their programs.

MapReduce Architecture
User
Program

fork
fork fork
Master

assign
assign
Map
Reduce

Worker
Worker

Input Worker
Data Output
Intermediate File

Components of MapReduce

1. Map Phase - Processes input data to produce intermediate key-value

pairs.

2. Shuffle and Sort Phase - Groups intermediate key-value pairs by key.

3. Reduce Phase - Aggregates the values for each key to produce the final
output.
Step-by-Step Process

1. Input Splitting

The input data is divided into smaller chunks. In this case, each document can
be a chunk.

2)Map Phase

The mapper reads the input files line by line and emits each word as a key with
a count of 1 as the value.

3)Shuffle and Sort Phase

The framework sorts and groups the intermediate key-value pairs generated by
the mapper so that all values associated with the same key are grouped together.

4)Reduce Phase

The reducer processes each group of intermediate key-value pairs and sums up
the values to produce the final count for each word.
Algorithm for Map-Reduce

1. Map Function:

o The Map function processes input data (e.g., lines of text from a
document).

o It outputs key-value pairs, where the key is a word and the value is
1 (indicating one occurrence of the word).

2. Shuffle and Sort:

o The intermediate key-value pairs generated by the Map function

are grouped by key.

o All values for the same key are brought together.

3. Reduce Function:

o The Reduce function processes the grouped key-value pairs.

o It aggregates the values for each key to produce the final result,
which in this case is the total count of each word.

CODE:

Mapper:

def map (document)

for word in document. Split ()

emit (word, 1)

Reducer:

def reduce (word, counts)

total = sum(counts)

emit (word, total)

MapReduce Logical Data Flow

➢ The input data to both the Map and the Reduce functions has a particular
structure.

➢ The input data to the Map function is in the form of a (key, value) pair.

➢ The output data from the Map function is structured as (key, value) pairs
called intermediate (key, value) pairs.

➢ In other words, the user-defined Map function processes each input (key,
value) pair and produces a number of (zero, one, or more) intermediate
(key, value) pairs.

➢ Here, the goal is to process all input (key, value) pairs to the Map function
in parallel (Figure 6.2).
➢ In turn, the Reduce function receives the intermediate (key, value) pairs in
the form of a group of intermediate values associated with one intermediate
key, (key, [set of values]).

➢ The Map-Reduce framework forms these groups by first sorting the

intermediate (key, value) pairs and then grouping values with the same
key. It should be noted that the data is sorted to simplify the grouping
process. The Reduce function processes each (key, [set of values]) group
and produces a set of (key, value) pairs as output.

➢ To clarify the data flow in a sample Map-Reduce application, one of the

well-known Map-Reduce problems, namely word count, to count the
number of occurrences of each word in a collection of documents.

➢ Figure 6.3 demonstrates the data flow of the word-count problem for a
simple input file containing only two lines as follows:

(1) “most people ignore most poetry” and

(2) “most poetry ignores most people.”

➢ In this case, the Map function simultaneously produces a number of

intermediate (key, value) pairs for each line of content so that each word
is the intermediate key with 1 as its intermediate value; for example,
(ignore, 1).

➢ Then the Map-Reduce library collects all the generated intermediate (key,
value) pairs and sorts them to group the 1's for identical words; for
example, (people, [1,1]).

➢ Groups are then sent to the Reduce function in parallel so that it can sum
up the 1 values for each word and generate the actual number of occurrence
for each word in the file; for example, (people, 2).
Architecture of Map-Reduce in Hadoop

➢ Similar to HDFS, the Map-Reduce engine also has a master/slave

architecture consisting of a single JobTracker as the master and a number
of TaskTrackers as the slaves (workers).
Job Tracker

• The Job Tracker is the master daemon that manages and coordinates
MapReduce jobs. It is responsible for resource management, monitoring
and job scheduling.

Role of the JobTracker

1. Resource Management: It monitors the availability of resources such as

CPU and memory across the cluster nodes.

2. Job Scheduling: The JobTracker schedules jobs by assigning map and

reduce tasks to TaskTrackers in the Hadoop cluster.

3. Monitoring: It tracks the progress of each job, reporting the status and
performance metrics to the user.
4. Fault Tolerance: The JobTracker reassigns tasks that fail during
execution, ensuring the job completes even in the face of hardware
failures.

Task Tracker

• The Task Tracker is the slave daemon that executes individual tasks as
directed by the Job Tracker.

Role of the TaskTracker

1. Task Execution: The primary role of the TaskTracker is to execute the

map and reduce tasks assigned to it by the JobTracker.

2. Resource Management: Manages the resources (CPU, memory)

allocated to each task to ensure they run efficiently.

3. Progress Reporting: Regularly sends progress updates and status reports

to the JobTracker via heartbeats.

4. Fault Tolerance: Detects and handles task failures, restarting failed tasks
as necessary.

Workflow

1. Job Submission: A client submits a MapReduce job to the Job Tracker.

2. Job Initialization: The Job Tracker splits the job into smaller tasks and
assigns them to Task Trackers.

3. Task Execution: Task Trackers execute the tasks and process the data.

4. Progress Monitoring: Task Trackers send progress reports to the Job

Tracker.

5. Task Completion: Once tasks are completed, Task Trackers notify the
Job Tracker.
6. Job Completion: The Job Tracker consolidates the results and marks the
job as complete.

Example Use Cases

• Indexing web pages in search engines.

• Analyzing log files.

• Processing large sets of images or videos.

• Machine learning algorithms that require large-scale data processing.

Benefits of Map-Reduce

• Scalability: Can process petabytes of data by distributing the work across

many nodes.

• Fault Tolerance: Automatically handles node failures by re-executing

failed tasks.

• Simplicity: Abstracts the complexity of parallel and distributed

processing.
Matrix-vector multiplication using Map Reduce:

Matrix-vector multiplication can be efficiently performed using the Map Reduce

programming model, which is particularly useful for large-scale data processing.
Here’s a general outline of how to implement matrix-vector multiplication in
Map Reduce:

Matrix Representation

Assume the matrix A is represented as a set of key-value pairs where each key is
a tuple (i, j) representing the position in the matrix, and the value is the matrix
entry Aij

The vector x is represented as a set of key-value pairs where each key is the index
j and the value is the vector entry xj.

Map Reduce – Algorithm:

1. Input Format:

• Matrix A is represented as a set of key-value pairs (i,j) → Aij

• Vector x is represented as key-value pairs j→xj

2. Mapper Phase:

o The mapper takes pairs from the matrix A and the vector x.

o For each matrix entry (i, j , Aij) and corresponding vector entry
(j,xj), the mapper emits a key-value pair (i,Aij × xj)

3. Shuffle and Sort Phase:

o The framework sorts and groups all values by key i. This means all
partial products for each row i are grouped together.
4. Reducer Phase:

o The reducer sums the partial products for each key i to produce the
final entry of the resulting vector y.

o The reducer emits key-value pairs (i,yi)

Example

Let's consider the matrix A and vector x:

Mapper Output

For matrix entry A11=1 and vector entry x1=5

• Emit (1,1 × 5) = (1,5)

For matrix entry A12=2 and vector entry x2=6

• Emit (1,2 × 6 ) = (1,12)

For matrix entry A21=3 and vector entry x1=5

• Emit (2,3×5) = (2,15)

For matrix entry A22=4and vector entry x2=6

• Emit (2,4×6) = (2,24)

Shuffle and Sort Phase

The intermediate key-value pairs after shuffle and sort:

Intermediate results = {(1,[5,12]) , (2,[15,24])}

Reducer Output

For key 1:

• Sum 5+12=17

• Emit (1,17)

For key 2:

• Sum 15+24=39

• Emit (2,39)

The final result is the vector y:

This is how matrix-vector multiplication can be done using MapReduce.

In which scenario Matrix-vector multiplication in MapReduce is useful ?

When the matrix and vector are too large to fit into the memory of a
single machine.

Here are some key reasons why you might use MapReduce for this task:

1. Scalability

• Large-scale Data: When dealing with very large matrices (e.g., millions
of rows and columns), a single machine may not have enough memory or
processing power to handle the entire dataset. MapReduce allows the data
to be distributed across many machines in a cluster.

• Distributed Processing: MapReduce leverages the distributed

computing power of multiple machines to handle large datasets more
efficiently.

2. Fault Tolerance

• Redundancy: MapReduce provides fault tolerance by distributing the

data and processing tasks across multiple nodes. If a node fails, the
system can reassign the task to another node without losing data or
computational progress.

• Automatic Recovery: The MapReduce framework automatically handles

machine failures, ensuring the job completes successfully without manual
intervention.

3. Parallel Processing

• Simultaneous Computation: By dividing the task into smaller chunks

(map tasks), multiple parts of the matrix can be processed in parallel.
This significantly speeds up the computation compared to a serial
approach on a single machine.

• Optimal Resource Utilization: Efficiently uses the resources of a

cluster, maximizing throughput and minimizing latency.

4. Simplified Programming Model

• Abstraction: MapReduce abstracts the complexity of distributed

computing. Programmers can focus on writing map and reduce functions
without worrying about the underlying details of parallelism, data
distribution, and fault tolerance.

• Standardization: Provides a standardized way to process large-scale

data, making it easier to develop, maintain, and port data processing
applications.

5. Applications

• Machine Learning: Many machine learning algorithms, such as those

for training models, involve matrix-vector multiplications. Using
MapReduce allows these algorithms to scale to large datasets.

• Graph Processing: Algorithms for graph analysis, such as PageRank,

often involve repeated matrix-vector multiplications.

• Data Mining: Tasks such as computing recommendations, search

indices, and other data mining operations frequently involve matrix-
vector operations.

Real-life applications - Matrix-vector multiplication using MapReduce

Matrix-vector multiplication using MapReduce has several real-life

applications, especially in fields that involve large-scale data processing and
analysis. Here are some notable examples:

1. PageRank Algorithm

Application:

The PageRank algorithm, used by Google Search to rank web pages, involves
iteratively computing the rank of each page based on the ranks of the pages
linking to it.
Details:

• The web can be represented as a directed graph where nodes are web
pages, and edges are hyperlinks.

• The transition matrix, representing the probabilities of moving from one

page to another, is multiplied by the PageRank vector in each iteration.

• MapReduce is used to handle the large-scale computation of these

matrix-vector multiplications efficiently.

2. Recommendation Systems

Application:

Recommendation systems, such as those used by Netflix, Amazon, and Spotify,

use matrix factorization techniques to predict user preferences based on past
behavior.

Details:

• The user-item interaction matrix is multiplied by user and item feature

vectors.

• For large-scale datasets, MapReduce can distribute the computation

across a cluster, making it feasible to process millions of users and items.

By using MapReduce for matrix-vector multiplication, organizations can handle

very large datasets efficiently and effectively, leveraging the power of distributed
computing to overcome the limitations of single-machine processing.
Hadoop YARN

• YARN “Yet Another Resource Negotiator” is the resource

management layer of Hadoop. The Yarn was introduced in Hadoop 2.x. YARN
supports multiple processing frameworks in addition to Map-Reduce, such as
Spark.

Hadoop 2.x provides a general purpose data processing platform which is

not just limited to the Map-Reduce.

• Spark for micro – batch and iterative processing

• Storm for stream processing

• Hadoop for batch processing,

• Yarn allows different data processing engines like graph processing,

interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS (Hadoop Distributed File System).

• Apart from resource management, Yarn also does job Scheduling.

• It enables Hadoop to process other data processing system other than Map-
Reduce. It allows running several different frameworks on the same
hardware where Hadoop is deployed.

YARN vs. Hadoop 1.x

In Hadoop 1.x, the Job Tracker was responsible for both resource
management and job scheduling, which led to scalability and resource utilization
issues.

YARN addresses these issues by splitting the responsibilities into separate

components:
Architecture – YARN

Apache Yarn Framework consists of a master daemon known as “Resource

Manager”, slave daemon called “node manager” (one per slave node) and
Application Master (one per application).

1. Resource Manager (RM) - Manages resources across the cluster

It is the master daemon of Yarn. RM manages the global assignments of

resources (CPU and memory) among all the applications. Resource Manager has
two Main components

•Scheduler

•Application manager

a) Scheduler

The scheduler is responsible for allocating the resources to the running

application. The scheduler is pure scheduler it means that it performs no
monitoring no tracking for the application and even doesn’t guarantees about
restarting failed tasks either due to application failure or hardware failures.

b) Application Manager

It manages running Application Masters in the cluster, i.e., it is responsible

for starting application masters and for monitoring and restarting them on
different nodes in case of failures.

2. Node Manager (NM) - Manages resources on individual nodes

It is the slave daemon of Yarn. NM is responsible for containers monitoring

their resource usage and reporting the same to the Resource Manager. It manages
the user process on that machine. Node Manager has two Main components

• Container
• Application Master (AM)

a) Container

The smallest unit of resource allocation in YARN. Each container runs a

part of an application and is mana ged by the Node Manager. A container is a
fundamental resource management unit. It encapsulates a set of resources such as
memory, CPU, and storage that are allocated by the Resource Manager (RM) to
a particular application for executing tasks.

Need of Container in YARN:

Isolation: Containers provide an isolated environment for running

applications. This isolation ensures that different applications or tasks do not
interfere with each other, which helps maintain system stability and security.

b). Application Master (AM) - it handles job scheduling and monitoring.

One application master runs per application. It negotiates resources

from the resource manager and works with the node manager. It Manages the
application life cycle, including resource requests and monitoring progress
The AM acquires containers from the RM’s Scheduler before contacting
the corresponding NMs to start the application’s individual tasks.

YARN additional Components:

1) Resource Manager Restart

2) Yarn Resource Manager High availability - - Standby Resource

Manager

3) Yarn Web Application Proxy

4) Yarn Docker Container Executor

1)Resource Manager Restart

Restarting the Resource Manager in Hadoop YARN is a crucial operation

for maintaining and managing a YARN cluster. The Resource Manager is
responsible for allocating resources to applications and managing their execution.

What Happens During a Resource Manager Restart?

Resource Management Disruption: During the restart, the Resource

Manager is temporarily unavailable. This means that no new applications can be
submitted, and existing applications might experience temporary interruptions in
resource allocation.

Types of Resource Manager Restart

a) Non-work-preserving RM restart

A non-work-preserving restart of the Resource Manager in Hadoop YARN

means that the restart does not guarantee the preservation of the state of running
applications or jobs. This type of restart can lead to the following consequences:

Application State Loss: During a non-work-preserving restart, there is a

risk that the Resource Manager may lose information about the currently running
applications, their states, and progress. This can result in the need to re-submit or
re-schedule applications after the Resource Manager comes back online.

Non-work-preserving restarts are typically used in scenarios where the

Resource Manager needs to be restarted due to issues like configuration changes,
software upgrades, or troubleshooting, but where the state preservation of running
applications is not feasible.

b) Work-preserving RM restart

A work-preserving restart of the Resource Manager in Hadoop YARN is

designed to maintain the state and progress of applications running on the cluster
during the restart process. This approach minimizes disruptions and ensures that
applications continue to run smoothly. Here’s what it entails:

State Preservation: The Resource Manager retains the state of running

applications, including their progress and resource allocations, during the restart.
This means that applications do not need to be re-submitted or restarted from
scratch.

2)Yarn Resource Manager High availability

Failover Mechanism: Work-preserving restarts often involve mechanisms

such as high-availability (HA) setups. In a high-availability configuration, there
are typically two Resource Managers: an active one and a standby one. If the
active Resource Manager fails or needs to restart, the standby Resource Manager
takes over, ensuring continuous operation and state preservation.

Before to Hadoop v2.4, the master (RM)was the SPOF (single point of
failure). The High Availability feature adds redundancy in the form of an
Active/Standby Resource Manager pair to remove this single point of failure.
Resource Manager HA is realized through an Active/Standby architecture
- at any point of time, one of the RMs is Active, and one or more RMs are in
Standby mode waiting to take over should anything happen to the Active.

Zookeeper is a centralized service for maintaining configuration

information, naming, providing distributed synchronization, and providing group
services. In yarn with HA, Automatic failover is set up via Zookeeper.

3)Yarn Web Application Proxy

A proxy is an intermediary server or service that acts as a gateway between

a client and a destination server.

By default, it runs as a part of RM but we can configure and run in a

standalone mode. Hence, the reason of the proxy is to reduce the possibility of
the web-based attack through Yarn.

4)Yarn Docker Container Executor

Docker containers are created from Docker images, which are built using
Docker files. The Docker Engine manages and runs these containers, enabling
efficient and scalable application deployment.
A Docker container is a lightweight, standalone, and executable package
that contains everything needed to run a piece of software, including the code,
runtime, libraries, and settings. It provides a consistent and isolated environment
across different system

Docker generates light weight virtual machines. The Docker Container

Executor allows the Yarn Node Manager to launch yarn container to Docker
container. These containers provide a custom software environment in which
user’s code run, isolated from a software environment of Node Manager.

Benefits of YARN

1. Scalability: - Decouples resource management from job scheduling,

allowing for more scalable and efficient resource utilization.

2. Flexibility: - Supports different types of distributed applications

beyond Map-Reduce, such as Apache Spark, Apache Flink, and more.

3. Resource Utilization: - Better resource utilization by dynamically

allocating resources based on the needs of the applications.

4. Fault Tolerance: Improved fault tolerance by monitoring and

handling task failures more efficiently.

5. Multi-tenancy: Supports multiple users and applications, making it

suitable for shared environments like cloud services.

Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
258 pages
Hadoop Framework: HDFS & DFS Concepts
No ratings yet
Hadoop Framework: HDFS & DFS Concepts
21 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
248 pages
Overview of File Systems and Hadoop
No ratings yet
Overview of File Systems and Hadoop
34 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
16 pages
Module 2
No ratings yet
Module 2
12 pages
Understanding Block Abstraction in HDFS
No ratings yet
Understanding Block Abstraction in HDFS
22 pages
Hadoop Framework: HDFS & MapReduce Overview
No ratings yet
Hadoop Framework: HDFS & MapReduce Overview
25 pages
RDBMS vs Hadoop: Key Differences Explained
No ratings yet
RDBMS vs Hadoop: Key Differences Explained
56 pages
Overview of HDFS and AFS Features
No ratings yet
Overview of HDFS and AFS Features
9 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
41 pages
Unit2 HDFS 6feb2026
No ratings yet
Unit2 HDFS 6feb2026
24 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
20 pages
Big Data (Unit 3)
No ratings yet
Big Data (Unit 3)
27 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
6 pages
Overview of HDFS Features and Operations
No ratings yet
Overview of HDFS Features and Operations
51 pages
Understanding Big Data and HDFS
No ratings yet
Understanding Big Data and HDFS
421 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
15 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
23 pages
Understanding Hadoop's HDFS Architecture
No ratings yet
Understanding Hadoop's HDFS Architecture
17 pages
Overview of Hadoop HDFS Features
No ratings yet
Overview of Hadoop HDFS Features
90 pages
Introduction to Hadoop and DFS Concepts
No ratings yet
Introduction to Hadoop and DFS Concepts
120 pages
HDFS Architecture and Concepts Explained
No ratings yet
HDFS Architecture and Concepts Explained
20 pages
HDFS Fault Tolerance Mechanisms
No ratings yet
HDFS Fault Tolerance Mechanisms
9 pages
Overview of Hadoop Architecture
No ratings yet
Overview of Hadoop Architecture
48 pages
Big Data Unit III
No ratings yet
Big Data Unit III
38 pages
Unit - 2 - Hadoop Distributed File System
No ratings yet
Unit - 2 - Hadoop Distributed File System
6 pages
Big Data Storage with Hadoop DFS
No ratings yet
Big Data Storage with Hadoop DFS
14 pages
HDFS Data Replication Explained
No ratings yet
HDFS Data Replication Explained
65 pages
RDBMS vs Hadoop: Key Differences
No ratings yet
RDBMS vs Hadoop: Key Differences
19 pages
Understanding Hadoop as a Framework
No ratings yet
Understanding Hadoop as a Framework
34 pages
Understanding HDFS Architecture and Benefits
No ratings yet
Understanding HDFS Architecture and Benefits
39 pages
Overview of Hadoop Ecosystem
No ratings yet
Overview of Hadoop Ecosystem
24 pages
Big Data Ecosystem and GFS Overview
No ratings yet
Big Data Ecosystem and GFS Overview
55 pages
Understanding HDFS: Features & Architecture
No ratings yet
Understanding HDFS: Features & Architecture
16 pages
Overview of Hadoop and HDFS Architecture
100% (2)
Overview of Hadoop and HDFS Architecture
14 pages
HDFS Understanding
No ratings yet
HDFS Understanding
8 pages
Overview of Hadoop HDFS Features and Architecture
No ratings yet
Overview of Hadoop HDFS Features and Architecture
89 pages
Overview of Hadoop HDFS Architecture
No ratings yet
Overview of Hadoop HDFS Architecture
15 pages
Big Data Unit 3
No ratings yet
Big Data Unit 3
21 pages
Introduction to the Hadoop Ecosystem
No ratings yet
Introduction to the Hadoop Ecosystem
46 pages
Introduction to Hadoop and DFS
No ratings yet
Introduction to Hadoop and DFS
34 pages
Overview of Storage Systems in Cloud Computing
No ratings yet
Overview of Storage Systems in Cloud Computing
14 pages
Understanding HDFS in Big Data
No ratings yet
Understanding HDFS in Big Data
21 pages
Cloud Computing Mid 2
No ratings yet
Cloud Computing Mid 2
8 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
57 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
183 pages
Understanding Distributed File Systems
No ratings yet
Understanding Distributed File Systems
15 pages
HDFS Overview by Neha Pathipati
No ratings yet
HDFS Overview by Neha Pathipati
25 pages
Hadoop Modules and MapReduce Overview
No ratings yet
Hadoop Modules and MapReduce Overview
46 pages
Overview of HDFS Architecture and Features
No ratings yet
Overview of HDFS Architecture and Features
5 pages
HDFS Overview: Design, Benefits, and Operations
No ratings yet
HDFS Overview: Design, Benefits, and Operations
27 pages
HDFS Architecture and Features Explained
No ratings yet
HDFS Architecture and Features Explained
39 pages
HDFS: Overview of Hadoop Storage System
No ratings yet
HDFS: Overview of Hadoop Storage System
148 pages
Overview of HDFS Architecture and Features
No ratings yet
Overview of HDFS Architecture and Features
20 pages
Understanding HDFS: Design & Concepts
No ratings yet
Understanding HDFS: Design & Concepts
46 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
22 pages
HDFS Architecture and Components Overview
No ratings yet
HDFS Architecture and Components Overview
30 pages
Understanding Hadoop HDFS Architecture
No ratings yet
Understanding Hadoop HDFS Architecture
15 pages
The Arab Uprisings Explained New Contentious Politics in The Middle East Marc Lynch (Editor) Available Instanly
100% (2)
The Arab Uprisings Explained New Contentious Politics in The Middle East Marc Lynch (Editor) Available Instanly
114 pages
Velocity and Acceleration in Mechanisms
No ratings yet
Velocity and Acceleration in Mechanisms
16 pages
Decree Absolute Search Revised - Aug - 16
No ratings yet
Decree Absolute Search Revised - Aug - 16
2 pages
Defining Components of Tourist Destinations
No ratings yet
Defining Components of Tourist Destinations
40 pages
CPD551 Video Surveillance System Specs
No ratings yet
CPD551 Video Surveillance System Specs
15 pages
Cycle vs. Scooter: Commuting Comparison
No ratings yet
Cycle vs. Scooter: Commuting Comparison
5 pages
SPPC2000 Security Solutions Overview
No ratings yet
SPPC2000 Security Solutions Overview
9 pages
MD2U Stepper Motor Driver Manual
No ratings yet
MD2U Stepper Motor Driver Manual
1 page
Silver Jubilee Celebration (Responses) - Google Sheets
No ratings yet
Silver Jubilee Celebration (Responses) - Google Sheets
16 pages
Urban Village Development Plan Framework
No ratings yet
Urban Village Development Plan Framework
27 pages
Jeremy Newcomb: Leadership & Sales Expertise
No ratings yet
Jeremy Newcomb: Leadership & Sales Expertise
2 pages
Jinnah's Fourteen Points Explained
No ratings yet
Jinnah's Fourteen Points Explained
1 page
Understanding Mitigation Blocks in Trading
No ratings yet
Understanding Mitigation Blocks in Trading
2 pages
Jio Jio Customer Support Executive Chat Voice Work From Home Earn Upto 40k February 6 2026
No ratings yet
Jio Jio Customer Support Executive Chat Voice Work From Home Earn Upto 40k February 6 2026
1 page
HTML Project for Vocational Training
No ratings yet
HTML Project for Vocational Training
18 pages
Plan Regulador Transporte Huancayo
No ratings yet
Plan Regulador Transporte Huancayo
9 pages
Quatro Modbus Registers Overview
No ratings yet
Quatro Modbus Registers Overview
6 pages
Risk Assessment and Production Plan
No ratings yet
Risk Assessment and Production Plan
105 pages
Clarion Apa450 Channel 4-3-2 Car Amplifier
No ratings yet
Clarion Apa450 Channel 4-3-2 Car Amplifier
17 pages
ICHI: New Global Health Intervention Codes
No ratings yet
ICHI: New Global Health Intervention Codes
4 pages
JnanaBhumi Scholarship Status Check
No ratings yet
JnanaBhumi Scholarship Status Check
3 pages
Understanding RMS Voltage in AC Circuits
No ratings yet
Understanding RMS Voltage in AC Circuits
11 pages
Managing Blood Pressure 110/60
No ratings yet
Managing Blood Pressure 110/60
13 pages
Sodium Cyanide Safety Training Guide
No ratings yet
Sodium Cyanide Safety Training Guide
21 pages
e-Governance Systems: Features & Frameworks
No ratings yet
e-Governance Systems: Features & Frameworks
14 pages
SAS Certification Exam Study Guide
No ratings yet
SAS Certification Exam Study Guide
10 pages
Mallya Extradition Case Judgment 2020
No ratings yet
Mallya Extradition Case Judgment 2020
46 pages
Cross-Listing and Accounting Quality Insights
No ratings yet
Cross-Listing and Accounting Quality Insights
14 pages
Overview of HVAC System Components
No ratings yet
Overview of HVAC System Components
58 pages