0% found this document useful (0 votes)
6 views36 pages

Hadoop Framework: HDFS & MapReduce Concepts

Unit II covers the Hadoop framework, focusing on distributed file systems like HDFS, which allows for scalable and fault-tolerant storage of large datasets across multiple machines. It explains the architecture of HDFS, including the roles of NameNode and DataNode, as well as the MapReduce processing framework that enables distributed data processing through map and reduce functions. Key features include data replication for fault tolerance, high throughput, and the ability to handle large files efficiently.

Uploaded by

codegeass0047
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views36 pages

Hadoop Framework: HDFS & MapReduce Concepts

Unit II covers the Hadoop framework, focusing on distributed file systems like HDFS, which allows for scalable and fault-tolerant storage of large datasets across multiple machines. It explains the architecture of HDFS, including the roles of NameNode and DataNode, as well as the MapReduce processing framework that enables distributed data processing through map and reduce functions. Key features include data replication for fault tolerance, high throughput, and the ability to handle large files efficiently.

Uploaded by

codegeass0047
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT II - HADOOP FRAMEWORK

Distributed File Systems - Large-Scale File System Organization – HDFS


concepts - MapReduce Execution, Algorithms using MapReduce, Matrix
Vector Multiplication – Hadoop YARN

Distributed File Systems / Large-Scale File-System Organization

A file system is a method used by an operating system to organize, store, retrieve,


and manage files on a storage device, such as a hard drive or SSD. It provides a
way to keep track of where files are located, how they are named, and who can
access them.

Types of local File Systems:

Different operating systems use different file systems. Examples

1)NTFS (New Technology File System) for Windows,

2)ext (Extended File System) for Linux,

3)APFS (Apple File System) for macOS,

4)FAT32 (File Allocation Table) for various systems.

Local file systems are simple and effective for single-machine environment,

But it has several disadvantages compared to distributed file systems:

1. Limited Scalability: Local file systems are confined to the storage


capacity and resources of a single machine. Distributed file systems can
scale out by adding more machines, allowing them to handle larger
amounts of data and more simultaneous users.
2. Single Point of Failure: A local file system is susceptible to hardware
failures, such as disk crashes, which can lead to data loss. Distributed file
systems often have built-in redundancy and fault tolerance mechanisms,
spreading data across multiple nodes to prevent data loss from a single
point of failure.

3. Accessibility: Files on a local file system are only accessible from the
machine where they are stored. Distributed file systems provide remote
access, allowing users to access files from any machine connected to the
network, which is crucial for collaboration and remote work.

4. Resource Utilization: A local file system can lead to inefficient use of


resources since the storage and processing capabilities of a single
machine can become bottlenecks. Distributed file systems leverage
multiple machines, balancing the load and improving resource utilization.

A distributed file system (DFS) is a type of file system that allows access to
files from multiple hosts sharing via a computer network.

What Are the Different Types of Distributed File Systems?

These are the most common DFS implementations:

NFS (Network File System): Developed by Sun Microsystems, NFS allows


users to access files over a network as easily as if they were on their local disks.
It is widely used in UNIX and Linux environments.

AFS (Andrew File System): Developed by Carnegie Mellon University,


AFS uses a set of trusted servers to present a homogeneous, location-transparent
file namespace to all the client workstations.

HDFS (Hadoop Distributed File System): Part of the Apache Hadoop


project, HDFS is designed for storing large data sets reliably and streaming
those data sets at high bandwidth to user applications. It is particularly suited
for big data analytics.

Google File System (GFS): Designed by Google to meet the company's


rapidly growing data processing needs, GFS is highly scalable and fault-
tolerant, optimized for handling large amounts of data across many servers.

Amazon S3 (Simple Storage Service): While not a traditional file system,


S3 provides a distributed storage system designed to make web-scale computing
easier by offering object storage through a web service interface.

Cloud Store, an open-source DFS originally developed by Kosmix.

How Does a Distributed File System Work?

A distributed file system works as follows:

• Distribution: First, a DFS distributes datasets across multiple clusters or nodes.


Each node provides its own computing power, which enables a DFS to process
the datasets in parallel.

• Replication: A DFS will also replicate datasets onto different clusters by


copying the same pieces of information into multiple clusters. This helps the
distributed file system to achieve fault tolerance—to recover the data in case of
a node or cluster failure—as well as high concurrency, which enables the same
piece of data to be processed at the same time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks
are replicated, perhaps three times, at three different compute nodes. Moreover,
the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure. Normally, both the chunk size and
the degree of replication can be decided by the user.

Advantages of DFS:

1. Data Distribution: In a DFS, data is spread across multiple servers or


locations, which can improve reliability and availability.

2. Transparency: Users can access and manage files as if they were on


their local machines, without needing to know the physical location of
the data.

3. Scalability: Distributed file systems can easily scale by adding more


servers to handle increased data loads and user requests.

4. Fault Tolerance: By replicating data across multiple servers, a DFS can


continue to function even if one or more servers fail.

5. Performance: Distributed file systems often provide better performance


for large-scale data processing because they can handle multiple requests
simultaneously.
HDFS - Hadoop Distributed File System

HDFS stands for Hadoop Distributed File System. It is a distributed file system
designed to store large amounts of data across many machines, providing high
availability and fault tolerance.

Key features of HDFS:

1. Scalability: HDFS can handle petabytes of data by distributing it across


multiple servers, making it easy to scale out by adding more machines.

2. Fault Tolerance: Data is replicated across multiple nodes (typically three


copies by default), so if one node fails, the data can still be accessed from
another node.

3. High Throughput: HDFS is optimized for high throughput rather than


low latency, making it suitable for applications that need to process large
volumes of data.

4. Streaming Data Access: HDFS is designed for high-throughput data


access and is optimized for reading large files sequentially.

5. Large File Support: HDFS is capable of handling very large files,


making it ideal for big data applications.

6. Cost-Effective Storage: By using commodity hardware, HDFS offers a


cost-effective solution for storing large datasets. Hadoop doesn’t require
expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available
from multiple vendors).
HDFS Architecture:

The following diagram illustrates the Hadoop concepts

Components:

1)Client:

o The interface through which users or applications interact with the


HDFS.

2)Name Node

Name Node plays a Master role in Master/Slaves Architecture

It manages the metadata and regulates access to files in HDFS,

❖ Name Node is the single point of contact for accessing files in HDFS So,
whereas Data Nodes acts as slaves. File System metadata is stored on
Name Node.

❖ Name node has Metadata. It contains


1)Number of blocks based on the block size of the file system

2)Replication information

3)block location based on Rack Awareness Algorithm

Thus, Metadata is relatively small in size and fits into Main Memory of a
computer machine. So, it is stored in Main Memory of Name Node to
allow fast access.

3)Data Node

❖ Name Node plays a slave role in Master/Slaves Architecture

❖ In data node, HDFS files are stored in the form of fixed size chunks of data
which are called blocks.

❖ Data Nodes serve read and write requests of clients on HDFS files
and also perform block creation, replication and deletions.

HDFS Blocks

❖ HDFS is a block structured file system. Each HDFS file is broken into
blocks of fixed size. The default size of the block in Hadoop 1.0 is 64
MB and the Hadoop 2.0 is 128 MB which are stored across various data
nodes on the cluster. Each of these blocks is stored as a separate file on
local file system on data nodes (Commodity machines on cluster).

Hadoop Commands :

Hadoop commands are used to upload / distributing the data / downloading


the data from HDFS file system.

❖ HDFS’s fsck command is useful to get the files and blocks details of file
system.

❖ Example: The following command list the blocks that make up each file in
the file system.
$ hadoop fsck -files -blocks

Advantages of Blocks

1. Quick Seek Time:

By default, HDFS Block Size is 128 MB which is much larger than any other
file system. In HDFS, large block size is maintained to reduce the seek time for
block access.

2. Ability to Store Large Files:

Another benefit of this block structure is that, there is no need to store all blocks
of a file on the same disk or node. So, a file’s size can be larger than the size of a
disk or node.

3. How Fault Tolerance is achieved with HDFS Blocks:

HDFS blocks feature suits well with the replication for providing fault tolerance
and availability.

By default, each block is replicated to three separate machines. This feature


insures blocks against corrupted blocks or disk or machine failure. If a block
becomes unavailable, a copy can be read from another machine. And a block that
is no longer available due to corruption or machine failure can be replicated from
its alternative machines to other live machines to bring the replication factor back
to the normal level (3 by default).

Rack Awareness Algorithm

1) First replica will get stored on the local rack.


2) Second replica will get stored on the other Datanode in the same rack.
3) Third replica will get stored on the different rack.
Replication:

• The diagram shows data replication where blocks written to DataNodes


in one rack are replicated to DataNodes in another rack based on Rack
Awareness Algorithm.

• This ensures data availability and reliability in case of node or rack


failure.

Note: No two replicas of the same block are present on the same datanode.

Heartbeat:

In the Hadoop Distributed File System (HDFS), a heartbeat is a mechanism


used to monitor the status and health of the DataNodes within the HDFS cluster.

DataNodes send heartbeat signals to the NameNode at regular intervals,


typically every 3 seconds by default.

Command-Line Interface

Action Command

Create a directory named /foodir bin/hadoop dfs -mkdir /foodir

View the contents of a file named bin/hadoop dfs –cat


/foodir/[Link]

Hadoop Filesystems

The Hadoop File System (HFS) is the primary data storage system used by
Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data
across highly scalable Hadoop clusters.
The Java abstract class [Link] represents a filesystem
in Hadoop, and there are several concrete implementations, which are described
in Table

Anatomy of File Read :

Read Process:

o The client requests to read a file.

o The client first contacts the NameNode to retrieve the metadata,


including the locations of the blocks of the file.

o The NameNode provides the block locations.

o The client then directly reads the data blocks from the respective
DataNodes.
1)Open File: The HDFS client opens the file by contacting the NameNode.

2)Get Block Locations: The NameNode returns the block locations for the
file.

3)Establish FS Data Input Stream: The client sets up an input stream to read
data.

4)Read Data from DataNodes: The client reads the first block from the
DataNode.

5)Continue Reading: The client reads remaining blocks from the respective
DataNodes.

6)Close FS Data Input Stream: The client closes the input stream after reading
all data.
Anatomy of File Write

Write Process:

o The client requests to write a file.

o The client contacts the NameNode to get block allocations.

o The NameNode provides the DataNode locations for the new


blocks.

o The client writes the data blocks to a series of DataNodes, typically


in a pipeline fashion (one after another).

o Data blocks are replicated to ensure fault tolerance, typically


across different racks for redundancy (e.g., Rack 1 to Rack 2).

Steps in the File Write Operation:


1)Create (1): - Create File:

o The client initiates a file creation request.

o This is done by calling the [Link] method.

o This request is sent to the NameNode to create a new file in the


HDFS namespace.

2. Create (2) - NameNode Response

The NameNode responds by acknowledging the file creation request

o The NameNode allocates the necessary metadata and responds to


the client.

o This includes the list of DataNodes where the data blocks will be
stored.

3. Write (3): - Write Data

o The client starts writing data to the HDFS using the


FSDataOutputStream object.

o The data is written in packets and sent to the first DataNode in the
pipeline.

4. Write Packet (4): - Datanode Pipeline

o The first DataNode receives the data packet and writes it to its
local storage.

o It then forwards the packet to the next DataNode in the pipeline.

This process continues until all DataNodes in the pipeline have received
and stored the data packet

[Link] Packet (5): - Acknowledgement Packet


o Each DataNode, after writing the packet, sends an
acknowledgment back to the previous DataNode.

o This acknowledgment propagates back to the client.

6. Close (6): - Close the Stream

1. Once all data is written, the client closes the FSDataOutputStream.

2. This action indicates that the write operation is complete.

7. Complete (7): - Completion Acknowledgment

1. The NameNode receives a notification that the file write is


complete.

2. The NameNode finalizes the file creation and updates the file
system metadata.

MapReduce

➢ MapReduce is a processing framework for handling large data sets with a


distributed algorithm on a cluster.

➢ It provides two interfaces in the form of two functions:

➢ Map and Reduce. Users can override these two functions to interact with
and manipulate the data flow of running their programs.

MapReduce Architecture
User
Program

fork
fork fork
Master

assign
assign
Map
Reduce

Worker
Worker

Worker
Worker

Input Worker
Data Output
Intermediate File

Components of MapReduce

1. Map Phase - Processes input data to produce intermediate key-value


pairs.

2. Shuffle and Sort Phase - Groups intermediate key-value pairs by key.

3. Reduce Phase - Aggregates the values for each key to produce the final
output.
Step-by-Step Process

1. Input Splitting

The input data is divided into smaller chunks. In this case, each document can
be a chunk.

2)Map Phase

The mapper reads the input files line by line and emits each word as a key with
a count of 1 as the value.

3)Shuffle and Sort Phase

The framework sorts and groups the intermediate key-value pairs generated by
the mapper so that all values associated with the same key are grouped together.

4)Reduce Phase

The reducer processes each group of intermediate key-value pairs and sums up
the values to produce the final count for each word.
Algorithm for Map-Reduce

1. Map Function:

o The Map function processes input data (e.g., lines of text from a
document).

o It outputs key-value pairs, where the key is a word and the value is
1 (indicating one occurrence of the word).

2. Shuffle and Sort:

o The intermediate key-value pairs generated by the Map function


are grouped by key.

o All values for the same key are brought together.

3. Reduce Function:

o The Reduce function processes the grouped key-value pairs.

o It aggregates the values for each key to produce the final result,
which in this case is the total count of each word.

CODE:

Mapper:

def map (document)

for word in document. Split ()

emit (word, 1)

Reducer:

def reduce (word, counts)

total = sum(counts)

emit (word, total)


MapReduce Logical Data Flow

➢ The input data to both the Map and the Reduce functions has a particular
structure.

➢ The input data to the Map function is in the form of a (key, value) pair.

➢ The output data from the Map function is structured as (key, value) pairs
called intermediate (key, value) pairs.

➢ In other words, the user-defined Map function processes each input (key,
value) pair and produces a number of (zero, one, or more) intermediate
(key, value) pairs.

➢ Here, the goal is to process all input (key, value) pairs to the Map function
in parallel (Figure 6.2).
➢ In turn, the Reduce function receives the intermediate (key, value) pairs in
the form of a group of intermediate values associated with one intermediate
key, (key, [set of values]).

➢ The Map-Reduce framework forms these groups by first sorting the


intermediate (key, value) pairs and then grouping values with the same
key. It should be noted that the data is sorted to simplify the grouping
process. The Reduce function processes each (key, [set of values]) group
and produces a set of (key, value) pairs as output.

➢ To clarify the data flow in a sample Map-Reduce application, one of the


well-known Map-Reduce problems, namely word count, to count the
number of occurrences of each word in a collection of documents.

➢ Figure 6.3 demonstrates the data flow of the word-count problem for a
simple input file containing only two lines as follows:

(1) “most people ignore most poetry” and

(2) “most poetry ignores most people.”

➢ In this case, the Map function simultaneously produces a number of


intermediate (key, value) pairs for each line of content so that each word
is the intermediate key with 1 as its intermediate value; for example,
(ignore, 1).

➢ Then the Map-Reduce library collects all the generated intermediate (key,
value) pairs and sorts them to group the 1's for identical words; for
example, (people, [1,1]).

➢ Groups are then sent to the Reduce function in parallel so that it can sum
up the 1 values for each word and generate the actual number of occurrence
for each word in the file; for example, (people, 2).
Architecture of Map-Reduce in Hadoop

➢ Similar to HDFS, the Map-Reduce engine also has a master/slave


architecture consisting of a single JobTracker as the master and a number
of TaskTrackers as the slaves (workers).
Job Tracker

• The Job Tracker is the master daemon that manages and coordinates
MapReduce jobs. It is responsible for resource management, monitoring
and job scheduling.

Role of the JobTracker

1. Resource Management: It monitors the availability of resources such as


CPU and memory across the cluster nodes.

2. Job Scheduling: The JobTracker schedules jobs by assigning map and


reduce tasks to TaskTrackers in the Hadoop cluster.

3. Monitoring: It tracks the progress of each job, reporting the status and
performance metrics to the user.
4. Fault Tolerance: The JobTracker reassigns tasks that fail during
execution, ensuring the job completes even in the face of hardware
failures.

Task Tracker

• The Task Tracker is the slave daemon that executes individual tasks as
directed by the Job Tracker.

Role of the TaskTracker

1. Task Execution: The primary role of the TaskTracker is to execute the


map and reduce tasks assigned to it by the JobTracker.

2. Resource Management: Manages the resources (CPU, memory)


allocated to each task to ensure they run efficiently.

3. Progress Reporting: Regularly sends progress updates and status reports


to the JobTracker via heartbeats.

4. Fault Tolerance: Detects and handles task failures, restarting failed tasks
as necessary.

Workflow

1. Job Submission: A client submits a MapReduce job to the Job Tracker.

2. Job Initialization: The Job Tracker splits the job into smaller tasks and
assigns them to Task Trackers.

3. Task Execution: Task Trackers execute the tasks and process the data.

4. Progress Monitoring: Task Trackers send progress reports to the Job


Tracker.

5. Task Completion: Once tasks are completed, Task Trackers notify the
Job Tracker.
6. Job Completion: The Job Tracker consolidates the results and marks the
job as complete.

Example Use Cases

• Indexing web pages in search engines.

• Analyzing log files.

• Processing large sets of images or videos.

• Machine learning algorithms that require large-scale data processing.

Benefits of Map-Reduce

• Scalability: Can process petabytes of data by distributing the work across


many nodes.

• Fault Tolerance: Automatically handles node failures by re-executing


failed tasks.

• Simplicity: Abstracts the complexity of parallel and distributed


processing.
Matrix-vector multiplication using Map Reduce:

Matrix-vector multiplication can be efficiently performed using the Map Reduce


programming model, which is particularly useful for large-scale data processing.
Here’s a general outline of how to implement matrix-vector multiplication in
Map Reduce:

Matrix Representation

Assume the matrix A is represented as a set of key-value pairs where each key is
a tuple (i, j) representing the position in the matrix, and the value is the matrix
entry Aij

The vector x is represented as a set of key-value pairs where each key is the index
j and the value is the vector entry xj.

Map Reduce – Algorithm:

1. Input Format:

• Matrix A is represented as a set of key-value pairs (i,j) → Aij

• Vector x is represented as key-value pairs j→xj

2. Mapper Phase:

o The mapper takes pairs from the matrix A and the vector x.

o For each matrix entry (i, j , Aij) and corresponding vector entry
(j,xj), the mapper emits a key-value pair (i,Aij × xj)

3. Shuffle and Sort Phase:

o The framework sorts and groups all values by key i. This means all
partial products for each row i are grouped together.
4. Reducer Phase:

o The reducer sums the partial products for each key i to produce the
final entry of the resulting vector y.

o The reducer emits key-value pairs (i,yi)

Example

Let's consider the matrix A and vector x:

Mapper Output

For matrix entry A11=1 and vector entry x1=5

• Emit (1,1 × 5) = (1,5)

For matrix entry A12=2 and vector entry x2=6

• Emit (1,2 × 6 ) = (1,12)

For matrix entry A21=3 and vector entry x1=5

• Emit (2,3×5) = (2,15)

For matrix entry A22=4and vector entry x2=6

• Emit (2,4×6) = (2,24)

Shuffle and Sort Phase

The intermediate key-value pairs after shuffle and sort:


Intermediate results = {(1,[5,12]) , (2,[15,24])}

Reducer Output

For key 1:

• Sum 5+12=17

• Emit (1,17)

For key 2:

• Sum 15+24=39

• Emit (2,39)

The final result is the vector y:

This is how matrix-vector multiplication can be done using MapReduce.

In which scenario Matrix-vector multiplication in MapReduce is useful ?

When the matrix and vector are too large to fit into the memory of a
single machine.

Here are some key reasons why you might use MapReduce for this task:

1. Scalability

• Large-scale Data: When dealing with very large matrices (e.g., millions
of rows and columns), a single machine may not have enough memory or
processing power to handle the entire dataset. MapReduce allows the data
to be distributed across many machines in a cluster.

• Distributed Processing: MapReduce leverages the distributed


computing power of multiple machines to handle large datasets more
efficiently.

2. Fault Tolerance

• Redundancy: MapReduce provides fault tolerance by distributing the


data and processing tasks across multiple nodes. If a node fails, the
system can reassign the task to another node without losing data or
computational progress.

• Automatic Recovery: The MapReduce framework automatically handles


machine failures, ensuring the job completes successfully without manual
intervention.

3. Parallel Processing

• Simultaneous Computation: By dividing the task into smaller chunks


(map tasks), multiple parts of the matrix can be processed in parallel.
This significantly speeds up the computation compared to a serial
approach on a single machine.

• Optimal Resource Utilization: Efficiently uses the resources of a


cluster, maximizing throughput and minimizing latency.

4. Simplified Programming Model

• Abstraction: MapReduce abstracts the complexity of distributed


computing. Programmers can focus on writing map and reduce functions
without worrying about the underlying details of parallelism, data
distribution, and fault tolerance.

• Standardization: Provides a standardized way to process large-scale


data, making it easier to develop, maintain, and port data processing
applications.

5. Applications

• Machine Learning: Many machine learning algorithms, such as those


for training models, involve matrix-vector multiplications. Using
MapReduce allows these algorithms to scale to large datasets.

• Graph Processing: Algorithms for graph analysis, such as PageRank,


often involve repeated matrix-vector multiplications.

• Data Mining: Tasks such as computing recommendations, search


indices, and other data mining operations frequently involve matrix-
vector operations.

Real-life applications - Matrix-vector multiplication using MapReduce

Matrix-vector multiplication using MapReduce has several real-life


applications, especially in fields that involve large-scale data processing and
analysis. Here are some notable examples:

1. PageRank Algorithm

Application:

The PageRank algorithm, used by Google Search to rank web pages, involves
iteratively computing the rank of each page based on the ranks of the pages
linking to it.
Details:

• The web can be represented as a directed graph where nodes are web
pages, and edges are hyperlinks.

• The transition matrix, representing the probabilities of moving from one


page to another, is multiplied by the PageRank vector in each iteration.

• MapReduce is used to handle the large-scale computation of these


matrix-vector multiplications efficiently.

2. Recommendation Systems

Application:

Recommendation systems, such as those used by Netflix, Amazon, and Spotify,


use matrix factorization techniques to predict user preferences based on past
behavior.

Details:

• The user-item interaction matrix is multiplied by user and item feature


vectors.

• For large-scale datasets, MapReduce can distribute the computation


across a cluster, making it feasible to process millions of users and items.

By using MapReduce for matrix-vector multiplication, organizations can handle


very large datasets efficiently and effectively, leveraging the power of distributed
computing to overcome the limitations of single-machine processing.
Hadoop YARN

• YARN “Yet Another Resource Negotiator” is the resource


management layer of Hadoop. The Yarn was introduced in Hadoop 2.x. YARN
supports multiple processing frameworks in addition to Map-Reduce, such as
Spark.

Hadoop 2.x provides a general purpose data processing platform which is


not just limited to the Map-Reduce.

• Spark for micro – batch and iterative processing

• Storm for stream processing

• Hadoop for batch processing,

• Yarn allows different data processing engines like graph processing,


interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS (Hadoop Distributed File System).

• Apart from resource management, Yarn also does job Scheduling.

• It enables Hadoop to process other data processing system other than Map-
Reduce. It allows running several different frameworks on the same
hardware where Hadoop is deployed.

YARN vs. Hadoop 1.x

In Hadoop 1.x, the Job Tracker was responsible for both resource
management and job scheduling, which led to scalability and resource utilization
issues.

YARN addresses these issues by splitting the responsibilities into separate


components:
Architecture – YARN

Apache Yarn Framework consists of a master daemon known as “Resource


Manager”, slave daemon called “node manager” (one per slave node) and
Application Master (one per application).

1. Resource Manager (RM) - Manages resources across the cluster

It is the master daemon of Yarn. RM manages the global assignments of


resources (CPU and memory) among all the applications. Resource Manager has
two Main components

•Scheduler

•Application manager

a) Scheduler

The scheduler is responsible for allocating the resources to the running


application. The scheduler is pure scheduler it means that it performs no
monitoring no tracking for the application and even doesn’t guarantees about
restarting failed tasks either due to application failure or hardware failures.

b) Application Manager

It manages running Application Masters in the cluster, i.e., it is responsible


for starting application masters and for monitoring and restarting them on
different nodes in case of failures.

2. Node Manager (NM) - Manages resources on individual nodes

It is the slave daemon of Yarn. NM is responsible for containers monitoring


their resource usage and reporting the same to the Resource Manager. It manages
the user process on that machine. Node Manager has two Main components

• Container
• Application Master (AM)

a) Container

The smallest unit of resource allocation in YARN. Each container runs a


part of an application and is mana ged by the Node Manager. A container is a
fundamental resource management unit. It encapsulates a set of resources such as
memory, CPU, and storage that are allocated by the Resource Manager (RM) to
a particular application for executing tasks.

Need of Container in YARN:

Isolation: Containers provide an isolated environment for running


applications. This isolation ensures that different applications or tasks do not
interfere with each other, which helps maintain system stability and security.

b). Application Master (AM) - it handles job scheduling and monitoring.

One application master runs per application. It negotiates resources


from the resource manager and works with the node manager. It Manages the
application life cycle, including resource requests and monitoring progress
The AM acquires containers from the RM’s Scheduler before contacting
the corresponding NMs to start the application’s individual tasks.

YARN additional Components:

1) Resource Manager Restart

2) Yarn Resource Manager High availability - - Standby Resource


Manager

3) Yarn Web Application Proxy

4) Yarn Docker Container Executor

1)Resource Manager Restart

Restarting the Resource Manager in Hadoop YARN is a crucial operation


for maintaining and managing a YARN cluster. The Resource Manager is
responsible for allocating resources to applications and managing their execution.

What Happens During a Resource Manager Restart?

Resource Management Disruption: During the restart, the Resource


Manager is temporarily unavailable. This means that no new applications can be
submitted, and existing applications might experience temporary interruptions in
resource allocation.

Types of Resource Manager Restart

a) Non-work-preserving RM restart

A non-work-preserving restart of the Resource Manager in Hadoop YARN


means that the restart does not guarantee the preservation of the state of running
applications or jobs. This type of restart can lead to the following consequences:

Application State Loss: During a non-work-preserving restart, there is a


risk that the Resource Manager may lose information about the currently running
applications, their states, and progress. This can result in the need to re-submit or
re-schedule applications after the Resource Manager comes back online.

Non-work-preserving restarts are typically used in scenarios where the


Resource Manager needs to be restarted due to issues like configuration changes,
software upgrades, or troubleshooting, but where the state preservation of running
applications is not feasible.

b) Work-preserving RM restart

A work-preserving restart of the Resource Manager in Hadoop YARN is


designed to maintain the state and progress of applications running on the cluster
during the restart process. This approach minimizes disruptions and ensures that
applications continue to run smoothly. Here’s what it entails:

State Preservation: The Resource Manager retains the state of running


applications, including their progress and resource allocations, during the restart.
This means that applications do not need to be re-submitted or restarted from
scratch.

2)Yarn Resource Manager High availability

Failover Mechanism: Work-preserving restarts often involve mechanisms


such as high-availability (HA) setups. In a high-availability configuration, there
are typically two Resource Managers: an active one and a standby one. If the
active Resource Manager fails or needs to restart, the standby Resource Manager
takes over, ensuring continuous operation and state preservation.

Before to Hadoop v2.4, the master (RM)was the SPOF (single point of
failure). The High Availability feature adds redundancy in the form of an
Active/Standby Resource Manager pair to remove this single point of failure.
Resource Manager HA is realized through an Active/Standby architecture
- at any point of time, one of the RMs is Active, and one or more RMs are in
Standby mode waiting to take over should anything happen to the Active.

Zookeeper is a centralized service for maintaining configuration


information, naming, providing distributed synchronization, and providing group
services. In yarn with HA, Automatic failover is set up via Zookeeper.

3)Yarn Web Application Proxy

A proxy is an intermediary server or service that acts as a gateway between


a client and a destination server.

By default, it runs as a part of RM but we can configure and run in a


standalone mode. Hence, the reason of the proxy is to reduce the possibility of
the web-based attack through Yarn.

4)Yarn Docker Container Executor

Docker containers are created from Docker images, which are built using
Docker files. The Docker Engine manages and runs these containers, enabling
efficient and scalable application deployment.
A Docker container is a lightweight, standalone, and executable package
that contains everything needed to run a piece of software, including the code,
runtime, libraries, and settings. It provides a consistent and isolated environment
across different system

Docker generates light weight virtual machines. The Docker Container


Executor allows the Yarn Node Manager to launch yarn container to Docker
container. These containers provide a custom software environment in which
user’s code run, isolated from a software environment of Node Manager.

Benefits of YARN

1. Scalability: - Decouples resource management from job scheduling,


allowing for more scalable and efficient resource utilization.

2. Flexibility: - Supports different types of distributed applications


beyond Map-Reduce, such as Apache Spark, Apache Flink, and more.

3. Resource Utilization: - Better resource utilization by dynamically


allocating resources based on the needs of the applications.

4. Fault Tolerance: Improved fault tolerance by monitoring and


handling task failures more efficiently.

5. Multi-tenancy: Supports multiple users and applications, making it


suitable for shared environments like cloud services.

You might also like