UNIT II - HADOOP FRAMEWORK
Distributed File Systems - Large-Scale File System Organization – HDFS
concepts - MapReduce Execution, Algorithms using MapReduce, Matrix
Vector Multiplication – Hadoop YARN
Distributed File Systems / Large-Scale File-System Organization
A file system is a method used by an operating system to organize, store, retrieve,
and manage files on a storage device, such as a hard drive or SSD. It provides a
way to keep track of where files are located, how they are named, and who can
access them.
Types of local File Systems:
Different operating systems use different file systems. Examples
1)NTFS (New Technology File System) for Windows,
2)ext (Extended File System) for Linux,
3)APFS (Apple File System) for macOS,
4)FAT32 (File Allocation Table) for various systems.
Local file systems are simple and effective for single-machine environment,
But it has several disadvantages compared to distributed file systems:
1. Limited Scalability: Local file systems are confined to the storage
capacity and resources of a single machine. Distributed file systems can
scale out by adding more machines, allowing them to handle larger
amounts of data and more simultaneous users.
2. Single Point of Failure: A local file system is susceptible to hardware
failures, such as disk crashes, which can lead to data loss. Distributed file
systems often have built-in redundancy and fault tolerance mechanisms,
spreading data across multiple nodes to prevent data loss from a single
point of failure.
3. Accessibility: Files on a local file system are only accessible from the
machine where they are stored. Distributed file systems provide remote
access, allowing users to access files from any machine connected to the
network, which is crucial for collaboration and remote work.
4. Resource Utilization: A local file system can lead to inefficient use of
resources since the storage and processing capabilities of a single
machine can become bottlenecks. Distributed file systems leverage
multiple machines, balancing the load and improving resource utilization.
A distributed file system (DFS) is a type of file system that allows access to
files from multiple hosts sharing via a computer network.
What Are the Different Types of Distributed File Systems?
These are the most common DFS implementations:
NFS (Network File System): Developed by Sun Microsystems, NFS allows
users to access files over a network as easily as if they were on their local disks.
It is widely used in UNIX and Linux environments.
AFS (Andrew File System): Developed by Carnegie Mellon University,
AFS uses a set of trusted servers to present a homogeneous, location-transparent
file namespace to all the client workstations.
HDFS (Hadoop Distributed File System): Part of the Apache Hadoop
project, HDFS is designed for storing large data sets reliably and streaming
those data sets at high bandwidth to user applications. It is particularly suited
for big data analytics.
Google File System (GFS): Designed by Google to meet the company's
rapidly growing data processing needs, GFS is highly scalable and fault-
tolerant, optimized for handling large amounts of data across many servers.
Amazon S3 (Simple Storage Service): While not a traditional file system,
S3 provides a distributed storage system designed to make web-scale computing
easier by offering object storage through a web service interface.
Cloud Store, an open-source DFS originally developed by Kosmix.
How Does a Distributed File System Work?
A distributed file system works as follows:
• Distribution: First, a DFS distributes datasets across multiple clusters or nodes.
Each node provides its own computing power, which enables a DFS to process
the datasets in parallel.
• Replication: A DFS will also replicate datasets onto different clusters by
copying the same pieces of information into multiple clusters. This helps the
distributed file system to achieve fault tolerance—to recover the data in case of
a node or cluster failure—as well as high concurrency, which enables the same
piece of data to be processed at the same time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks
are replicated, perhaps three times, at three different compute nodes. Moreover,
the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure. Normally, both the chunk size and
the degree of replication can be decided by the user.
Advantages of DFS:
1. Data Distribution: In a DFS, data is spread across multiple servers or
locations, which can improve reliability and availability.
2. Transparency: Users can access and manage files as if they were on
their local machines, without needing to know the physical location of
the data.
3. Scalability: Distributed file systems can easily scale by adding more
servers to handle increased data loads and user requests.
4. Fault Tolerance: By replicating data across multiple servers, a DFS can
continue to function even if one or more servers fail.
5. Performance: Distributed file systems often provide better performance
for large-scale data processing because they can handle multiple requests
simultaneously.
HDFS - Hadoop Distributed File System
HDFS stands for Hadoop Distributed File System. It is a distributed file system
designed to store large amounts of data across many machines, providing high
availability and fault tolerance.
Key features of HDFS:
1. Scalability: HDFS can handle petabytes of data by distributing it across
multiple servers, making it easy to scale out by adding more machines.
2. Fault Tolerance: Data is replicated across multiple nodes (typically three
copies by default), so if one node fails, the data can still be accessed from
another node.
3. High Throughput: HDFS is optimized for high throughput rather than
low latency, making it suitable for applications that need to process large
volumes of data.
4. Streaming Data Access: HDFS is designed for high-throughput data
access and is optimized for reading large files sequentially.
5. Large File Support: HDFS is capable of handling very large files,
making it ideal for big data applications.
6. Cost-Effective Storage: By using commodity hardware, HDFS offers a
cost-effective solution for storing large datasets. Hadoop doesn’t require
expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available
from multiple vendors).
HDFS Architecture:
The following diagram illustrates the Hadoop concepts
Components:
1)Client:
o The interface through which users or applications interact with the
HDFS.
2)Name Node
Name Node plays a Master role in Master/Slaves Architecture
It manages the metadata and regulates access to files in HDFS,
❖ Name Node is the single point of contact for accessing files in HDFS So,
whereas Data Nodes acts as slaves. File System metadata is stored on
Name Node.
❖ Name node has Metadata. It contains
1)Number of blocks based on the block size of the file system
2)Replication information
3)block location based on Rack Awareness Algorithm
Thus, Metadata is relatively small in size and fits into Main Memory of a
computer machine. So, it is stored in Main Memory of Name Node to
allow fast access.
3)Data Node
❖ Name Node plays a slave role in Master/Slaves Architecture
❖ In data node, HDFS files are stored in the form of fixed size chunks of data
which are called blocks.
❖ Data Nodes serve read and write requests of clients on HDFS files
and also perform block creation, replication and deletions.
HDFS Blocks
❖ HDFS is a block structured file system. Each HDFS file is broken into
blocks of fixed size. The default size of the block in Hadoop 1.0 is 64
MB and the Hadoop 2.0 is 128 MB which are stored across various data
nodes on the cluster. Each of these blocks is stored as a separate file on
local file system on data nodes (Commodity machines on cluster).
Hadoop Commands :
Hadoop commands are used to upload / distributing the data / downloading
the data from HDFS file system.
❖ HDFS’s fsck command is useful to get the files and blocks details of file
system.
❖ Example: The following command list the blocks that make up each file in
the file system.
$ hadoop fsck -files -blocks
Advantages of Blocks
1. Quick Seek Time:
By default, HDFS Block Size is 128 MB which is much larger than any other
file system. In HDFS, large block size is maintained to reduce the seek time for
block access.
2. Ability to Store Large Files:
Another benefit of this block structure is that, there is no need to store all blocks
of a file on the same disk or node. So, a file’s size can be larger than the size of a
disk or node.
3. How Fault Tolerance is achieved with HDFS Blocks:
HDFS blocks feature suits well with the replication for providing fault tolerance
and availability.
By default, each block is replicated to three separate machines. This feature
insures blocks against corrupted blocks or disk or machine failure. If a block
becomes unavailable, a copy can be read from another machine. And a block that
is no longer available due to corruption or machine failure can be replicated from
its alternative machines to other live machines to bring the replication factor back
to the normal level (3 by default).
Rack Awareness Algorithm
1) First replica will get stored on the local rack.
2) Second replica will get stored on the other Datanode in the same rack.
3) Third replica will get stored on the different rack.
Replication:
• The diagram shows data replication where blocks written to DataNodes
in one rack are replicated to DataNodes in another rack based on Rack
Awareness Algorithm.
• This ensures data availability and reliability in case of node or rack
failure.
Note: No two replicas of the same block are present on the same datanode.
Heartbeat:
In the Hadoop Distributed File System (HDFS), a heartbeat is a mechanism
used to monitor the status and health of the DataNodes within the HDFS cluster.
DataNodes send heartbeat signals to the NameNode at regular intervals,
typically every 3 seconds by default.
Command-Line Interface
Action Command
Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
View the contents of a file named bin/hadoop dfs –cat
/foodir/[Link]
Hadoop Filesystems
The Hadoop File System (HFS) is the primary data storage system used by
Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to data
across highly scalable Hadoop clusters.
The Java abstract class [Link] represents a filesystem
in Hadoop, and there are several concrete implementations, which are described
in Table
Anatomy of File Read :
Read Process:
o The client requests to read a file.
o The client first contacts the NameNode to retrieve the metadata,
including the locations of the blocks of the file.
o The NameNode provides the block locations.
o The client then directly reads the data blocks from the respective
DataNodes.
1)Open File: The HDFS client opens the file by contacting the NameNode.
2)Get Block Locations: The NameNode returns the block locations for the
file.
3)Establish FS Data Input Stream: The client sets up an input stream to read
data.
4)Read Data from DataNodes: The client reads the first block from the
DataNode.
5)Continue Reading: The client reads remaining blocks from the respective
DataNodes.
6)Close FS Data Input Stream: The client closes the input stream after reading
all data.
Anatomy of File Write
Write Process:
o The client requests to write a file.
o The client contacts the NameNode to get block allocations.
o The NameNode provides the DataNode locations for the new
blocks.
o The client writes the data blocks to a series of DataNodes, typically
in a pipeline fashion (one after another).
o Data blocks are replicated to ensure fault tolerance, typically
across different racks for redundancy (e.g., Rack 1 to Rack 2).
Steps in the File Write Operation:
1)Create (1): - Create File:
o The client initiates a file creation request.
o This is done by calling the [Link] method.
o This request is sent to the NameNode to create a new file in the
HDFS namespace.
2. Create (2) - NameNode Response
The NameNode responds by acknowledging the file creation request
o The NameNode allocates the necessary metadata and responds to
the client.
o This includes the list of DataNodes where the data blocks will be
stored.
3. Write (3): - Write Data
o The client starts writing data to the HDFS using the
FSDataOutputStream object.
o The data is written in packets and sent to the first DataNode in the
pipeline.
4. Write Packet (4): - Datanode Pipeline
o The first DataNode receives the data packet and writes it to its
local storage.
o It then forwards the packet to the next DataNode in the pipeline.
This process continues until all DataNodes in the pipeline have received
and stored the data packet
[Link] Packet (5): - Acknowledgement Packet
o Each DataNode, after writing the packet, sends an
acknowledgment back to the previous DataNode.
o This acknowledgment propagates back to the client.
6. Close (6): - Close the Stream
1. Once all data is written, the client closes the FSDataOutputStream.
2. This action indicates that the write operation is complete.
7. Complete (7): - Completion Acknowledgment
1. The NameNode receives a notification that the file write is
complete.
2. The NameNode finalizes the file creation and updates the file
system metadata.
MapReduce
➢ MapReduce is a processing framework for handling large data sets with a
distributed algorithm on a cluster.
➢ It provides two interfaces in the form of two functions:
➢ Map and Reduce. Users can override these two functions to interact with
and manipulate the data flow of running their programs.
MapReduce Architecture
User
Program
fork
fork fork
Master
assign
assign
Map
Reduce
Worker
Worker
Worker
Worker
Input Worker
Data Output
Intermediate File
Components of MapReduce
1. Map Phase - Processes input data to produce intermediate key-value
pairs.
2. Shuffle and Sort Phase - Groups intermediate key-value pairs by key.
3. Reduce Phase - Aggregates the values for each key to produce the final
output.
Step-by-Step Process
1. Input Splitting
The input data is divided into smaller chunks. In this case, each document can
be a chunk.
2)Map Phase
The mapper reads the input files line by line and emits each word as a key with
a count of 1 as the value.
3)Shuffle and Sort Phase
The framework sorts and groups the intermediate key-value pairs generated by
the mapper so that all values associated with the same key are grouped together.
4)Reduce Phase
The reducer processes each group of intermediate key-value pairs and sums up
the values to produce the final count for each word.
Algorithm for Map-Reduce
1. Map Function:
o The Map function processes input data (e.g., lines of text from a
document).
o It outputs key-value pairs, where the key is a word and the value is
1 (indicating one occurrence of the word).
2. Shuffle and Sort:
o The intermediate key-value pairs generated by the Map function
are grouped by key.
o All values for the same key are brought together.
3. Reduce Function:
o The Reduce function processes the grouped key-value pairs.
o It aggregates the values for each key to produce the final result,
which in this case is the total count of each word.
CODE:
Mapper:
def map (document)
for word in document. Split ()
emit (word, 1)
Reducer:
def reduce (word, counts)
total = sum(counts)
emit (word, total)
MapReduce Logical Data Flow
➢ The input data to both the Map and the Reduce functions has a particular
structure.
➢ The input data to the Map function is in the form of a (key, value) pair.
➢ The output data from the Map function is structured as (key, value) pairs
called intermediate (key, value) pairs.
➢ In other words, the user-defined Map function processes each input (key,
value) pair and produces a number of (zero, one, or more) intermediate
(key, value) pairs.
➢ Here, the goal is to process all input (key, value) pairs to the Map function
in parallel (Figure 6.2).
➢ In turn, the Reduce function receives the intermediate (key, value) pairs in
the form of a group of intermediate values associated with one intermediate
key, (key, [set of values]).
➢ The Map-Reduce framework forms these groups by first sorting the
intermediate (key, value) pairs and then grouping values with the same
key. It should be noted that the data is sorted to simplify the grouping
process. The Reduce function processes each (key, [set of values]) group
and produces a set of (key, value) pairs as output.
➢ To clarify the data flow in a sample Map-Reduce application, one of the
well-known Map-Reduce problems, namely word count, to count the
number of occurrences of each word in a collection of documents.
➢ Figure 6.3 demonstrates the data flow of the word-count problem for a
simple input file containing only two lines as follows:
(1) “most people ignore most poetry” and
(2) “most poetry ignores most people.”
➢ In this case, the Map function simultaneously produces a number of
intermediate (key, value) pairs for each line of content so that each word
is the intermediate key with 1 as its intermediate value; for example,
(ignore, 1).
➢ Then the Map-Reduce library collects all the generated intermediate (key,
value) pairs and sorts them to group the 1's for identical words; for
example, (people, [1,1]).
➢ Groups are then sent to the Reduce function in parallel so that it can sum
up the 1 values for each word and generate the actual number of occurrence
for each word in the file; for example, (people, 2).
Architecture of Map-Reduce in Hadoop
➢ Similar to HDFS, the Map-Reduce engine also has a master/slave
architecture consisting of a single JobTracker as the master and a number
of TaskTrackers as the slaves (workers).
Job Tracker
• The Job Tracker is the master daemon that manages and coordinates
MapReduce jobs. It is responsible for resource management, monitoring
and job scheduling.
Role of the JobTracker
1. Resource Management: It monitors the availability of resources such as
CPU and memory across the cluster nodes.
2. Job Scheduling: The JobTracker schedules jobs by assigning map and
reduce tasks to TaskTrackers in the Hadoop cluster.
3. Monitoring: It tracks the progress of each job, reporting the status and
performance metrics to the user.
4. Fault Tolerance: The JobTracker reassigns tasks that fail during
execution, ensuring the job completes even in the face of hardware
failures.
Task Tracker
• The Task Tracker is the slave daemon that executes individual tasks as
directed by the Job Tracker.
Role of the TaskTracker
1. Task Execution: The primary role of the TaskTracker is to execute the
map and reduce tasks assigned to it by the JobTracker.
2. Resource Management: Manages the resources (CPU, memory)
allocated to each task to ensure they run efficiently.
3. Progress Reporting: Regularly sends progress updates and status reports
to the JobTracker via heartbeats.
4. Fault Tolerance: Detects and handles task failures, restarting failed tasks
as necessary.
Workflow
1. Job Submission: A client submits a MapReduce job to the Job Tracker.
2. Job Initialization: The Job Tracker splits the job into smaller tasks and
assigns them to Task Trackers.
3. Task Execution: Task Trackers execute the tasks and process the data.
4. Progress Monitoring: Task Trackers send progress reports to the Job
Tracker.
5. Task Completion: Once tasks are completed, Task Trackers notify the
Job Tracker.
6. Job Completion: The Job Tracker consolidates the results and marks the
job as complete.
Example Use Cases
• Indexing web pages in search engines.
• Analyzing log files.
• Processing large sets of images or videos.
• Machine learning algorithms that require large-scale data processing.
Benefits of Map-Reduce
• Scalability: Can process petabytes of data by distributing the work across
many nodes.
• Fault Tolerance: Automatically handles node failures by re-executing
failed tasks.
• Simplicity: Abstracts the complexity of parallel and distributed
processing.
Matrix-vector multiplication using Map Reduce:
Matrix-vector multiplication can be efficiently performed using the Map Reduce
programming model, which is particularly useful for large-scale data processing.
Here’s a general outline of how to implement matrix-vector multiplication in
Map Reduce:
Matrix Representation
Assume the matrix A is represented as a set of key-value pairs where each key is
a tuple (i, j) representing the position in the matrix, and the value is the matrix
entry Aij
The vector x is represented as a set of key-value pairs where each key is the index
j and the value is the vector entry xj.
Map Reduce – Algorithm:
1. Input Format:
• Matrix A is represented as a set of key-value pairs (i,j) → Aij
• Vector x is represented as key-value pairs j→xj
2. Mapper Phase:
o The mapper takes pairs from the matrix A and the vector x.
o For each matrix entry (i, j , Aij) and corresponding vector entry
(j,xj), the mapper emits a key-value pair (i,Aij × xj)
3. Shuffle and Sort Phase:
o The framework sorts and groups all values by key i. This means all
partial products for each row i are grouped together.
4. Reducer Phase:
o The reducer sums the partial products for each key i to produce the
final entry of the resulting vector y.
o The reducer emits key-value pairs (i,yi)
Example
Let's consider the matrix A and vector x:
Mapper Output
For matrix entry A11=1 and vector entry x1=5
• Emit (1,1 × 5) = (1,5)
For matrix entry A12=2 and vector entry x2=6
• Emit (1,2 × 6 ) = (1,12)
For matrix entry A21=3 and vector entry x1=5
• Emit (2,3×5) = (2,15)
For matrix entry A22=4and vector entry x2=6
• Emit (2,4×6) = (2,24)
Shuffle and Sort Phase
The intermediate key-value pairs after shuffle and sort:
Intermediate results = {(1,[5,12]) , (2,[15,24])}
Reducer Output
For key 1:
• Sum 5+12=17
• Emit (1,17)
For key 2:
• Sum 15+24=39
• Emit (2,39)
The final result is the vector y:
This is how matrix-vector multiplication can be done using MapReduce.
In which scenario Matrix-vector multiplication in MapReduce is useful ?
When the matrix and vector are too large to fit into the memory of a
single machine.
Here are some key reasons why you might use MapReduce for this task:
1. Scalability
• Large-scale Data: When dealing with very large matrices (e.g., millions
of rows and columns), a single machine may not have enough memory or
processing power to handle the entire dataset. MapReduce allows the data
to be distributed across many machines in a cluster.
• Distributed Processing: MapReduce leverages the distributed
computing power of multiple machines to handle large datasets more
efficiently.
2. Fault Tolerance
• Redundancy: MapReduce provides fault tolerance by distributing the
data and processing tasks across multiple nodes. If a node fails, the
system can reassign the task to another node without losing data or
computational progress.
• Automatic Recovery: The MapReduce framework automatically handles
machine failures, ensuring the job completes successfully without manual
intervention.
3. Parallel Processing
• Simultaneous Computation: By dividing the task into smaller chunks
(map tasks), multiple parts of the matrix can be processed in parallel.
This significantly speeds up the computation compared to a serial
approach on a single machine.
• Optimal Resource Utilization: Efficiently uses the resources of a
cluster, maximizing throughput and minimizing latency.
4. Simplified Programming Model
• Abstraction: MapReduce abstracts the complexity of distributed
computing. Programmers can focus on writing map and reduce functions
without worrying about the underlying details of parallelism, data
distribution, and fault tolerance.
• Standardization: Provides a standardized way to process large-scale
data, making it easier to develop, maintain, and port data processing
applications.
5. Applications
• Machine Learning: Many machine learning algorithms, such as those
for training models, involve matrix-vector multiplications. Using
MapReduce allows these algorithms to scale to large datasets.
• Graph Processing: Algorithms for graph analysis, such as PageRank,
often involve repeated matrix-vector multiplications.
• Data Mining: Tasks such as computing recommendations, search
indices, and other data mining operations frequently involve matrix-
vector operations.
Real-life applications - Matrix-vector multiplication using MapReduce
Matrix-vector multiplication using MapReduce has several real-life
applications, especially in fields that involve large-scale data processing and
analysis. Here are some notable examples:
1. PageRank Algorithm
Application:
The PageRank algorithm, used by Google Search to rank web pages, involves
iteratively computing the rank of each page based on the ranks of the pages
linking to it.
Details:
• The web can be represented as a directed graph where nodes are web
pages, and edges are hyperlinks.
• The transition matrix, representing the probabilities of moving from one
page to another, is multiplied by the PageRank vector in each iteration.
• MapReduce is used to handle the large-scale computation of these
matrix-vector multiplications efficiently.
2. Recommendation Systems
Application:
Recommendation systems, such as those used by Netflix, Amazon, and Spotify,
use matrix factorization techniques to predict user preferences based on past
behavior.
Details:
• The user-item interaction matrix is multiplied by user and item feature
vectors.
• For large-scale datasets, MapReduce can distribute the computation
across a cluster, making it feasible to process millions of users and items.
By using MapReduce for matrix-vector multiplication, organizations can handle
very large datasets efficiently and effectively, leveraging the power of distributed
computing to overcome the limitations of single-machine processing.
Hadoop YARN
• YARN “Yet Another Resource Negotiator” is the resource
management layer of Hadoop. The Yarn was introduced in Hadoop 2.x. YARN
supports multiple processing frameworks in addition to Map-Reduce, such as
Spark.
Hadoop 2.x provides a general purpose data processing platform which is
not just limited to the Map-Reduce.
• Spark for micro – batch and iterative processing
• Storm for stream processing
• Hadoop for batch processing,
• Yarn allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS (Hadoop Distributed File System).
• Apart from resource management, Yarn also does job Scheduling.
• It enables Hadoop to process other data processing system other than Map-
Reduce. It allows running several different frameworks on the same
hardware where Hadoop is deployed.
YARN vs. Hadoop 1.x
In Hadoop 1.x, the Job Tracker was responsible for both resource
management and job scheduling, which led to scalability and resource utilization
issues.
YARN addresses these issues by splitting the responsibilities into separate
components:
Architecture – YARN
Apache Yarn Framework consists of a master daemon known as “Resource
Manager”, slave daemon called “node manager” (one per slave node) and
Application Master (one per application).
1. Resource Manager (RM) - Manages resources across the cluster
It is the master daemon of Yarn. RM manages the global assignments of
resources (CPU and memory) among all the applications. Resource Manager has
two Main components
•Scheduler
•Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running
application. The scheduler is pure scheduler it means that it performs no
monitoring no tracking for the application and even doesn’t guarantees about
restarting failed tasks either due to application failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible
for starting application masters and for monitoring and restarting them on
different nodes in case of failures.
2. Node Manager (NM) - Manages resources on individual nodes
It is the slave daemon of Yarn. NM is responsible for containers monitoring
their resource usage and reporting the same to the Resource Manager. It manages
the user process on that machine. Node Manager has two Main components
• Container
• Application Master (AM)
a) Container
The smallest unit of resource allocation in YARN. Each container runs a
part of an application and is mana ged by the Node Manager. A container is a
fundamental resource management unit. It encapsulates a set of resources such as
memory, CPU, and storage that are allocated by the Resource Manager (RM) to
a particular application for executing tasks.
Need of Container in YARN:
Isolation: Containers provide an isolated environment for running
applications. This isolation ensures that different applications or tasks do not
interfere with each other, which helps maintain system stability and security.
b). Application Master (AM) - it handles job scheduling and monitoring.
One application master runs per application. It negotiates resources
from the resource manager and works with the node manager. It Manages the
application life cycle, including resource requests and monitoring progress
The AM acquires containers from the RM’s Scheduler before contacting
the corresponding NMs to start the application’s individual tasks.
YARN additional Components:
1) Resource Manager Restart
2) Yarn Resource Manager High availability - - Standby Resource
Manager
3) Yarn Web Application Proxy
4) Yarn Docker Container Executor
1)Resource Manager Restart
Restarting the Resource Manager in Hadoop YARN is a crucial operation
for maintaining and managing a YARN cluster. The Resource Manager is
responsible for allocating resources to applications and managing their execution.
What Happens During a Resource Manager Restart?
Resource Management Disruption: During the restart, the Resource
Manager is temporarily unavailable. This means that no new applications can be
submitted, and existing applications might experience temporary interruptions in
resource allocation.
Types of Resource Manager Restart
a) Non-work-preserving RM restart
A non-work-preserving restart of the Resource Manager in Hadoop YARN
means that the restart does not guarantee the preservation of the state of running
applications or jobs. This type of restart can lead to the following consequences:
Application State Loss: During a non-work-preserving restart, there is a
risk that the Resource Manager may lose information about the currently running
applications, their states, and progress. This can result in the need to re-submit or
re-schedule applications after the Resource Manager comes back online.
Non-work-preserving restarts are typically used in scenarios where the
Resource Manager needs to be restarted due to issues like configuration changes,
software upgrades, or troubleshooting, but where the state preservation of running
applications is not feasible.
b) Work-preserving RM restart
A work-preserving restart of the Resource Manager in Hadoop YARN is
designed to maintain the state and progress of applications running on the cluster
during the restart process. This approach minimizes disruptions and ensures that
applications continue to run smoothly. Here’s what it entails:
State Preservation: The Resource Manager retains the state of running
applications, including their progress and resource allocations, during the restart.
This means that applications do not need to be re-submitted or restarted from
scratch.
2)Yarn Resource Manager High availability
Failover Mechanism: Work-preserving restarts often involve mechanisms
such as high-availability (HA) setups. In a high-availability configuration, there
are typically two Resource Managers: an active one and a standby one. If the
active Resource Manager fails or needs to restart, the standby Resource Manager
takes over, ensuring continuous operation and state preservation.
Before to Hadoop v2.4, the master (RM)was the SPOF (single point of
failure). The High Availability feature adds redundancy in the form of an
Active/Standby Resource Manager pair to remove this single point of failure.
Resource Manager HA is realized through an Active/Standby architecture
- at any point of time, one of the RMs is Active, and one or more RMs are in
Standby mode waiting to take over should anything happen to the Active.
Zookeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing group
services. In yarn with HA, Automatic failover is set up via Zookeeper.
3)Yarn Web Application Proxy
A proxy is an intermediary server or service that acts as a gateway between
a client and a destination server.
By default, it runs as a part of RM but we can configure and run in a
standalone mode. Hence, the reason of the proxy is to reduce the possibility of
the web-based attack through Yarn.
4)Yarn Docker Container Executor
Docker containers are created from Docker images, which are built using
Docker files. The Docker Engine manages and runs these containers, enabling
efficient and scalable application deployment.
A Docker container is a lightweight, standalone, and executable package
that contains everything needed to run a piece of software, including the code,
runtime, libraries, and settings. It provides a consistent and isolated environment
across different system
Docker generates light weight virtual machines. The Docker Container
Executor allows the Yarn Node Manager to launch yarn container to Docker
container. These containers provide a custom software environment in which
user’s code run, isolated from a software environment of Node Manager.
Benefits of YARN
1. Scalability: - Decouples resource management from job scheduling,
allowing for more scalable and efficient resource utilization.
2. Flexibility: - Supports different types of distributed applications
beyond Map-Reduce, such as Apache Spark, Apache Flink, and more.
3. Resource Utilization: - Better resource utilization by dynamically
allocating resources based on the needs of the applications.
4. Fault Tolerance: Improved fault tolerance by monitoring and
handling task failures more efficiently.
5. Multi-tenancy: Supports multiple users and applications, making it
suitable for shared environments like cloud services.