0% found this document useful (0 votes)

60 views23 pages

Anatomy of MapReduce Job Execution

The document provides a comprehensive overview of the MapReduce job run process, including job submission, initialization, task assignment, execution phases, and job completion. It also discusses failure types and handling mechanisms, job scheduling strategies, and the importance of shuffle and sort phases in data processing. Additionally, it outlines task execution components and lifecycle in the Hadoop environment.

Uploaded by

Ragul S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views23 pages

Anatomy of MapReduce Job Execution

Uploaded by

Ragul S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT 3

Anatomy of a MapReduce Job Run

MapReduce is a programming model developed by Google for processing large data sets
with a distributed algorithm on a cluster. In Hadoop, the MapReduce engine breaks down a
job into smaller tasks and runs them in a fault-tolerant and parallel manner. Understanding
the anatomy of a MapReduce job helps in optimizing its performance and debugging.

1. Job Submission

● The process begins when a user submits a MapReduce job via the JobClient.

● The JobClient reads the job configuration, including input/output paths,

mapper/reducer classes, and the number of reducers.

● It submits the job to the JobTracker (in Hadoop 1.x) or ResourceManager (in
Hadoop 2.x YARN).

2. Job Initialization

● The JobTracker/ResourceManager receives the job and initializes it.

● The input data is split into logical InputSplits.

● Each InputSplit corresponds to a Map Task.

● The metadata of the job is stored and JobTracker/ResourceManager assigns it a Job

ID.

3. Task Assignment

● TaskTrackers (Hadoop 1.x) or NodeManagers (Hadoop 2.x) send heartbeat

signals indicating availability.

● The JobTracker/ResourceManager assigns tasks (map/reduce) to these nodes

based on data locality (placing computation where the data resides).

● Each task is encapsulated in a TaskAttempt for fault tolerance.

UNIT 3

4. Map Task Execution

● The Mapper processes one InputSplit at a time.

● Data is read using RecordReader, converting raw input into key-value pairs.

● The map() function is applied to each key-value pair, generating intermediate

key-value pairs.

● The output is written to local disk, sorted, and partitioned (based on the number of
reducers).

● A combiner may be used here for local aggregation to reduce data transfer.

5. Shuffle and Sort Phase

● The intermediate data is shuffled, i.e., it is sent across the network to reducers
based on key partitioning.

● The data is sorted by key before it reaches the reducer.

● This is the most data-intensive phase of the MapReduce job.

6. Reduce Task Execution

● Reducers fetch their respective intermediate data from the mappers.

● The reduce() function processes each key and its list of values to produce the final
output.

● The output is written to HDFS.

7. Output Commit

● Once all reduce tasks finish successfully, the output is committed to the final output
directory in HDFS.

● Temporary files are cleaned up.

8. Job Completion

● The JobTracker/ResourceManager marks the job as SUCCESS or FAILED.

● The JobClient is notified, and the user can view job logs and counters.
UNIT 3

Failures in MapReduce
MapReduce is designed to run on commodity hardware, where failures are common. To
ensure reliability and fault tolerance, Hadoop's MapReduce framework includes mechanisms
to detect, handle, and recover from failures automatically. Understanding these failures is
essential for efficient debugging and performance optimization.

Types of Failures in MapReduce

MapReduce jobs can encounter failures at different stages and components during
execution. The major types of failures are:

1. Task Failure

a. Mapper Failure

● A map task may fail due to:

○ Data corruption in HDFS.

○ Logic errors or exceptions in user-defined map() function.

○ Hardware failure or memory error on the node.

● Handling:

○ The JobTracker/ResourceManager detects failure through heartbeat loss.

○ The failed task is rescheduled on another node, possibly with a replica of the
data block.

b. Reducer Failure

● A reduce task may fail due to:

○ Errors in the reduce() logic.

○ Memory overflow or disk space issues.

○ Network failures during shuffle.

UNIT 3

● Handling:

○ The reducer is retried on a different node using the same intermediate data
from map outputs.

2. Node Failure

● A TaskTracker (Hadoop 1.x) or NodeManager (Hadoop 2.x) may go down due to:

○ Power outage, hardware crash, or network disconnect.

● Impact:

○ All tasks running on the node are considered failed.

○ The JobTracker/ResourceManager reschedules the tasks on other healthy

nodes.

● Detection:

○ If the node doesn’t respond for a predefined timeout (e.g., 10 minutes), it's
marked as dead.

3. JobTracker / ResourceManager Failure

● If the JobTracker (in Hadoop 1.x) fails:

○ Entire job needs to be resubmitted since it's a single point of failure.

● In Hadoop 2.x (YARN):

○ ResourceManager is the central authority, and its failure affects the whole
cluster.

○ However, YARN supports ResourceManager High Availability (HA), where

a standby can take over.

4. Disk Failure

● Occurs when:

○ Local disk on a node crashes.

○ Temporary data, map outputs, or HDFS blocks are lost.

UNIT 3

● Handling:

○ HDFS replicates each block (default is 3 replicas), so another replica is used.

○ Failed tasks are re-executed on different nodes.

5. Network Failures

● Issues during:

○ Shuffle phase (mapper to reducer data transfer).

○ Communication between nodes.

● Handling:

○ Retries are attempted.

○ If persistent, tasks are rescheduled.

6. Data Corruption

● Happens due to:

○ Damaged input files.

○ Bit errors during data transfer or storage.

● Detection:

○ HDFS uses checksums to detect corruption.

● Handling:

○ Corrupt blocks are discarded and fetched from healthy replicas.

7. Speculative Execution Issues

● Sometimes slow tasks (stragglers) are executed redundantly on other nodes.

● May cause resource contention or inconsistent output if not managed properly.

UNIT 3

Job Scheduling in MapReduce

In Hadoop MapReduce, job scheduling is a crucial process that determines how and when
the submitted jobs (and their respective tasks) are executed in a distributed environment.
The goal of job scheduling is to maximize resource utilization, ensure fairness, and
optimize job performance.

1. Introduction to Job Scheduling

● Hadoop clusters are multi-user environments where many jobs may be submitted
concurrently.

● Each job consists of multiple map and reduce tasks, and these tasks must be
scheduled across the available nodes.

● Job scheduling decides:

○ The order of job execution.

○ The resources allocated to each job.

○ How to handle priority, fairness, and deadlines.

2. Default Job Scheduler: FIFO (First-In-First-Out)

● FIFO Scheduler is the simplest and default scheduler in Hadoop 1.x.

● Jobs are executed in the order they are submitted.

● Limitations:

○ No fairness in multi-user environments.

○ Long jobs block short jobs, leading to poor utilization.

3. Fair Scheduler

● Developed by Facebook for better resource sharing.

● Aims to give each user or job an equal share of the cluster over time.

● Features:

○ Jobs are organized into pools.

UNIT 3

○ Each pool is assigned minimum and maximum resource limits.

○ Jobs get resources such that all pools get a fair share.

● Benefits:

○ Short jobs don't get blocked.

○ Ensures better cluster utilization and job turnaround time.

4. Capacity Scheduler

● Developed by Yahoo for large shared clusters.

● Designed to allow multiple organizations or departments to share a Hadoop

cluster.

● Cluster is divided into queues with configurable capacity.

● Each queue gets a guaranteed minimum of resources.

● Resources not used by a queue can be temporarily borrowed by others.

● Key features:

○ Multi-tenancy support.

○ Security and access control per queue.

5. YARN and Scheduler Evolution

In Hadoop 2.x and later (YARN - Yet Another Resource Negotiator):

● The ResourceManager handles resource allocation.

● Schedulers in YARN:

○ FIFO Scheduler

○ Capacity Scheduler

○ Fair Scheduler

● Applications are run inside ApplicationMasters, which negotiate for resources with
the ResourceManager.
UNIT 3

6. Scheduler Comparison
Feature FIFO Scheduler Fair Scheduler Capacity
Scheduler

Job Ordering Submission time Resource fairness Queue capacity

Multi-user Support Poor Good Excellent

Resource No Optional Yes

Guarantees

Best For Small clusters Shared multi-user Large enterprise use

use

7. Speculative Execution & Scheduling

● Hadoop supports speculative execution for slow tasks (stragglers).

● Scheduler may run duplicate tasks on other nodes.

● The task which finishes first is accepted; others are killed.

● Improves overall job performance but can increase resource usage.

8. Factors Affecting Scheduling

● Data locality: Prefer nodes where data resides.

● Resource availability: CPU, memory, disk, network.

● Job priorities: High-priority jobs get more resources.

● User quotas: Limits for multi-user environments.

9. Use Case Examples

● FIFO: Suitable for single-user, single-job clusters.

● Fair Scheduler: Ideal for development/testing environments.

● Capacity Scheduler: Best for large enterprises with multiple teams.

UNIT 3

Shuffle and Sort in MapReduce

In Hadoop MapReduce, Shuffle and Sort are critical intermediate phases that occur
between the Map and Reduce phases. These stages are automatically managed by the
framework and are essential for correctly grouping intermediate key-value pairs so they can
be processed efficiently by reducers.

1. Introduction

● The Map phase outputs intermediate key-value pairs.

● The Reduce phase processes these pairs grouped by key.

● But before reducing, the system must:

○ Shuffle: Transfer and group intermediate data from mappers to reducers.

○ Sort: Sort intermediate keys so reducers receive data in a sorted order.

● Together, these ensure correctness and performance in distributed data processing.

2. Phases in Shuffle and Sort

A. Map Side Processing

i) Spill Phase

● When the mapper buffer exceeds a threshold (default: 80% of allocated memory), a
spill to disk is triggered.

● During spill:

○ Output is sorted by key.

○ Optional combiner function is applied (if defined).

○ Output is written to a temporary spill file.

ii) Merge Phase

● If multiple spill files are created, they are merged into a single sorted output file.

● The final output of the mapper is a sorted list of key-value pairs, ready for shuffling.
UNIT 3

B. Shuffle Phase (Between Map and Reduce)

● Shuffle is the process of transferring intermediate data from the map tasks to the
reduce tasks.

● Each reducer:

○ Requests data related to its assigned key range from all mappers.

○ Data is fetched over the network.

● Hadoop ensures that data locality and efficiency are maintained during transfer.

C. Sort Phase (Reduce Side)

● Once a reducer receives all its data:

○ It performs an external merge sort on the keys.

○ Ensures that all values associated with the same key are grouped together.

● This is necessary because the reduce() function is called once per unique key.

3. Role of Partitioner

● The Partitioner determines which reducer will process which keys.

● Default: HashPartitioner.

● Ensures that:

○ Keys with the same hash go to the same reducer.

○ Data is balanced across reducers.

4. Importance of Shuffle and Sort

Feature Purpose

Shuffle Moves and groups data from mappers to appropriate reducers

Sort Orders keys before reduction

UNIT 3

Grouping Ensures correct input for reduce()

Efficiency Helps in optimization via combiner and compression

5. Performance Considerations

● Shuffle is network-intensive → Often the most expensive part of MapReduce.

● Techniques to optimize:

○ Combiner to reduce data volume.

○ Compression of map outputs.

○ Tuning buffer sizes.

○ Using efficient partitioning.

6. Diagram: Shuffle and Sort

+---------+ +-----------+ +----------+

| Mapper1 |-----> | | | |
+---------+ | | | |
| |----> | Reducer1 |
+---------+ | Shuffle | | |
| Mapper2 |-----> | & |----> | Reducer2 |
+---------+ | Sort | | |
| | +----------+
+---------+ | |
| Mapper3 |-----> | |
+---------+ +-----------+

7. Real-World Example

Imagine a word count program:

● Map phase: emits (word, 1) for each word.

● Shuffle phase: sends all (word, 1) pairs to the same reducer.

● Sort phase: groups and sorts these pairs by word.

● Reduce phase: sums the counts for each word.

UNIT 3

Task Execution in MapReduce

MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster. One of the core components of this model is task
execution, which defines how a job's tasks are created, managed, scheduled, and
executed on Hadoop’s distributed environment.

1. Introduction

In the MapReduce framework, a job is split into tasks. There are two types of tasks:

● Map tasks: Process input splits and produce intermediate key-value pairs.

● Reduce tasks: Process intermediate data to produce final output.

Task execution involves:

● Job submission

● Task division

● Scheduling

● Execution

● Monitoring

● Retrying (if needed)

2. Components Involved in Task Execution

Component Role

JobTracker (Hadoop 1) Coordinates job execution, task scheduling, and monitoring

TaskTracker Executes individual map/reduce tasks

ApplicationMaster (YARN-based Hadoop 2+) Manages application execution per job

NodeManager Launches containers to run tasks on individual nodes

UNIT 3

3. Task Execution Lifecycle

A. Job Submission

● The client submits a job to the ResourceManager (YARN) or JobTracker (Hadoop

1).

● The system validates the input and output paths.

B. Input Splits

● Input data is split into InputSplits, each assigned to a Map Task.

● Example: A 1 GB file might be split into 64 MB chunks → 16 map tasks.

C. Task Assignment

● ApplicationMaster assigns map and reduce tasks to available nodes.

● Each node executes its task in an isolated container (YARN).

D. Map Task Execution

● Each map task:

○ Reads data from HDFS.

○ Processes data using user-defined map function.

○ Buffers output in memory.

○ Periodically writes intermediate output to local disk (spill).

○ Optionally applies a Combiner.

E. Shuffle and Sort

● After map tasks complete, intermediate data is shuffled to reducers.

● Data is sorted by key and grouped before feeding to reduce tasks.

F. Reduce Task Execution

UNIT 3

● Each reduce task:

○ Fetches relevant intermediate data from map task outputs.

○ Sorts and merges the data.

○ Applies the reduce function.

○ Writes final output to HDFS.

4. Types of Tasks in MapReduce

Task Type Description

Map Task Processes input split and produces intermediate key-value pairs

Reduce Task Aggregates intermediate data and produces final output

Setup Task Initializes resources before map/reduce tasks run

Cleanup Task Cleans up after successful/failed map/reduce execution

5. Task Execution Control

Task Monitoring

● Tasks report progress to the ApplicationMaster.

● Heartbeat signals are sent to monitor liveness.

Failure Handling

● Failed tasks are retried (default: 4 times).

● Speculative execution may launch duplicate tasks for slow-running instances.

6. Speculative Execution

● Sometimes, a task runs slower than others → "straggler".

● Hadoop may launch a duplicate task on another node.

● The first to finish is accepted; the other is killed.

UNIT 3

● Helps to speed up overall job completion.

7. Diagram: Task Execution in MapReduce

+-----------------+
| Job Client |
+-----------------+
|
v
+-----------------+ +---------------------+
| ResourceManager | <----> | ApplicationMaster |
+-----------------+ +---------------------+
| |
v v
+------------------+ +---------------------+
| Map Task (Node 1)| | Reduce Task (Node 2)|
+------------------+ +---------------------+

8. Optimization Techniques in Task Execution

● Use of Combiner: Reduces volume of intermediate data.

● Data Locality: Assign map tasks closer to where data resides.

● Proper partitioning: Avoids data skew in reducers.

● Speculative execution: Handles stragglers efficiently.

9. Example: Word Count

In a word count job:

● Map Task: Emits (word, 1) for each word.

● Shuffle: Groups all occurrences of the same word.

● Reduce Task: Sums the values → total count per word.

UNIT 3

MapReduce Types and Formats

1. Introduction

MapReduce is a programming paradigm used to process large-scale data in parallel across

a distributed cluster. The effectiveness of MapReduce heavily depends on the data types
and input/output formats used in the processing pipeline. Understanding these is critical
for writing optimized and flexible MapReduce applications.

2. MapReduce Data Types

In Hadoop, data flows through key-value pairs. Each phase of the job (Map, Shuffle,
Reduce) involves processing these pairs.

Java Type Hadoop Writable Equivalent

int IntWritable

long LongWritable

float FloatWritable

double DoubleWritable

boolean BooleanWritable

String Text

Array/Map/Set ArrayWritable, MapWritable, SetWritable

Example:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {.....}

3. Custom Writable Types

When built-in Writable types are insufficient, Hadoop allows custom Writable types by
implementing:

public class MyWritable implements Writable {

private IntWritable id;

UNIT 3

private Text name;

public void write(DataOutput out) {...}

public void readFields(DataInput in) {...}

Custom types must implement:

● Writable or WritableComparable

● readFields() and write() for serialization

4. Input Formats in MapReduce

InputFormat defines how input files are split and read by MapReduce.

Role of InputFormat:

● Splits input data into logical InputSplits

● Provides RecordReader to read key-value pairs from splits

Common InputFormats:

InputFormat Description

TextInputFormat Default format. Treats each line as a value, key is line offset

KeyValueTextInputFormat Splits line into key/value based on separator (like tab)

SequenceFileInputFormat Reads Hadoop-specific binary SequenceFiles

NLineInputFormat Assigns N lines per split

CombineFileInputFormat Combines small files into a larger split

DBInputFormat Reads data from a database

XMLInputFormat Custom input format for XML files

Example:

[Link]([Link]);
UNIT 3

5. Output Formats in MapReduce

OutputFormat defines how the results of MapReduce jobs are written to storage.

Common OutputFormats:

OutputFormat Description

TextOutputFormat Default output, writes key-value as strings

SequenceFileOutputFormat Writes binary output

MultipleOutputs Allows writing to multiple outputs

NullOutputFormat Discards output (used for benchmarking)

DBOutputFormat Writes to relational database

MapFileOutputFormat Writes output in MapFile format (sorted SequenceFile)

Example:

[Link]([Link]);

6. Input/Output Key-Value Format Examples

TextInputFormat:

Input:
0 Hadoop is fun
17 MapReduce is powerful

Mapper Input:
Key = 0, Value = "Hadoop is fun"
Key = 17, Value = "MapReduce is powerful"

TextOutputFormat:

Output:
Hadoop 1
is 2
fun 1
MapReduce 1
powerful 1
UNIT 3

7. Use of FileInputFormat and FileOutputFormat

Hadoop provides base classes:

● FileInputFormat – handles file splits and data reading.

● FileOutputFormat – handles writing results to HDFS.

These can be extended for customized processing like handling CSV, JSON, XML formats.

8. Custom Input/Output Formats

For non-standard data sources/formats (e.g., XML, images), we can define:

● Custom InputFormat class

● Custom RecordReader

● Custom OutputFormat class

This allows advanced use cases like:

● Reading zipped data

● Writing to multiple files

● Exporting structured formats like JSON or Avro

9. Diagram: Input/Output Format Flow

[HDFS Input File]

MapReduce Features:
1. Introduction

MapReduce is a powerful data processing model introduced by Google and later adopted by
Apache Hadoop to handle large-scale data processing in a distributed and parallel
manner. It simplifies the task of processing terabytes and petabytes of data by splitting it
into manageable pieces and executing them across a cluster.

2. Key Features of MapReduce

A. Simplicity and Abstraction

● MapReduce abstracts the complexity of parallel and distributed computing.

● Developers only need to implement two functions:

○ map() – for processing input data.

○ reduce() – for aggregating intermediate output.

● Example: Word count program with a few lines of code.

B. Scalability

● Can scale to thousands of commodity machines.

● Efficiently handles massive data (multi-terabyte or petabyte scale).

● Data is automatically partitioned and distributed across nodes.

C. Fault Tolerance

● Detects and handles failures automatically.

● Failed tasks are re-executed on other available nodes.

● Intermediate results are written to local disk, ensuring job recovery.

D. Data Locality Optimization

● Tasks are scheduled close to the data (same node or rack).

UNIT 3

● Reduces network congestion and improves performance.

● Brings computation to data instead of moving data.

E. Parallel Processing

● Map tasks and reduce tasks are executed in parallel across nodes.

● Takes advantage of multi-core and multi-node environments.

● Greatly reduces overall job execution time.

F. High Throughput

● Designed to process vast amounts of data in parallel.

● High throughput achieved due to pipelined data flow:

○ Input → Map → Shuffle → Reduce → Output

G. Automatic Task Scheduling

● JobTracker (Hadoop 1) or ResourceManager (YARN) handles:

○ Task assignment

○ Load balancing

○ Monitoring

● Developer doesn’t need to manage resource allocation.

H. Flexible Data Input and Output

● Supports multiple input and output formats:

○ Text, SequenceFile, KeyValue, DBInput, XML, JSON, etc.

● Custom InputFormat and OutputFormat can be defined as needed.

UNIT 3

I. Language Independence

● Though implemented in Java, MapReduce can be used with:

○ Python, C++, Ruby, and more via Hadoop Streaming API

● Makes it adaptable for a wide range of developers.

J. Speculative Execution

● Handles slow-running tasks (stragglers) by launching backup tasks.

● Improves job completion time by finishing with the faster task.

K. Combiner Support

● A mini-reducer that runs after the mapper to reduce data transfer.

● Used to optimize bandwidth during the shuffle phase.

L. Counters and Logs

● Built-in support for custom counters to track progress and metrics.

● Logs and status updates aid in debugging and monitoring jobs.

3. Real-world Use Cases of MapReduce

Domain Use Case

Search Engines Index building, log analysis

E-commerce Product recommendations, transaction analysis

Bioinformatics DNA sequence analysis

Social Media Hashtag frequency, trend detection

UNIT 3

4. Advantages of MapReduce

● Handles big data efficiently

● Reduces development effort

● Enhances resource utilization

● Enables distributed fault-tolerant processing

5. Limitations (for completeness)

Limitation Description

High latency Not ideal for real-time processing

Complex debugging Distributed environment can be hard to debug

Disk I/O heavy Writes intermediate data to disk

Limited iterative support Not optimized for iterative ML tasks

MapReduce Job Architecture Overview
100% (1)
MapReduce Job Architecture Overview
46 pages
Hadoop I/O: Data Integrity & Compression
No ratings yet
Hadoop I/O: Data Integrity & Compression
27 pages
HDFS Design and Concepts Overview
No ratings yet
HDFS Design and Concepts Overview
16 pages
Developing a MapReduce Application Guide
No ratings yet
Developing a MapReduce Application Guide
52 pages
Hadoop I/O: Compression & Serialization
No ratings yet
Hadoop I/O: Compression & Serialization
20 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
HDFS Read and Write Operations Explained
No ratings yet
HDFS Read and Write Operations Explained
7 pages
Hadoop and Cloud Computing Overview
100% (1)
Hadoop and Cloud Computing Overview
33 pages
Data Analysis Techniques with Hadoop
100% (1)
Data Analysis Techniques with Hadoop
2 pages
Efficient Data Archiving in Hadoop
No ratings yet
Efficient Data Archiving in Hadoop
32 pages
Job Scheduling Mechanisms in MapReduce
100% (1)
Job Scheduling Mechanisms in MapReduce
3 pages
Understanding YARN in Hadoop 2
No ratings yet
Understanding YARN in Hadoop 2
16 pages
Introduction to Stream Processing Concepts
No ratings yet
Introduction to Stream Processing Concepts
19 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
5 pages
Understanding Digital Data Types and Big Data
No ratings yet
Understanding Digital Data Types and Big Data
136 pages
Unit Testing MapReduce with MRUnit
No ratings yet
Unit Testing MapReduce with MRUnit
31 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Hadoop Distributed File System Overview
100% (1)
Hadoop Distributed File System Overview
148 pages
Understanding MapReduce Execution Pipeline
No ratings yet
Understanding MapReduce Execution Pipeline
27 pages
HDFS and Big Data Analytics Overview
No ratings yet
HDFS and Big Data Analytics Overview
18 pages
Big Data Applications with Pig and Hive
No ratings yet
Big Data Applications with Pig and Hive
10 pages
Hadoop Ecosystem Overview and Spark Insights
No ratings yet
Hadoop Ecosystem Overview and Spark Insights
96 pages
Data Analysis Using Hadoop Framework
No ratings yet
Data Analysis Using Hadoop Framework
4 pages
HDFS Interface Features and Operations
0% (1)
HDFS Interface Features and Operations
21 pages
Node.js and Express Framework Overview
No ratings yet
Node.js and Express Framework Overview
127 pages
Cloud Computing Unit 4 Notes
No ratings yet
Cloud Computing Unit 4 Notes
41 pages
MapReduce Job Execution Overview
No ratings yet
MapReduce Job Execution Overview
24 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Big Data Processing with Pig and Hive
No ratings yet
Big Data Processing with Pig and Hive
29 pages
Assessing MapReduce Output Quality
No ratings yet
Assessing MapReduce Output Quality
41 pages
Counting Distinct Elements in Data Streams
No ratings yet
Counting Distinct Elements in Data Streams
19 pages
Basics of Hadoop in Big Data Analytics
No ratings yet
Basics of Hadoop in Big Data Analytics
22 pages
Essential HDFS Commands Guide
No ratings yet
Essential HDFS Commands Guide
10 pages
Installing Hadoop on Ubuntu
No ratings yet
Installing Hadoop on Ubuntu
16 pages
Understanding Hive as a NoSQL Database
No ratings yet
Understanding Hive as a NoSQL Database
9 pages
Understanding MapReduce Counters in Hadoop
100% (1)
Understanding MapReduce Counters in Hadoop
31 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
38 pages
Hadoop Pipes for Big Data Analysis
No ratings yet
Hadoop Pipes for Big Data Analysis
15 pages
Understanding CORBA Middleware Architecture
No ratings yet
Understanding CORBA Middleware Architecture
38 pages
Big Data Analytics: Pig and Hive Overview
No ratings yet
Big Data Analytics: Pig and Hive Overview
19 pages
HDFS Architecture and MapReduce Overview
No ratings yet
HDFS Architecture and MapReduce Overview
33 pages
Hadoop Overview for Big Data Students
No ratings yet
Hadoop Overview for Big Data Students
30 pages
HDFS Data Blocks and High Availability
No ratings yet
HDFS Data Blocks and High Availability
10 pages
Big Data Processing with Pig and Hive
No ratings yet
Big Data Processing with Pig and Hive
17 pages
Cloud Enabling Technologies Overview
No ratings yet
Cloud Enabling Technologies Overview
53 pages
Hadoop and Big Data Exam Papers
No ratings yet
Hadoop and Big Data Exam Papers
4 pages
HBase Data Model and Implementation Overview
No ratings yet
HBase Data Model and Implementation Overview
34 pages
Big Data Processing with Pig and Hive
No ratings yet
Big Data Processing with Pig and Hive
18 pages
Hive Overview and Key Features
No ratings yet
Hive Overview and Key Features
57 pages
Overview of Apache YARN Architecture
No ratings yet
Overview of Apache YARN Architecture
3 pages
Unit 4: Transaction Management Notes
No ratings yet
Unit 4: Transaction Management Notes
20 pages
Anatomy of MapReduce Job Execution
No ratings yet
Anatomy of MapReduce Job Execution
28 pages
MapReduce: Types, Features, and Formats
No ratings yet
MapReduce: Types, Features, and Formats
26 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
18 pages
NoSQL Databases: Concepts and Cassandra
No ratings yet
NoSQL Databases: Concepts and Cassandra
31 pages
Programming Models in Cloud Computing
No ratings yet
Programming Models in Cloud Computing
88 pages
Data Structures in Java: Linked Lists
No ratings yet
Data Structures in Java: Linked Lists
23 pages
Overview of Cloud Platforms and Services
No ratings yet
Overview of Cloud Platforms and Services
21 pages
Understanding MapReduce Framework Basics
No ratings yet
Understanding MapReduce Framework Basics
14 pages
Understanding MapReduce Job Execution
No ratings yet
Understanding MapReduce Job Execution
17 pages
List of Primary Schools in Maharashtra
No ratings yet
List of Primary Schools in Maharashtra
48 pages
Impact of Monetary Policy Uncertainty
No ratings yet
Impact of Monetary Policy Uncertainty
16 pages
A Lover of The Cross-Alfred Zahir PDF
100% (1)
A Lover of The Cross-Alfred Zahir PDF
105 pages
Overview of Distributed Computing Models
No ratings yet
Overview of Distributed Computing Models
5 pages
Engineering Chemistry Syllabus Overview
No ratings yet
Engineering Chemistry Syllabus Overview
56 pages
MultiFlow AC/DC Hybrid Pump Manual
No ratings yet
MultiFlow AC/DC Hybrid Pump Manual
15 pages
Understanding Articles: "The" and "A/An"
No ratings yet
Understanding Articles: "The" and "A/An"
4 pages
DP TV Aver 15022 Drivers
No ratings yet
DP TV Aver 15022 Drivers
94 pages
Grade 10 English Daily Lesson Log
No ratings yet
Grade 10 English Daily Lesson Log
5 pages
B2B Contract Terms for Work Services
No ratings yet
B2B Contract Terms for Work Services
3 pages
Appeal Against DMCI Property Developer
No ratings yet
Appeal Against DMCI Property Developer
5 pages
Teaching The Kite Runner: Lesson Plan
No ratings yet
Teaching The Kite Runner: Lesson Plan
10 pages
HSR 2021 Analysis for Eastern Province
No ratings yet
HSR 2021 Analysis for Eastern Province
202 pages
Importance of Business Correspondence
No ratings yet
Importance of Business Correspondence
15 pages
MGVCL AT&C Loss Reduction Strategies
No ratings yet
MGVCL AT&C Loss Reduction Strategies
46 pages
General Well-Being Assessment Tool
No ratings yet
General Well-Being Assessment Tool
5 pages
ADG 200 Special Section 11:17 PDF
No ratings yet
ADG 200 Special Section 11:17 PDF
213 pages
Moringa: Names and Benefits Overview
No ratings yet
Moringa: Names and Benefits Overview
2 pages
Fresh Graduate in Biochemistry & Research
No ratings yet
Fresh Graduate in Biochemistry & Research
2 pages
Vertex DTRman
No ratings yet
Vertex DTRman
124 pages
Snap Fitness Personal Trainer Profiles
No ratings yet
Snap Fitness Personal Trainer Profiles
7 pages
Vitrified Tile Flooring Guide
No ratings yet
Vitrified Tile Flooring Guide
10 pages
Fit Lab Procedures and Timetable Guide
No ratings yet
Fit Lab Procedures and Timetable Guide
38 pages
HubSpot Solutions Partner Program 2024 Guide
No ratings yet
HubSpot Solutions Partner Program 2024 Guide
29 pages
Questionnaire For Low Back Pain in The Garment Industry Workers
100% (1)
Questionnaire For Low Back Pain in The Garment Industry Workers
11 pages
Advances in Desalination Technologies
50% (2)
Advances in Desalination Technologies
31 pages
Female Reproductive System Exam Guide
No ratings yet
Female Reproductive System Exam Guide
3 pages
Urban Gardening: Environmental Benefits
No ratings yet
Urban Gardening: Environmental Benefits
3 pages
Niharika Sonowal Semester Grade Report
No ratings yet
Niharika Sonowal Semester Grade Report
2 pages

Anatomy of MapReduce Job Execution

Uploaded by

Anatomy of MapReduce Job Execution

Uploaded by

UNIT 3

Anatomy of a MapReduce Job Run

●​ The JobClient reads the job configuration, including input/output paths,

●​ The JobTracker/ResourceManager receives the job and initializes it.​

●​ The input data is split into logical InputSplits.​

●​ Each InputSplit corresponds to a Map Task.​

●​ The metadata of the job is stored and JobTracker/ResourceManager assigns it a Job

●​ TaskTrackers (Hadoop 1.x) or NodeManagers (Hadoop 2.x) send heartbeat

●​ The JobTracker/ResourceManager assigns tasks (map/reduce) to these nodes

●​ Each task is encapsulated in a TaskAttempt for fault tolerance.​

4. Map Task Execution

●​ The Mapper processes one InputSplit at a time.​

●​ The map() function is applied to each key-value pair, generating intermediate

5. Shuffle and Sort Phase

●​ The data is sorted by key before it reaches the reducer.​

●​ This is the most data-intensive phase of the MapReduce job.​

6. Reduce Task Execution

●​ Reducers fetch their respective intermediate data from the mappers.​

●​ The output is written to HDFS.​

●​ Temporary files are cleaned up.​

●​ The JobTracker/ResourceManager marks the job as SUCCESS or FAILED.​

Types of Failures in MapReduce

●​ A map task may fail due to:​

○​ Data corruption in HDFS.​

○​ Logic errors or exceptions in user-defined map() function.​

○​ Hardware failure or memory error on the node.​

○​ The JobTracker/ResourceManager detects failure through heartbeat loss.​

●​ A reduce task may fail due to:​

○​ Errors in the reduce() logic.​

○​ Memory overflow or disk space issues.​

○​ Network failures during shuffle.​

○​ Power outage, hardware crash, or network disconnect.​

○​ All tasks running on the node are considered failed.​

○​ The JobTracker/ResourceManager reschedules the tasks on other healthy

3. JobTracker / ResourceManager Failure

●​ If the JobTracker (in Hadoop 1.x) fails:​

○​ Entire job needs to be resubmitted since it's a single point of failure.​

●​ In Hadoop 2.x (YARN):​

○​ However, YARN supports ResourceManager High Availability (HA), where

○​ Local disk on a node crashes.​

○​ Temporary data, map outputs, or HDFS blocks are lost.​

○​ HDFS replicates each block (default is 3 replicas), so another replica is used.​

○​ Failed tasks are re-executed on different nodes.​

○​ Shuffle phase (mapper to reducer data transfer).​

○​ Communication between nodes.​

○​ Retries are attempted.​

○​ If persistent, tasks are rescheduled.​

●​ Happens due to:​

○​ Damaged input files.​

○​ Bit errors during data transfer or storage.​

○​ HDFS uses checksums to detect corruption.​

○​ Corrupt blocks are discarded and fetched from healthy replicas.​

7. Speculative Execution Issues

●​ Sometimes slow tasks (stragglers) are executed redundantly on other nodes.​

●​ May cause resource contention or inconsistent output if not managed properly.

Job Scheduling in MapReduce

1. Introduction to Job Scheduling

●​ Job scheduling decides:​

○​ The order of job execution.​

○​ The resources allocated to each job.​

○​ How to handle priority, fairness, and deadlines.​

2. Default Job Scheduler: FIFO (First-In-First-Out)

●​ FIFO Scheduler is the simplest and default scheduler in Hadoop 1.x.​

●​ Jobs are executed in the order they are submitted.​

○​ No fairness in multi-user environments.​

○​ Long jobs block short jobs, leading to poor utilization.​

●​ Developed by Facebook for better resource sharing.​

○​ Jobs are organized into pools.​

○​ Each pool is assigned minimum and maximum resource limits.​

○​ Short jobs don't get blocked.​

○​ Ensures better cluster utilization and job turnaround time.​

●​ Developed by Yahoo for large shared clusters.​

●​ Designed to allow multiple organizations or departments to share a Hadoop

●​ Cluster is divided into queues with configurable capacity.​

●​ Each queue gets a guaranteed minimum of resources.​

●​ Resources not used by a queue can be temporarily borrowed by others.​

● The JobClient reads the job configuration, including input/output paths,

● The JobTracker/ResourceManager receives the job and initializes it.

● The input data is split into logical InputSplits.

● Each InputSplit corresponds to a Map Task.

● The metadata of the job is stored and JobTracker/ResourceManager assigns it a Job

● TaskTrackers (Hadoop 1.x) or NodeManagers (Hadoop 2.x) send heartbeat

● The JobTracker/ResourceManager assigns tasks (map/reduce) to these nodes

● Each task is encapsulated in a TaskAttempt for fault tolerance.

● The Mapper processes one InputSplit at a time.

● The map() function is applied to each key-value pair, generating intermediate

● The data is sorted by key before it reaches the reducer.

● This is the most data-intensive phase of the MapReduce job.

● Reducers fetch their respective intermediate data from the mappers.

● The output is written to HDFS.

● Temporary files are cleaned up.

● The JobTracker/ResourceManager marks the job as SUCCESS or FAILED.

● A map task may fail due to:

○ Data corruption in HDFS.

○ Logic errors or exceptions in user-defined map() function.

○ Hardware failure or memory error on the node.

○ The JobTracker/ResourceManager detects failure through heartbeat loss.

● A reduce task may fail due to:

○ Errors in the reduce() logic.

○ Memory overflow or disk space issues.

○ Network failures during shuffle.

○ Power outage, hardware crash, or network disconnect.

○ All tasks running on the node are considered failed.

○ The JobTracker/ResourceManager reschedules the tasks on other healthy

● If the JobTracker (in Hadoop 1.x) fails:

○ Entire job needs to be resubmitted since it's a single point of failure.

● In Hadoop 2.x (YARN):

○ However, YARN supports ResourceManager High Availability (HA), where

○ Local disk on a node crashes.

○ Temporary data, map outputs, or HDFS blocks are lost.

○ HDFS replicates each block (default is 3 replicas), so another replica is used.

○ Failed tasks are re-executed on different nodes.

○ Shuffle phase (mapper to reducer data transfer).

○ Communication between nodes.

○ Retries are attempted.

○ If persistent, tasks are rescheduled.

● Happens due to:

○ Damaged input files.

○ Bit errors during data transfer or storage.

○ HDFS uses checksums to detect corruption.

○ Corrupt blocks are discarded and fetched from healthy replicas.

● Sometimes slow tasks (stragglers) are executed redundantly on other nodes.

● May cause resource contention or inconsistent output if not managed properly.

● Job scheduling decides:

○ The order of job execution.

○ The resources allocated to each job.

○ How to handle priority, fairness, and deadlines.

● FIFO Scheduler is the simplest and default scheduler in Hadoop 1.x.

● Jobs are executed in the order they are submitted.

○ No fairness in multi-user environments.

○ Long jobs block short jobs, leading to poor utilization.

● Developed by Facebook for better resource sharing.

○ Jobs are organized into pools.

○ Each pool is assigned minimum and maximum resource limits.

○ Short jobs don't get blocked.

○ Ensures better cluster utilization and job turnaround time.

● Developed by Yahoo for large shared clusters.

● Designed to allow multiple organizations or departments to share a Hadoop

● Cluster is divided into queues with configurable capacity.

● Each queue gets a guaranteed minimum of resources.

● Resources not used by a queue can be temporarily borrowed by others.

○ Security and access control per queue.

● The ResourceManager handles resource allocation.

● Hadoop supports speculative execution for slow tasks (stragglers).

● Scheduler may run duplicate tasks on other nodes.

● The task which finishes first is accepted; others are killed.

● Improves overall job performance but can increase resource usage.

● Data locality: Prefer nodes where data resides.

● Resource availability: CPU, memory, disk, network.

● Job priorities: High-priority jobs get more resources.

● User quotas: Limits for multi-user environments.

● FIFO: Suitable for single-user, single-job clusters.

● Fair Scheduler: Ideal for development/testing environments.

● Capacity Scheduler: Best for large enterprises with multiple teams.

● The Map phase outputs intermediate key-value pairs.

● The Reduce phase processes these pairs grouped by key.