0% found this document useful (0 votes)
60 views23 pages

Anatomy of MapReduce Job Execution

The document provides a comprehensive overview of the MapReduce job run process, including job submission, initialization, task assignment, execution phases, and job completion. It also discusses failure types and handling mechanisms, job scheduling strategies, and the importance of shuffle and sort phases in data processing. Additionally, it outlines task execution components and lifecycle in the Hadoop environment.

Uploaded by

Ragul S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views23 pages

Anatomy of MapReduce Job Execution

The document provides a comprehensive overview of the MapReduce job run process, including job submission, initialization, task assignment, execution phases, and job completion. It also discusses failure types and handling mechanisms, job scheduling strategies, and the importance of shuffle and sort phases in data processing. Additionally, it outlines task execution components and lifecycle in the Hadoop environment.

Uploaded by

Ragul S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 3

Anatomy of a MapReduce Job Run


MapReduce is a programming model developed by Google for processing large data sets
with a distributed algorithm on a cluster. In Hadoop, the MapReduce engine breaks down a
job into smaller tasks and runs them in a fault-tolerant and parallel manner. Understanding
the anatomy of a MapReduce job helps in optimizing its performance and debugging.

1. Job Submission

●​ The process begins when a user submits a MapReduce job via the JobClient.​

●​ The JobClient reads the job configuration, including input/output paths,


mapper/reducer classes, and the number of reducers.​

●​ It submits the job to the JobTracker (in Hadoop 1.x) or ResourceManager (in
Hadoop 2.x YARN).​

2. Job Initialization

●​ The JobTracker/ResourceManager receives the job and initializes it.​

●​ The input data is split into logical InputSplits.​

●​ Each InputSplit corresponds to a Map Task.​

●​ The metadata of the job is stored and JobTracker/ResourceManager assigns it a Job


ID.​

3. Task Assignment

●​ TaskTrackers (Hadoop 1.x) or NodeManagers (Hadoop 2.x) send heartbeat


signals indicating availability.​

●​ The JobTracker/ResourceManager assigns tasks (map/reduce) to these nodes


based on data locality (placing computation where the data resides).​

●​ Each task is encapsulated in a TaskAttempt for fault tolerance.​


UNIT 3

4. Map Task Execution

●​ The Mapper processes one InputSplit at a time.​

●​ Data is read using RecordReader, converting raw input into key-value pairs.​

●​ The map() function is applied to each key-value pair, generating intermediate


key-value pairs.​

●​ The output is written to local disk, sorted, and partitioned (based on the number of
reducers).​

●​ A combiner may be used here for local aggregation to reduce data transfer.​

5. Shuffle and Sort Phase

●​ The intermediate data is shuffled, i.e., it is sent across the network to reducers
based on key partitioning.​

●​ The data is sorted by key before it reaches the reducer.​

●​ This is the most data-intensive phase of the MapReduce job.​

6. Reduce Task Execution

●​ Reducers fetch their respective intermediate data from the mappers.​

●​ The reduce() function processes each key and its list of values to produce the final
output.​

●​ The output is written to HDFS.​

7. Output Commit

●​ Once all reduce tasks finish successfully, the output is committed to the final output
directory in HDFS.​

●​ Temporary files are cleaned up.​

8. Job Completion

●​ The JobTracker/ResourceManager marks the job as SUCCESS or FAILED.​

●​ The JobClient is notified, and the user can view job logs and counters.
UNIT 3

Failures in MapReduce
MapReduce is designed to run on commodity hardware, where failures are common. To
ensure reliability and fault tolerance, Hadoop's MapReduce framework includes mechanisms
to detect, handle, and recover from failures automatically. Understanding these failures is
essential for efficient debugging and performance optimization.

Types of Failures in MapReduce

MapReduce jobs can encounter failures at different stages and components during
execution. The major types of failures are:

1. Task Failure

a. Mapper Failure

●​ A map task may fail due to:​

○​ Data corruption in HDFS.​

○​ Logic errors or exceptions in user-defined map() function.​

○​ Hardware failure or memory error on the node.​

●​ Handling:​

○​ The JobTracker/ResourceManager detects failure through heartbeat loss.​

○​ The failed task is rescheduled on another node, possibly with a replica of the
data block.​

b. Reducer Failure

●​ A reduce task may fail due to:​

○​ Errors in the reduce() logic.​

○​ Memory overflow or disk space issues.​

○​ Network failures during shuffle.​


UNIT 3

●​ Handling:​

○​ The reducer is retried on a different node using the same intermediate data
from map outputs.​

2. Node Failure

●​ A TaskTracker (Hadoop 1.x) or NodeManager (Hadoop 2.x) may go down due to:​

○​ Power outage, hardware crash, or network disconnect.​

●​ Impact:​

○​ All tasks running on the node are considered failed.​

○​ The JobTracker/ResourceManager reschedules the tasks on other healthy


nodes.​

●​ Detection:​

○​ If the node doesn’t respond for a predefined timeout (e.g., 10 minutes), it's
marked as dead.​

3. JobTracker / ResourceManager Failure

●​ If the JobTracker (in Hadoop 1.x) fails:​

○​ Entire job needs to be resubmitted since it's a single point of failure.​

●​ In Hadoop 2.x (YARN):​

○​ ResourceManager is the central authority, and its failure affects the whole
cluster.​

○​ However, YARN supports ResourceManager High Availability (HA), where


a standby can take over.​

4. Disk Failure

●​ Occurs when:​

○​ Local disk on a node crashes.​

○​ Temporary data, map outputs, or HDFS blocks are lost.​


UNIT 3

●​ Handling:​

○​ HDFS replicates each block (default is 3 replicas), so another replica is used.​

○​ Failed tasks are re-executed on different nodes.​

5. Network Failures

●​ Issues during:​

○​ Shuffle phase (mapper to reducer data transfer).​

○​ Communication between nodes.​

●​ Handling:​

○​ Retries are attempted.​

○​ If persistent, tasks are rescheduled.​

6. Data Corruption

●​ Happens due to:​

○​ Damaged input files.​

○​ Bit errors during data transfer or storage.​

●​ Detection:​

○​ HDFS uses checksums to detect corruption.​

●​ Handling:​

○​ Corrupt blocks are discarded and fetched from healthy replicas.​

7. Speculative Execution Issues

●​ Sometimes slow tasks (stragglers) are executed redundantly on other nodes.​

●​ May cause resource contention or inconsistent output if not managed properly.


UNIT 3

Job Scheduling in MapReduce


In Hadoop MapReduce, job scheduling is a crucial process that determines how and when
the submitted jobs (and their respective tasks) are executed in a distributed environment.
The goal of job scheduling is to maximize resource utilization, ensure fairness, and
optimize job performance.

1. Introduction to Job Scheduling

●​ Hadoop clusters are multi-user environments where many jobs may be submitted
concurrently.​

●​ Each job consists of multiple map and reduce tasks, and these tasks must be
scheduled across the available nodes.​

●​ Job scheduling decides:​

○​ The order of job execution.​

○​ The resources allocated to each job.​

○​ How to handle priority, fairness, and deadlines.​

2. Default Job Scheduler: FIFO (First-In-First-Out)

●​ FIFO Scheduler is the simplest and default scheduler in Hadoop 1.x.​

●​ Jobs are executed in the order they are submitted.​

●​ Limitations:​

○​ No fairness in multi-user environments.​

○​ Long jobs block short jobs, leading to poor utilization.​

3. Fair Scheduler

●​ Developed by Facebook for better resource sharing.​

●​ Aims to give each user or job an equal share of the cluster over time.​

●​ Features:​

○​ Jobs are organized into pools.​


UNIT 3

○​ Each pool is assigned minimum and maximum resource limits.​

○​ Jobs get resources such that all pools get a fair share.​

●​ Benefits:​

○​ Short jobs don't get blocked.​

○​ Ensures better cluster utilization and job turnaround time.​

4. Capacity Scheduler

●​ Developed by Yahoo for large shared clusters.​

●​ Designed to allow multiple organizations or departments to share a Hadoop


cluster.​

●​ Cluster is divided into queues with configurable capacity.​

●​ Each queue gets a guaranteed minimum of resources.​

●​ Resources not used by a queue can be temporarily borrowed by others.​

●​ Key features:​

○​ Multi-tenancy support.​

○​ Security and access control per queue.​

5. YARN and Scheduler Evolution

In Hadoop 2.x and later (YARN - Yet Another Resource Negotiator):

●​ The ResourceManager handles resource allocation.​

●​ Schedulers in YARN:​

○​ FIFO Scheduler​

○​ Capacity Scheduler​

○​ Fair Scheduler​

●​ Applications are run inside ApplicationMasters, which negotiate for resources with
the ResourceManager.​
UNIT 3

6. Scheduler Comparison
Feature FIFO Scheduler Fair Scheduler Capacity
Scheduler

Job Ordering Submission time Resource fairness Queue capacity

Multi-user Support Poor Good Excellent

Resource No Optional Yes


Guarantees

Best For Small clusters Shared multi-user Large enterprise use


use

7. Speculative Execution & Scheduling

●​ Hadoop supports speculative execution for slow tasks (stragglers).​

●​ Scheduler may run duplicate tasks on other nodes.​

●​ The task which finishes first is accepted; others are killed.​

●​ Improves overall job performance but can increase resource usage.​

8. Factors Affecting Scheduling

●​ Data locality: Prefer nodes where data resides.​

●​ Resource availability: CPU, memory, disk, network.​

●​ Job priorities: High-priority jobs get more resources.​

●​ User quotas: Limits for multi-user environments.​

9. Use Case Examples

●​ FIFO: Suitable for single-user, single-job clusters.​

●​ Fair Scheduler: Ideal for development/testing environments.​

●​ Capacity Scheduler: Best for large enterprises with multiple teams.


UNIT 3

Shuffle and Sort in MapReduce


In Hadoop MapReduce, Shuffle and Sort are critical intermediate phases that occur
between the Map and Reduce phases. These stages are automatically managed by the
framework and are essential for correctly grouping intermediate key-value pairs so they can
be processed efficiently by reducers.

1. Introduction

●​ The Map phase outputs intermediate key-value pairs.​

●​ The Reduce phase processes these pairs grouped by key.​

●​ But before reducing, the system must:​

○​ Shuffle: Transfer and group intermediate data from mappers to reducers.​

○​ Sort: Sort intermediate keys so reducers receive data in a sorted order.​

●​ Together, these ensure correctness and performance in distributed data processing.​

2. Phases in Shuffle and Sort

A. Map Side Processing

i) Spill Phase

●​ When the mapper buffer exceeds a threshold (default: 80% of allocated memory), a
spill to disk is triggered.​

●​ During spill:​

○​ Output is sorted by key.​

○​ Optional combiner function is applied (if defined).​

○​ Output is written to a temporary spill file.​

ii) Merge Phase

●​ If multiple spill files are created, they are merged into a single sorted output file.​

●​ The final output of the mapper is a sorted list of key-value pairs, ready for shuffling.​
UNIT 3

B. Shuffle Phase (Between Map and Reduce)

●​ Shuffle is the process of transferring intermediate data from the map tasks to the
reduce tasks.​

●​ Each reducer:​

○​ Requests data related to its assigned key range from all mappers.​

○​ Data is fetched over the network.​

●​ Hadoop ensures that data locality and efficiency are maintained during transfer.​

C. Sort Phase (Reduce Side)

●​ Once a reducer receives all its data:​

○​ It performs an external merge sort on the keys.​

○​ Ensures that all values associated with the same key are grouped together.​

●​ This is necessary because the reduce() function is called once per unique key.​

3. Role of Partitioner

●​ The Partitioner determines which reducer will process which keys.​

●​ Default: HashPartitioner.​

●​ Ensures that:​

○​ Keys with the same hash go to the same reducer.​

○​ Data is balanced across reducers.​

4. Importance of Shuffle and Sort


Feature Purpose

Shuffle Moves and groups data from mappers to appropriate reducers

Sort Orders keys before reduction


UNIT 3

Grouping Ensures correct input for reduce()

Efficiency Helps in optimization via combiner and compression

5. Performance Considerations

●​ Shuffle is network-intensive → Often the most expensive part of MapReduce.​

●​ Techniques to optimize:​

○​ Combiner to reduce data volume.​

○​ Compression of map outputs.​

○​ Tuning buffer sizes.​

○​ Using efficient partitioning.​

6. Diagram: Shuffle and Sort

+---------+ +-----------+ +----------+


| Mapper1 |-----> | | | |
+---------+ | | | |
| |----> | Reducer1 |
+---------+ | Shuffle | | |
| Mapper2 |-----> | & |----> | Reducer2 |
+---------+ | Sort | | |
| | +----------+
+---------+ | |
| Mapper3 |-----> | |
+---------+ +-----------+

7. Real-World Example

Imagine a word count program:

●​ Map phase: emits (word, 1) for each word.​

●​ Shuffle phase: sends all (word, 1) pairs to the same reducer.​

●​ Sort phase: groups and sorts these pairs by word.​

●​ Reduce phase: sums the counts for each word.


UNIT 3

Task Execution in MapReduce


MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster. One of the core components of this model is task
execution, which defines how a job's tasks are created, managed, scheduled, and
executed on Hadoop’s distributed environment.

1. Introduction

In the MapReduce framework, a job is split into tasks. There are two types of tasks:

●​ Map tasks: Process input splits and produce intermediate key-value pairs.​

●​ Reduce tasks: Process intermediate data to produce final output.​

Task execution involves:

●​ Job submission​

●​ Task division​

●​ Scheduling​

●​ Execution​

●​ Monitoring​

●​ Retrying (if needed)​

2. Components Involved in Task Execution

Component Role

JobTracker (Hadoop 1) Coordinates job execution, task scheduling, and monitoring

TaskTracker Executes individual map/reduce tasks

ApplicationMaster (YARN-based Hadoop 2+) Manages application execution per job

NodeManager Launches containers to run tasks on individual nodes


UNIT 3

3. Task Execution Lifecycle

A. Job Submission

●​ The client submits a job to the ResourceManager (YARN) or JobTracker (Hadoop


1).​

●​ The system validates the input and output paths.​

B. Input Splits

●​ Input data is split into InputSplits, each assigned to a Map Task.​

●​ Example: A 1 GB file might be split into 64 MB chunks → 16 map tasks.​

C. Task Assignment

●​ ApplicationMaster assigns map and reduce tasks to available nodes.​

●​ Each node executes its task in an isolated container (YARN).​

D. Map Task Execution

●​ Each map task:​

○​ Reads data from HDFS.​

○​ Processes data using user-defined map function.​

○​ Buffers output in memory.​

○​ Periodically writes intermediate output to local disk (spill).​

○​ Optionally applies a Combiner.​

E. Shuffle and Sort

●​ After map tasks complete, intermediate data is shuffled to reducers.​

●​ Data is sorted by key and grouped before feeding to reduce tasks.​

F. Reduce Task Execution


UNIT 3

●​ Each reduce task:​

○​ Fetches relevant intermediate data from map task outputs.​

○​ Sorts and merges the data.​

○​ Applies the reduce function.​

○​ Writes final output to HDFS.​

4. Types of Tasks in MapReduce

Task Type Description

Map Task Processes input split and produces intermediate key-value pairs

Reduce Task Aggregates intermediate data and produces final output

Setup Task Initializes resources before map/reduce tasks run

Cleanup Task Cleans up after successful/failed map/reduce execution

5. Task Execution Control

Task Monitoring

●​ Tasks report progress to the ApplicationMaster.​

●​ Heartbeat signals are sent to monitor liveness.​

Failure Handling

●​ Failed tasks are retried (default: 4 times).​

●​ Speculative execution may launch duplicate tasks for slow-running instances.​

6. Speculative Execution

●​ Sometimes, a task runs slower than others → "straggler".​

●​ Hadoop may launch a duplicate task on another node.​

●​ The first to finish is accepted; the other is killed.​


UNIT 3

●​ Helps to speed up overall job completion.​

7. Diagram: Task Execution in MapReduce

+-----------------+
| Job Client |
+-----------------+
|
v
+-----------------+ +---------------------+
| ResourceManager | <----> | ApplicationMaster |
+-----------------+ +---------------------+
| |
v v
+------------------+ +---------------------+
| Map Task (Node 1)| | Reduce Task (Node 2)|
+------------------+ +---------------------+

8. Optimization Techniques in Task Execution

●​ Use of Combiner: Reduces volume of intermediate data.​

●​ Data Locality: Assign map tasks closer to where data resides.​

●​ Proper partitioning: Avoids data skew in reducers.​

●​ Speculative execution: Handles stragglers efficiently.​

9. Example: Word Count

In a word count job:

●​ Map Task: Emits (word, 1) for each word.​

●​ Shuffle: Groups all occurrences of the same word.​

●​ Reduce Task: Sums the values → total count per word.


UNIT 3

MapReduce Types and Formats


1. Introduction

MapReduce is a programming paradigm used to process large-scale data in parallel across


a distributed cluster. The effectiveness of MapReduce heavily depends on the data types
and input/output formats used in the processing pipeline. Understanding these is critical
for writing optimized and flexible MapReduce applications.

2. MapReduce Data Types

In Hadoop, data flows through key-value pairs. Each phase of the job (Map, Shuffle,
Reduce) involves processing these pairs.

Java Type Hadoop Writable Equivalent

int IntWritable

long LongWritable

float FloatWritable

double DoubleWritable

boolean BooleanWritable

String Text

Array/Map/Set ArrayWritable, MapWritable, SetWritable

Example:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {.....}

3. Custom Writable Types

When built-in Writable types are insufficient, Hadoop allows custom Writable types by
implementing:

public class MyWritable implements Writable {

private IntWritable id;


UNIT 3

private Text name;

public void write(DataOutput out) {...}

public void readFields(DataInput in) {...}

Custom types must implement:

●​ Writable or WritableComparable​

●​ readFields() and write() for serialization

4. Input Formats in MapReduce

InputFormat defines how input files are split and read by MapReduce.

Role of InputFormat:

●​ Splits input data into logical InputSplits​

●​ Provides RecordReader to read key-value pairs from splits​

Common InputFormats:

InputFormat Description

TextInputFormat Default format. Treats each line as a value, key is line offset

KeyValueTextInputFormat Splits line into key/value based on separator (like tab)

SequenceFileInputFormat Reads Hadoop-specific binary SequenceFiles

NLineInputFormat Assigns N lines per split

CombineFileInputFormat Combines small files into a larger split

DBInputFormat Reads data from a database

XMLInputFormat Custom input format for XML files

Example:

[Link]([Link]);
UNIT 3

5. Output Formats in MapReduce

OutputFormat defines how the results of MapReduce jobs are written to storage.

Common OutputFormats:

OutputFormat Description

TextOutputFormat Default output, writes key-value as strings

SequenceFileOutputFormat Writes binary output

MultipleOutputs Allows writing to multiple outputs

NullOutputFormat Discards output (used for benchmarking)

DBOutputFormat Writes to relational database

MapFileOutputFormat Writes output in MapFile format (sorted SequenceFile)

Example:

[Link]([Link]);

6. Input/Output Key-Value Format Examples

TextInputFormat:

Input:
0 Hadoop is fun
17 MapReduce is powerful

Mapper Input:
Key = 0, Value = "Hadoop is fun"
Key = 17, Value = "MapReduce is powerful"

TextOutputFormat:

Output:
Hadoop 1
is 2
fun 1
MapReduce 1
powerful 1
UNIT 3

7. Use of FileInputFormat and FileOutputFormat

Hadoop provides base classes:

●​ FileInputFormat – handles file splits and data reading.​

●​ FileOutputFormat – handles writing results to HDFS.​

These can be extended for customized processing like handling CSV, JSON, XML formats.

8. Custom Input/Output Formats

For non-standard data sources/formats (e.g., XML, images), we can define:

●​ Custom InputFormat class​

●​ Custom RecordReader​

●​ Custom OutputFormat class​

This allows advanced use cases like:

●​ Reading zipped data​

●​ Writing to multiple files​

●​ Exporting structured formats like JSON or Avro

9. Diagram: Input/Output Format Flow

[HDFS Input File]


|
v
[InputFormat] --> [InputSplit]
|
v
[RecordReader] --> (key, value)
|
v
[Mapper]
|
v
[Reducer]
|
v
[OutputFormat] --> [HDFS Output File]
UNIT 3

MapReduce Features:
1. Introduction

MapReduce is a powerful data processing model introduced by Google and later adopted by
Apache Hadoop to handle large-scale data processing in a distributed and parallel
manner. It simplifies the task of processing terabytes and petabytes of data by splitting it
into manageable pieces and executing them across a cluster.

2. Key Features of MapReduce

A. Simplicity and Abstraction

●​ MapReduce abstracts the complexity of parallel and distributed computing.​

●​ Developers only need to implement two functions:​

○​ map() – for processing input data.​

○​ reduce() – for aggregating intermediate output.​

●​ Example: Word count program with a few lines of code.​

B. Scalability

●​ Can scale to thousands of commodity machines.​

●​ Efficiently handles massive data (multi-terabyte or petabyte scale).​

●​ Data is automatically partitioned and distributed across nodes.​

C. Fault Tolerance

●​ Detects and handles failures automatically.​

●​ Failed tasks are re-executed on other available nodes.​

●​ Intermediate results are written to local disk, ensuring job recovery.​

D. Data Locality Optimization

●​ Tasks are scheduled close to the data (same node or rack).​


UNIT 3

●​ Reduces network congestion and improves performance.​

●​ Brings computation to data instead of moving data.​

E. Parallel Processing

●​ Map tasks and reduce tasks are executed in parallel across nodes.​

●​ Takes advantage of multi-core and multi-node environments.​

●​ Greatly reduces overall job execution time.​

F. High Throughput

●​ Designed to process vast amounts of data in parallel.​

●​ High throughput achieved due to pipelined data flow:​

○​ Input → Map → Shuffle → Reduce → Output​

G. Automatic Task Scheduling

●​ JobTracker (Hadoop 1) or ResourceManager (YARN) handles:​

○​ Task assignment​

○​ Load balancing​

○​ Monitoring​

●​ Developer doesn’t need to manage resource allocation.​

H. Flexible Data Input and Output

●​ Supports multiple input and output formats:​

○​ Text, SequenceFile, KeyValue, DBInput, XML, JSON, etc.​

●​ Custom InputFormat and OutputFormat can be defined as needed.​


UNIT 3

I. Language Independence

●​ Though implemented in Java, MapReduce can be used with:​

○​ Python, C++, Ruby, and more via Hadoop Streaming API​

●​ Makes it adaptable for a wide range of developers.​

J. Speculative Execution

●​ Handles slow-running tasks (stragglers) by launching backup tasks.​

●​ Improves job completion time by finishing with the faster task.​

K. Combiner Support

●​ A mini-reducer that runs after the mapper to reduce data transfer.​

●​ Used to optimize bandwidth during the shuffle phase.​

L. Counters and Logs

●​ Built-in support for custom counters to track progress and metrics.​

●​ Logs and status updates aid in debugging and monitoring jobs.

3. Real-world Use Cases of MapReduce

Domain Use Case

Search Engines Index building, log analysis

E-commerce Product recommendations, transaction analysis

Bioinformatics DNA sequence analysis

Social Media Hashtag frequency, trend detection


UNIT 3

4. Advantages of MapReduce

●​ Handles big data efficiently​

●​ Reduces development effort​

●​ Enhances resource utilization​

●​ Enables distributed fault-tolerant processing​

5. Limitations (for completeness)

Limitation Description

High latency Not ideal for real-time processing

Complex debugging Distributed environment can be hard to debug

Disk I/O heavy Writes intermediate data to disk

Limited iterative support Not optimized for iterative ML tasks

You might also like