0% found this document useful (0 votes)

81 views27 pages

Understanding MapReduce Execution Pipeline

The MapReduce framework was introduced by Google in 2004 to support distributed processing of large data sets across computer clusters. It is based on divide-and-conquer principles where input data is split into independent chunks processed in parallel by mappers. The framework then sorts the outputs of the mappers and uses them as input for reducers, handling tasks like scheduling and monitoring across nodes.

Uploaded by

Radhamani V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views27 pages

Understanding MapReduce Execution Pipeline

Uploaded by

Radhamani V

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

MapReduce

The MapReduce framework was inspired by these concepts and introduced by Google in 2004
to support distributed processing on large data sets distributed over clusters of computers.

MapReduce was introduced to solve large-data computational problems, and is specifically designed
to run on commodity hardware.

It is based on divide-and-conquer principles — the input data sets are split into independent chunks,
which are processed by the mappers in parallel.

Additionally, execution of the maps is typically co-located with the data.

The framework then sorts the outputs of the maps, and uses them as an input to the reducers.
Slide 1
Functionality of Mappers and Reducers

Single Hadoop Job = Mapper + [Reducer] + Driver

Where driver is the main application controlling some of the aspects of the execution.

Slide 2
Responsibility of MapReduce Framework

1. Choosing appropriate machines (nodes) for running mappers

2. Starting and monitoring the mapper’s execution
3. Choosing appropriate locations for the reducer’s execution
4. Sorting and shuffling the output of mappers and delivering the output to reducer nodes
5. Starting and monitoring the reducer’s execution

Other than these,

The framework takes care of scheduling tasks, monitoring them, and re-executing the failed tasks.

Slide 3
The three key Methods in Mapper Class

The Mapper class has three key methods (which you can overwrite):
setup, cleanup, and map

Both setup and cleanup methods are invoked only once during a specific mapper life cycle — at the
beginning and end of mapper execution, respectively.

The setup method is used to implement the mapper’s initialization (for example, reading shared resources,
connecting to HBase tables, and so on), whereas cleanup is used for cleaning up the mapper’s resources
and, optionally, if the mapper implements an associative array or counter, to write out the information.

The business functionality (that is, the application-specific logic) of the mapper is implemented in the map
function. Typically, given a key/value pair, this method processes the pair and writes out (using a context
object) one or more resulting key/value pairs. Slide 4
Hollywood Principle in Mapper Class

A context object passed to this method allows the map method to get additional information about the
execution environment, and report on its execution. An important thing to note is that a map function does
not read data. It is invoked based on the “Hollywood principle” every time a reader reads (and optionally
parses) a new record with the data that is passed to it (through context) by the reader.

Hollywood principle — “Don’t call us, we’ll call you” principle

It is a useful software development technique in which an object’s (or component’s) initial condition and
ongoing life cycle is handled by its environment, rather than by the object itself. This principle is typically used
for implementing a class/component that must fit into the constraints of an existing framework.

Slide 5
MapReduce Execution Architecture

Slide 6
MapReduce Execution Architecture (Contd.,)

1. Choosing appropriate machines (nodes) for

running mappers
2. Starting and monitoring the mapper’s execution
3. Choosing appropriate locations for the reducer’s
execution
4. Sorting and shuffling the output of mappers and
delivering the output to reducer nodes
5. Starting and monitoring the reducer’s execution

Other than these,

The framework takes care of scheduling tasks,
monitoring them, and re-executing the failed tasks.
Slide 7
Main Components of MapReduce Execution Pipelines

1. Driver
2. Context
3. InputData
4. InputFormat
5. InputSplit
6. RecordReader
7. Mapper
8. Partition
9. Shuffle
10. Sort
11. Reducer
12. OutputFormat 13. RecordWriter
14. [Combiner] 15. [DistributedCache] Slide 8
Main Components of MapReduce Execution Pipelines (Contd.,)
1. Driver – main program
- defines job-specific configuration &
specifies all components

2. Context - Coordinator object to manage the phases

of process runs in different machines
- It provides system and job-wide
information

3. InputData - Arrival of 10s of GBs of data will be

splitted and stored in HDFS / Hbase / Other storage

4. InputFormat - It is invoked by job driver to decide the

number and location of map task
execution and for data split.
- It sets InputSplits which is used to split
the input data.

Slide 9
Main Components of MapReduce Execution Pipelines (Contd.,)
5. InputSplit - It defines a unit of work for a
single map task in a MapReduce program.

6. RecordReader - It reads the data from its source.

- It resides inside the mapper task
- It converts it into key/value pairs and
delivers them to the map method

7. Mapper - 1st phase of user-defined work

- It takes input data in the form of a series of
key/value pairs (k1, v1), which are used for individual
map execution.
- It typically transforms the input pair into an
output pair (k2, v2), which is used as an input for
shuffle and sort.
- Individual mappers can’t communicate with each
other.

Slide 10
Main Components of MapReduce Execution Pipelines (Contd.,)
8. Partition - Each map task may emit key/value pairs
to any partition, i.e., (k2,v2) is known as
subset of intermediate key space.
- Partitioner class computes a hash value
for the key, and assigns the partition to specific
reducer based on this result

7. Shuffle - It moves map outputs to the reducers

which is known as shuffling.

8. Sort - The set of intermediate key/value pairs

for a given reducer is automatically sorted by
Hadoop to form keys/values (k2, {v2, v2,…}) before
they are
presented to the reducer.

Slide 11
YARN – Map Reduce Phases

Slide 12
Main Components of MapReduce Execution Pipelines (Contd.,)
9. Reducer - For each key assigned to a given reducer, it calls reduce() once.
- Its iterator returns the values associated with a key are in an undefined order.
- The reducer typically transforms the input key/value pairs into output pairs
(k3, v3).
10. OutputFormat - It defines a location of the output data and RecordWriter used for storing the resulting
data
11. RecordWriter - It defines how individual output records are written.

Slide 13
Main Components of MapReduce Execution Pipelines (Contd.,)
12. [Combiner] - If present, a combiner runs after the
mapper and before the reducer.
- It receives all data emitted by mapper instances
as input, and tries to combine
values with the same key, thus reducing the
keys’ space, and decreasing the number of keys (not
necessarily data) that must be sorted.

13. [DistributedCache] - It enables the sharing of data

globally by all nodes on the cluster.
- It is a shared library to be accessed by
each task, a global lookup file holding key/value
pairs, jar files (or archives) containing executable
code, and so on.

- The cache copies over the file(s) to the

machines where the actual execution occurs, and
makes them available for the local usage.
Slide 14
Features of MapReduce

1. It completely hides the complexity of managing a large distributed cluster of machines, and coordination of
job execution between these nodes.

2. A developer’s programming model is very simple – only to implement mapper and reducer functionality, as
well as a driver, bringing them together as a single job and configuring required parameters.

3. All users’ code is then packaged into a single jar file (in reality, the MapReduce framework can operate on
multiple jar files), that can be submitted for execution on the MapReduce cluster.

Slide 15
Runtime Coordination and Task Management in MapReduce

Scheduling, Synchronization, and Error and fault handling

Scheduling — The framework ensures that multiple tasks from multiple jobs are executed on the cluster.
Different schedulers provide different scheduling strategies.
1. First come, first served
It is for ensuring that all the jobs from all users get their fair share of a cluster’s execution.

2. Speculative execution
It is for ensuring that non-anticipated slowness of a given machine will not slow down execution of the task. It

is default one ([Link] = true;) ReSearch!!

Slide 16
Runtime Coordination and Task Management in MapReduce (Contd.,)

Synchronization - The reduce phase cannot start until all of a map’s key/value pairs are emitted.
So, it requires synchronization between the map and reduce phases of processing. At this point, sorting of the
intermediate key/value pairs which are grouped by key is being done.

Error and fault handling - To accomplish job execution in the environment where errors and faults are the
norm, the JobTracker attempts to restart failed task executions.

Slide 17
MapReduce Execution

Slide 18
MapReduce Execution

Initial Process
Job Driver  Job Tracker  Job Client
Job Driver (InputFormat to partition the data, and communicates with the scheduler to get map details)  Job
Client
Regular Process
Job Tracker  Task Tracker
Job Tracker (receives job, InputSplit from Job Driver and submits it to the Task Tracker in its allocated node,
then starts tracking the Job Client’s submitted job, and monitors Task Tracker to know the job status. It
creates set of reducer tasks also)
Task Tracker (process its allocated job, and start a loop to send periodic heartbeat messages)
Task Runner (uses distributed cache to create a child JVM for mapper and reducer)
Job Tracker (analyses the received message)  Scheduler (updates cluster status information)
Slide 19
MapReduce Execution (Contd.,)

Single node can manage with multiple maps and multiple reducers
Single JVM can be allocated for single task or multiple tasks
Configuration file [Link] = 1  single task
= -1  multiple tasks
= above 1  [Link]().setInt(Job.JVM_Num_Tasks.To_Run,int)

Slide 20
MapReduce Application:
Simple implementation of a word count MapReduce job - 1
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

Slide 21
MapReduce Application:
Simple implementation of a word count MapReduce job - 5
public class WordCount extends Configured implements Tool{
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
StringTokenizer tokenizer = new StringTokenizer(line);
while ([Link]()) {
[Link]([Link]());
[Link](word, one);
}
}
}

Slide 22
MapReduce Application:
Simple implementation of a word count MapReduce job - 4
public class WordCount extends Configured implements Tool{
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
StringTokenizer tokenizer = new StringTokenizer(line);
while ([Link]()) {
[Link]([Link]());
[Link](word, one);
}
}
}

Slide 23
MapReduce Application:
Simple implementation of a word count MapReduce job - 3
public int run(String[] args) throws Exception {
Configuration conf = new Configuration(); Job job = new Job(conf, "Word Count");
[Link]([Link]);
// Set up the input
[Link]([Link]); [Link](job, new
Path(args[0]));
// Mapper
[Link]([Link]);
// Reducer
[Link]([Link]);
// Output
[Link]([Link]); [Link]([Link]);
[Link]([Link]); [Link](job, new
Path(args[1]));
//Execute
boolean res = [Link](true);
if (res) return 0;
else return -1;
}
Slide 24
MapReduce Application:
Simple implementation of a word count MapReduce job - 2
public static void main(String[] args) throws Exception {
int res = [Link](new WordCount(), args);
[Link](res);
}
}

Slide 25
MapReduce Job Execution – Worker-driven Load Balancing Approach

In this case, all of the execution requests are written to the queue. Each worker tries to read a new
request from the queue, and then executes it. Once execution is complete, a worker tries to read the next
request. This type of load balancing is called worker-driven load balancing. In this case, the requester
does not know anything about execution capabilities, or even the number of workers. A worker reads a
new request only after a current one is completed, thus ensuring effective utilization of resources.
Slide 26
Implementing InputFormat for Multiple HBase Tables
For MapReduce jobs leveraging HBase-based data sources, Hadoop provides TableInputFormat. The
limitation of this implementation is that it supports only a single table.

Because Hadoop’s implementation of TableInputFormat supports a single table/scan, all of the information
about the table and scan is contained in the TableInputFormat implementation, and does not need to be
defined in the InputSplit class. In this case, different splits can refer to different tables/scan pairs. As a result,
you must extend the table split class to contain not only table-related information (name, start and end row,
region server location), but also a scan for a given table.

Slide 27

Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
11 pages
MapReduce Job Architecture Overview
100% (1)
MapReduce Job Architecture Overview
46 pages
Basics of Hadoop in Big Data Analytics
No ratings yet
Basics of Hadoop in Big Data Analytics
22 pages
Association Rule Mining Techniques
100% (1)
Association Rule Mining Techniques
24 pages
MapReduce Job Execution Overview
No ratings yet
MapReduce Job Execution Overview
24 pages
Anatomy of MapReduce Job Execution
No ratings yet
Anatomy of MapReduce Job Execution
28 pages
Hadoop: The Definitive Guide Overview
100% (1)
Hadoop: The Definitive Guide Overview
57 pages
MapReduce Unit Testing with MRUnit
No ratings yet
MapReduce Unit Testing with MRUnit
56 pages
Understanding Hive as a NoSQL Database
No ratings yet
Understanding Hive as a NoSQL Database
9 pages
Overview of MapReduce Applications
No ratings yet
Overview of MapReduce Applications
11 pages
Data Stream Mining Techniques
No ratings yet
Data Stream Mining Techniques
16 pages
Hive Overview and Key Features
No ratings yet
Hive Overview and Key Features
57 pages
When to Use Manhattan Distance in Clustering
No ratings yet
When to Use Manhattan Distance in Clustering
183 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
5 pages
Job Scheduling Mechanisms in MapReduce
100% (1)
Job Scheduling Mechanisms in MapReduce
3 pages
Data Warehousing & Mining Syllabus
100% (1)
Data Warehousing & Mining Syllabus
1 page
Hadoop and MapReduce Framework Guide
No ratings yet
Hadoop and MapReduce Framework Guide
19 pages
Cloud Enabling Technologies Overview
No ratings yet
Cloud Enabling Technologies Overview
75 pages
Dendrogram in Hierarchical Clustering
No ratings yet
Dendrogram in Hierarchical Clustering
50 pages
Understanding CORBA Middleware Architecture
No ratings yet
Understanding CORBA Middleware Architecture
38 pages
Association Rule Mining Techniques
No ratings yet
Association Rule Mining Techniques
71 pages
Visual Aids for EDA Techniques
No ratings yet
Visual Aids for EDA Techniques
34 pages
Anatomy of MapReduce Workflows
No ratings yet
Anatomy of MapReduce Workflows
43 pages
Hadoop MapReduce Word Count Example
No ratings yet
Hadoop MapReduce Word Count Example
14 pages
Spark Performance Tuning Techniques
No ratings yet
Spark Performance Tuning Techniques
11 pages
Data Warehousing Exam Questions
No ratings yet
Data Warehousing Exam Questions
2 pages
Hive Serialization and Deserialization Guide
No ratings yet
Hive Serialization and Deserialization Guide
9 pages
MapReduce Concepts in NoSQL Databases
No ratings yet
MapReduce Concepts in NoSQL Databases
12 pages
Unit - Iv: Machine Learning (ML) For Iot
No ratings yet
Unit - Iv: Machine Learning (ML) For Iot
17 pages
Basis Path Testing in White Box Testing
No ratings yet
Basis Path Testing in White Box Testing
17 pages
Hive Data Manipulation Techniques
No ratings yet
Hive Data Manipulation Techniques
17 pages
Understanding MapReduce Programming Model
No ratings yet
Understanding MapReduce Programming Model
7 pages
Developing Pig Latin Scripts in Hadoop
No ratings yet
Developing Pig Latin Scripts in Hadoop
42 pages
Understanding Simpson's Paradox in Data Science
No ratings yet
Understanding Simpson's Paradox in Data Science
61 pages
Anatomy of MapReduce Job Execution
No ratings yet
Anatomy of MapReduce Job Execution
23 pages
YARN and MapReduce in Hadoop Explained
No ratings yet
YARN and MapReduce in Hadoop Explained
89 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
18 pages
Advanced Techniques in Association Rules
No ratings yet
Advanced Techniques in Association Rules
18 pages
Medical Data Preprocessing Guide
No ratings yet
Medical Data Preprocessing Guide
5 pages
Understanding Multistage Graphs
No ratings yet
Understanding Multistage Graphs
3 pages
Inter and Trans-Firewall Analytics Overview
No ratings yet
Inter and Trans-Firewall Analytics Overview
9 pages
Cluster Analysis: Concepts & Algorithms
No ratings yet
Cluster Analysis: Concepts & Algorithms
43 pages
Data Structures in Java: Linked Lists
No ratings yet
Data Structures in Java: Linked Lists
23 pages
RTAP Applications in Real-Time Analytics
No ratings yet
RTAP Applications in Real-Time Analytics
16 pages
Understanding Kohonen Self-Organizing Maps
No ratings yet
Understanding Kohonen Self-Organizing Maps
15 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
MapReduce and HBase Fundamentals
No ratings yet
MapReduce and HBase Fundamentals
42 pages
HDFS Design and Concepts Overview
No ratings yet
HDFS Design and Concepts Overview
16 pages
Dept Wise Salary Calculation in Java
100% (1)
Dept Wise Salary Calculation in Java
5 pages
Unit 3 (Isr)
No ratings yet
Unit 3 (Isr)
9 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
33 pages
Big Data Mining: Statistical Modeling & ML
100% (2)
Big Data Mining: Statistical Modeling & ML
27 pages
MapReduce Applications in Hadoop
No ratings yet
MapReduce Applications in Hadoop
17 pages
Data Mining and Warehouse Overview
No ratings yet
Data Mining and Warehouse Overview
26 pages
Soft Computing Handwritten Notes
No ratings yet
Soft Computing Handwritten Notes
22 pages
Mining Frequent Itemsets and Clustering Techniques
No ratings yet
Mining Frequent Itemsets and Clustering Techniques
46 pages
MongoDB Operations and CAP Theorem
No ratings yet
MongoDB Operations and CAP Theorem
34 pages
Data Science Overview and Applications
No ratings yet
Data Science Overview and Applications
25 pages
Understanding MapReduce Workflows
No ratings yet
Understanding MapReduce Workflows
38 pages
7 - Hadoop Map Reduce
No ratings yet
7 - Hadoop Map Reduce
36 pages
Introduction to Compiler Construction
No ratings yet
Introduction to Compiler Construction
22 pages
Fourier Series in Signals & Systems
No ratings yet
Fourier Series in Signals & Systems
8 pages
BCA II Semester OS Exam 2022-23 Guide
No ratings yet
BCA II Semester OS Exam 2022-23 Guide
2 pages
Study Guide for Learning AI
No ratings yet
Study Guide for Learning AI
3 pages
Disk Scheduling Algorithms Explained
No ratings yet
Disk Scheduling Algorithms Explained
29 pages
ATO-OB/TS FFFIS Application Layer
No ratings yet
ATO-OB/TS FFFIS Application Layer
40 pages
Tkinter GUI Programming Basics
No ratings yet
Tkinter GUI Programming Basics
38 pages
CS3452 Theory of Computation L T P C 3 0 0 3
No ratings yet
CS3452 Theory of Computation L T P C 3 0 0 3
1 page
ABC Indicator and Material Report
No ratings yet
ABC Indicator and Material Report
9 pages
Turing Machine Concepts and Techniques
100% (1)
Turing Machine Concepts and Techniques
38 pages
BE04000101 Merge
No ratings yet
BE04000101 Merge
42 pages
Computer Science Revision Test XI
No ratings yet
Computer Science Revision Test XI
4 pages
C++ Lab Manual: Pointers & Functions
No ratings yet
C++ Lab Manual: Pointers & Functions
4 pages
Zend 200-550 Exam Dumps & Prep Guide
No ratings yet
Zend 200-550 Exam Dumps & Prep Guide
8 pages
Tree Automata for Unranked Trees
No ratings yet
Tree Automata for Unranked Trees
52 pages
IGCSE Computer Science Mark Scheme 2022
No ratings yet
IGCSE Computer Science Mark Scheme 2022
9 pages
Introduction to Single Variable Inequalities
No ratings yet
Introduction to Single Variable Inequalities
4 pages
Infix to Postfix Conversion Program
No ratings yet
Infix to Postfix Conversion Program
4 pages
Math 465: Combinatorics Course Overview
No ratings yet
Math 465: Combinatorics Course Overview
23 pages
Brute Force in Closest Pair Problems
No ratings yet
Brute Force in Closest Pair Problems
28 pages
C++ Student and Point Class Examples
No ratings yet
C++ Student and Point Class Examples
31 pages
Employee DLL Operations in C
No ratings yet
Employee DLL Operations in C
7 pages
Marking Scheme 7
No ratings yet
Marking Scheme 7
2 pages
Sahil Kumar: IIT Delhi Achievements
No ratings yet
Sahil Kumar: IIT Delhi Achievements
2 pages
DarkBASIC Programming Language Manual
No ratings yet
DarkBASIC Programming Language Manual
163 pages
Problem Solving Techniques in CS
No ratings yet
Problem Solving Techniques in CS
3 pages
Least Squares Approximation Exercises
No ratings yet
Least Squares Approximation Exercises
1 page
AI Assistive Tech for Elderly Care
No ratings yet
AI Assistive Tech for Elderly Care
14 pages
UML Class Diagram for Course System
No ratings yet
UML Class Diagram for Course System
1 page
Module 5 Notes BCS306A
No ratings yet
Module 5 Notes BCS306A
18 pages

Understanding MapReduce Execution Pipeline

Uploaded by

Understanding MapReduce Execution Pipeline

Uploaded by

MapReduce

Additionally, execution of the maps is typically co-located with the data.

Single Hadoop Job = Mapper + [Reducer] + Driver

1. Choosing appropriate machines (nodes) for running mappers

Other than these,

Hollywood principle — “Don’t call us, we’ll call you” principle

1. Choosing appropriate machines (nodes) for

Other than these,

2. Context - Coordinator object to manage the phases

3. InputData - Arrival of 10s of GBs of data will be

4. InputFormat - It is invoked by job driver to decide the

6. RecordReader - It reads the data from its source.

7. Mapper - 1st phase of user-defined work

7. Shuffle - It moves map outputs to the reducers

8. Sort - The set of intermediate key/value pairs

13. [DistributedCache] - It enables the sharing of data

- The cache copies over the file(s) to the

Scheduling, Synchronization, and Error and fault handling

is default one ([Link] = true;) ReSearch!!

You might also like