0% found this document useful (0 votes)

10 views45 pages

Understanding MapReduce in Hadoop

MR(BDA)

Uploaded by

mbhavya5867

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views45 pages

Understanding MapReduce in Hadoop

MR(BDA)

Uploaded by

mbhavya5867

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

BIG DATA & ANALYTICS

UNIT III – Map Reduce

INTRODUCTION

In MapReduce Programming, Jobs (Applications) are split into a set of map

tasks and reduce tasks.

Then these tasks are executed in a distributed fashion on Hadoop cluster.

Each task processes small subset of data that has been assigned to it.

This way, Hadoop distributes the load across the cluster.

MapReduce job takes a set of files that is stored in HDFS (Hadoop Distributed
File System) as input.

Map task takes care of loading, parsing, transforming, and filtering.

Reduce task takes care of grouping and aggregating data that is produced by
map tasks to generate final output.
What is MR Job?

ⓘ Start presenting to display the poll results on this slide.

Each map task is broken into the following phases:

1. InputSplit

2. Record Reader.

3. Mapper.

4. Combiner.

5. Partitioner.

The output produced by map task is known as intermediate keys and values.

These intermediate keys and values are sent to reducer.

The reduce tasks are broken into the following phases:

1. Shuffle.

2. Sort.

3. Reducer.

4. Output Format.

Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality.

Data locality means that data is not moved over network;

only computational code is moved to process data which saves network bandwidth.
Advantages of Hadoop Data locality

i. Faster Execution: In data locality, the program is moved to the node where data
resides instead of moving large data to the node, this makes Hadoop faster.
Because the size of the program is always lesser than the size of data, so moving data
is a bottleneck of network transfer.

ii. High Throughput: Data locality increases the overall throughput of the system
Inputs and Outputs for the map and reduce functions are key-value pairs
MAPPER
A mapper maps the input key-value pairs into a set of intermediate key-value pairs.
Maps are individual tasks that have the transforming input records into intermediate
key-value pairs.
1. InputSplit – It is the logical representation of data. It describes a unit of work that
contains a single map task in a MapReduce program
Hadoop InputSplit represents the data which is processed by an individual Mapper.
The split is divided into records. Hence, the mapper process each record (which is a
key-value pair).

MapReduce InputSplit length is measured in bytes and every InputSplit has storage
locations (hostname strings).
MapReduce system use storage locations to place map tasks as close to split’s data as
possible.
Note:
Inputsplit does not contain the input data; it is just a reference to the data.
InputSplit in Hadoop is user defined.
User can control split size according to the size of data in MapReduce program.
Thus the number of map tasks is equal to the number of InputSplits.
2. RecordReader: The RecordReader load’s data from its source and converts data into
key-value pairs suitable for reading by the mapper. The “start” is the byte position in the
file where the RecordReader should start generating key/value pairs and the “end” is where
it should stop reading records. A RecordReader acts as iterator over records.

Types of Hadoop RecordReader in MapReduce: TextInputFormat provides 2 types of

RecordReaders:
i. LineRecordReader
ii. SequenceFileRecordReader
i. LineRecordReader
Line RecordReader is the default RecordReader in hadoop.
This treats each line of the input file as the new value and associated key is byte
offset.
LineRecordReader always skips the first line in the split (or part of it), if it is not the
first split. It read one line after the boundary of the split in the end (if data is available,
so it is not the last split).

ii. SequenceFileRecordReader: It reads data specified by the header of a sequence

file.
It presents the tasks with keys and values.
The key is the positional information and value is a chunk of data that constitutes the
record.
3. Map: Map function works on the key-value pair produced by RecordReader and
generates zero or more intermediate key-value pairs. The MapReduce decides the key-
value pair based on the context. The output of the mapper program is called as
intermediate data (key-value pairs) which are understandable to reduce.
4. Combiner:
Large chunks of intermediate data is generated by the Mapper.
The intermediate data is passed on the Reducer for further processing, which leads to
enormous network congestion.
Hadoop Combiner that plays a key role in reducing network congestion.

The combiner in MapReduce is also known as ‘Mini-reducer’ or local reducer. The

primary job of Combiner is to process the output data from the Mapper, before passing
it to Reducer. It runs after the mapper and before the Reducer and its use is optional.
MapReduce program with out Combiner MapReduce program with Combiner
MapReduce program with out Combiner
Input is split into two mappers and 9 keys are generated from the mappers. Now we have
(9 key/value) intermediate data, the further mapper will send directly this data to
reducer and while sending data to the reducer, it consumes some network bandwidth
(bandwidth means time taken to transfer data between 2 machines). It will take more time
to transfer data to reducer if the size of data is big.

MapReduce program with Combiner

if we use a hadoop combiner, then combiner shuffles intermediate data (9 key/value)
before sending it to the reducer and generates 4 key/value pair as an output.

Advantages of MapReduce Combiner

•Hadoop Combiner reduces the time taken for data transfer between mapper and
reducer.
•It decreases the amount of data that needed to be processed by the reducer.
5. Partitioner: The partitioner takes the intermediate key-value pairs produced by
the mapper, splits them into shard, and sends the shard to the particular reducer as
per the user-specific code.

Usually, the key with same values goes to the same reducer. The partitioned data of
each map task is written to the local disk of that machine and pulled by the respective
reducer.
The Partitioner in MapReduce controls the partitioning of the key of the
intermediate mapper output. By hash function, key (or a subset of the key) is used to
derive the partition. A total number of partitions depends on the number of reduce
task.

According to the key-value each mapper output is partitioned and records having the
same key value go into the same partition (within each mapper), and then each
partition is sent to a reducer. Partition class determines which partition a given (key,
value) pair will go.

Partition phase takes place after map phase and before reduce phase.
Maps are individual tasks that have the transforming input
records into intermediate key-value pairs

ⓘ Start presenting to display the poll results on this slide.

A mapper maps the input key-value pairs into a set
of intermediate key-value pairs.

ⓘ Start presenting to display the poll results on this slide.

In SequenceFileRecordReader,
The key is the positional information and value is a chunk of
data that constitutes the record.

ⓘ Start presenting to display the poll results on this slide.

Inputs and Outputs for the map and
reduce functions are key-value pairs

ⓘ Start presenting to display the poll results on this slide.

It is true about input split(MSQ)

ⓘ Start presenting to display the poll results on this slide.

reduce tasks are broken into the following
phases(MSQ)

ⓘ Start presenting to display the poll results on this slide.

In MapReduce program, The number of map tasks
is equal to the number of InputSplits

ⓘ Start presenting to display the poll results on this slide.

Advantages of Hadoop Data locality
(MSQ)

ⓘ Start presenting to display the poll results on this slide.

Inputsplit does not contain the input data; it is
just a reference to the data

ⓘ Start presenting to display the poll results on this slide.

The RecordReader load’s data from its source and converts
into key-value pairs suitable for reading by the mapper

ⓘ Start presenting to display the poll results on this slide.

Each map task is broken into the
following phases(MSQ)

ⓘ Start presenting to display the poll results on this slide.

Types of RecordReaders(MSQ)

ⓘ Start presenting to display the poll results on this slide.

REDUCER
The primary chore of the Reducer is to reduce a set of intermediate values (the ones
that share a common key) to a smaller set of values.

Shuffling and Sorting in Hadoop MapReduce

The Reducer has three primary phases: Shuffle and Sort, Reduce, and
Output Format.

[Link] and Sort:

This phase takes the output of all the partitioners and downloads them into
the local machine where the reducer is running.
Then these individual data pipes are sorted by keys which produce larger
data list.
The main purpose of this sort is grouping similar words so that their values
can be easily iterated over by the reduce task.
Analogy:

Manual Sorting & Counting

2. Reduce: The reducer takes the grouped data produced by the shuffle and
sort phase, applies reduce function, and processes one group at a time. The
reduce function iterates all the values associated with that key. Reducer
function provides various operations such as aggregation, filtering, and
combining data. Once it is done, the output of reducer (zero or more key-value
pairs) is sent to the output format.
3. Output Format: The output format separates key-value pair with tab (default) and
writes it out to a file using record writer.
The Hadoop Output Format checks the Output-Specification of the job.
It determines how RecordWriter implementation is used to write output to output files

i. Hadoop RecordWriter
As we know, Reducer takes as input a set of an intermediate key-value pair produced by
the mapper and runs a reducer function on them to generate output that is again zero or
more key-value pairs.
RecordWriter writes these output key-value pairs from the Reducer phase to output
files
ii. Hadoop Output Format
Hadoop RecordWriter takes output data from Reducer and writes this data to output
files. The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format. The Output Format and InputFormat functions are
alike. OutputFormat instances provided by Hadoop are used to write to files on the
HDFS or local disk. OutputFormat describes the output-specification for a Map-Reduce
job. On the basis of output specification;
• MapReduce job checks that the output directory does not already exist.
• OutputFormat provides the RecordWriter implementation to be used to write
the output files of the job.
• Output files are stored in a FileSystem.
Hadoop MapReduce : Combined working of Map and Reduce

Hadoop MapReduce and NoSQL Overview
No ratings yet
Hadoop MapReduce and NoSQL Overview
44 pages
Hadoop MapReduce Tutorial Overview
No ratings yet
Hadoop MapReduce Tutorial Overview
20 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
74 pages
Hadoop MapReduce Anatomy and HiveQL Guide
No ratings yet
Hadoop MapReduce Anatomy and HiveQL Guide
79 pages
Understanding MapReduce Programming
No ratings yet
Understanding MapReduce Programming
32 pages
Hadoop MapReduce Framework Overview
No ratings yet
Hadoop MapReduce Framework Overview
94 pages
Anatomy of Hadoop MapReduce Explained
No ratings yet
Anatomy of Hadoop MapReduce Explained
54 pages
Understanding MapReduce Tasks and Phases
No ratings yet
Understanding MapReduce Tasks and Phases
34 pages
Understanding Map-Reduce in Big Data
No ratings yet
Understanding Map-Reduce in Big Data
19 pages
Anatomy of Hadoop MapReduce Jobs
No ratings yet
Anatomy of Hadoop MapReduce Jobs
14 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
7 pages
Understanding MapReduce in Big Data
No ratings yet
Understanding MapReduce in Big Data
60 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
41 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
9 pages
MapReduce Fundamentals in Hadoop
No ratings yet
MapReduce Fundamentals in Hadoop
4 pages
Anatomy of MapReduce in Hadoop
No ratings yet
Anatomy of MapReduce in Hadoop
37 pages
Introduction to MapReduce for Big Data
No ratings yet
Introduction to MapReduce for Big Data
34 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
40 pages
MapReduce Applications and Workflow Guide
No ratings yet
MapReduce Applications and Workflow Guide
20 pages
MapReduce Programming Overview
No ratings yet
MapReduce Programming Overview
39 pages
Unit IV
No ratings yet
Unit IV
66 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
40 pages
Hadoop Job Execution Control Flow
No ratings yet
Hadoop Job Execution Control Flow
13 pages
Anatomy of MapReduce in Hadoop
No ratings yet
Anatomy of MapReduce in Hadoop
11 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
27 pages
Map Reduce
No ratings yet
Map Reduce
95 pages
Understanding Hadoop's MapReduce Architecture
No ratings yet
Understanding Hadoop's MapReduce Architecture
14 pages
Control Flow in Hadoop Job Execution
No ratings yet
Control Flow in Hadoop Job Execution
63 pages
MapReduce Data Processing Overview
No ratings yet
MapReduce Data Processing Overview
8 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
43 pages
Understanding MapReduce Workflows
No ratings yet
Understanding MapReduce Workflows
38 pages
Understanding MapReduce Components
No ratings yet
Understanding MapReduce Components
10 pages
MapReduce Workflows Explained
No ratings yet
MapReduce Workflows Explained
18 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
12 pages
Anatomy of a MapReduce Job Run
100% (1)
Anatomy of a MapReduce Job Run
5 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
13 pages
Unit Iii
No ratings yet
Unit Iii
17 pages
MapReduce Workflow and Examples
No ratings yet
MapReduce Workflow and Examples
14 pages
Hadoop MapReduce: Mapper & Reducer Overview
No ratings yet
Hadoop MapReduce: Mapper & Reducer Overview
14 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
12 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
33 pages
Bda Notes Unit 3
No ratings yet
Bda Notes Unit 3
58 pages
Understanding Hadoop MapReduce Basics
No ratings yet
Understanding Hadoop MapReduce Basics
64 pages
MapReduce in Big Data Analytics
No ratings yet
MapReduce in Big Data Analytics
59 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
29 pages
MapReduce Framework: Roles & Architecture
No ratings yet
MapReduce Framework: Roles & Architecture
8 pages
Big Data MapReduce Course - Student - LSI3
No ratings yet
Big Data MapReduce Course - Student - LSI3
38 pages
Map Reduce Type
No ratings yet
Map Reduce Type
25 pages
Ultimate Hadoop MapReduce
No ratings yet
Ultimate Hadoop MapReduce
12 pages
Understanding MapReduce Functions
No ratings yet
Understanding MapReduce Functions
2 pages
Understanding MapReduce Components
No ratings yet
Understanding MapReduce Components
19 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
41 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
11 pages
Understanding MapReduce Data Processing
No ratings yet
Understanding MapReduce Data Processing
12 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
10 pages
Unit 3 Final
No ratings yet
Unit 3 Final
17 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
161 pages
CSS Geography Study Notes Guide
No ratings yet
CSS Geography Study Notes Guide
21 pages
Add Math Project Work 2
No ratings yet
Add Math Project Work 2
28 pages
MI NI HI-FI AUDIO SYSTEM Manual
No ratings yet
MI NI HI-FI AUDIO SYSTEM Manual
19 pages
DBMS Course Syllabus Overview
No ratings yet
DBMS Course Syllabus Overview
2 pages
FedRAMP SSP High Security Controls
No ratings yet
FedRAMP SSP High Security Controls
531 pages
Pros and Cons of GM Foods Explained
No ratings yet
Pros and Cons of GM Foods Explained
7 pages
General Remedial Works Specification
No ratings yet
General Remedial Works Specification
3 pages
Vietnamese Proverbs and Cultural Identity
No ratings yet
Vietnamese Proverbs and Cultural Identity
8 pages
Understanding Communication Fallacies
No ratings yet
Understanding Communication Fallacies
1 page
Kinetic Molecular Theory in Matter
No ratings yet
Kinetic Molecular Theory in Matter
16 pages
Pathophysiology of Cataracts Explained
No ratings yet
Pathophysiology of Cataracts Explained
58 pages
Grade 11 English Model Paper 2020
No ratings yet
Grade 11 English Model Paper 2020
10 pages
Overview of Psychology's Foundations
No ratings yet
Overview of Psychology's Foundations
2 pages
Gondwana Flora and Glossopteris Insights
No ratings yet
Gondwana Flora and Glossopteris Insights
3 pages
Aruba 303 Series Access Point Data Sheet
No ratings yet
Aruba 303 Series Access Point Data Sheet
9 pages
Trias Store: Rice and Feeds Retail in Mogpog
No ratings yet
Trias Store: Rice and Feeds Retail in Mogpog
5 pages
Understanding the Research Process
No ratings yet
Understanding the Research Process
9 pages
Nail and Skin Assessment Guide
No ratings yet
Nail and Skin Assessment Guide
20 pages
Understanding Indirect Voluntariness in Ethics
No ratings yet
Understanding Indirect Voluntariness in Ethics
4 pages
Titanium Dioxide in Pervious Concrete
No ratings yet
Titanium Dioxide in Pervious Concrete
5 pages
Process Control & Instrumentation Overview
No ratings yet
Process Control & Instrumentation Overview
48 pages
Digital Strategies in Luxury Retail
No ratings yet
Digital Strategies in Luxury Retail
80 pages
Kindergarten Teachers' Inclusive Education Challenges
No ratings yet
Kindergarten Teachers' Inclusive Education Challenges
31 pages
Relief Valve Calibration Procedure
No ratings yet
Relief Valve Calibration Procedure
4 pages
Declining Balance Depreciation Methods
No ratings yet
Declining Balance Depreciation Methods
4 pages
Shockwave™ Distributor Installation Guide
No ratings yet
Shockwave™ Distributor Installation Guide
6 pages
Limasawa vs. Masao: First Mass Debate
No ratings yet
Limasawa vs. Masao: First Mass Debate
29 pages
How to Write a Critical Comment
No ratings yet
How to Write a Critical Comment
4 pages
Delphic Maxim in Greek Papyri Analysis
No ratings yet
Delphic Maxim in Greek Papyri Analysis
17 pages
PMS 2022 Exam Results and Next Steps
No ratings yet
PMS 2022 Exam Results and Next Steps
3 pages

Understanding MapReduce in Hadoop

Uploaded by

Understanding MapReduce in Hadoop

Uploaded by

BIG DATA & ANALYTICS

UNIT III – Map Reduce

In MapReduce Programming, Jobs (Applications) are split into a set of map

Then these tasks are executed in a distributed fashion on Hadoop cluster.

This way, Hadoop distributes the load across the cluster.

Map task takes care of loading, parsing, transforming, and filtering.

ⓘ Start presenting to display the poll results on this slide.

These intermediate keys and values are sent to reducer.

Data locality means that data is not moved over network;

Types of Hadoop RecordReader in MapReduce: TextInputFormat provides 2 types of

ii. SequenceFileRecordReader: It reads data specified by the header of a sequence

The combiner in MapReduce is also known as ‘Mini-reducer’ or local reducer. The

MapReduce program with Combiner

Advantages of MapReduce Combiner

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

ⓘ Start presenting to display the poll results on this slide.

Shuffling and Sorting in Hadoop MapReduce

[Link] and Sort:

Manual Sorting & Counting

You might also like