BIG DATA & ANALYTICS
UNIT III – Map Reduce
INTRODUCTION
In MapReduce Programming, Jobs (Applications) are split into a set of map
tasks and reduce tasks.
Then these tasks are executed in a distributed fashion on Hadoop cluster.
Each task processes small subset of data that has been assigned to it.
This way, Hadoop distributes the load across the cluster.
MapReduce job takes a set of files that is stored in HDFS (Hadoop Distributed
File System) as input.
Map task takes care of loading, parsing, transforming, and filtering.
Reduce task takes care of grouping and aggregating data that is produced by
map tasks to generate final output.
What is MR Job?
ⓘ Start presenting to display the poll results on this slide.
Each map task is broken into the following phases:
1. InputSplit
2. Record Reader.
3. Mapper.
4. Combiner.
5. Partitioner.
The output produced by map task is known as intermediate keys and values.
These intermediate keys and values are sent to reducer.
The reduce tasks are broken into the following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality.
Data locality means that data is not moved over network;
only computational code is moved to process data which saves network bandwidth.
Advantages of Hadoop Data locality
i. Faster Execution: In data locality, the program is moved to the node where data
resides instead of moving large data to the node, this makes Hadoop faster.
Because the size of the program is always lesser than the size of data, so moving data
is a bottleneck of network transfer.
ii. High Throughput: Data locality increases the overall throughput of the system
Inputs and Outputs for the map and reduce functions are key-value pairs
MAPPER
A mapper maps the input key-value pairs into a set of intermediate key-value pairs.
Maps are individual tasks that have the transforming input records into intermediate
key-value pairs.
1. InputSplit – It is the logical representation of data. It describes a unit of work that
contains a single map task in a MapReduce program
Hadoop InputSplit represents the data which is processed by an individual Mapper.
The split is divided into records. Hence, the mapper process each record (which is a
key-value pair).
MapReduce InputSplit length is measured in bytes and every InputSplit has storage
locations (hostname strings).
MapReduce system use storage locations to place map tasks as close to split’s data as
possible.
Note:
Inputsplit does not contain the input data; it is just a reference to the data.
InputSplit in Hadoop is user defined.
User can control split size according to the size of data in MapReduce program.
Thus the number of map tasks is equal to the number of InputSplits.
2. RecordReader: The RecordReader load’s data from its source and converts data into
key-value pairs suitable for reading by the mapper. The “start” is the byte position in the
file where the RecordReader should start generating key/value pairs and the “end” is where
it should stop reading records. A RecordReader acts as iterator over records.
Types of Hadoop RecordReader in MapReduce: TextInputFormat provides 2 types of
RecordReaders:
i. LineRecordReader
ii. SequenceFileRecordReader
i. LineRecordReader
Line RecordReader is the default RecordReader in hadoop.
This treats each line of the input file as the new value and associated key is byte
offset.
LineRecordReader always skips the first line in the split (or part of it), if it is not the
first split. It read one line after the boundary of the split in the end (if data is available,
so it is not the last split).
ii. SequenceFileRecordReader: It reads data specified by the header of a sequence
file.
It presents the tasks with keys and values.
The key is the positional information and value is a chunk of data that constitutes the
record.
3. Map: Map function works on the key-value pair produced by RecordReader and
generates zero or more intermediate key-value pairs. The MapReduce decides the key-
value pair based on the context. The output of the mapper program is called as
intermediate data (key-value pairs) which are understandable to reduce.
4. Combiner:
Large chunks of intermediate data is generated by the Mapper.
The intermediate data is passed on the Reducer for further processing, which leads to
enormous network congestion.
Hadoop Combiner that plays a key role in reducing network congestion.
The combiner in MapReduce is also known as ‘Mini-reducer’ or local reducer. The
primary job of Combiner is to process the output data from the Mapper, before passing
it to Reducer. It runs after the mapper and before the Reducer and its use is optional.
MapReduce program with out Combiner MapReduce program with Combiner
MapReduce program with out Combiner
Input is split into two mappers and 9 keys are generated from the mappers. Now we have
(9 key/value) intermediate data, the further mapper will send directly this data to
reducer and while sending data to the reducer, it consumes some network bandwidth
(bandwidth means time taken to transfer data between 2 machines). It will take more time
to transfer data to reducer if the size of data is big.
MapReduce program with Combiner
if we use a hadoop combiner, then combiner shuffles intermediate data (9 key/value)
before sending it to the reducer and generates 4 key/value pair as an output.
Advantages of MapReduce Combiner
•Hadoop Combiner reduces the time taken for data transfer between mapper and
reducer.
•It decreases the amount of data that needed to be processed by the reducer.
5. Partitioner: The partitioner takes the intermediate key-value pairs produced by
the mapper, splits them into shard, and sends the shard to the particular reducer as
per the user-specific code.
Usually, the key with same values goes to the same reducer. The partitioned data of
each map task is written to the local disk of that machine and pulled by the respective
reducer.
The Partitioner in MapReduce controls the partitioning of the key of the
intermediate mapper output. By hash function, key (or a subset of the key) is used to
derive the partition. A total number of partitions depends on the number of reduce
task.
According to the key-value each mapper output is partitioned and records having the
same key value go into the same partition (within each mapper), and then each
partition is sent to a reducer. Partition class determines which partition a given (key,
value) pair will go.
Partition phase takes place after map phase and before reduce phase.
Maps are individual tasks that have the transforming input
records into intermediate key-value pairs
ⓘ Start presenting to display the poll results on this slide.
A mapper maps the input key-value pairs into a set
of intermediate key-value pairs.
ⓘ Start presenting to display the poll results on this slide.
In SequenceFileRecordReader,
The key is the positional information and value is a chunk of
data that constitutes the record.
ⓘ Start presenting to display the poll results on this slide.
Inputs and Outputs for the map and
reduce functions are key-value pairs
ⓘ Start presenting to display the poll results on this slide.
It is true about input split(MSQ)
ⓘ Start presenting to display the poll results on this slide.
reduce tasks are broken into the following
phases(MSQ)
ⓘ Start presenting to display the poll results on this slide.
In MapReduce program, The number of map tasks
is equal to the number of InputSplits
ⓘ Start presenting to display the poll results on this slide.
Advantages of Hadoop Data locality
(MSQ)
ⓘ Start presenting to display the poll results on this slide.
Inputsplit does not contain the input data; it is
just a reference to the data
ⓘ Start presenting to display the poll results on this slide.
The RecordReader load’s data from its source and converts
into key-value pairs suitable for reading by the mapper
ⓘ Start presenting to display the poll results on this slide.
Each map task is broken into the
following phases(MSQ)
ⓘ Start presenting to display the poll results on this slide.
Types of RecordReaders(MSQ)
ⓘ Start presenting to display the poll results on this slide.
REDUCER
The primary chore of the Reducer is to reduce a set of intermediate values (the ones
that share a common key) to a smaller set of values.
Shuffling and Sorting in Hadoop MapReduce
The Reducer has three primary phases: Shuffle and Sort, Reduce, and
Output Format.
[Link] and Sort:
This phase takes the output of all the partitioners and downloads them into
the local machine where the reducer is running.
Then these individual data pipes are sorted by keys which produce larger
data list.
The main purpose of this sort is grouping similar words so that their values
can be easily iterated over by the reduce task.
Analogy:
Manual Sorting & Counting
2. Reduce: The reducer takes the grouped data produced by the shuffle and
sort phase, applies reduce function, and processes one group at a time. The
reduce function iterates all the values associated with that key. Reducer
function provides various operations such as aggregation, filtering, and
combining data. Once it is done, the output of reducer (zero or more key-value
pairs) is sent to the output format.
3. Output Format: The output format separates key-value pair with tab (default) and
writes it out to a file using record writer.
The Hadoop Output Format checks the Output-Specification of the job.
It determines how RecordWriter implementation is used to write output to output files
i. Hadoop RecordWriter
As we know, Reducer takes as input a set of an intermediate key-value pair produced by
the mapper and runs a reducer function on them to generate output that is again zero or
more key-value pairs.
RecordWriter writes these output key-value pairs from the Reducer phase to output
files
ii. Hadoop Output Format
Hadoop RecordWriter takes output data from Reducer and writes this data to output
files. The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format. The Output Format and InputFormat functions are
alike. OutputFormat instances provided by Hadoop are used to write to files on the
HDFS or local disk. OutputFormat describes the output-specification for a Map-Reduce
job. On the basis of output specification;
• MapReduce job checks that the output directory does not already exist.
• OutputFormat provides the RecordWriter implementation to be used to write
the output files of the job.
• Output files are stored in a FileSystem.
Hadoop MapReduce : Combined working of Map and Reduce