Fundamentals of MapReduce with Example
MapReduce is one of the core building blocks of processing in Hadoop
framework. MapReduce became the genesis of the Hadoop processing model. So,
MapReduce is a programming model that allows us to perform parallel and
distributed processing on huge data sets.
MapReduce consists of two distinct tasks – Map and Reduce. As the name
MapReduce suggests, reducer phase takes place after mapper phase has been
completed. So, the first is the map job, where a block of data is read and processed
to produce key-value pairs as intermediate outputs. The output of a Mapper or map
job (key-value pairs) is input to the Reducer. Then, the reducer aggregates those
intermediate data tuples (intermediate key-value pair) into a smaller set of tuples
or key-value pairs which is the final output.
But why MapReduce came into picture? The answer is pretty simple. Traditional
Enterprise Systems normally have a centralized server to store and process data.
This approach was not suitable to handle the data which has one or more of the
following aspects – velocity, variety, volume and complexity.
Google solved this bottleneck issue using an algorithm called MapReduce.
MapReduce divides a task into small parts and assigns them to many computers.
Later, the results are collected at one place and integrated to form the result
dataset.
The MapReduce algorithm performs the following actions-
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value
pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered
maps as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
MapReduce consists of 2 steps:
• Map Function – It takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (Key-Value pair).
Example -
Input - Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS,
caR, CAR, car, BUS, TRAIN.
Convert into another set of data(Key, Value) - (Bus,1), (Car,1), (bus,1),
(car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),
(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1).
• Reduce Function – Takes the output from Map as an input and combines
those data tuples into a smaller set of tuples.
Example -
Input – Set of tuples from previous step.
Output – Smaller set of tuples – (BUS,7), (CAR,7), (TRAIN,7)