0% found this document useful (0 votes)
7 views17 pages

Unit 3 Final

MapReduce is a programming model for processing large datasets in parallel across a Hadoop cluster, consisting of two main phases: Map, which converts input data into key-value pairs, and Reduce, which aggregates values for each key. It allows for efficient data processing by dividing large datasets into smaller parts that can be processed simultaneously across multiple nodes, enhancing speed and reliability. The architecture includes components like Input Data Storage (HDFS), Mapper, Reducer, and the optional Combiner, facilitating distributed processing and fault tolerance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

Unit 3 Final

MapReduce is a programming model for processing large datasets in parallel across a Hadoop cluster, consisting of two main phases: Map, which converts input data into key-value pairs, and Reduce, which aggregates values for each key. It allows for efficient data processing by dividing large datasets into smaller parts that can be processed simultaneously across multiple nodes, enhancing speed and reliability. The architecture includes components like Input Data Storage (HDFS), Mapper, Reducer, and the optional Combiner, facilitating distributed processing and fault tolerance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

What is MapReduce?

Definition: MapReduce is a programming model used to process large amounts of data in parallel
across multiple computers (nodes) in a Hadoop cluster. It is mainly used for Big Data processing.
MapReduce is a distributed programming model used in Hadoop to process large datasets. It consists
of two phases: Map, which converts input data into key-value pairs, and Reduce, which aggregates
values for each key. It allows parallel processing across multiple nodes in a cluster.
Why MapReduce is needed?
When the data size becomes very large, such as Gigabytes (GBs), Terabytes (TBs), or Petabytes
(PBs), processing it on a single computer becomes difficult. A single system may not have enough
memory, storage, or processing power to handle such huge data efficiently. As a result, the processing
speed becomes very slow. Additionally, if that single system fails, the entire data processing task may
stop, which increases the risk of data loss or system failure.
MapReduce solves this by:
MapReduce works by divid ing large data into smaller parts so that each part can be handled separately.
These smaller data blocks are processed simultaneously on multiple machines in parallel, which
significantly increases processing speed and efficiency. After all the parts are processed, the individual
results are combined together to produce the final output. This approach makes it possible to handle
very large datasets quickly and reliably.
Basic Idea of MapReduce
The basic idea of MapReduce comes from two main functions: Map and Reduce. These two phases
work together to process large amounts of data efficiently in a distributed environment.
1. Map Phase
In the Map phase, the input data is taken and divided into smaller pieces. Each piece of data is
processed independently, and the mapper converts the data into key-value pairs.
For example, if the input is:
Hello Hadoop
Hello World
The mapper produces the following output:
Hello 1
Hadoop 1
Hello 1
World 1
Each word is treated as a key, and the value “1” represents one occurrence of that word.
Shuffle and Sort Phase (Automatic)
After the Map phase, the Shuffle and Sort phase occurs automatically. In this phase, all identical keys
are grouped together. The system organizes the intermediate output so that values belonging to the
same key are combined. After shuffling, the data becomes:
Hello → [1,1]
Hadoop → [1]
World → [1]
Reduce Phase
In the Reduce phase, the grouped data from the Shuffle and Sort phase is processed. The reducer takes
each key along with its list of values and performs aggregation operations such as sum, count, average,
minimum, or maximum. The main purpose of this phase is to combine the intermediate results and
produce the final output.
For example, after shuffling we have:
Hello → [1,1]
Hadoop → [1]
World → [1]
The reducer adds the values for each key:
Hello 2
Hadoop 1
World 1

1|Page
Thus, the Reduce phase generates the final summarized result.
Architecture of MapReduce:
MapReduce is a distributed data processing model used in Apache Hadoop to process large datasets
across multiple machines. It works closely with Hadoop Distributed File System (HDFS) and
follows a structured sequence of steps from input storage to final output generation.

1. Input Data Storage (HDFS)


The first step in MapReduce architecture is storing the data in HDFS. When a large file is placed into
HDFS, it is automatically divided into fixed-size blocks (commonly 128 MB or 256 MB). These blocks
are distributed across different machines called DataNodes in a cluster. The NameNode manages
metadata such as block locations and file structure. This distributed storage ensures fault tolerance and
parallel processing capability.
2. Input Splitting
After storing the file in HDFS, the MapReduce framework divides the file into logical units called
input splits. Each input split is assigned to one Mapper for processing. The number of splits generally
depends on the size of the file and the block size in HDFS. Input splitting allows multiple mappers to
work simultaneously, improving processing speed through parallelism.
3. Record Reader
The Record Reader converts each input split into key-value pairs that can be processed by the Mapper.
By default, the key is the byte offset of the line in the file, and the value is the actual line of text. This
step acts as a bridge between raw data in HDFS and the Mapper logic. It ensures that data is structured
properly before processing begins.
4. Mapper Phase
In the Mapper phase, each mapper processes its assigned input split independently. The mapper reads
the key-value pairs generated by the Record Reader and applies the user-defined map function. It then
produces intermediate key-value pairs as output. Since multiple mappers run in parallel on different
nodes, large datasets can be processed efficiently and quickly.
5. Combiner Phase (Optional)
The Combiner acts as a mini-reducer that runs on the mapper node before data is sent across the
network. Its purpose is to reduce the amount of intermediate data transferred to the reducers. By
performing local aggregation of data with the same key, it minimizes network traffic and improves
overall performance. However, using a combiner is optional and depends on the problem being solved.
6. Shuffle and Sort Phase
The shuffle and sort phase is a critical stage in MapReduce. During this phase, the framework groups
all intermediate values associated with the same key and transfers them to the appropriate reducer. The
system automatically sorts the keys before sending them to reducers. This ensures that all values related
to a particular key are processed together. Shuffle and sort happen internally within the Hadoop
framework and require no manual coding.
7. Reducer Phase

2|Page
In the Reducer phase, each reducer receives a key and a list of associated values from the shuffle phase.
The reducer applies a user-defined reduce function to aggregate or summarize the data. Common
operations include counting, summing, averaging, or filtering data. The reducer processes each key
independently and generates the final output key-value pairs.
8 . Output Storage (HDFS)
The final step in MapReduce architecture is writing the reducer output back to HDFS. Each reducer
creates one output file, typically named part-00000, part-00001, and so on. These output files are stored
in distributed form across the cluster. The result can then be used for further analysis or processing.

Features of MapReduce
✔ Parallel processing
✔ Fault tolerance
✔ Scalable
✔ Works on distributed systems
✔ Processes structured & unstructured data
Advantages
 Handles very large datasets
 Automatically manages failures
 Distributed processing
 High performance
Disadvantages
 Not suitable for real-time processing
 High disk I/O
 Complex programming in Java
Real-Life Example
Suppose a company wants to:
 Count how many times a product name appears in millions of reviews.
Map → Extract product names
Reduce → Count total occurrences

Writing MapReduce Programs


MapReduce programs are written to process large-scale data using parallel computation. In Apache
Hadoop, a job is divided into small tasks called Map tasks and Reduce tasks, which run across many
machines.
MapReduce Working Model

3|Page
Figure: Figure for understanding MapReduce concept with word count process.

Step-by-step flow:

1. Input data is stored in HDFS.


2. Data is split into smaller parts.
3. Mapper processes each split and produces key–value pairs.
4. Shuffle & Sort groups similar keys together.
5. Reducer processes grouped data.
6. Final output is written back to HDFS.
Main Components of a MapReduce Program

1. Driver Program

The driver controls the job execution.


It:
 Sets input and output paths

4|Page
 Sets mapper and reducer classes
 Configures job settings
 Starts the job using waitForCompletion()
Note: Think of it as the main() method of the program.

2. Mapper

The mapper reads input data and converts it into key–value pairs.
Example:
 Input → Weather record
 Output → (Year, Temperature)
Mapper works on:
 One record at a time
 Produces intermediate output
Note: It performs filtering and transformation.
3. Shuffle and Sort
This is done automatically by Hadoop.
It:
 Groups all same keys together
 Sends grouped data to reducers
Example:
(2020, 30)
(2020, 32)
(2021, 29)
Becomes:
2020 → [30, 32]
2021 → [29]

4. Reducer

The reducer processes grouped values and produces final results.


Example:
 Input → (2020, [30, 32])
 Output → (2020, 32)
Note: It performs aggregation or summary operations like:
 Max
 Min
 Count
 Sum
 Average

5. RecordReader

 Converts input files into key–value pairs


 Sends them to the mapper
 Example: reads file line-by-line

6. Combiner

 Works like a mini-reducer


 Runs after mapper
 Reduces data before sending to reducer
 Improves performance
5|Page
Note: Used to reduce network traffic

7. Partitioner

 Decides which reducer gets which key


 Default: hash partition
 Ensures load is balanced

Weather Dataset

Weather sensors are installed at many locations around the world. These sensors collect information
such as temperature, humidity, wind speed, and rainfall every hour. Because the data is collected
continuously, it becomes very large in size.
This large amount of data is called log data. Since the data is semi-structured and arranged record
by record, it is very suitable for processing using Apache Hadoop and the MapReduce programming
model. MapReduce can easily split this big dataset and process it in parallel.

Data Format

The dataset is provided by the National Climatic Data Center (NCDC).


The data is stored in a line-oriented ASCII text format:
 Each line = one record
 Each record contains weather details for a specific time and place
 Example fields: station ID, date, year, temperature, quality code
Example:
012345678901234201001011230+0025...
Because each line is independent, it is easy for the Mapper to read and process one record at a time.

Why this dataset is good for MapReduce?

 Very large volume of data


 Simple text format
 Records are independent
 Easy to split into blocks
 Can be processed in parallel
Therefore, weather datasets are a perfect example for learning MapReduce programs.

Classic MapReduce (MapReduce 1):


Classic MapReduce, also known as MapReduce 1, is the original processing model used in Apache
Hadoop. It is designed to process large datasets by dividing the work into smaller tasks and running
them on multiple machines in a cluster. When a job is submitted, the system automatically splits the
work into map and reduce tasks and executes them in parallel.
At the highest level, the system consists of four main components.
The first component is the client, which submits the MapReduce job to the cluster. The client
prepares the job configuration, specifies the input and output paths, and starts the execution process.
The second component is the JobTracker. It is the master node that coordinates the entire job. It
manages scheduling, assigns tasks to worker machines, monitors progress, and handles failures. In
simple terms, the JobTracker controls how and where the job runs.
The third component is the TaskTrackers, which are worker nodes. They receive tasks from the J
obTracker and execute the map and reduce operations. Each TaskTracker runs multiple tasks and
regularly sends status updates back to the JobTracker.
6|Page
The fourth component is the distributed storage system, usually Hadoop Distributed File System
(HDFS). It stores input data, program files, and output results. All components use this filesystem to
share data across the cluster.
Together, these components allow Classic MapReduce to process large amounts of data efficiently.
However, as clusters grew bigger, this design faced scalability issues, which later led to the
development of YARN in MapReduce 2

Classic MapReduce, also called MapReduce 1, is the original data processing model used in Apache
Hadoop. It processes large datasets by dividing the work into smaller map and reduce tasks and running
them in parallel across many machines in a cluster. This parallel execution helps complete big data
processing quickly and efficiently.
The system has four main components. The client submits the job and provides configuration details.
The JobTracker acts as the master and manages scheduling, task assignment, and monitoring. The
TaskTrackers are worker nodes that execute map and reduce tasks. Data is stored and shared using
Hadoop Distributed File System (HDFS).
Together, these components enable efficient distributed processing, but scalability limitations in large
clusters later led to the development of YARN (MapReduce 2).

Figure 1. How Hadoop runs a MapReduce job using the classic framework
Job Submission
In Apache Hadoop, a job starts when the submit() method is called. This creates an internal
JobSubmitter that prepares the job for execution. After submission, waitForCompletion() continuously
checks the job’s progress and displays updates on the console.
During submission, the system first requests a new job ID from the JobTracker. It then checks whether
the output path is valid and ensures the input files exist. Next, the input data is divided into splits. After
that, all required resources such as the job JAR file, configuration file, and split information are copied
to shared storage. Finally, the JobTracker is informed that the job is ready to run.
Job Initialization
Once the JobTracker receives the job, it places it in a queue and prepares it for execution. It reads the
input splits and creates one map task for each split. It also creates the required number of reduce tasks
based on the job configuration. In addition, setup and cleanup tasks are created to prepare the
environment before execution and clean resources after completion.
Task Assignment
7|Page
TaskTrackers continuously send heartbeat messages to the JobTracker to show they are active. When
a TaskTracker is free, it requests a new task. The JobTracker then assigns a map or reduce task. Each
TaskTracker has a fixed number of slots, so it can run only a limited number of tasks at the same time
depending on CPU and memory.
Task Execution
After receiving a task, the TaskTracker copies the necessary files from Hadoop Distributed File System
(HDFS) to its local system. It creates a working directory and starts a separate Java Virtual Machine
to execute the task. Running each task in a separate JVM ensures that errors in one task do not affect
others.
Progress and Status Updates
MapReduce jobs may run for a long time, so Hadoop tracks their progress. Each task regularly reports
its completion status and performance details to the JobTracker. Users can monitor how much of the
map and reduce work has been completed.
Job C ompletion
When all tasks finish successfully, the JobTracker marks the job as complete. The client receives a
success message, and temporary files are cleaned up. For very large clusters, this classic system faced
scalability problems, which led to the introduction of YARN (MapReduce 2) for better resource
management.

YARN (MapReduce 2)
YARN (Yet Another Resource Negotiator) is the improved version of MapReduce introduced in
Apache Hadoop to overcome the scalability problems of Classic MapReduce. In the old system, the
JobTracker handled both resource management and job monitoring, which caused performance issues
in larg e clusters. YARN solves this by separating these responsibilities into different components.
In YARN, the work of the JobTracker is divided between two main services: the Resource
Manager and the Application Master.
The Resource Manager controls and allocates resources for the entire cluster, while the
Application Master manages the execution of a specific application or job.
When a job is submitted, the Application Master requests resources from the Resource
Manager. These resources are provided in the form of containers, each with a fixed amount of memory
and CPU. The Application Master then runs the required tasks inside these containers. This ensures
that every job uses only the resources assigned to it.
The containers are managed by Node Managers, which run on each worker machine. They monitor
the containers and make sure applications do not exceed their resource limits. Unlike Classic
MapReduce, where one JobTracker manages all jobs, YARN gives each application its own dedicated
Application Master. This improves control and reliability.
One major advantage of YARN is that multiple applications can run on the same cluster at the same
time. For example, MapReduce jobs, streaming jobs, or other distributed applications can work
together.
This leads to better scalability, improved performance, and efficient use of cluster resources. It also
makes upgrading and managing different versions of MapReduce easier.

8|Page
Figure: How Hadoop runs a MapReduce job using YARN MapReduce

Main Entities in YARN


YARN involves more components than Classic MapReduce. The client submits the job. The Resource
Manager allocates resources across the cluster. The Node Managers launch and monitor containers
on each machine. The Application Master coordinates the MapReduce tasks for a specific job. All
job files and data are stored and shared using Hadoop Distributed File System (HDFS).
Together, these components make YARN more flexible, scalable, and efficient than Classic
MapReduce.
YARN (MapReduce 2) – Job Execution Steps
In Apache Hadoop, YARN (MapReduce 2) follows a structured process to submit, initialize, and
execute jobs. Although the user API is similar to MapReduce 1, the internal execution is handled
differently for better scalability and resource management.

Job Submission
Jobs in MapReduce 2 are submitted using the same API as MapReduce 1. When the framework is set
to YARN, the client communicates with the Resource Manager instead of the JobTracker. The client
uploads the job files, configuration, and other resources to the shared storage and requests a new
application (job) ID from the Resource Manager. The submission process is therefore similar to the
classic model but handled by YARN components.

Job Initialization
After submission, the Resource Manager forwards the request to its scheduler. The scheduler allocates
a container where the Application Master is launched under the control of a Node Manager.
The Application Master initializes the job by setting up tracking structures to monitor task progress
and completion. It then retrieves the input splits from the shared storage, usually Hadoop Distributed
File System (HDFS). Based on these splits, it creates one map task for each split and the required
number of reduce tasks.

Task Assignment

9|Page
Next, the Application Master requests containers from the Resource Manager to run the map and
reduce tasks. These requests include information about data locality, meaning tasks are scheduled close
to where the data is stored. This reduces network usage and improves performance.

Task Execution

Once containers are allocated, the Node Manager starts them. Inside each container, a Java process
called YarnChild is launched. It downloads the required job resources such as configuration files,
JAR files, and cached data. After setup, it executes the map or reduce task.

Understanding Hadoop API for MapReduce Framework (Old and New)


Apache Hadoop provides two Java MapReduce APIs: old API and new API. The differences are
given below in clear point-wise sentences:
1. The new API uses abstract classes instead of interfaces, making it easier to extend and
modify.
2. The new API is located in [Link], while the old API is in
[Link].
3. The new API uses a Context object to interact with the framework, replacing JobConf,
OutputCollector, and Reporter.
4. The new API allows both Mapper and Reducer to control execution by overriding the
run() method.
5. The new API uses a single Configuration class for job settings, while the old API uses
JobConf.
6. The new API controls jobs using the Job class, while the old API uses JobClient.
7. The output file naming is different: old API uses part-nnnnn, while the new API uses part-
m-nnnnn and part-r-nnnnn.
8. The new API passes reducer values as Iterable, which supports the for-each loop, while the
old API uses Iterator.

Basic Programs of Hadoop MapReduce –


Driver Code
How It Works
The Driver Code is the main program of a MapReduce job.
It is responsible for:
 Creating the Job
 Setting configuration details
 Connecting Mapper, Reducer, Combiner, and Partitioner
 Providing input and output paths
 Submitting the job to Hadoop
Without the Driver class, the MapReduce program cannot run.

A Job object is used to define and control a MapReduce job. It contains all the details about how the
job should run.
When we run a job on a Hadoop cluster, the code is packaged into a JAR file. Instead of manually
giving the JAR file name, we use the setJarByClass() method in the Job class. Hadoop will
automatically find the JAR file that contains that class.
After creating the Job object, we must give:
 Input Path
 Output Path
The input path is given using the static method addInputPath() from the FileInputFormat class.
The input can be:
 A single file

10 | P a g e
 A directory (all files inside it will be taken as input)
 A file pattern
The output path is given using the static method setOutputPath() from the FileOutputFormat class.
The output path must be a directory. The reducer writes the final output files into this directory.
Next, we must specify:
 Mapper class → using setMapperClass()
 Reducer class → using setReducerClass()
We also specify the output data types:
 setOutputKeyClass() → sets the output key type
 setOutputValueClass() → sets the output value type
Usually, mapper and reducer output types are the same.
If they are different, then we use:
 setMapOutputKeyClass()
 setMapOutputValueClass()
The input type is controlled by the input format. Since we did not specify it explicitly, Hadoop uses
the default TextInputFormat.
After setting all these details, we run the job using:
waitForCompletion()
This method:
 Submits the job
 Waits until it finishes
 Returns true if successful
 Returns false if failed
Based on this return value:
 Exit code 0 → Success
 Exit code 1 → Failure
Driver Program: MaxTemperature
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MaxTemperature


{
public static void main(String[] args) throws Exception
{
if ([Link] != 2)
{
[Link]("Usage: MaxTemperature <input path> <output path>");
[Link](-1);
}
Job job = new Job();
[Link]([Link]);
[Link]("Max temperature");
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link](true) ? 0 : 1);

11 | P a g e
}
}

Mapper Code:
How it works
The Mapper is the first phase in the MapReduce process.
Its main job is:
 Read input data
 Convert it into key-value pairs
 Send intermediate output to the next phase
Each mapper processes one input split.

The Mapper class is a generic class. It has four type parameters:


1. Input Key Type
2. Input Value Type
3. Output Key Type
4. Output Value Type
In this example:
 Input Key → Long integer offset (LongWritable)
 Input Value → One line of text (Text)
 Output Key → Year (Text)
 Output Value → Air temperature (IntWritable)
The input key represents the byte position of the line in the file.
The input value is a single line from the dataset.
The mapper reads each line, extracts:
 Year
 Air temperature
 Quality code
If the temperature is valid and not missing, it writes:
(year, temperature)
Mapper Program: MaxTemperatureMapper
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable>
{
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = [Link]();
String year = [Link](15, 19);
int airTemperature;
if ([Link](87) == '+')
{
airTemperature = [Link]([Link](88, 92));
}
else
{

12 | P a g e
airTemperature = [Link]([Link](87, 92));
}
String quality = [Link](92, 93);
if (airTemperature != MISSING && [Link]("[01459]"))
{
[Link](new Text(year),
new IntWritable(airTemperature));
}
}
}

Reducer Code:
How it works:
The Reducer is the final phase in the MapReduce process.
Its main job is:
 Receive grouped data from the Mapper
 Process all values for a single key
 Produce final output

Again, the Reducer class uses four type parameters. These specify:
1. Input Key Type
2. Input Value Type
3. Output Key Type
4. Output Value Type
The input types of the reducer must be the same as the output types of the mapper.
In our example:
 Mapper Output Key → Text (Year)
 Mapper Output Value → IntWritable (Temperature)
So the Reducer Input Types are:
 Text
 IntWritable
In this case, the reducer output types are also:
 Text → Year
 IntWritable → Maximum temperature
The reducer receives:
 A year (key)
 A list of temperatures for that year (values)
It finds the maximum temperature by:
 Iterating through all temperatures
 Comparing each value with the highest value found so far
 Storing the maximum
 Writing the final result
Reducer Program: MaxTemperatureReducer
import [Link];
import [Link];
import [Link];
import [Link];

public class MaxTemperatureReducer


extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values,

13 | P a g e
Context context)
throws IOException, InterruptedException
{
int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values)


{
maxValue = [Link](maxValue, [Link]());
}

[Link](key, new IntWritable(maxValue));


}
}
Record Reader:
How it works:
How RecordReader Works
1. Input file is divided into InputSplits.
2. Each split is assigned to a Mapper.
3. The RecordReader reads the split.
4. It converts raw data into key/value pairs.
5. These key/value pairs are passed to the map() method.

A RecordReader is responsible for creating key/value pairs that are given to the Map task for
processing.
Before the Mapper works, the input data must be converted into key/value pairs.
This conversion is done by the RecordReader.
Each InputFormat must provide its own RecordReader implementation.
That means every InputFormat decides how the input data should be read and how the key/value
pairs should be created.
For example:
The default TextInputFormat provides a LineRecordReader.
 The key → Byte offset of the file (LongWritable)
 The value → One line of text from the input file (Text)
So for every line in the input file, the RecordReader creates:
Key → Position of the line in the file
Value → The actual line content
Example
Suppose the input file contains:
Hello Hadoop
Big Data
MapReduce
The LineRecordReader will generate:
(0, "Hello Hadoop")
(13, "Big Data")
(22, "MapReduce")
Here:
 0 → Starting byte position of first line
 13 → Starting byte position of second line
 22 → Starting byte position of third line
These pairs are then sent to the Mapper.

Combiner Code:
How it words:

14 | P a g e
A Combiner is an optional component in MapReduce.
It works:
 After the Mapper
 Before the Reducer
Its main purpose is:
 To reduce the amount of data sent from Mapper to Reducer
 To improve performance
 To reduce network traffic
The Combiner acts like a mini-reducer.
Important:
 Hadoop does not guarantee how many times the combiner will run.
 It may run zero, one, or multiple times.
 Therefore, the combiner logic must not change the final result.
 The reducer output must be correct even if the combiner is not executed.

Many MapReduce jobs are limited by the network bandwidth available in the cluster.
So, it is important to reduce the amount of data transferred between the map tasks and reduce tasks.
Hadoop allows the user to specify a Combiner function.
 The combiner runs after the mapper
 It works on the mapper output
 Its output becomes the input to the reducer
The combiner is used as an optimization technique.
Important points:
 Hadoop does not guarantee how many times the combiner will run.
 It may run:
o Zero times
o One time
o Many times
 Therefore, whether the combiner runs or not, the final reducer output must remain the
same.
The combiner does not replace the reducer.
It only reduces the amount of data that is shuffled between mapper and reducer.
In this example, since we are calculating maximum temperature, the reducer logic can also be used
as a combiner.
Program: MaxTemperatureWithCombiner (Driver Code)
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MaxTemperatureWithCombiner


{
public static void main(String[] args) throws Exception
{
if ([Link] != 2)
{
[Link]("Usage: MaxTemperatureWithCombiner <input path> "
+ "<output path>");
[Link](-1);
}

15 | P a g e
Job job = new Job();
[Link]([Link]);
[Link]("Max temperature");

[Link](job, new Path(args[0]));


[Link](job, new Path(args[1]));

[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

[Link]([Link](true) ? 0 : 1);
}
}

Partitioner Code
The Partitioning phase happens:
 After the Map phase
 Before the Reduce phase
The number of partitions is equal to the number of reducers.
The partitioner decides which reducer will process which key.
The data is divided across reducers according to the partitioning function.

Difference Between Partitioner and Combiner

Partitioner

o Divides data based on the number of reducers.
o Ensures that all values of the same key go to the same reducer.
o Controls which reducer processes the data.
 Combiner
o Works like a mini reducer.
o Runs on mapper output.
o Reduces data before sending it to reducer.
o Used for optimization.
The combiner does not decide where data goes.
The partitioner does not perform aggregation.

Default Partitioner

Hadoop uses Hash Partitioner by default.


 It applies a hash function on the key.
 Then it decides the reducer number.
Formula used internally:
partition = ([Link]() & Integer.MAX_VALUE) % numberOfReducers
So:
 Same key → Always same reducer
 Different keys → May go to different reducers

Why Custom Partitioner is Needed?

Sometimes, default hash partitioning is not enough.

16 | P a g e
We may want to:
 Partition data based on some part of the key
 Partition based on value
 Send specific keys to specific reducers
In such cases, we create a Custom Partitioner.
Example: Custom Partitioner Code
Suppose we want:
 Years less than 1950 → Reducer 0
 Years 1950 and above → Reducer 1
import [Link];
import [Link];
import [Link];

public class YearPartitioner


extends Partitioner<Text, IntWritable>
{
@Override
public int getPartition(Text key, IntWritable value,
int numReduceTasks)
{
int year = [Link]([Link]());

if (numReduceTasks == 0)
{
return 0;
}

if (year < 1950)


{
return 0;
}
else
{
return 1 % numReduceTasks;
}
}
}

17 | P a g e

You might also like