MapReduce Data Flow
The data flow in MapReduce describes how data moves and transforms from its raw state on the disk
(HDFS) to the final processed output. This process involves distinct phases: Input Splitting, Mapping,
Shuffling/Sorting, and Reducing. The flow varies slightly depending on whether you use a Single
Reducer or Multiple Reducers.
1. General Data Flow Steps
1. Input Splits: The input file in HDFS is divided into logical blocks (Input Splits).
2. Mapping: Each split is processed by a Mapper, which outputs intermediate <Key,
Value> pairs.
3. Shuffling & Sorting: The framework automatically sorts these pairs by Key and groups all
values belonging to the same Key together (e.g., <"Apple", [1, 1, 1]>).
4. Partitioning: This step decides which Reducer will receive which Key. (Crucial for multiple
reducers).
5. Reducing: The Reducer processes the grouped data and writes the final output to HDFS.
A. Data Flow with Single Reducer
In this scenario, the job is configured to use exactly one Reducer ([Link](1)).
Flow:
1. All Mappers process their data chunks in parallel.
2. All intermediate data from every Mapper is sent to the same, single Reducer.
3. The Reducer processes every single key-value pair generated by the entire job.
Output:
You get exactly one output file in HDFS (typically named part-r-00000).
The output is globally sorted (because one Reducer sorted everything).
Drawback:
This creates a massive performance bottleneck. The single Reducer becomes
overloaded if the data volume is large, as it must process 100% of the data alone. It
defeats the purpose of distributed computing for the reduction phase.
B. Data Flow with Multiple Reducers
In this scenario, the job is configured to use multiple Reducers (e.g., [Link](3)).
Flow:
1. Mappers process data in parallel.
2. Partitioning: A Partitioner function runs on the Mapper output. It uses a hash of the
Key to decide which Reducer gets which data. (e.g., Keys starting with A-I go to
Reducer 1, J-R to Reducer 2, etc.).
3. Shuffling: Data is physically transferred over the network so that Reducer 1 only gets
its assigned keys, Reducer 2 gets its keys, and so on.
4. Parallel Reducing: All Reducers run simultaneously, each processing a subset of the
data.
Output:
You get multiple output files in HDFS (part-r-00000, part-r-00001, part-r-00002).
Each file is sorted internally, but there is no global sort order across all files (unless
you manually merge them).
Benefit:
This is true parallel procesMapReduce Data Flow
The data flow in MapReduce describes how data moves and transforms from its raw
state on the disk (HDFS) to the final processed output. This process involves distinct
phases: Input Splitting, Mapping, Shuffling/Sorting, and Reducing. The flow varies
slightly depending on whether you use a Single Reducer or Multiple Reducers.
1. General Data Flow Steps
Input Splits: The input file in HDFS is divided into logical blocks (Input Splits).
Mapping: Each split is processed by a Mapper, which outputs intermediate <Key,
Value> pairs.
Shuffling & Sorting: The framework automatically sorts these pairs by Key and
groups all values belonging to the same Key together (e.g., <"Apple", [1, 1, 1]>).
Partitioning: This step decides which Reducer will receive which Key. (Crucial for
multiple reducers).
Reducing: The Reducer processes the grouped data and writes the final output to
HDFS.
A. Data Flow with Single Reducer
In this scenario, the job is configured to use exactly one
Reducer ([Link](1)).
Flow:
All Mappers process their data chunks in parallel.
All intermediate data from every Mapper is sent to the same, single Reducer.
The Reducer processes every single key-value pair generated by the entire job.
Output:
You get exactly one output file in HDFS (typically named part-r-00000).
The output is globally sorted (because one Reducer sorted everything).
Drawback:
This creates a massive performance bottleneck. The single Reducer becomes
overloaded if the data volume is large, as it must process 100% of the data alone. It
defeats the purpose of distributed computing for the reduction phase.
B. Data Flow with Multiple Reducers
In this scenario, the job is configured to use multiple
Reducers (e.g., [Link](3)).
Flow:
Mappers process data in parallel.
Partitioning: A PMapReduce Data Flow
The data flow in MapReduce describes how data moves and transforms from its
raw state on the disk (HDFS) to the final processed output. This process involves
distinct phases: Input Splitting, Mapping, Shuffling/Sorting, and Reducing. The flow
varies slightly depending on whether you use a Single Reducer or Multiple
Reducers.
1. General Data Flow Steps
Input Splits: The input file in HDFS is divided into logical blocks (Input Splits).
Mapping: Each split is processed by a Mapper, which outputs intermediate <Key,
Value> pairs.
Shuffling & Sorting: The framework automatically sorts these pairs by Key and
groups all values belonging to the same Key together (e.g., <"Apple", [1, 1, 1]>).
Partitioning: This step decides which Reducer will receive which Key. (Crucial for
multiple reducers).
Reducing: The Reducer processes the grouped data and writes the final output to
HDFS.
A. Data Flow with Single Reducer
In this scenario, the job is configured to use exactly one
Reducer ([Link](1)).
Flow:
All Mappers process their data chunks in parallel.
All intermediate data from every Mapper is sent to the same, single Reducer.
The Reducer processes every single key-value pair generated by the entire job.
Output:
You get exactly one output file in HDFS (typically named part-r-00000).
The output is globally sorted (because one Reducer sorted everything).
Drawback:
This creates a massive performance bottleneck. The single Reducer becomes
overloaded if the data volume is large, as it must process 100% of the data alone. It
defeats the purpose of distributed computing for the reduction phase.
B. Data Flow with Multiple Reducers
In this scenario, the job is configured to use multiple
Reducers (e.g., [Link](3)).
Flow:
Mappers process data in parallel.
Partitioning: A Partitioner function runs on the Mapper output. It uses a hash of
the Key to decide which Reducer gets which data. (e.g., Keys starting with A-I go to
Reducer 1, J-R to Reducer 2, etc.).
Shuffling: Data is physically transferred over the network so that Reducer 1 only
gets its assigned keys, Reducer 2 gets its keys, and so on.
Parallel Reducing: All Reducers run simultaneously, each processing a subset of the
data.
Output:
You get multiple output files in HDFS (part-r-00000, part-r-00001, part-r-00002).
Each file is sorted internally, but there is no global sort order across all files (unless
you manually merge them).
Benefit:
This is true parallel processing. It is much faster and scalable because the workload
is shared across multiple machines.
artitioner function runs on the Mapper output. It uses a hash of the Key to decide
which Reducer gets which data. (e.g., Keys starting with A-I go to Reducer 1, J-R to
Reducer 2, etc.).
Shuffling: Data is physically transferred over the network so that Reducer 1 only gets
its assigned keys, Reducer 2 gets its keys, and so on.
Parallel Reducing: All Reducers run simultaneously, each processing a subset of the
data.
Output:
You get multiple output files in HDFS (part-r-00000, part-r-00001, part-r-00002).
Each file is sorted internally, but there is no global sort order across all files (unless
you manually merge them).
Benefit:
This is true parallel processing. It is much faster and scalable because the workload is
shared across multiple machines.
sing. It is much faster and scalable because the workload is shared across multiple
machines.