0% found this document useful (0 votes)
5 views5 pages

Understanding MapReduce Data Flow

it telling about flow of data

Uploaded by

g90078332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Understanding MapReduce Data Flow

it telling about flow of data

Uploaded by

g90078332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MapReduce Data Flow

The data flow in MapReduce describes how data moves and transforms from its raw state on the disk
(HDFS) to the final processed output. This process involves distinct phases: Input Splitting, Mapping,
Shuffling/Sorting, and Reducing. The flow varies slightly depending on whether you use a Single
Reducer or Multiple Reducers.

1. General Data Flow Steps

1. Input Splits: The input file in HDFS is divided into logical blocks (Input Splits).

2. Mapping: Each split is processed by a Mapper, which outputs intermediate <Key,


Value> pairs.

3. Shuffling & Sorting: The framework automatically sorts these pairs by Key and groups all
values belonging to the same Key together (e.g., <"Apple", [1, 1, 1]>).

4. Partitioning: This step decides which Reducer will receive which Key. (Crucial for multiple
reducers).

5. Reducing: The Reducer processes the grouped data and writes the final output to HDFS.

A. Data Flow with Single Reducer

In this scenario, the job is configured to use exactly one Reducer ([Link](1)).

 Flow:

1. All Mappers process their data chunks in parallel.

2. All intermediate data from every Mapper is sent to the same, single Reducer.

3. The Reducer processes every single key-value pair generated by the entire job.

 Output:

 You get exactly one output file in HDFS (typically named part-r-00000).

 The output is globally sorted (because one Reducer sorted everything).

 Drawback:

 This creates a massive performance bottleneck. The single Reducer becomes


overloaded if the data volume is large, as it must process 100% of the data alone. It
defeats the purpose of distributed computing for the reduction phase.

B. Data Flow with Multiple Reducers

In this scenario, the job is configured to use multiple Reducers (e.g., [Link](3)).

 Flow:

1. Mappers process data in parallel.


2. Partitioning: A Partitioner function runs on the Mapper output. It uses a hash of the
Key to decide which Reducer gets which data. (e.g., Keys starting with A-I go to
Reducer 1, J-R to Reducer 2, etc.).

3. Shuffling: Data is physically transferred over the network so that Reducer 1 only gets
its assigned keys, Reducer 2 gets its keys, and so on.

4. Parallel Reducing: All Reducers run simultaneously, each processing a subset of the
data.

 Output:

 You get multiple output files in HDFS (part-r-00000, part-r-00001, part-r-00002).

 Each file is sorted internally, but there is no global sort order across all files (unless
you manually merge them).

 Benefit:

 This is true parallel procesMapReduce Data Flow

 The data flow in MapReduce describes how data moves and transforms from its raw
state on the disk (HDFS) to the final processed output. This process involves distinct
phases: Input Splitting, Mapping, Shuffling/Sorting, and Reducing. The flow varies
slightly depending on whether you use a Single Reducer or Multiple Reducers.

 1. General Data Flow Steps

 Input Splits: The input file in HDFS is divided into logical blocks (Input Splits).

 Mapping: Each split is processed by a Mapper, which outputs intermediate <Key,


Value> pairs.

 Shuffling & Sorting: The framework automatically sorts these pairs by Key and
groups all values belonging to the same Key together (e.g., <"Apple", [1, 1, 1]>).

 Partitioning: This step decides which Reducer will receive which Key. (Crucial for
multiple reducers).

 Reducing: The Reducer processes the grouped data and writes the final output to
HDFS.

 A. Data Flow with Single Reducer

 In this scenario, the job is configured to use exactly one


Reducer ([Link](1)).

 Flow:

 All Mappers process their data chunks in parallel.

 All intermediate data from every Mapper is sent to the same, single Reducer.

 The Reducer processes every single key-value pair generated by the entire job.
 Output:

 You get exactly one output file in HDFS (typically named part-r-00000).

 The output is globally sorted (because one Reducer sorted everything).

 Drawback:

 This creates a massive performance bottleneck. The single Reducer becomes


overloaded if the data volume is large, as it must process 100% of the data alone. It
defeats the purpose of distributed computing for the reduction phase.

 B. Data Flow with Multiple Reducers

 In this scenario, the job is configured to use multiple


Reducers (e.g., [Link](3)).

 Flow:

 Mappers process data in parallel.

 Partitioning: A PMapReduce Data Flow

 The data flow in MapReduce describes how data moves and transforms from its
raw state on the disk (HDFS) to the final processed output. This process involves
distinct phases: Input Splitting, Mapping, Shuffling/Sorting, and Reducing. The flow
varies slightly depending on whether you use a Single Reducer or Multiple
Reducers.

 1. General Data Flow Steps

 Input Splits: The input file in HDFS is divided into logical blocks (Input Splits).

 Mapping: Each split is processed by a Mapper, which outputs intermediate <Key,


Value> pairs.

 Shuffling & Sorting: The framework automatically sorts these pairs by Key and
groups all values belonging to the same Key together (e.g., <"Apple", [1, 1, 1]>).

 Partitioning: This step decides which Reducer will receive which Key. (Crucial for
multiple reducers).

 Reducing: The Reducer processes the grouped data and writes the final output to
HDFS.

 A. Data Flow with Single Reducer

 In this scenario, the job is configured to use exactly one


Reducer ([Link](1)).

 Flow:

 All Mappers process their data chunks in parallel.

 All intermediate data from every Mapper is sent to the same, single Reducer.
 The Reducer processes every single key-value pair generated by the entire job.

 Output:

 You get exactly one output file in HDFS (typically named part-r-00000).

 The output is globally sorted (because one Reducer sorted everything).

 Drawback:

 This creates a massive performance bottleneck. The single Reducer becomes


overloaded if the data volume is large, as it must process 100% of the data alone. It
defeats the purpose of distributed computing for the reduction phase.

 B. Data Flow with Multiple Reducers

 In this scenario, the job is configured to use multiple


Reducers (e.g., [Link](3)).

 Flow:

 Mappers process data in parallel.

 Partitioning: A Partitioner function runs on the Mapper output. It uses a hash of


the Key to decide which Reducer gets which data. (e.g., Keys starting with A-I go to
Reducer 1, J-R to Reducer 2, etc.).

 Shuffling: Data is physically transferred over the network so that Reducer 1 only
gets its assigned keys, Reducer 2 gets its keys, and so on.

 Parallel Reducing: All Reducers run simultaneously, each processing a subset of the
data.

 Output:

 You get multiple output files in HDFS (part-r-00000, part-r-00001, part-r-00002).

 Each file is sorted internally, but there is no global sort order across all files (unless
you manually merge them).

 Benefit:

 This is true parallel processing. It is much faster and scalable because the workload
is shared across multiple machines.

 artitioner function runs on the Mapper output. It uses a hash of the Key to decide
which Reducer gets which data. (e.g., Keys starting with A-I go to Reducer 1, J-R to
Reducer 2, etc.).

 Shuffling: Data is physically transferred over the network so that Reducer 1 only gets
its assigned keys, Reducer 2 gets its keys, and so on.

 Parallel Reducing: All Reducers run simultaneously, each processing a subset of the
data.

 Output:

 You get multiple output files in HDFS (part-r-00000, part-r-00001, part-r-00002).


 Each file is sorted internally, but there is no global sort order across all files (unless
you manually merge them).

 Benefit:

 This is true parallel processing. It is much faster and scalable because the workload is
shared across multiple machines.

 sing. It is much faster and scalable because the workload is shared across multiple
machines.

You might also like