0% found this document useful (0 votes)

16 views5 pages

Understanding RecordReader in Hadoop

RecordReader in Hadoop reads raw data from HDFS and generates meaningful key-value pairs that are understandable by mappers. It looks for line starts and ends in the input splits and reads data, retrieving any unfinished lines from other splits. This allows mappers to receive complete records even if they are split across blocks. It handles different input formats like text or databases by reading lines or records respectively and passing them to mappers.

Uploaded by

Saikrishna Tipparapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views5 pages

Understanding RecordReader in Hadoop

Uploaded by

Saikrishna Tipparapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Q1. What is the purpose of RecordReader in Hadoop?

Ans:-

While processing data in Hadoop we use Mapper and Reducer (classic MapReduce).

Both these mapper and Reducer function have <key,Value > pairs as input and output.

While storing data in HDFS it's just like a mere dump. Hadoop doesnt care of where the line
ends or record ends.. it just divides the data into blocks and save.

So in order Mapper to launch , the raw data should be read . While talking about inputformat ,
let's consider Text for now, The record reader comes into picture.

Being launched on an input split, the record reader looks for the start of the line in the split and
reads data through the split. If it is unable to find the end of the line in that input split it will read
the remaining part of the line remotely from another split.

It generates meaningful data understandable by mappers and pass it to mapper.

Same is the case when inputformat is DB , it reads records in place of lines.

For example, while doing word count program my file has data like

****I am learning Hadoop.

And while dividing into blocks my data got splitted at learning. like

************I am learning in one block and Hadoop. *

So Record reader forms one complete line *** I am learning Hadoop. and submits to the
mapper.

Q2. What happens if the number of reducers is 0?

Basically a reducer 0 means, it is used to reduce a step which is required to be skipped and then a
mapper output will be set to be a final out.

Now, when the number of reducers is 0 In this case the outputs of the map-tasks go directly to
the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the
map-outputs before writing them out to the FileSystem.

[Link] joins in Hadoop ?

MapReduce Join

Joining two large dataset can be achieved using MapReduce Join. However, this process
involves writing lots of code to perform actual join operation.

Joining of two datasets begin by comparing size of each dataset. If one dataset is smaller
as compared to the other dataset then smaller dataset is distributed to every datanode in
the cluster. Once it is distributed, either Mapper or Reducer uses smaller dataset to
perform lookup for matching records from large dataset and then combine those records
to form output records.

Depending upon the place where actual join is performed, this join is classified into-

1. Map-Side Join - When the join is performed by the mapper, it is called as map-side
join. In this type, the join is performed before data is actually consumed by the map
function. It is mandatory that the input to each map is in the form of a partition and is in
sorted order. Also, there must be an equal number of partitions and it must be sorted by
the join key.

2. Reduce-Side Join - When the join is performed by the reducer, it is called as reduce-
side join. There is no necessity in this join to have dataset in a structured form (or
partitioned).

Here, map side processing emits join key and corresponding tuples of both the tables. As
an effect of this processing, all the tuples with same join key fall into the same reducer
which then joins the records with same join key.

Overall process flow is depicted in below diagram.

Q5. Elaborate some problems which can only be solved by
MapReduce and cannot be solved by PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set and
sensor list of different cities. I want to count the population by using one mapreduce for two
cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of
Bangalore city similar to Noida through which I can bring the population data of these two cities
to one reducer. The idea behind this is somehow I have to instruct map reducer program
whenever you find city with the name Bangalore and city with the name Noida, you create
the alias name which will be the common name for these two cities so that you create a common
key for both the cities and it get passed to the same reducer. For this, we have to write custom
partitioner.

In mapreduce when you create a key for city, you have to consider city as the key. So,
whenever the framework comes across a different city, it considers it as a different key. Hence,
we need to use customized partitioner. There is a provision in mapreduce only, where you can
write your custom partitioner and mention if city = bangalore or noida then pass similar
hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we
cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works
better than Pig.

Q6. When MR jobs are more useful than Pig?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The
language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query
written in SQL, where we need an execution engine to execute the query. So, when a program is
written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here,
MapReduce acts as the execution engine.

Q7. How is the splitting of file invoked in Hadoop

framework?
It is invoked by the Hadoop framework by running getInputSplit() method of the Input format
class (like FileInputFormat) defined by the user.

Q8. Explain what is distributed Cache in MapReduce

Framework ?
Distributed Cache is an important feature provided by map reduce
framework. When you want to share some files across all nodes in Hadoop
Cluster, DistributedCache is used. The files could be an executable jar files
or simple properties file.

MapReduce Job Components and Processes
No ratings yet
MapReduce Job Components and Processes
19 pages
Hadoop MapReduce and NoSQL Overview
No ratings yet
Hadoop MapReduce and NoSQL Overview
44 pages
Anatomy of Hadoop MapReduce Explained
No ratings yet
Anatomy of Hadoop MapReduce Explained
54 pages
MapReduce Interview Q&A Guide
No ratings yet
MapReduce Interview Q&A Guide
6 pages
Input Format,: Record Reader
No ratings yet
Input Format,: Record Reader
6 pages
Understanding Apache Hadoop MapReduce
No ratings yet
Understanding Apache Hadoop MapReduce
11 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
33 pages
Shuffling and Sorting in MapReduce
No ratings yet
Shuffling and Sorting in MapReduce
2 pages
Understanding MapReduce Counters
No ratings yet
Understanding MapReduce Counters
57 pages
MapReduce: Types, Features, and Formats
No ratings yet
MapReduce: Types, Features, and Formats
26 pages
MapReduce Workflows in Big Data Analytics
No ratings yet
MapReduce Workflows in Big Data Analytics
35 pages
Hadoop MapReduce Framework Overview
No ratings yet
Hadoop MapReduce Framework Overview
94 pages
Remid Exam Paper Solution
No ratings yet
Remid Exam Paper Solution
11 pages
MapReduce Workflows Explained
No ratings yet
MapReduce Workflows Explained
18 pages
Control Flow in Hadoop Job Execution
No ratings yet
Control Flow in Hadoop Job Execution
63 pages
MapReduce Types and Features Overview
No ratings yet
MapReduce Types and Features Overview
16 pages
Hadoop MapReduce Anatomy and HiveQL Guide
No ratings yet
Hadoop MapReduce Anatomy and HiveQL Guide
79 pages
Hadoop MapReduce Tutorial Overview
No ratings yet
Hadoop MapReduce Tutorial Overview
20 pages
Understanding Hadoop's MapReduce Architecture
No ratings yet
Understanding Hadoop's MapReduce Architecture
14 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
40 pages
Cloudera MapReduce Quiz Practice
No ratings yet
Cloudera MapReduce Quiz Practice
44 pages
Understanding MapReduce Components
No ratings yet
Understanding MapReduce Components
10 pages
MapReduce Applications and Workflow Guide
No ratings yet
MapReduce Applications and Workflow Guide
20 pages
New 9
No ratings yet
New 9
3 pages
Unit IV
No ratings yet
Unit IV
66 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
45 pages
Hadoop MapReduce Workflow Explained
No ratings yet
Hadoop MapReduce Workflow Explained
25 pages
MapReduce Applications and Testing Guide
No ratings yet
MapReduce Applications and Testing Guide
32 pages
Understanding MapReduce Components
No ratings yet
Understanding MapReduce Components
4 pages
Understanding MapReduce Components
No ratings yet
Understanding MapReduce Components
19 pages
MapReduce Framework: Roles & Architecture
No ratings yet
MapReduce Framework: Roles & Architecture
8 pages
Hadoop MapReduce: Mapper & Reducer Overview
No ratings yet
Hadoop MapReduce: Mapper & Reducer Overview
14 pages
Understanding MapReduce Workflow
No ratings yet
Understanding MapReduce Workflow
44 pages
Using Distributed Cache in Hadoop
No ratings yet
Using Distributed Cache in Hadoop
26 pages
Understanding MapReduce Programming
No ratings yet
Understanding MapReduce Programming
32 pages
Big Data
No ratings yet
Big Data
9 pages
Bdaunit 3
No ratings yet
Bdaunit 3
23 pages
Hadoop Configuration and Architecture Guide
No ratings yet
Hadoop Configuration and Architecture Guide
7 pages
Understanding Hadoop and MapReduce Basics
No ratings yet
Understanding Hadoop and MapReduce Basics
57 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
9 pages
BDS Session 7
No ratings yet
BDS Session 7
67 pages
Understanding MapReduce in Big Data
No ratings yet
Understanding MapReduce in Big Data
60 pages
Understanding MapReduce Tasks and Phases
No ratings yet
Understanding MapReduce Tasks and Phases
34 pages
MapReduce Programming Overview
No ratings yet
MapReduce Programming Overview
39 pages
MapReduce Applications in Big Data Analytics
No ratings yet
MapReduce Applications in Big Data Analytics
23 pages
MapReduce Fundamentals in Hadoop
No ratings yet
MapReduce Fundamentals in Hadoop
4 pages
MapReduce Framework Overview and Tasks
No ratings yet
MapReduce Framework Overview and Tasks
34 pages
Optimizing MapReduce Jobs in Hadoop
No ratings yet
Optimizing MapReduce Jobs in Hadoop
13 pages
Understanding MapReduce in Hadoop Stack
No ratings yet
Understanding MapReduce in Hadoop Stack
48 pages
Understanding Hadoop MapReduce Concepts
No ratings yet
Understanding Hadoop MapReduce Concepts
3 pages
Unit 2
No ratings yet
Unit 2
8 pages
Understanding MapReduce Data Processing
No ratings yet
Understanding MapReduce Data Processing
12 pages
MapReduce Workflows and Job Execution
No ratings yet
MapReduce Workflows and Job Execution
25 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
12 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
11 pages
Understanding Map Reduce Framework
No ratings yet
Understanding Map Reduce Framework
10 pages
Deeper Look at MapReduce 12062023 054022pm
No ratings yet
Deeper Look at MapReduce 12062023 054022pm
38 pages
MapReduce in Cloud Computing Explained
No ratings yet
MapReduce in Cloud Computing Explained
23 pages
Understanding Apache Hadoop MapReduce
No ratings yet
Understanding Apache Hadoop MapReduce
11 pages
OOP Lab Manual: Java Search & Sort
100% (3)
OOP Lab Manual: Java Search & Sort
7 pages
Global Misinformation Susceptibility in COVID-19
No ratings yet
Global Misinformation Susceptibility in COVID-19
15 pages
CSE 350 Signal Generator Overview
No ratings yet
CSE 350 Signal Generator Overview
18 pages
Understanding IDTF in Blockchain Hashing
No ratings yet
Understanding IDTF in Blockchain Hashing
4 pages
Cat NGEO EL350 Natural Gas Engine Oil
No ratings yet
Cat NGEO EL350 Natural Gas Engine Oil
2 pages
Electrolysis of Water: Gas Production Insights
No ratings yet
Electrolysis of Water: Gas Production Insights
4 pages
SASA Steering Wheel Position Sensor
No ratings yet
SASA Steering Wheel Position Sensor
2 pages
Adobe Scan 02 Feb 2023
No ratings yet
Adobe Scan 02 Feb 2023
13 pages
Islamic vs Non-Islamic Bank Governance Impact
No ratings yet
Islamic vs Non-Islamic Bank Governance Impact
19 pages
Cylinder-Head Bolts for Optimal Safety
No ratings yet
Cylinder-Head Bolts for Optimal Safety
20 pages
Creating a C Daemon Process Guide
No ratings yet
Creating a C Daemon Process Guide
9 pages
Complaints Resolving System Project Report
No ratings yet
Complaints Resolving System Project Report
60 pages
Fractional Distillation Project Report
No ratings yet
Fractional Distillation Project Report
19 pages
Earth Mat Design for Metro Stations
No ratings yet
Earth Mat Design for Metro Stations
22 pages
Understanding Common Factors and GCF
No ratings yet
Understanding Common Factors and GCF
18 pages
2010 Front Discharge Mixer Truck Parts Manual PN - 30947 - FDPB - Rev - 5
No ratings yet
2010 Front Discharge Mixer Truck Parts Manual PN - 30947 - FDPB - Rev - 5
292 pages
Class X Information Technology Lab Guide
No ratings yet
Class X Information Technology Lab Guide
33 pages
Types of UML Diagrams Explained
100% (1)
Types of UML Diagrams Explained
8 pages
MCCB Frame Size and Capacity Guide
No ratings yet
MCCB Frame Size and Capacity Guide
60 pages
Differentiation Exercises and Solutions
No ratings yet
Differentiation Exercises and Solutions
4 pages
NGV Cylinder Safety Standards Overview
100% (1)
NGV Cylinder Safety Standards Overview
6 pages
Page Rank and Web Measurement Insights
No ratings yet
Page Rank and Web Measurement Insights
5 pages
NPCIL Employee Directory and Contacts
No ratings yet
NPCIL Employee Directory and Contacts
200 pages
Cervical Lordosis in Asymptomatic Individuals: A Meta-Analysis
No ratings yet
Cervical Lordosis in Asymptomatic Individuals: A Meta-Analysis
7 pages
Harman Kordon AVR 5550 Eng
No ratings yet
Harman Kordon AVR 5550 Eng
52 pages
Understanding Tensile Testing Methods
No ratings yet
Understanding Tensile Testing Methods
9 pages
Restorative Composite Resins Overview
No ratings yet
Restorative Composite Resins Overview
5 pages
Explore Your Inner Divinity Journey
No ratings yet
Explore Your Inner Divinity Journey
2 pages
Block Cipher Design Principles
No ratings yet
Block Cipher Design Principles
13 pages
Automotive Sensors and Actuators Guide
No ratings yet
Automotive Sensors and Actuators Guide
15 pages

Understanding RecordReader in Hadoop

Uploaded by

Understanding RecordReader in Hadoop

Uploaded by

Q1. What is the purpose of RecordReader in Hadoop?

It generates meaningful data understandable by mappers and pass it to mapper.

Same is the case when inputformat is DB , it reads records in place of lines.

******************************I am learning Hadoop. **************************

************************I am learning in one block and Hadoop. *************

Q2. What happens if the number of reducers is 0?

[Link] joins in Hadoop ?

Overall process flow is depicted in below diagram.

Q6. When MR jobs are more useful than Pig?

Q7. How is the splitting of file invoked in Hadoop

Q8. Explain what is distributed Cache in MapReduce

You might also like

****I am learning Hadoop.

************I am learning in one block and Hadoop. *