0% found this document useful (0 votes)
16 views5 pages

Understanding RecordReader in Hadoop

RecordReader in Hadoop reads raw data from HDFS and generates meaningful key-value pairs that are understandable by mappers. It looks for line starts and ends in the input splits and reads data, retrieving any unfinished lines from other splits. This allows mappers to receive complete records even if they are split across blocks. It handles different input formats like text or databases by reading lines or records respectively and passing them to mappers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Understanding RecordReader in Hadoop

RecordReader in Hadoop reads raw data from HDFS and generates meaningful key-value pairs that are understandable by mappers. It looks for line starts and ends in the input splits and reads data, retrieving any unfinished lines from other splits. This allows mappers to receive complete records even if they are split across blocks. It handles different input formats like text or databases by reading lines or records respectively and passing them to mappers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Q1. What is the purpose of RecordReader in Hadoop?

Ans:-

While processing data in Hadoop we use Mapper and Reducer (classic MapReduce).

Both these mapper and Reducer function have <key,Value > pairs as input and output.

While storing data in HDFS it's just like a mere dump. Hadoop doesnt care of where the line
ends or record ends.. it just divides the data into blocks and save.

So in order Mapper to launch , the raw data should be read . While talking about inputformat ,
let's consider Text for now, The record reader comes into picture.

Being launched on an input split, the record reader looks for the start of the line in the split and
reads data through the split. If it is unable to find the end of the line in that input split it will read
the remaining part of the line remotely from another split.

It generates meaningful data understandable by mappers and pass it to mapper.

Same is the case when inputformat is DB , it reads records in place of lines.


For example, while doing word count program my file has data like

******************************I am learning Hadoop. **************************

And while dividing into blocks my data got splitted at learning. like

************************I am learning in one block and Hadoop. *************

So Record reader forms one complete line *** I am learning Hadoop. and submits to the
mapper.

Q2. What happens if the number of reducers is 0?

Basically a reducer 0 means, it is used to reduce a step which is required to be skipped and then a
mapper output will be set to be a final out.

Now, when the number of reducers is 0 In this case the outputs of the map-tasks go directly to
the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the
map-outputs before writing them out to the FileSystem.

[Link] joins in Hadoop ?


MapReduce Join

Joining two large dataset can be achieved using MapReduce Join. However, this process
involves writing lots of code to perform actual join operation.

Joining of two datasets begin by comparing size of each dataset. If one dataset is smaller
as compared to the other dataset then smaller dataset is distributed to every datanode in
the cluster. Once it is distributed, either Mapper or Reducer uses smaller dataset to
perform lookup for matching records from large dataset and then combine those records
to form output records.

Depending upon the place where actual join is performed, this join is classified into-

1. Map-Side Join - When the join is performed by the mapper, it is called as map-side
join. In this type, the join is performed before data is actually consumed by the map
function. It is mandatory that the input to each map is in the form of a partition and is in
sorted order. Also, there must be an equal number of partitions and it must be sorted by
the join key.

2. Reduce-Side Join - When the join is performed by the reducer, it is called as reduce-
side join. There is no necessity in this join to have dataset in a structured form (or
partitioned).

Here, map side processing emits join key and corresponding tuples of both the tables. As
an effect of this processing, all the tuples with same join key fall into the same reducer
which then joins the records with same join key.

Overall process flow is depicted in below diagram.


Q5. Elaborate some problems which can only be solved by
MapReduce and cannot be solved by PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set and
sensor list of different cities. I want to count the population by using one mapreduce for two
cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of
Bangalore city similar to Noida through which I can bring the population data of these two cities
to one reducer. The idea behind this is somehow I have to instruct map reducer program
whenever you find city with the name Bangalore and city with the name Noida, you create
the alias name which will be the common name for these two cities so that you create a common
key for both the cities and it get passed to the same reducer. For this, we have to write custom
partitioner.

In mapreduce when you create a key for city, you have to consider city as the key. So,
whenever the framework comes across a different city, it considers it as a different key. Hence,
we need to use customized partitioner. There is a provision in mapreduce only, where you can
write your custom partitioner and mention if city = bangalore or noida then pass similar
hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we
cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works
better than Pig.

Q6. When MR jobs are more useful than Pig?


Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The
language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query
written in SQL, where we need an execution engine to execute the query. So, when a program is
written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here,
MapReduce acts as the execution engine.

Q7. How is the splitting of file invoked in Hadoop


framework?
It is invoked by the Hadoop framework by running getInputSplit() method of the Input format
class (like FileInputFormat) defined by the user.

Q8. Explain what is distributed Cache in MapReduce


Framework ?
Distributed Cache is an important feature provided by map reduce
framework. When you want to share some files across all nodes in Hadoop
Cluster, DistributedCache is used. The files could be an executable jar files
or simple properties file.

You might also like