DEPARTMENT OF DATA SCIENCE AND ANALYTICS
24BDA6C20-MAP REDUCE PROGRAMMING LAB MANUAL
LIST OF EXPERIMENTS
1. Exercises to implement Stock count Map reduce program
2 Exercises for implementing sorting technique using Mapreduce
3 Exercises to implement file management tasks using Hadoop
4 Exercises for implementing two different map reduce programs using joins
5 Create a student database in MongoDB with the fields: (SRN, Sname, Degree, Sem, CGPA)and use,
Create collection, insert data, find, find one, sort, limit, skip, distinct, projection (CRUD operations)
6 Create an employee database inMongoDB and use update modifiers
7 Create an employee database in MongoDB and use Create collection, insert data, find, find one,
update, upsert, multi (CRUD operations)
8 Exercise for implementing general commands in HBase
9 Exercise for Table Management Commands in HBase
10 Exercises for Data Manipulation Commands in Hbase
REFERENCES:
Boris Lublinsky Kevin T. Smith Alexey Yakubovich ,ProfessionalHadoop® Solutions,
Wiley, ISBN: 9788126551071, 2015.
HADOOP INTRODUCTION:
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering localcomputation and
storage.
Hadoop Architecture:
The Apache Hadoop framework includes following four modules:
Hadoop Common: Contains Java libraries and utilities needed by other Hadoop modules.
These libraries give file system and OS level abstraction and comprise of the essential Java
files and scripts that are required to start Hadoop.
Hadoop Distributed File System (HDFS): A distributed file-system that provides high-
throughput access to application data on the community machines thus providing very high
aggregate bandwidth across the cluster.
Hadoop YARN: A resource-management framework responsible for job scheduling and
cluster resource management.
Hadoop MapReduce: This is a YARN- based programming model for parallel processing of
large data sets.
Hadoop Ecosystem:
Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large
amount of data, quickly and cost effectively through clusters of commodity hardware. It won‘t
be wrong if we say that Apache Hadoop is actually a collection of several components and not
just a single product.
With Hadoop Ecosystem there are several commercial along with an open source products which
are broadly used to make Hadoop laymen accessible and more usable.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault- tolerant
manner. In terms of programming, there are two functions which are most common in
MapReduce.
The Map Task: Master computer or node takes input and convert it into divide it into smaller
parts and distribute it on other worker nodes. All worker nodes solve their own small problem
and give answer to the master node.
The Reduce Task: Master node combines all answers coming from worker node andforms it
in some form of output which is answer of our big distributed problem.
Generally both the input and the output are reserved in a file-system. The framework is
responsible for scheduling tasks, monitoring them and even re-executes the failed tasks.
Hadoop Distributed File System (HDFS)
HDFS is a distributed file-system that provides high throughput access to data. When data is
pushed to HDFS, it automatically splits up into multiple blocks and stores/replicates the data
thus ensuring high availability and fault tolerance.
Note: A file consists of many blocks (large blocks of 64MB and above).
Here are the main components of HDFS:
Name Node: It acts as the master of the system. It maintains the name system
i.e.,directories and files and manages the blocks which are present on the Data Nodes.
Data Nodes: They are the slaves which are deployed on each machine and provide
theactual storage. They are responsible for serving read and write requests for the
clients.
Secondary Name Node: It is responsible for performing periodic checkpoints. In the
event of Name Node failure, you can restart the Name Node using the checkpoint.
Hive
Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is a
data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries,
and the analysis of large datasets stored in Hadoop compatible file systems.
HBase (Hadoop DataBase)
HBase is a distributed, column oriented database and uses HDFS for the underlying storage.
As said earlier, HDFS works on write once and read many times pattern, but this isn‘t a case
always. We may require real time read/write random access for huge dataset; this is where
HBase comes into the picture. HBase is built on top of HDFS and distributed on column-
oriented database.
INSTALLATION – STYLE -1:
i) Perform setting up and Installing Hadoop in its three operating modes:
• Standalone
• Pseudo Distributed
• Fully Distributed
DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your machine, version 6
or later. Sun's JDK is the one most widely used with Hadoop, although others have been
reported to work.
Hadoop runs on Unix and on Windows. Linux is the only supported production platform, but
other flavors of Unix (including Mac OS X) can be used to run Hadoop for development.
Windows is only supported as a development platform, and additionally requires Cygwin to
run. During the Cygwin installation process, you should include the openssh package if you
plan to run Hadoop in pseudo-distributed mode
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:-
1. Command for installing ssh is “sudo apt-get install ssh”.
2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into [Link] by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz [Link].
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-
[Link]
6. Extract the hadoop by using the command tar xvfz [Link]
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
the [Link] file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop
version
10. Check the hadoop instance in standalone mode working correctly or not by using an
implicit hadoop jar file named as word count.
11. If the word count (EXAMPLE PROGRAM) is displayed correctly in part-r-00000 file
it means that standalone mode is installed successfully.
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED
MODE:-
1. In order install pseudo distributed mode we need to configure the hadoop configuration
files resides in the directory /home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the [Link] file by changing the java path.
3. Configure the [Link] which contains a property tag, it contains name and value.
Name as [Link] and value as hdfs://localhost:9000
4. Configure [Link].
5. Configure [Link].
6. Configure [Link] before configure the copy [Link] to mapred-
[Link].
7. Now format the name node by using command hdfs namenode –format.
8. Type the command [Link],[Link] means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command
hdfs dfs –mkdr /csedir and enter some data into [Link] using command nano [Link] and
copy from local directory to hadoop using command hdfs dfs – copyFromLocal [Link]
/csedir/and run sample jar file wordcount to check whether pseudo distributed mode is
working or not.
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.
FULLY DISTRIBUTED MODE INSTALLATION: ALGORITHM
1. Stop all single node clusters
$[Link]
2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).
3. Copy public key to all three hosts to get a password less SSH access
$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
4. Configure all Configuration files, to name Master and Slave Nodes.
$cd $HADOOP_HOME/etc/hadoop
$nano [Link]
$ nano [Link]
5. Add hostnames to file slaves and save it.
$ nano slaves
6. Configure $ nano [Link]
7. Do in Master Node
$ hdfs namenode –format
$ [Link]
$[Link]
8. Format NameNode
9. Daemons Starting in Master and Slave Nodes
10. END
INPUT
ubuntu @localhost> jps
OUTPUT:
Data node, name nodem Secondary name node, NodeManager, Resource Manager
INSTALLATION STYLE 2:
Installation Steps –
Following are steps to Install Apache Hadoop on Ubuntu 14.04
Step1. Install Java (OpenJDK) - Since hadoop is based on java, make sure you have
java jdkinstalled on the system. Please check the version of java (It should be 1.7 or
$ java –version
above it)
If it returns "The program java can be found in the following packages", If Java
isn't beeninstalled yet, so execute the following command:
$sudo apt-get install default-jdk
Step2: Configure Apache Hadoop
1. Open bashrc in gedit mode
1. Open
$sudo geditbashrc file in geditor
~/.bashrc
2. Set java environment variable
export JAVA_HOME=/usr/jdk1.7.0_45/
3. Set Hadoop environment variable
export HADOOP_HOME=/usr/Hadoop 2.6/
4. Apply environment variables
$source ~/.bashrc
Step3: Install eclipse
Step4: Copy Hadoop plug-ins such as
• [Link]
• [Link]
• [Link] from release folder of hadoop2x-eclipse-plugin-master to
eclipse plugins
Step5: In eclipse, start new MapReduce project
File->new->other->MapReduce project
Step 6: Copy Hadoop packages such as [Link] [Link]
in src file ofMapReduce project
Step 7: Create Mapper, Reducer, and driver
Inside a project->src->File->new->other->Mapper/Reducer/Driver
Step 8: Copy Log file [Link] from src file of hadoop in src file of MapReduce project
Hadoop is powerful because it is extensible and it is easy to integrate with any component. Its
popularity is due in part to its ability to store, analyze and access large amounts of data, quickly
and cost effectively across clusters of commodity hardware.
Apache Hadoop is not actually a single product but instead a collection of several
components. When all these componentsare merged, it makes the Hadoop very user friendly.
Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created for
you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we use
chuck. You should substitute your user name in the example commands.
hadoop fs -mkdir /user/chuck
hadoop fs -put [Link]
hadoop fs -put [Link] /user/chuck
Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
[Link], we can run the following command:
hadoop fs -cat [Link]
Deleting Files from HDFS
hadoop fs -rm [Link]
• Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.
• Adding directory is done through the command “hdfs dfs –put lendi_english /”.
Copying Data from NFS to HDFS
Copying from directory command is “hdfs dfs –copyFromLocal
/home/lendi/Desktop/shakes/glossary /lendicse/”
• View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
• Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
• Command for Deleting files is “hdfs dfs –rm r /kartheek”.
EXPECTED OUTPUT:
SAMPLE EXERCISE-
AIM: To Develop a MapReduce program to calculate the frequency of a given word in
agiven file
Map Function – It takes a set of data and converts it into another set of data, where
individual
elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS,
TRAIN
Output
Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples
into a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples
(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)
Work Flow of Program
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order
to group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result.
Now Let’s See the Word Count Program in Java
Make sure that Hadoop is installed on your system with java idk Steps to follow
Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) >
Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish
Step 3. Right Click on Package > New > Class (Name it - WordCount)
Step 4. Add Following Reference Libraries –
Right Click on Project > Build Path> Add External Archivals
• /usr/lib/hadoop-0.20/[Link]
• Usr/lib/hadoop-0.20/lib/[Link]
SOURCE CODE:
package PackageDemo;
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](j, input);
[Link](j, output);
[Link]([Link](true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = [Link]();
String[] words=[Link](",");
for(String word: words )
{
Text outputKey = new Text([Link]().trim());
IntWritable outputValue = new IntWritable(1);
[Link](outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += [Link]();
}
[Link](word, new IntWritable(sum));
}
}
}
Make Jar File
Right Click on Project> Export> Select export destination as Jar File > next> Finish
To Move this into Hadoop directly, open the terminal and enter the following
commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
Run Jar file
(Hadoop jar [Link] [Link] PathToInputTextFile
PathToOutputDirectry)
[training@localhost ~]$ Hadoop jar [Link]
[Link] wordCountFile MRDir1
Result: Open Result
[training@localhost ~]$ hadoop fs -ls MRDir1
Found 3 items
-rw-r--r-- 1 training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_logs
-rw-r--r-- 1 training supergroup
20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
BUS 7
CAR 4
TRAIN 6