0% found this document useful (0 votes)
34 views50 pages

Installing Hadoop: Setup Guide

The document describes the steps to set up Hadoop in pseudo-distributed and fully distributed modes. For pseudo-distributed mode, it involves downloading Hadoop, extracting files, configuring environment variables, creating folders for namenode and datanode, editing configuration files, formatting namenode, and launching processes. For fully distributed mode, it involves extracting files, creating symlinks, modifying .bashrc, configuring hadoop-env.sh, running test jobs, and reconfiguring XML files for distributed operation.

Uploaded by

Mrunal Bhilare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views50 pages

Installing Hadoop: Setup Guide

The document describes the steps to set up Hadoop in pseudo-distributed and fully distributed modes. For pseudo-distributed mode, it involves downloading Hadoop, extracting files, configuring environment variables, creating folders for namenode and datanode, editing configuration files, formatting namenode, and launching processes. For fully distributed mode, it involves extracting files, creating symlinks, modifying .bashrc, configuring hadoop-env.sh, running test jobs, and reconfiguring XML files for distributed operation.

Uploaded by

Mrunal Bhilare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Experiment no: 1.

Perform setting up and Installing Hadoop in its two operating modes


a) Pseudo Distributed
b) b) Fully Distributed

a) Pseudo Distributed

Step 1: Download Binary Package :


Download the latest binary from the following site as follows.
[Link]

For reference, you can check the file save to the folder as follows.
C:\BigData

Step 2: Unzip the binary package


Open Git Bash, and change directory (cd) to the folder where you save the
binary package and then unzip as follows.
$ cd C:\Bigdata
MINGW: C:\Bigdata
$ tar-xvzf [Link]
For my situation, the Hadoop twofold is extricated to C:\BigData\hadoop-3.1.2.
Next, go to this GitHub Repo and download the receptacle organizer as a
speed as demonstrated as follows. Concentrate the compress and duplicate
all the documents present under the receptacle envelope to
C:\BigData\hadoop-3.1.2\bin. Supplant the current records too.
Step 3: Create folders for datanode and namenode :
● Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’.
Inside the ‘information’ envelope make two organizers ‘datanode’
and ‘namenode’. Your documents on HDFS will dwell under the
datanode envelope.

● SetHadoop Environment Variables


● Hadoop requires the following environment variables to be set.
HADOOP_HOME=” C:\BigData\hadoop-3.1.2”
HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”
JAVA_HOME=<Root of your JDK installation>”
● To set these variables, navigate to My Computer or This PC.
Right-click -> Properties -> Advanced System settings ->
Environment variables.
● Click New to create a new environment variable.

● In the event that you don’t have JAVA 1.8 introduced, at that
point you’ll have to download and introduce it first. In the event
that the JAVA_HOME climate variable is now set, at that point
check whether the way has any spaces in it (ex: C:\Program
Files\Java\… ). Spaces in the JAVA_HOME way will lead you to
issues. There is a stunt to get around it. Supplant ‘Program
Files ‘to ‘Progra~1’in the variable worth. Guarantee that the
variant of Java is 1.8 and JAVA_HOME is highlighting JDK 1.8.
Step 4: To make Short Name of Java Home path

● Set Hadoop Environment Variables


● Edit PATH Environment Variable
● Click on New and Add %JAVA_HOME%, %HADOOP_HOME%,
%HADOOP_BIN%, %HADOOP_HOME%/sin to your PATH one by
one.
● Now we have set the environment variables, we need to validate
them. Open a new Windows Command prompt and run an echo
command on each variable to confirm they are assigned the
desired values.

echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
● On the off chance that the factors are not instated yet, at that
point it can likely be on the grounds that you are trying them in
an old meeting. Ensure you have opened another order brief to
test them.
Step 5: Configure Hadoop
Once environment variables are set up, we need to configure Hadoop by
editing the following configuration files.
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]

First, let’s configure the Hadoop environment file. Open


C:\BigData\hadoop-3.1.2\etc\hadoop\[Link] and add below
content at the bottom
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Step 6: Edit [Link]
After editing [Link], you need to set the replication factor and the
location of namenode and datanodes. Open
C:\BigData\hadoop-3.1.2\etc\hadoop\[Link] and below content
within <configuration> </configuration> tags.
<configuration>
<property>
<name>[Link]</name>
<value>1</value>
</property>
<property>
<name>[Link]</name>
<value>C:\BigData\hadoop-3.2.1\data\namenode</value
> </property>
<property>
<name>[Link]</name>
<value>C:\BigData\hadoop-3.1.2\data\datanode</value
> </property>
</configuration>
Step 7: Edit [Link]
Now, configure Hadoop Core’s settings. Open C:\BigData\hadoop
3.1.2\etc\hadoop\[Link] and below content within <configuration>
</configuration> tags.

<configuration>
<property>
<name>[Link]</name>
<value>hdfs://[Link]:19000</value>
</property>
</configuration>

Step 8: YARN configurations


Edit file [Link]
Make sure the following entries are existing as follows.
<configuration> <property>
<name>[Link]-services</name>
<value>mapreduce_shuffle</value> </property>
<property>
<name>[Link]
a ss</name>
<value>[Link]</value
> </property>
</configuration>

Step 9: Edit [Link]


At last, how about we arrange properties for the Map-Reduce system.
Open C:\BigData\hadoop-3.1.2\etc\hadoop\[Link] and beneath
content inside <configuration> </configuration> labels. In the event that
you don’t see [Link], at that point open mapred
[Link] record and rename it to [Link]
<configuration>
<property>
<name>[Link]</name>
<value>%USERNAME%</value>
</property>
<property>
<name>[Link]</name>
<value>yarn</value>
</property>
<property>
<name>[Link]</name>
<value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>[Link]</name>
<value>local</value>
</property>
</configuration>

Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s


not then created one and add localhost in it and save it.
Step 10: Format Name Node :
To organize the Name Node, open another Windows Command Prompt
and run the beneath order. It might give you a few admonitions,
disregard them.
● hadoop namenode -format

Format Hadoop Name Node

Step 11: Launch Hadoop :


Open another Windows Command brief, make a point to run it as an
Administrator to maintain a strategic distance from authorization
mistakes. When opened, execute the beginning [Link] order. Since we
have added %HADOOP_HOME%\sbin to the PATH variable, you can run
this order from any envelope. In the event that you haven’t done as
such, at that point go to the %HADOOP_HOME%\sbin organizer and run
the order.

You can check the given below screenshot for your reference 4 new
windows will open and cmd terminals for 4 daemon processes like as
follows.
● namenode
● datanode
● node manager
● resource manager
Don’t close these windows, minimize them. Closing the windows will
terminate the daemons. You can run them in the background if you don’t
like to see these windows.

Step 12: Hadoop Web UI


In conclusion, how about we screen to perceive how are Hadoop
daemons are getting along. Also you can utilize the Web UI for a wide
range of authoritative and observing purposes. Open your program and
begin.
Step 13: Resource Manager
Open localhost:8088 to open Resource Manager
Step 14: Node Manager
Open localhost:8042 to open Node Manager

Step 15: Name Node :


Open localhost:9870 to check out the health of Name Node
Step 16: Data Node :
Open localhost:9864 to check out Data Node
b) Fully Distributed

Installing and configuration of Hadoop in Standalone Mode

Setup The Following are the steps to install Hadoop 2.4.1 in pseudo distributed
mode. Step 1 − Extract all downloaded files:

The following command is used to extract files on command

prompt: Command: cd Downloads

Step 2 − Create soft links (shortcuts).

The following command is used to create shortcuts:

Command: ln -s ./Downloads/hadoop-2.7.2/ ./hadoop

Step 3 − Configure .bashrc

This following code is to modify PATH variable in bash shell.

Command: vi ./.bashrc

The following code exports variables to path :

export HADOOP_HOME=/home/luck/hadoop

export

PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin Step 4

− Configure Hadoop in Stand-alone mode:

The following command is used to Configure Hadoop’s [Link]

file Command: vi ./hadoop/etc/hadoop/[Link]

export JAVA_HOME=/home/luck/jdk

Step 5 − Exit and re-open the command prompt

Step 6 − Run a Hadoop job on Standalone cluster

To run hadoop test the hadoop command. The usage message must be displayed.

Step 7 − Go to the directory you have downloaded the compressed Hadoop file and
unzip using terminal

Command: $ tar -xzvf [Link]

Step 8 − Go to the Hadoop distribution directory.

Command: $ cd /usr/local/hadoop
FInal output after installation and configuration:

By default, Hadoop is configured in standalone mode and is run in a non-distributed


mode on a single physical system. If setup is installed and configured properly, then the
following result is displayed on the command prompt:

Hadoop 2.4.1
Subversion [Link] -r
1529768 Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

The above result shows that Hadoop standalone mode setup is working. The

following Xml files must be reconfigured in order to develop Hadoop in Java:

[Link]

[Link]

[Link]

[Link]:

The [Link] file contains information regarding memory allocated for the file
system, the port number used for Hadoop instance, size of Read/Write buffers, and
memory limit for storing the data.

Open the [Link] with the following command and add the properties listed below
in between, tags in this file.

Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/[Link]

< property >


< name >[Link]< /name >
< value >/app/hadoop/tmp< /value >
< description >Parent directory for other temporary directories.< /description
> < /property >
< property >
< name >[Link] < /name >
< value >hdfs://localhost:54310< /value >
< description >The name of the default file system. < /description >
< /property >

[Link]:

This [Link] file is used to specify the MapReduce framework currently in use.

Open [Link] file with the following command and add the following properties
in between the , tags in this file.

Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/[Link]


< property >
< name >[Link]< /name >
< value >localhost:54311< /value >
< description >MapReduce job tracker runs at this host and port.
< /description >
< /property >

[Link]:

The [Link] file contains information regarding the namenode path, datanode
paths of the local file systems, the value of replication data, etc. It means the place
where you want to store the

Hadoop infrastructure.
Open the [Link] file with following command and add the properties listed below
in between the tags.

Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/[Link]

< property >


< name >[Link]< /name >
< value >1< /value >
< description >Default block replication.< /description >
< /property >
< property >
< name >[Link]
< value >/home/hduser_/hdfs< /value >
< /property >

Conclusion: Performed setting up and Installing Hadoop in itstwo


operating modes a)Pseudo Distributed b) Fully Distributed
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.

Experiment no: 2.

Implement the following file management tasksin hadoop:


a)Adding files and directories
b)Retrieving files
c)Deleting files

a) Adding files and directories


Below mentioned steps are followed to insert the required file in the Hadoop file
system.

Step1: Create an input directory

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input


Step2: Use the put command transfer and store the data file from the local systems
to the HDFS using the following commands in the terminal.

$ $HADOOP_HOME/bin/hadoop fs -put /home/[Link] /user/input


Step3: Verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

b) Retrieving files

For instance, if you have a file in HDFS called Intellipaat. Then retrieve the required
file from the Hadoop file system by carrying out:

Step1: View the data from HDFS using the cat command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/intellipaat


Step2: Gets the file from HDFS to the local file system using get command as shown
below

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

c) Deleting files

To delete a file in HDFS you would generally use the “rm” command. Here is an
example:

hdfs dfs -rm hdfs://nn1/file

You can use the “-r” or “-R” flags to delete recursively. This is needed for deleting
directories. Everything within the directory will be removed. Here is another
example:

hdfs dfs -rm -R hdfs://nn1/user/hadoop/emptydir

Check out our HDFS directory deletion guide HERE.

You can explicitly specify more than one to delete with a single command like

this: hdfs dfs -rm hdfs://nn1/file hdfs://nn1/user/hadoop/emptydir

If you want to delete files permanently without sending them to the trash you can use
the “-skipTrash” option. This is great for situations where you are over quota. Keep in
mind the trash may not even be enabled in the first place, so you should be aware of
that. Here is an example:

hdfs dfs -rm -skipTrash hdfs://nn1/file

If you want you can specify that the command should ask for confirmation before
deleting when the number of files that will be deleted is over a certain threshold. That
threshold is defined in this variable: [Link]. You can
enable asking for confirmation with this flag: “-safely”. Here is an example

hdfs dfs -rm -safely hdfs://nn1/file


Conclusion: Performed Implementing the following file management
tasksin hadoop:a)Adding files and directories b)Retrieving files and
c)Deleting files
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.

Experiment no: 3.

To understand the overall programming architecture using MapReduce


API. ->
we will take a close look at the classes and their methods that are involved in the
operations of MapReduce programming. We will primarily keep our focus on the
following −
● JobContext Interface
● JobClass
● Mapper Class
● Reducer Class

JobContext Interface
The JobContext interface is the super interface for all the classes, which defines
different jobs in MapReduce. It gives you a read-only view of the job that is provided
to the tasks while they are running.
The following are the sub-interfaces of JobContext interface.
[Link]. Subinterface Description

1. MapContext<KEYIN, VALUEIN, KEYOUT,

VALUEOUT> Defines the context that is given to the

Mapper.

2. ReduceContext<KEYIN, VALUEIN, KEYOUT,

VALUEOUT> Defines the context that is passed to the

Reducer.

Job class is the main class that implements the JobContext interface.
Job Class
The Job class is the most important class in the MapReduce API. It allows the user
to configure the job, submit it, control its execution, and query the state. The set
methods only work until the job is submitted, afterwards they will throw an
IllegalStateException.
Normally, the user creates the application, describes the various facets of the job,
and then submits the job and monitors its progress.
Here is an example of how to submit a job −
// Create a new Job
Job job = new Job(new Configuration());
[Link]([Link]);

// Specify various job-specific parameters


[Link]("myjob");
[Link](new Path("in"));
[Link](new Path("out"));

[Link]([Link]);
[Link]([Link]);

// Submit the job, then poll for progress until the job
is complete
[Link](true);
Constructors
Following are the constructor summary of Job class.
[Link] Constructor Summary

1 Job()

2 Job(Configuration conf)

3 Job(Configuration conf, String jobName)

Methods
Some of the important methods of Job class are as follows −
[Link] Method Description

1 getJobName()

User-specified job name.

2 getJobState()

Returns the current state of the Job.

3 isComplete()

Checks if the job is finished or not.

4 setInputFormatClass()

Sets the InputFormat for the job.


5 setJobName(String name)

Sets the user-specified job name.

6 setOutputFormatClass()

Sets the Output Format for the job.

7 setMapperClass(Class)

Sets the Mapper for the job.

8 setReducerClass(Class)

Sets the Reducer for the job.

9 setPartitionerClass(Class)

Sets the Partitioner for the job.

10 setCombinerClass(Class)

Sets the Combiner for the job.

Mapper Class
The Mapper class defines the Map job. Maps input key-value pairs to a set of
intermediate key-value pairs. Maps are the individual tasks that transform the input
records into intermediate records. The transformed intermediate records need not
be of the same type as the input records. A given input pair may map to zero or
many output pairs.

Method
map is the most prominent method of the Mapper class. The syntax is defined
below −
map(KEYIN key, VALUEIN value,
[Link] context)
This method is called once for each key-value pair in the input
split.

Reducer Class
The Reducer class defines the Reduce job in MapReduce. It reduces a set of
intermediate values that share a key to a smaller set of values. Reducer
implementations can access the Configuration for a job via the
[Link]() method. A Reducer has three primary phases −
Shuffle, Sort, and Reduce.
● Shuffle− The Reducer copies the sorted output from each Mapper using
HTTP across the network.
● Sort − The framework merge-sorts the Reducer inputs by keys (since different
Mappers may have output the same key). The shuffle and sort phases occur
simultaneously, i.e., while outputs are being fetched, they are merged.
● Reduce − In this phase the reduce (Object, Iterable, Context) method is called
for each <key, (collection of values)> in the sorted inputs.

Method
reduce is the most prominent method of the Reducer class. The syntax is defined
below −
reduce(KEYIN key, Iterable<VALUEIN> values,
[Link] context) This
method is called once for each key on the collection of key-value pairs.

Conclusion: Performed the overall programming architecture


using MapReduce API.
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.

Experiment no: 4.
Store the basic information about students such as roll no, name, date of
birth and address of student using various collection types such as List, Set
and Map. ->
import [Link].*;
class ArrayListExample{
public static void main(String args[]){
ArrayList al=new ArrayList(); // creating array list
[Link]("Jack", 1, “04-01-1999”, “chicago”);// adding elements
[Link]("Tyler", 2, “24-09-1998”, “new york”);
Iterator itr=[Link]();
while([Link]()){
[Link]([Link]()); } } }

class LinkedlistExample{
public static void main(String args[]){
LinkedList al=new LinkedList();// creating linked list
[Link]("Rachit", 5, “04-01-2000”, “delhi”);// adding elements
[Link]("Rahul", 6, “22-05-1999”, “mumbai”);
[Link]("Rajat", 7, “14-08-2001”, “delhi”);
Iterator itr = [Link]();
while([Link]()){
[Link]([Link]()); } } }

class QueueExample { public static void main(String args[]){


PriorityQueue queue=new PriorityQueue(); // creating priority queue
[Link]("Amit", 22, “25-09-2001”, “rajasthan”);// adding elements
[Link]("Rachit", 5, “04-01-2000”, “delhi”);
[Link]("Rahul", 6, “22-05-1999”, “mumbai”);
[Link]("head:"+[Link]());
[Link]("head:"+[Link]());
[Link]("iterating the queue elements:");
Iterator itr=[Link]();
while([Link]()){ [Link]([Link]()); }
[Link]();
[Link]();
[Link]("afterremoving two elements:");
Iterator itr2=[Link]();
while([Link]()){ [Link]([Link]()); } } }

class HashsetExample{
public static void main(String args[]){
HashSet&amp;lt;String&amp;gt; al=new HashSet();// creating
hashSet [Link]("Rachit"", 5, “04-01-2000”, “delhi”); // adding
elements [Link]("Amit", 22, “25-09-2001”, “rajasthan”);
[Link]("jack", 1, “04-01-1999”, “chicago”);
Iterator&amp;lt;String&amp;gt; itr=[Link]();
while([Link]()){
[Link]([Link]()); } } }

class LinkedHashsetExample{
public static void main(String args[]){
LinkedHashSet&amp;lt;String&amp;gt;
al=new LinkedHashSet(); // creating linkedhashset
[Link]("Mariana", 2, “24-02-2000”, “las vegas”);// adding
elements [Link]("Rick", 18, “04-04-1999”, “chicago”);
[Link]("Sam", 15, “14-01-1999”, “san francisco”);
Iterator&amp;lt;String&amp;gt; itr=[Link]();
while([Link]()){
[Link]([Link]()); } } } }

In the above code, it will return the names that we have added using add() method

i.e:

Jack, 1, 04-01-1999, chicago


Tyler, 2, 24-09-1998, new york

Rachit, 5, 04-01-2000, delhi


Rahul, 6, 22-05-1999, mumbai
Rajat, 7, 14-08-2001, delhi

head:Amit, 22, 25-09-2001, rajasthan


head:Amit, 22, 25-09-2001, rajasthan
iterating the queue elements:
Amit, 22, 25-09-2001, rajasthan
Rachit, 5, 04-01-2000, delhi
Rahul, 6, 22-05-1999, mumbai
after removing two elements:
Rahul, 6, 22-05-1999, mumbai

Amit, 22, 25-09-2001, rajasthan


Rachit, 5, 04-01-2000, delhi
jack, 1, 04-01-1999, chicago

Mariana,2, 24-02-2000, las vegas


Rick,18,04-04-1999, chicago
Sam, 15, 14-01-1999, san francisco

Conclusion: Performed a program to store the basic information about


students such as roll no, name, date of birth and address of student
using various collection types such as List, Set and Map.
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.

Experiment no: 5.

Run a basic Word Count Map Reduce program to understand Map


Reduce Paradigm.
a) Find the number of occurrence of each word appearing in the input
files b) Performing a MapReduce Job for word search count (lock
forspecific
keywords in a file).
->
a) Find the number of occurrence of each word appearing in the input

files package [Link];

import [Link];
import [Link].*;

import [Link];
import [Link].*;
import [Link].*;
import [Link].*;
import
[Link]
ma t;
import
[Link]
ma t;
import
[Link]
or mat;
import
[Link]
or mat;
public class WordCount {

public static class Map extends


Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text


value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
StringTokenizer tokenizer = new
StringTokenizer(line);
while ([Link]()) {
[Link]([Link]());
[Link](word, one);
}
}
}

public static class Reduce extends


Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key,


Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{ int sum = 0;
for (IntWritable val : values) {
sum += [Link]();
}
[Link](key, new
IntWritable(sum)); }
}

public static void main(String[] args)


throws Exception {
Configuration conf = new

Configuration(); Job job = new


Job(conf, "wordcount");

[Link]([Link]);
[Link]([Link]);

[Link]([Link]);
[Link]([Link]);

[Link]([Link]

);

[Link]([Link]);

[Link](job, new
Path(args[0]));
[Link](job, new
Path(args[1]));

[Link](true);
}

To run the example, the command syntax is:

bin/hadoop jar hadoop-*-[Link] wordcount [-m <#maps>]


[-r <#reducers>] <in-dir> <out-dir>

All of the files in the input directory (called in-dir in the command line above) are read
and the counts of words in the input are written to the output directory (called out-dir
above). It is assumed that both inputs and outputs are stored in [Link] your input is
not already in HDFS, but is rather in a local file system somewhere, you need to
copy the data into HDFS using a command like this:

bin/hadoop dfs -mkdir <hdfs-dir> //not required in hadoop 0.17.2


and later
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

b) Performing a MapReduce Job for word search count (lock forspecific


keywords in a file).

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WCReducer extends MapReduceBase implements Reducer<Text,
IntWritable, Text,
IntWritable> {

// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,
OutputCollector<Text, IntWritable> output,
Reporter rep) throwsIOException
{

int count = 0;

// Counting the frequency of each words


while ([Link]())
{
IntWritable i = [Link]();
count += [Link]();
}

[Link](key, new IntWritable(count));


}
}

Conclusion: Performed to run a basic Word Count Map Reduce program to


understand Map Reduce Paradigm. a)Find the number of occurrence of
each word appearing in the input files b)Performing a MapReduce Job for
word search count (lock for specific keywords in a file).
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.

Experiment no: 6.

Install and Run Hbase then use HbaseDDl and DML commands.
->
How HBase is installed and initially configured. Java and Hadoop are required to
proceed with HBase, so you have to download and install java and Hadoop in your
system.

Pre-Installation Setup
Before installing Hadoop into Linux environment, we need to set up Linux using ssh
(Secure Shell). Follow the steps given below for setting up the Linux environment.

Creating a User
First of all, it is recommended to create a separate user for Hadoop to isolate the
Hadoop file system from the Unix file system. Follow the steps given below to
create a user.
● Open the root using the command “su”.
● Createa user from the root account using the command “useradd username”.
● Now you can open an existing user account using the command “su
username”.
Open the Linux terminal and type the following commands to create a user.
$ su
password:
# useradd hadoop
# passwd hadoop
New passwd:
Retype new passwd

SSH Setup and Key Generation


SSH setup is required to perform different operations on the cluster such as start,
stop, and distributed daemon shell operations. To authenticate different users of
Hadoop, it is required to provide public/private key pair for a Hadoop user and
share it with different users.
The following commands are used to generate a key value pair using SSH. Copy
the public keys form id_rsa.pub to authorized_keys, and provide owner, read and
write permissions to authorized_keys file respectively.
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Verify ssh
ssh localhost

Installing Java
Java is the main prerequisite for Hadoop and HBase. First of all, you should verify
the existence of java in your system using “java -version”. The syntax of java
version command is given below.
$ java -version

If everything works fine, it will give you the following output.


java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed
mode)

If java is not installed in your system, then follow the steps given below for installing
java.

Step 1
Download java (JDK <latest version> - [Link]) by visiting the following link
Oracle Java.
Then [Link] will be downloaded into your system.

Step 2
Generally you will find the downloaded java file in Downloads folder. Verify it and
extract the [Link] file using the following commands.
$ cd Downloads/
$ ls
[Link]
$ tar zxf [Link]
$ ls
jdk1.7.0_71 [Link]

Step 3
To make java available to all the users, you have to move it to the location
“/usr/local/”. Open root and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4
For setting up PATH and JAVA_HOME variables, add the following commands to
~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH= $PATH:$JAVA_HOME/bin

Now apply all the changes into the current running system.
$ source ~/.bashrc
Step 5
Use the following commands to configure java alternatives:
# alternatives --install /usr/bin/java java
usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac


usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java

# alternatives --set javac

usr/local/java/bin/javac # alternatives --set jar

usr/local/java/bin/jar

Now verify the java -version command from the terminal as explained

above. Downloading Hadoop

After installing java, you have to install Hadoop. First of all, verify the existence of
Hadoop using “ Hadoop version ” command as shown below.
hadoop version

If everything works fine, it will give you the following output.


Hadoop 2.6.0
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum
18e43357c8f927c0695f1e9522859d6a This command was run
using
/home/hadoop/hadoop/share/hadoop/common/[Link]

If your system is unable to locate Hadoop, then download Hadoop in your system.
Follow the commands given below to do so.
Download and extract hadoop-2.6.0 from Apache Software Foundation using the
following commands.
$ su
password:
# cd /usr/local
# wget
[Link]
2.6.0/[Link]
# tar xzf [Link]
# mv hadoop-2.6.0/* hadoop/
# exit

Installing Hadoop
Install Hadoop in any of the required mode. Here, we are demonstrating HBase
functionalities in pseudo distributed mode, therefore install Hadoop in pseudo
distributed mode.
The following steps are used for installing Hadoop 2.4.1.

Step 1 - Setting up Hadoop


You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc

Step 2 - Hadoop Configuration


You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make changes in those configuration
files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs in java, you have to reset the java
environment variable in [Link] file by replacing JAVA_HOME value with the
location of java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71

You will have to edit the following files to configure Hadoop.


[Link]
The [Link] file contains information such as the port number used for
Hadoop instance, memory allocated for file system, memory limit for storing data,
and the size of Read/Write buffers.
Open [Link] and add the following properties in between the <configuration>
and </configuration> tags.
<configuration>
<property>
<name>[Link]</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
[Link]
The [Link] file contains information such as the value of replication data,
namenode path, and datanode path of your local file systems, where you want to
store the Hadoop infrastructure.
Let us assume the following data.
[Link] (data replication value) = 1
(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs
file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode


(hadoopinfra/hdfs/datanode is the directory created by hdfs
file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open this file and add the following properties in between the <configuration>,
</configuration> tags.
<configuration>
<property>
<name>[Link]</name >
<value>1</value>
</property>

<property>
<name>[Link]</name>
<value>[Link]
</property>

<property>
<name>[Link]</name>
<value>[Link]
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make
changes according to your Hadoop infrastructure.
[Link]
This file is used to configure yarn into Hadoop. Open the [Link] file and add
the following property in between the <configuration$gt;, </configuration$gt; tags in
this file.
<configuration>
<property>
<name>[Link]-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
[Link]
This file is used to specify which MapReduce framework we are using. By default,
Hadoop contains a template of [Link]. First of all, it is required to copy the
file from [Link] to [Link] file using the following
command.
$ cp [Link] [Link]

Open [Link] file and add the following properties in between the
<configuration> and </configuration> tags.
<configuration>
<property>
<name>[Link]</name>
<value>yarn</value>
</property>
</configuration>

Verifying Hadoop Installation


The following steps are used to verify the Hadoop installation.

Step 1 - Name Node Setup


Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format

The expected result is as follows.


10/24/14 [Link] INFO [Link]: STARTUP_MSG:
/***********************************************************
* STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/[Link]
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 [Link] INFO [Link]: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been
successfully formatted.
10/24/14 [Link] INFO [Link]:
Going to
retain 1 images with txid >= 0
10/24/14 [Link] INFO [Link]: Exiting with status 0
10/24/14 [Link] INFO [Link]: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/[Link]
************************************************************/

Step 2 - Verifying Hadoop dfs


The following command is used to start dfs. Executing this command will start your
Hadoop file system.
$ [Link]

The expected output is as follows.


10/24/14 [Link]
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop
2.4.1/logs/[Link]
localhost: starting datanode, logging to /home/hadoop/hadoop
2.4.1/logs/[Link]
Starting secondary namenodes [[Link]]

Step 3 - Verifying Yarn Script


The following command is used to start the yarn script. Executing this command will
start your yarn daemons.
$ [Link]

The expected output is as follows.


starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop
2.4.1/logs/[Link] localhost:
starting nodemanager, logging to /home/hadoop/hadoop
2.4.1/logs/[Link]

Step 4 - Accessing Hadoop on Browser


The default port number to access Hadoop is 50070. Use the following url to get
Hadoop services on your browser.
[Link]

Step 5 - Verify all Applications of Cluster


The default port number to access all the applications of cluster is 8088. Use the
following url to visit this service.
[Link]
Installing HBase
We can install HBase in any of the three modes: Standalone mode, Pseudo
Distributed mode, and Fully Distributed mode.

Installing HBase in Standalone Mode


Download the latest stable version of HBase form [Link]
[Link]/apache/hbase/stable/ using “wget” command, and extract it using the tar
“zxvf” command. See the following command.
$cd usr/local/
$wget
[Link]
- [Link]
$tar -zxvf [Link]

Shift to super user mode and move the HBase folder to /usr/local as shown below.
$su
$password: enter your password here
mv hbase-0.99.1/* Hbase/

Configuring HBase in Standalone Mode


Before proceeding with HBase, you have to edit the following files and configure
HBase.

[Link]
Set the java Home for HBase and open [Link] file from the conf folder. Edit
JAVA_HOME environment variable and change the existing path to your current
JAVA_HOME variable as shown below.
cd /usr/local/Hbase/conf
gedit [Link]

This will open the [Link] file of HBase. Now replace the existing JAVA_HOME
value with your current value as shown below.
export JAVA_HOME=/usr/lib/jvm/java-1.7.0
[Link]
This is the main configuration file of HBase. Set the data directory to an appropriate
location by opening the HBase home folder in /usr/local/HBase. Inside the conf
folder, you will find several files, open the [Link] file as shown below.
#cd /usr/local/HBase/
#cd conf
# gedit [Link]
Inside the [Link] file, you will find the <configuration> and </configuration>
tags. Within them, set the HBase directory under the property key with the name
“[Link]” as shown below.
<configuration>
//Here you have to set the path where you want HBase to
store its files.
<property>
<name>[Link]</name>
<value>file:/home/hadoop/HBase/HFiles</value>
</property>

//Here you have to set the path where you want HBase to
store its built in zookeeper files.
<property>
<name>[Link]</name>
<value>/home/hadoop/zookeeper</value>
</property>
</configuration>
With this, the HBase installation and configuration part is successfully complete. We
can start HBase by using [Link] script provided in the bin folder of HBase.
For that, open HBase Home Folder and run HBase start script as shown below.
$cd /usr/local/HBase/bin
$./[Link]

If everything goes well, when you try to run HBase start script, it will prompt you a
message saying that HBase has started.
starting master, logging to
/usr/local/HBase/bin/../logs/[Link]
. out

Installing HBase in Pseudo-Distributed Mode


Let us now check how HBase is installed in pseudo-distributed

mode. Configuring HBase

Before proceeding with HBase, configure Hadoop and HDFS on your local system
or on a remote system and make sure they are running. Stop HBase if it is
running.
[Link]
Edit [Link] file to add the following properties.
<property>
<name>[Link]</name>
<value>true</value>
</property>
It will mention in which mode HBase should be run. In the same file from the local
file system, change the [Link], your HDFS instance address, using the
hdfs://// URI syntax. We are running HDFS on the localhost at port 8030.
<property>
<name>[Link]</name>
<value>hdfs://localhost:8030/hbase</value>
</property>

Starting HBase
After configuration is over, browse to HBase home folder and start HBase using the
following command.
$cd /usr/local/HBase
$bin/[Link]

Note: Before starting HBase, make sure Hadoop is running.

Checking the HBase Directory in HDFS


HBase creates its directory in HDFS. To see the created directory, browse to
Hadoop bin and type the following command.
$ ./bin/hadoop fs -ls /hbase

If everything goes well, it will give you the following output.


Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/[Link]
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41
/hbase/[Link] drwxr-xr-x - hbase users 0 2014-06-25
21:49 /hbase/oldWALs

Starting and Stopping a Master


Using the “[Link]” you can start up to 10 servers. Open the home
folder of HBase, master and execute the following command to start it.
$ ./bin/[Link] 2 4

To kill a backup master, you need its process id, which will be stored in a file named
“/tmp/[Link].” you can kill the backup master using the
following command.
$ cat /tmp/[Link] |xargs kill -9

Starting and Stopping RegionServers


You can run multiple region servers from a single system using the following
command.
$ .bin/[Link] start 2 3
To stop a region server, use the following command.
$ .bin/[Link] stop 3

Starting HBaseShell
After Installing HBase successfully, you can start HBase Shell. Below given are the
sequence of steps that are to be followed to start the HBase shell. Open the
terminal, and login as super user.

Start Hadoop File System


Browse through Hadoop home sbin folder and start Hadoop file system as shown
below.
$cd $HADOOP_HOME/sbin
$[Link]

Start HBase
Browse through the HBase root directory bin folder and start HBase.
$cd /usr/local/HBase
$./bin/[Link]

Start HBase Master Server


This will be the same directory. Start it as shown below.
$./bin/[Link] start 2 (number signifies
specific server.)

Start Region
Start the region server as shown below.
$./bin/./[Link] start 3

Start HBase Shell


You can start HBase shell using the following command.
$cd bin
$./hbase shell

This will give you the HBase Shell Prompt as shown below.
2014-12-09 [Link],526 INFO [main]
[Link]: [Link] is deprecated.
Instead, use
[Link]
HBase Shell; enter 'help<RETURN>' for list of supported
commands. Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.8-hadoop2,
r6cfc8d064754251365e070a10a82eb169956d5fe, Fri
Nov 14 [Link] PST 2014

hbase(main):001:0>

HBase Web Interface


To access the web interface of HBase, type the following url in the
browser. [Link]

This interface lists your currently running Region servers, backup masters and
HBase tables.

HBase Region servers and Backup Masters

HBase Tables
Setting Java Environment
We can also communicate with HBase using Java libraries, but before accessing
HBase using Java API you need to set classpath for those libraries.

Setting the Classpath


Before proceeding with programming, set the classpath to HBase libraries in
.bashrc file. Open .bashrc in any of the editors as shown below.
$ gedit ~/.bashrc

Set classpath for HBase libraries (lib folder in HBase) in it as shown


below. export CLASSPATH =
$CLASSPATH://home/hadoop/hbase/lib/*

This is to prevent the “class not found” exception while accessing the HBase using
java API

Conclusion: Performed to Install and Run Hbase then use HbaseDDl and
DML commands.
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.

Experiment no: 7.

Install, Deploy & configure Apache Spark Cluster. Run apache spark
applications using Scala.
->
Follow the steps given below to easily install Apache Spark cluster and
run apache spark applications using Scala.

i. Recommended Platform

● OS – Linux is supported as a development and deployment


platform. You can use Ubuntu 14.04 / 16.04 or later (you can also
use other Linux flavors like CentOS, Redhat, etc.). Windows is
supported as a dev platform. (If you are new to Linux follow this
Linux commands manual).

● Spark – Apache Spark 2.x

For Apache Spark Installation On Multi-Node Cluster, we will be


needing multiple nodes, either you can use Amazon AWS or follow this
guide to setup virtual platform using VMWare player.

ii. Install Spark on Master

a. Prerequisites

Add Entries in hosts file

Edit hosts file

[php]sudo nano /etc/hosts[/php]

Now add entries of master and slaves

[php]MASTER-IP master
SLAVE01-IP slave01

SLAVE02-IP slave02[/php]

(NOTE: In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the


value of the corresponding IP)
Install Java 7 (Recommended Oracle Java)

[php]sudo apt-get install python-software-properties

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java7-installer[/php]

Install Scala

[php]sudo apt-get install scala[/php]

Configure SSH

Install Open SSH Server-Client

[php]sudo apt-get install openssh-server

openssh-client[/php] Generate Key Pairs

[php]ssh-keygen -t rsa -P “”[/php]

Configure passwordless SSH

Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys


(of all the slaves as well as master)

Check by SSH to all the Slaves

[php]ssh slave01

ssh slave02[/php]

b. Install Spark

Download Spark
You can download the latest version of spark from
[Link]

Untar Tarball

tar xzf [Link]

(Note: All the scripts, jars, and configuration files are available in
newly created directory “spark-2.0.0-bin-hadoop2.6”)

Setup Configuration
Edit .bashrc

Now edit .bashrc file located in user’s home directory and add
following environment variables:

[php]export JAVA_HOME=<path-of-Java-installation>
(eg: /usr/lib/jvm/java-7-oracle/)

export SPARK_HOME=<path-to-the-root-of-your-spark-installation>
(eg: /home/dataflair/spark-2.0.0-bin-hadoop2.6/)

export PATH=$PATH:$SPARK_HOME/bin[/php]

(Note: After above step restart the Terminal/Putty so that all


the environment variables will come into effect)

Edit [Link]

Now edit configuration file [Link] (in $SPARK_HOME/conf/)


and set following parameters:

Note: Create a copy of template of [Link] and rename

it: [php]cp [Link] [Link][/php]

[php]export JAVA_HOME=<path-of-Java-installation>
(eg: /usr/lib/jvm/java-7-oracle/)

export SPARK_WORKER_CORES=8[/php]

Add Salves
Create configuration file slaves (in $SPARK_HOME/conf/) and
add following entries:

[php]slave01

slave02[/php]

“Apache Spark has been installed successfully on Master, now deploy


Spark on all the Slaves”

iii. Install Spark On Slaves

a. Setup Prerequisites on all the slaves

Run following steps on all the slaves (or worker nodes):

● “1.1. Add Entries in hosts file”


● “1.2. Install Java 7”

● “1.3. Install Scala”

b. Copy setups from master to all the slaves

Create tarball of configured setup

[php]tar czf [Link]

spark-2.0.0-bin-hadoop2.6[/php] NOTE: Run this

command on Master

Copy the configured tarball on all the slaves

[php]scp [Link] slave01:~[/php]

NOTE: Run this command on Master

[php]scp [Link] slave02:~[/php]

NOTE: Run this command on Master

c. Un-tar configured spark setup on all the slaves

[php]tar xzf [Link][/php]

NOTE: Run this command on all the slaves


“Congratulations Apache Spark has been installed on all the Slaves.
Now Start the daemons on the Cluster”

iv. Start Spark Cluster

a. Start Spark Services

[php]sbin/[Link][/php]

Note: Run this command on Master

b. Check whether services have been started

Check daemons on Master

[php]jps

Master[/php]
Check daemons on Slaves

[php]jps

Worker[/php]

v. Spark Web UI

a. Spark Master UI

Browse the Spark UI to know about worker nodes, running


application, cluster resources.

[Link]

b. Spark application UI

[Link]

vi. Stop the Cluster

a. Stop Spark Services

Once all the applications have finished, you can stop the spark
services (master and slaves daemons) running on the cluster
[php]sbin/[Link][/php]

Note: Run this command on Master

After Apache Spark installation, I recommend learning Spark


RDD, DataFrame, and Dataset. You can proceed further with
Spark shell commands to play with Spark.

So, this was all in how to Install Apache Spark. Hope you like
our explanation.
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.
Experiment no: 8.

Basic CRUD operations in MangoDB


->

CRUD operations create, read, update, and delete documents.

Create Operations
Create or insert operations add new documents to a collection. If the collection
does not currently exist, insert operations will create the collection.

MongoDB provides the following methods to insert documents into a collection:

● [Link]() New in version 3.2


● [Link]() New in version 3.2

In MongoDB, insert operations target a single collection. All write operations


in MongoDB are atomic on the level of a single document.

For examples, see Insert Documents.

Read Operations
Read operations retrieve documents from a collection; i.e. query a collection for
documents. MongoDB provides the following methods to read documents from
a collection:

● [Link]()

You can specify query filters or criteria that identify the documents to return.
For examples, see:

● Query Documents
● Query on Embedded/Nested Documents
● Query an Array
● Query an Array of Embedded Documents

Update Operations
Update operations modify existing documents in a collection. MongoDB provides
the following methods to update documents of a collection:

● [Link]() New in version 3.2


● [Link]() New in version 3.2
● [Link]() New in version 3.2

In MongoDB, update operations target a single collection. All write operations


in MongoDB are atomic on the level of a single document.

You can specify criteria, or filters, that identify the documents to update. These
filters use the same syntax as read operations.

For examples, see Update Documents.

Delete Operations
Delete operations remove documents from a collection. MongoDB provides
the following methods to delete documents of a collection:

● [Link]() New in version 3.2


● [Link]() New in version 3.2

In MongoDB, delete operations target a single collection. All write operations


in MongoDB are atomic on the level of a single document.

You can specify criteria, or filters, that identify the documents to remove. These
filters use the same syntax as read operations. For examples, see Delete
Documents.

Bulk Write
MongoDB provides the ability to perform write operations in bulk.
Nutan College of Engineering and Research
Department of Computer Science.

Name:- Kiran Raghunath Patil.

Year:- 4th Yr. Div:- B.


Roll no. :- 29.
Subject:- Big Data Analytics.
Experiment no: 9.

Retrieve varioustypes of documentsfrom students collection.


->
To retrieve documents from student [Link] syntax is as follows:
[Link]();
The above syntax will return all the documents from a collection in MongoDB. To
understand the above syntax, let us create a collection with documents. The query
to create documents are as follows:
>
[Link]({"StudentId":"STUD101","StudentN
a me":"David","StudentAge":24});
{
"acknowledged" : true, "insertedId" :
ObjectId("5c6bf5cf68174aae23f5ef4e")
}
>
[Link]({"StudentId":"STUD102","StudentN
a me":"Carol","StudentAge":22});
{
"acknowledged" : true, "insertedId" :
ObjectId("5c6bf5e968174aae23f5ef4f")
}
>
[Link]({"StudentId":"STUD103","StudentN
a me":"Maxwell","StudentAge":25});
{
"acknowledged" : true, "insertedId" :
ObjectId("5c6bf5f768174aae23f5ef50")
}
>
[Link]({"StudentId":"STUD104","StudentN
a me":"Bob","StudentAge":23});
{
"acknowledged" : true, "insertedId" :
ObjectId("5c6bf60868174aae23f5ef51")
}
>
[Link]({"StudentId":"STUD105","StudentN
a me":"Sam","StudentAge":27});
{
"acknowledged" : true, "insertedId" :
ObjectId("5c6bf61b68174aae23f5ef52")
}
Now you can use the above syntax in order to retrieve all the documents from a
collection with the help of find() method. The query is as follows:
> [Link]();
The following is the output:
{ "_id" : ObjectId("5c6bf5cf68174aae23f5ef4e"), "StudentId"
: "STUD-101", "StudentName" :
"David", "StudentAge" : 24 }
{ "_id" : ObjectId("5c6bf5e968174aae23f5ef4f"), "StudentId"
: "STUD-102", "StudentName" :
"Carol", "StudentAge" : 22 }
{ "_id" : ObjectId("5c6bf5f768174aae23f5ef50"), "StudentId"
: "STUD-103", "StudentName" :
"Maxwell", "StudentAge" : 25 }
{ "_id" : ObjectId("5c6bf60868174aae23f5ef51"), "StudentId"
: "STUD-104", "StudentName" :
"Bob", "StudentAge" : 23 }
{ "_id" : ObjectId("5c6bf61b68174aae23f5ef52"), "StudentId"
: "STUD-105", "StudentName" :
"Sam", "StudentAge" : 27 }
For a proper formatted output, use pretty() with find(). The query is as
follows: > [Link]().pretty();
The following is the output:
{
"_id" : ObjectId("5c6bf5cf68174aae23f5ef4e"),
"StudentId" : "STUD-101",
"StudentName" : "David",
"StudentAge" : 24
}
{
"_id" : ObjectId("5c6bf5e968174aae23f5ef4f"),
"StudentId" : "STUD-102",
"StudentName" : "Carol",
"StudentAge" : 22
}
{
"_id" : ObjectId("5c6bf5f768174aae23f5ef50"),
"StudentId" : "STUD-103",
"StudentName" : "Maxwell",
"StudentAge" : 25
}
{
"_id" : ObjectId("5c6bf60868174aae23f5ef51"),
"StudentId" : "STUD-104",
"StudentName" : "Bob",
"StudentAge" : 23
}
{
"_id" : ObjectId("5c6bf61b68174aae23f5ef52"),
"StudentId" : "STUD-105",
"StudentName" : "Sam",
"StudentAge" : 27
}
If you want to retrieve a single document on the basis of some condition, then you
can use the following query. Here, we are retrieving the document with
StudentName as “Maxwell”:
>
[Link]({"StudentName":"Maxwell"}).pretty();
The following is the output:
{
"_id" : ObjectId("5c6bf5f768174aae23f5ef50"),
"StudentId" : "STUD-103",
"StudentName" : "Maxwell",
"StudentAge" : 25
}

Common questions

Powered by AI

The primary use of yarn-site.xml in configuring Hadoop is to set up YARN, Hadoop's resource management layer. This configuration file specifies properties related to NodeManager services and the resource scheduling framework . Typical properties configured include yarn.nodemanager.aux-services, set to enable services like mapreduce_shuffle, and the yarn.nodemanager.aux-services.mapreduce.shuffle.class, which specifies the ShuffleHandler class . These configurations are crucial for ensuring that YARN can manage resources effectively and support the execution of MapReduce tasks .

In the MapReduce framework, the Mapper class is responsible for transforming input key-value pairs into intermediate key-value pairs. The main method, map, is called once for each key-value pair in the input split, and it outputs zero or more transformed intermediate records . The Reducer class, on the other hand, takes these intermediate key-value pairs, and the reduce method performs three steps: shuffle, sort, and reduce. During these phases, the Reducer gets the sorted outputs from each Mapper, merges inputs by keys, and applies the reduce method to output smaller collections of values for each shared key . Configuration involves specifying classes using methods like setMapperClass and setReducerClass while setting up the job parameters in the Hadoop job configuration .

HDFS configuration directly impacts the efficiency and reliability of a Hadoop cluster. Within hdfs-site.xml, critical settings like dfs.replication determine data redundancy and fault tolerance by specifying how many times data blocks are replicated across the cluster nodes . Higher replication factors improve fault tolerance but increase storage requirements. Paths for dfs.namenode.name.dir and dfs.datanode.data.dir specify where metadata and blocks are stored, facilitating efficient data access and management . A well-configured HDFS ensures balanced data distribution and optimized access speeds, crucial for processing large datasets efficiently within Hadoop .

In standalone mode, Hadoop runs in a non-distributed mode on a single system without requiring any configurations changes post-installation, as it uses the local file system . In contrast, pseudo-distributed mode requires additional configuration of XML files such as core-site.xml, hdfs-site.xml, and mapred-site.xml. It runs each Hadoop daemon as a separate Java process and utilizes HDFS instead of the local filesystem, necessitating more detailed configurations like setting up the replication factor and paths for namenode and datanode directories . Standalone mode is often used for development and testing, while pseudo-distributed mode is closer to a real-world deployment used for more extensive evaluation of Hadoop operations .

Transitioning from standalone to pseudo-distributed mode in Hadoop involves several steps. Initially, configure the environment by setting paths in configuration files such as core-site.xml, mapred-site.xml, and hdfs-site.xml. These files need to define critical parameters like default file system name and replication settings . Commands within ~/.bashrc, such as export commands for HADOOP_HOME, PATH, and related variables, need proper configuration . Critical actions include editing hadoop-env.sh to point JAVA_HOME to the right Java installation. Finally, ensure the directory paths for namenode and datanode are correctly set, then start Hadoop daemons using appropriate scripts . Transition involves increased complexity but mirrors a cluster setup more closely than standalone mode.

The Hadoop Web UI provides several benefits for managing Hadoop daemons. It offers a user-friendly interface for monitoring the health and status of various components like the namenode, datanode, resource manager, and node manager . It enables administrators to view real-time statistics and logs, simplifying the detection and debugging of issues. Additionally, web-based management allows for easier access and control over Hadoop operations without needing direct command-line interactions, enhancing accessibility for users with varied technical backgrounds .

Setting the JAVA_HOME variable in Hadoop configuration files is critical since Hadoop is Java-based and requires the Java runtime environment to execute its components. The JAVA_HOME variable tells Hadoop where to find the Java installation, ensuring all Hadoop scripts can correctly access Java's libraries and executables . Without this configuration, Hadoop components would not be able to run, leading to system errors. Correctly setting JAVA_HOME thus ensures stability and functionality within the Hadoop ecosystem, particularly since improper paths can cause runtime failures or misconfigurations that hinder Hadoop's performance .

The replication factor is configured in the hdfs-site.xml file under the dfs.replication property. This property determines how many copies of a data block are created across the cluster. A higher replication factor increases data reliability by ensuring that even if multiple node failures occur, data can still be recovered from surviving nodes . It also impacts disk space usage and load balancing; hence, it is crucial to optimize to balance fault tolerance with physical storage constraints. Proper configuration of the replication factor is essential for maintaining the integrity and availability of data within a distributed Hadoop environment .

Key configuration steps for setting up Hadoop in a Windows environment include editing several XML files and environment scripts to set the necessary paths and configurations. First, the hadoop-env.cmd script is edited to set environment variables such as HADOOP_PREFIX and YARN_CONF_DIR . Then, the hdfs-site.xml is configured to specify the replication factor and the directories for the namenode and datanodes. Following that, core-site.xml is configured to set the default filesystem name . YARN configurations are made in yarn-site.xml, ensuring that the mapreduce_shuffle service is properly set . Further, the mapred-site.xml must be configured for the MapReduce framework, including paths for staging directories and job tracker addresses . Lastly, the namenode is formatted, and the Hadoop daemons are launched with the appropriate commands in the command prompt .

The Reduce phase of a MapReduce job involves three main processes: Shuffle, Sort, and Reduce. During the Shuffle phase, the framework retrieves sorted output data from each Mapper via HTTP, distributing data across different nodes . In the Sort phase, the framework consolidates the data, organizing it by key since different Mappers may output the same key. This occurs simultaneously with the Shuffle phase, merging occurs as data is fetched . Finally, in the Reduce phase, the reduce method is called for each key with its associated values, where computation is applied to generate the final output . These processes collectively ensure data is processed in a just-in-time manner, allowing for efficiency and scalability of data-intensive tasks in Hadoop .

You might also like