0% found this document useful (0 votes)
14 views9 pages

Big Data Analytics Hadoop Installation Guide

Big Data Analytics File
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Big Data Analytics Hadoop Installation Guide

Big Data Analytics File
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

NETAJI SUBASH UNIVERSITY OF TECHNOLOGY

EAST CAMPUS
Geeta Colony, New Delhi- 110031

BIG DATA ANALYTICS


Course Code – CBCPC11

PRACTICAL FILE

Submitted By: Aarav Jain Submitted To: Shajal Afaq

Roll No: 2022UCB6063


INDEX
SNO EXPERIMENT DATE SUBMISSION SIGN
Experiment – 1
AIM: Installation of VMWare to set up Hadoop Environment and its ecosystems.

OUTPUT:
Experiment -2
AIM: To perform setting up of Hadoop in three operating modes: a)stand alone
(b)pseudo-distributed (c)fully-distributed

DESCRIPTION:
Hadoop is written in java, so you will need to have java in your machine,v6 or later. Sun's
Java Development Kit is the one most widely used with Hadoop, although others have
been reported to work.

Hadoop runs on Unix and Windows, Linux is the only supported production platform
,but the flavours of Unix (including MAC OS(x)) can be used to run Hadoop for
development. windows is only supported as a dev platform ,and additionally requires
Cygwin . During the installation you should include the open SSH packet if you plan to run
Hadoop in solo distributed mode.

ALGORITHM:
a) STEPS INVOLVED IN INSTALLING HDOOP IN STAND ALONE MODE

1. COMMAND FOR INSTALLING SSH IS : pseudo app get install ssh


2. COMMAND FOR KEY GENERATION IS : ssh-keygen-trsa-P's'
3. STORE THE KEY INTO [Link] BY USING THE COMMAND : CAT
$[Link]/id-isaput>>$HOME/.ssh/authorised_keys

4. Extract java using the command :tar-fz jdk [Link]


5. Extract eclipse using the command :tar [Link]
6. Extract hdoop using the command :[Link]
7. move the java to /usr/library/jbm and eclipse to /opt/path. configure the java path in the
java
8. Export java path and hdoop path in ./bashrc
9. check is installation is successful or not by checking the java version and hdoop version
10. check the hadoop in stand alone mode working correctly or not by usng an
implicit hadoop jad file as word count.
11. if the word count is displayed correctly in -r-00000-filename it means the stand alone
mode is installed successfully
b) STEPS INVOLVED IN INSTALLING HDOOP IN PSSEUDO DISTRIBUTED
MODE:

1. In order to install hdoop in pseudo distributed mode we need to configure hddop


configuration file resides in the directory/home/systemname/hdoop/2.7.1/etc/hdoop

2. First configure the hdoop - [Link] file by changing the java path
3. Configure the core [Link] which contains the property tag, it contains the name and
value. Name as [Link] and value as hdfs://localhost:9000
configure [Link]

4. Configure [Link] before configure the copy [Link] to


[Link]
5. Now format the namenode by using the command hdfs-namenode-format

start -[Link]
start [Link] ,start the daemons like namenode,datnode run

jps which views all daemons.

[Link] a directory by using the cmd hdfsdfs-mkdir/csedir and enter some data into syastem
[Link] and copy from local directory to hdoop using cmd hdfsdfs-copy from localcsedir/
and run sample jar [Link] to check whether pseudo distributed is working or not

7. display content of file by using cmd hdfsdfs-cat/newdirectory/part-r-00000.

FULLY DISTRIBUTION MODE INSTALLATION:


ALGO:

1. Stop all single node cluster


$STOP [Link]

2. Decide one as namenode [master] and remaining as datanodes[slave] copy


public key to all 3 host to get a password less ssh access

$ssh-copy-id-I $HOME/.ssh/id_rsa.pub systemname @systemno.


3. Configure all configuration file, to name master and slave nodes

$ cd $HDOOP_HOME/etc/hdoop
$ nano [Link]
$ nano [Link]
4. Add host name to file slaves and save it
$nano slaves

5. Configure $nano [Link]

6. Do in master node
$hdfs namenode format
$start [Link]
$start [Link]

[Link] namenode

8. Daemons starting in master and slave node

9. End

INPUT FORMAT:
ubantu @localhost>jps

OUTPUT FORMAT:

Datanode, Name node, Secondary namenode, Node manager , resource manager.


Experiment – 3
AIM: Implement the following task management task in Hadoop: a) Add file to
directory (b) retrieving files (c) delete files

DESCRIPTION:
HDFS is a scalable distributed file system designed to scale to petabytes of data while
running on top of underline file system of OS. HDFS keeps track of where the data resides in
a network by associating the names of RACK or network switch with the data set. This
allows Hadoop to efficiently schedule task to those nodes that contains data, or which are
nearest to optimizing bandwidth utilisation. Hadoop provide a set of command line, utilities
that work similarly to the Linux file commands and serve as primary interface with HDFS.
We are going to have a look into HDFS by interacting with it from the command line. We
will take a look at the most common file management task in Hadoop, which
includes;
a) add files and directory to HDFS.
b) retrieve files from HDFS to local file system
c) deleting files from HDFS.

ALGORITHM:
1)Adding Files and Directories to HDFS:

Before you can run Hadoop programme on data stored in HDFS, you will need to put the
data into HDFS first. Lets create a directory and put a file in it. HDFS has a default working
directory of /user/$USER where $USER is your user name.
This directory is automatically created for you, though, lets create it with mkdir
command. For the purpose of illustration we use chuck. You should substitute your
username in the example command.
hadoop fs-mkdir/user/chuck
hadoop fs-put [Link]
hadoop fs-put [Link]/user/chuck
2)Retrieve file from HDFS :

The Hadoop command get copies file from HDFS back to the local file system to
retrieve [Link] , we can run the following command
hadoop fs-cat [Link]

3)Delete file from HDFS :

hadoop fs-rm [Link]


command for create A DIRECTORY IN HDFS is :
hdfs dfs mkdir/lendicse
command for add A DIRECTORY IN HDFS is : "hdfs dfs-put lendi_english

4)NFS to HDFS copying from directory command is :

‘hdfs dfs -copyFROMLOCAL/home/lendi/desktop/shakes/glossary/lendicse/’ View

the file by command “ hdfs dfs-cta/lendi_english/glossary”


Command for listing items in hadoop is :

“hdfs dfs-ls hdfs://localhost:9000/”


Command for deleting the file is : “hdfs dfs rmr/”

INPUT : As any data of format structured, semi-structured and unstructured.

EXPECTED OUTPUT:

Common questions

Powered by AI

The 'hadoop fs-put' command is used for uploading files and directories from a local file system into HDFS, allowing data processing within the distributed environment of Hadoop. Conversely, 'hadoop fs-get' retrieves files from HDFS to the local file system. These commands are significant for managing data transfer between local and distributed systems, facilitating data ingestion into Hadoop for processing, and enabling access to processed data locally for analysis and reporting .

The steps to set up Hadoop in standalone mode include: installing SSH using 'pseudo app get install ssh', generating SSH keys using 'ssh-keygen-trsa-P's'', storing the key, extracting Java with 'tar-fz jdk 8u60-linux-i586.tar.gz', installing Eclipse, extracting Hadoop using 'tar-XVfz-hadoop-2.71.tar.gz', moving Java and Eclipse to the appropriate paths, exporting Java and Hadoop paths in '.bashrc', and verifying the installation by checking Java and Hadoop versions. The word count function is used to test the standalone mode to ensure correct installation, as successful execution shows that Hadoop can process data .

Formatting the namenode is necessary to set up the Hadoop Distributed File System. During this step, the file system metadata is initialized, ensuring the system starts with a clean state. In pseudo-distributed and fully distributed setups, this step prepares the namenode to manage data blocks and file metadata across the cluster, enabling efficient data storage and retrieval. Not formatting could lead to conflicts or errors if residual data or settings from a previous setup exist .

HDFS enhances bandwidth utilization by associating data with the names of RACKs or network switches, allowing Hadoop to efficiently schedule tasks to nodes that either contain the data or are located nearby. By prioritizing data locality, Hadoop minimizes data movement across the network, optimizing bandwidth usage, reducing latency, and improving overall system performance during data processing .

Files in HDFS are removed using the 'hadoop fs-rm' command. Before executing this command, one should consider the importance and dependency of the data, as deletion is irreversible. Ensuring data is backed up if necessary and confirming the correct path of the files intended for deletion is critical to avoid accidental data loss. Understanding the impact of the removal on dependent applications and workflows is also essential .

Pseudo-distributed mode offers the advantage of simulating a multi-node cluster on a single machine, which facilitates testing and development without the need for actual multiple hardware setups, making it cost-effective and accessible. However, it may pose challenges such as limited resource availability since all operations still run on a single machine, potentially leading to performance bottlenecks that wouldn't occur in a true multi-node environment. Additionally, discrepancies in performance metrics can arise compared to running the same tasks on a full cluster, which may affect tuning and scaling insights .

Configuring XML files such as hadoop-env.sh, core-site.xml, yarn-site.xml, and mapred-site.xml is crucial in setting up Hadoop in pseudo-distributed mode because these files define system properties and daemon settings. Proper configuration of these files ensures that Hadoop can run in a simulated multi-node environment on a single machine, allowing tasks to be distributed and managed as if running on separate physical nodes. This setup provides a realistic test environment that replicates a full cluster system and aids in performance tuning and resource management .

The '-ls' command in Hadoop is used to list items within directories of HDFS. It contributes to file system management by providing users with a way to view and verify the files and directories present in the HDFS, akin to checking file structures in a standard operating system. This command is crucial for maintaining an organized HDFS, enabling users to navigate the file system efficiently and manage resources effectively .

To convert a single-node cluster into a fully distributed cluster, one must first stop the single-node cluster using '$STOP CALL.SH'. Then, designate one machine as the namenode (master) and others as datanodes (slaves). Public key distribution using $ssh-copy-id is crucial for password-less SSH access between nodes, which simplifies node communication and management. After setting up SSH, configure core-site.xml and hdfs-site.xml on all nodes to identify master and slave roles, add hostnames to the 'slaves' configuration file, and update yarn-site.xml. Format the namenode and start Hadoop services with DFS and YARN commands on both master and slave nodes .

Multiple configuration files such as hadoop-env.sh, core-site.xml, yarn-site.xml, and mapred-site.xml allow for granular control over various components and processes within Hadoop, enhancing deployment efficiency. By separating configurations, administrators can fine-tune parameters specific to particular engines and services, leading to optimized resource use and system performance across standalone, pseudo-distributed, and fully distributed modes. However, this complexity requires thorough configuration management to ensure settings are correctly aligned across all files, minimizing potential for errors during deployment .

You might also like