0% found this document useful (0 votes)
19 views22 pages

Install Hadoop on Google Colab

The document outlines the installation and configuration of Hadoop in a pseudo-distributed mode on Google Colab, detailing the steps to set up Java, download Hadoop, configure XML files, and start Hadoop daemons. It also includes instructions for basic file management tasks in HDFS and implementing a Word Count program using MapReduce. The results of each exercise indicate successful execution and verification of outputs.

Uploaded by

bhavanipriy73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Install Hadoop on Google Colab

The document outlines the installation and configuration of Hadoop in a pseudo-distributed mode on Google Colab, detailing the steps to set up Java, download Hadoop, configure XML files, and start Hadoop daemons. It also includes instructions for basic file management tasks in HDFS and implementing a Word Count program using MapReduce. The results of each exercise indicate successful execution and verification of outputs.

Uploaded by

bhavanipriy73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PANIMALAR ENGINEERING COLLEGE

Department of CSE
Reg no: 211422104360

EX NO : 6

DATE:

Install and configure Hadoop in its operating modes: Pseudo distributed


AIM
To install and configure Hadoop on Google Colab for single-node pseudo-distributed setup.
ALGORITHM
Step1: Install Java JDK 8 and verify the version.
Step2: Download Hadoop 3.3.6 and extract it to /usr/local/hadoop.
Step3: Set environment variables (JAVA_HOME, HADOOP_HOME, and PATH) for Hadoop and
Java.
Step4: Configure Hadoop XML files:
[Link] → set JAVA_HOME.
[Link] → set [Link] for HDFS.
[Link] → configure NameNode and DataNode directories.
[Link] → set MapReduce framework to YARN.
[Link] → set NodeManager aux services.
Step5: Prepare local directories for HDFS storage (namenode and datanode).
Step6: Format HDFS NameNode to initialize the filesystem.
Step7: Start Hadoop daemons:
HDFS → NameNode and DataNode
YARN → ResourceManager and NodeManager
Step8: Verify running processes using jps.
Step9: Set Hadoop users to root in [Link] for permission consistency.
PROGRAM
# Install Java JDK 8
!apt-get install openjdk-8-jdk -y
!update-alternatives --config java
!java -version

34
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

!echo $JAVA_HOME
# Download and setup Hadoop
!wget [Link]
!tar -xvzf [Link]
!mv hadoop-3.3.6 /usr/local/hadoop
!sed -i 's|^export JAVA_HOME=.*|export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64|'
/usr/local/hadoop/etc/hadoop/[Link]
# Set environment variables
import os
HADOOP_HOME = "/usr/local/hadoop"
JAVA_HOME = "/usr/lib/jvm/java-8-openjdk-amd64"
[Link]["HADOOP_HOME"] = HADOOP_HOME
[Link]["JAVA_HOME"] = JAVA_HOME
[Link]["PATH"] =
f"{HADOOP_HOME}/bin:{HADOOP_HOME}/sbin:{JAVA_HOME}/bin:"
+ [Link]["PATH"]
print("HADOOP_HOME =", [Link]["HADOOP_HOME"])
print("JAVA_HOME =", [Link]["JAVA_HOME"])
import textwrap, pathlib
# Ensure Hadoop config directory exists
[Link](f"{HADOOP_HOME}/etc/hadoop").mkdir(parents=True, exist_ok=True)
# Configure Hadoop XML files
with open(f"{HADOOP_HOME}/etc/hadoop/[Link]","a") as f:
[Link](f"\nexport JAVA_HOME={JAVA_HOME}\n")
core_site = """
<configuration>
<property>
<name>[Link]</name>
<value>hdfs://localhost:9000</value>
</property>

35
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

</configuration>
"""
open(f"{HADOOP_HOME}/etc/hadoop/[Link]","w").write([Link](core_site).strip(
)+"\n")
hdfs_site = """
<configuration>
<property>
<name>[Link]</name>
<value>1</value>
</property>
<property>
<name>[Link]</name>
<value>file:/content/hadoop_data/namenode</value>
</property>
<property>
<name>[Link]</name>
<value>file:/content/hadoop_data/datanode</value>
</property>
</configuration>
"""
open(f"{HADOOP_HOME}/etc/hadoop/[Link]","w").write([Link](hdfs_site).strip(
)+"\n")
mapred_site = """
<configuration>
<property>
<name>[Link]</name>
<value>yarn</value>
</property>
</configuration>
"""

36
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

open(f"{HADOOP_HOME}/etc/hadoop/[Link]","w").write([Link](mapred_site
).strip()+"\n")
print("Hadoop configuration files written.")
# Prepare directories
!rm -rf /content/hadoop_data
!mkdir -p /content/hadoop_data/namenode /content/hadoop_data/datanode
# Format HDFS
!$HADOOP_HOME/bin/hdfs namenode -format -force
# Set Hadoop users to root
!echo "export HDFS_NAMENODE_USER=root" >> /usr/local/hadoop/etc/hadoop/[Link]
!echo "export HDFS_DATANODE_USER=root" >> /usr/local/hadoop/etc/hadoop/[Link]
!echo "export HDFS_SECONDARYNAMENODE_USER=root" >>
/usr/local/hadoop/etc/hadoop/[Link]
!echo "export YARN_RESOURCEMANAGER_USER=root" >>
/usr/local/hadoop/etc/hadoop/[Link]
!echo "export YARN_NODEMANAGER_USER=root" >> /usr/local/hadoop/etc/hadoop/hadoop-
[Link]
!jps
OUTPUT
namenode is running as process 3848. Stop it first and ensure /tmp/[Link] file
is
empty before retry.
datanode is running as process 3901. Stop it first and ensure /tmp/[Link] file is
empty
before retry.
resourcemanager is running as process 3973. Stop it first and ensure /tmp/hadoop-
[Link] file is empty before retry.
nodemanager is running as process 4031. Stop it first and ensure /tmp/hadoop-root-
[Link] file

37
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

is empty before retry.


3973 ResourceManager
3848 NameNode
29996 Jps
3901 DataNode
4031 NodeManager

RESULT:
Thus the exercise has been executed successfully and the outputs were verified.

38
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 7
DATE:
Implement file management tasks in Hadoop: Adding files and directories, retrieving files,
and deleting files.
Aim
To implement basic file management tasks in Hadoop Distributed File System (HDFS), including
creating directories, uploading files, retrieving files, and deleting files or directories using Hadoop
commands.
Procedure
 Start Hadoop services.
 Create a directory in HDFS.
 Upload a file into the directory.
 Check uploaded file.
 Read file contents.
 Delete the file.
 Delete the directory.
Commands
hdfs dfs -mkdir /project
hdfs dfs -put [Link] /project/
hdfs dfs -ls /project
hdfs dfs -cat /project/[Link]
hdfs dfs -rm /project/[Link]
hdfs dfs -rm -r /project
Output
Found 1 items
-rw-r--r-- 1 hadoop hadoop 2450 2025-09-30 12:45 /project/[Link]
id,name,age
1,Alice,23
2,Bob,30

39
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

3,Charlie,28
Deleted /project/[Link]
Deleted /project

Result
The basic file management tasks in Hadoop Distributed File System (HDFS) were successfully
implemented.

40
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 8
DATE:

Implement word count / frequency programs using MapReduce.


AIM
To create an input folder in HDFS, store a sample text file, and implement a simple WordCount
program using the MapReduce concept (Mapper and Reducer phases).
ALGORITHM
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Create Three Java Classes into the project. Name them WCDriver(having the main
function), WCMapper, WCReducer.
3. You have to include two Reference Libraries for that:
Right Click on Project -> then select Build Path-> Click on Configure Build Path.
4. Add External JARs option on the Right Hand Side. Click on it and add the below mention files.
You can find these files in /usr/lib/
1. /usr/lib/hadoop-0.20-mapreduce/[Link]
2. /usr/lib/hadoop/[Link]
5. Paste the Mapper Code program into the WCMapper Java Class file.
6. Paste the Reducer Code program into the WCReducer Java Class file.
7. Paste the Driver Code program into the WCDriver Java Class file.
8. Now you have to make a jar file. Right Click on Project-> Click on Export-> Select export
destination as Jar File-> Name the jar File([Link]) -> Click on next -> at last Click on
Finish. Now copy this file into the Workspace directory of Cloudera.
9. Open the terminal on CDH and change the directory to the workspace. You can do this by
using “cd workspace/” command. Now, Create a text file([Link]) and move it to HDFS.
10. Run this command to copy the file input file into the HDFS.
hadoop fs -put [Link] [Link]
11. Run the jar file using the command hadoop jar [Link] WCDriver [Link]
WCOutput.
12. User can see the result in WCOutput file or by writing following command on terminal.

41
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

hadoop fs -cat WCOutput/part-00000


Program
Mapper code
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WCMapper extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
// Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{
String line = [Link]();
// Splitting the line on spaces
for (String word : [Link](" "))
{
if ([Link]() > 0)
{
[Link](new Text(word), new IntWritable(1));
}
}
}
}

Reducer code

42
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WCReducer extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{
int count = 0;
// Counting the frequency of each words
while ([Link]())
{
IntWritable i = [Link]();
count += [Link]();
}
[Link](key, new IntWritable(count));
}
}
Driver Code
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

43
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WCDriver extends Configured implements Tool {
public int run(String args[]) throws IOException
{
if ([Link] < 2)
{
[Link]("Please give valid inputs");
return -1;
}
JobConf conf = new JobConf([Link]);
[Link](conf, new Path(args[0]));
[Link](conf, new Path(args[1]));
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](conf);
return 0;
}
// Main Method
public static void main(String args[]) throws Exception
{
int exitCode = [Link](new WCDriver(), args);

44
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

[Link](exitCode);
}
}

OUTPUT
=== MAPPER OUTPUT ===
[('hello', 1), ('hadoop', 1), ('hello', 1), ('colab', 1)]
=== REDUCER OUTPUT ===
hello 2
hadoop 1
colab 1

Result
Thus the program to implement word count / frequency programs using MapReduce.

45
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 9
DATE:

Implement an MR program that processes a weather dataset

Aim
To implement an MR program that processes a weather dataset.
Procedure
1. First Open Eclipse -> then select File -> New -> Java Project ->Name it MyProject -> then
select use an execution environment -> choose JavaSE-1.8 then next -> Finish.
2. In this Project Create Java class with name MyMaxMin -> then click Finish.
3. Copy the below source code to this MyMaxMin java class.
4. Add external jar for the packages that we have import.
5. Now we add these external jars to our MyProject. Right Click on MyProject -> then
select Build Path-> Click on Configure Build Path and select Add External jars and add jars
from it’s download location then click -> Apply and Close.
6. Now export the project as jar file. Right-click on MyProject choose Export and go to Java ->
JAR file click -> Next and choose your export destination then click -> Next. Choose Main
Class as MyMaxMin by clicking -> Browse and then click -> Finish -> Ok.
7. Start Hadoop Daemons
[Link]
[Link]
8. Move your dataset to the Hadoop HDFS.
hdfs dfs -put /file_path /destination
9. Check the file sent to our HDFS.
hdfs dfs -ls /
10. Now Run your Jar File with below command and produce the output in MyOutput File.
hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name
11. Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
12. See the result in the Downloaded File.

46
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Program
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MyMaxMin {
public static class MaxTemperatureMapper extends
Mapper<LongWritable, Text, Text, Text> {
public static final int MISSING = 9999;
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
// Convert the single row(Record) to String and store it in String
// variable name line
String line = [Link]();
// Check for the empty line
if (!([Link]() == 0)) {
// from character 6 to 14 we have the date in our dataset
String date = [Link](6, 14);
// similarly we have taken the maximum temperature from 39 to 45 characters
float temp_Max = [Link]([Link](39, 45).trim());
// similarly we have taken the minimum temperature from 47 to 53 characters

47
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

float temp_Min = [Link]([Link](47, 53).trim());


// if maximum temperature is greater than 30, it is a hot day
if (temp_Max > 30.0) {
// Hot day
[Link](new Text("The Day is Hot Day :" + date),
new Text([Link](temp_Max)));
}
// if the minimum temperature is less than 15, it is a cold day
if (temp_Min < 15) {
// Cold day
[Link](new Text("The Day is Cold Day :" +
date), new Text([Link](temp_Min)));
}
}
}
}
// Reducer
/*MaxTemperatureReducer class is static and extends Reducer abstract class
having four Hadoop generics type Text, Text, Text, Text.*/
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text>
{
/**
* @method reduce
* This method takes the input as key and list of values pair from the mapper,
* it does aggregation based on keys and produces the final context.
*/
public void reduce(Text Key, Iterator<Text> Values, Context context)
throws IOException, InterruptedException {
// putting all the values in temperature variable of type String
String temperature = [Link]().toString();

48
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

[Link](Key, new Text(temperature));


}
}
/**
* @method main
* This method is used for setting all the configuration properties.
* It acts as a driver for map-reduce code.
*/
public static void main(String[] args) throws Exception {
// reads the default configuration of the
// cluster from the configuration XML files
Configuration conf = new Configuration();
// Initializing the job with the default configuration of the cluster
Job job = new Job(conf, "weather example");
// Assigning the driver class name
[Link]([Link]);
// Key type coming out of mapper
[Link]([Link]);
// value type coming out of mapper
[Link]([Link]);
// Defining the mapper class name
[Link]([Link]);
// Defining the reducer class name
[Link]([Link]);
// Defining input Format class which is responsible to parse the dataset
// into a key value pair
[Link]([Link]);
// Defining output Format class which is responsible to parse the dataset
// into a key value pair
[Link]([Link]);

49
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

// setting the second argument as a path in a path variable


Path OutputPath = new Path(args[1]);
// Configuring the input path from the filesystem into the job
[Link](job, new Path(args[0]));
// Configuring the output path from the filesystem into the job
[Link](job, new Path(args[1]));
// deleting the context path automatically from hdfs so that we don't have
// to delete it explicitly
[Link](conf).delete(OutputPath);
// exiting the job only if the flag value becomes false
[Link]([Link](true) ? 0 : 1);
}
}

Output

Result
Thus the program to implement an MR that processes a weather dataset is executed.

50
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 10
DATE:

Create a retail data base with the following tables: Product, Customer, Manufacturer,
Shipping and Time using MongoDB and perform data replication using sharding techniques

Aim
To create a retail database using MongoDB and perform data replication using sharding
techniques.
Procedure
 Install MongoDB: First, make sure you have MongoDB installed on your system.
 Initialize a MongoDB Cluster: Set up a MongoDB cluster with sharding enabled. This involves
configuring multiple MongoDB instances to act as shards, as well as setting up configuration
servers and mongos instances.
 Create Database and Collections: Define the database schema and create collections for
Product, Customer, Manufacturer, Shipping, and Time.
 Enable Sharding: Enable sharding for the database and the collections you want to shard.
 Insert Data: Populate the collections with sample data.
 Monitor and Manage Shards: Monitor the cluster and manage shards as needed.
Program
// Create collections
[Link]("Product")
[Link]("Customer")
[Link]("Manufacturer")
[Link]("Shipping")
[Link]("Time")
// Enable Sharding
// Enable sharding for the database
[Link]("retail_db")
// Shard the collections

51
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

[Link]("retail_db.Product", { "_id": "hashed" })


[Link]("retail_db.Customer", { "_id": "hashed" })
[Link]("retail_db.Manufacturer", { "_id": "hashed" })
[Link]("retail_db.Shipping", { "_id": "hashed" })
[Link]("retail_db.Time", { "_id": "hashed" })
// Insert sample data into collections
[Link]([
{ _id: 1, name: "Product1", price: 100 },
{ _id: 2, name: "Product2", price: 200 }])
[Link]([
{ _id: 1, name: "Customer1", email: "customer1@[Link]" },
{ _id: 2, name: "Customer2", email: "customer2@[Link]" }])

Result

Thus to create a retail database using MongoDB and perform data replication using
sharding techniques.

52
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 11
DATE:

Install HIVE and implement the above retail schema definition and perform
CRUD operations.
Aim
To Install HIVE and implement the above retail schema definition and perform
CRUD operations.
Procedure
 Install Apache Hive: Download and install Apache Hive on your system. You can follow the
official documentation for installation instructions.
 Start Hadoop Services: Ensure that Hadoop services such as HDFS and YARN are running as
Hive relies on them.
 Create Hive Tables: Define the schema for the retail database and create tables in Hive
corresponding to the Product, Customer, Manufacturer, Shipping, and Time collections.
 Perform CRUD Operations: Use HiveQL (Hive's SQL-like query language) to perform CRUD
operations on the tables.
Program
//Create Hive Tables
//Create Product table
CREATE TABLE Product (
id INT,
name STRING,
price DOUBLE
);
//Create Customer table
CREATE TABLE Customer (
id INT,
name STRING,
email STRING

53
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

);
// Create Manufacturer table
CREATE TABLE Manufacturer (
id INT,
name STRING,
location STRING
);
//Create Shipping table
CREATE TABLE Shipping (
id INT,
product_id INT,
customer_id INT,
shipping_date STRING
);
//Create Time table
CREATE TABLE Time (
id INT,
timestamp STRING,
year INT,
month INT,
day INT
);
//Perform CRUD Operations
//Insert data into Product table
INSERT INTO Product VALUES
(1, 'Product1', 100),
(2, 'Product2', 200);
//Insert data into Customer table
INSERT INTO Customer VALUES
(1, 'Customer1', 'customer1@[Link]'),

54
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

(2, 'Customer2', 'customer2@[Link]');


//Insert data into other tables similarly
// Select data from Product table
SELECT * FROM Product;
//Update data in Customer table
UPDATE Customer SET email = 'newemail@[Link]' WHERE id = 1;
//Delete data from Manufacturer table
DELETE FROM Manufacturer WHERE id = 1;
Output:
Table 'Product' created successfully
--Insert data into Product table
INSERT INTO Product VALUES
(1, 'Product1', 100),
(2, 'Product2', 200);
2 rows inserted
SELECT * FROM Product;
id | name | price
---------------------
1 | Product1| 100.0
2 | Product2| 200.0
UPDATE Customer SET email = 'newemail@[Link]' WHERE id = 1;
1 row updated
DELETE FROM Manufacturer WHERE id = 1;
0 rows deleted (assuming no matching record with id = 1 exists)

Result

Thus the program to implement the retail schema definition and CRUD operations is performed.

55

You might also like