Experiment: Downloading and Installing Thrift, Generating HBase Thrift
Binding, and Interacting with HBase
Aim:
To download and install the Apache Thrift framework, generate HBase Thrift
bindings, and interact with HBase using Thrift for remote client communication.
Theory:
Apache Thrift is a software framework developed by Apache for scalable cross-
language services development. It allows different programming languages
(like Java, Python, C++, etc.) to communicate efficiently.
In the context of HBase, Thrift provides a service interface for non-Java clients
to access HBase data through Thrift APIs.
The HBase Thrift server acts as a bridge between HBase and clients by
exposing an HBase API over Thrift RPC (Remote Procedure Call).
Procedure:
Step 1: Prerequisites
Ensure that:
Hadoop and HBase are already installed and running.
Java and SSH are properly configured.
Start HBase:
[Link]
Step 2: Download and Install Apache Thrift
1. Download the latest Thrift source package:
2. wget [Link]
3. Extract the downloaded file:
4. tar -xvzf [Link]
5. cd thrift-0.18.1
6. Install required dependencies:
7. sudo apt-get update
8. sudo apt-get install automake bison flex g++ libtool make pkg-config
libboost-all-dev -y
9. Build and install Thrift:
10../configure
[Link]
[Link] make install
[Link] installation:
[Link] -version
Sample Output:
Thrift version 0.18.1
Step 3: Locate the HBase Thrift Interface File
HBase provides an interface definition file named [Link], located
in:
$HBASE_HOME/src/main/resources/org/apache/hadoop/hbase/thrift/
Step 4: Generate Thrift Bindings
You can generate Thrift client stubs in different programming languages (e.g.,
Python, Java, C++).
For example, to generate Python bindings:
thrift --gen py
$HBASE_HOME/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.t
hrift
For Java bindings:
thrift --gen java
$HBASE_HOME/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.t
hrift
This command creates a folder named gen-py or gen-java containing
automatically generated client code.
Step 5: Start the HBase Thrift Server
To start the Thrift service, run:
[Link] start thrift
Or alternatively:
hbase thrift start
This will start the HBase Thrift server on the default port 9090, enabling clients
to communicate with HBase through Thrift.
Step 6: Interact with HBase via Thrift
You can now write a small Python (or Java) client using the generated bindings
to connect and perform operations like:
Creating a table
Inserting data
Scanning or retrieving records
Example in Python (after generating bindings):
from thrift import Thrift
from [Link] import TSocket, TTransport
from [Link] import TBinaryProtocol
from hbase import Hbase
transport = [Link]('localhost', 9090)
transport = [Link](transport)
protocol = [Link](transport)
client = [Link](protocol)
[Link]()
print([Link]())
[Link]()
This retrieves and prints all table names in HBase.
Sample Output:
$ thrift -version
Thrift version 0.18.1
$ [Link] start thrift
starting thrift, logging to /usr/local/hbase/logs/[Link]
$ python3 hbase_client.py
['students', 'employees']
Result:
Apache Thrift was successfully installed and configured. HBase Thrift bindings
were generated, and interaction with HBase was achieved using a Thrift client.
This experiment demonstrates how Thrift enables cross-language
communication with HBase through a remote procedure interface.
Write HBASE command (I) To create a table as student with two column field.
(II) Insert data into the table. (III) Get data from the table. (IV) Delete data from
the table.
Aim:
To perform fundamental HBase operations — creating a table, inserting data,
retrieving data, and deleting data using HBase shell commands.
Theory:
HBase is a distributed, scalable, NoSQL database built on top of Hadoop HDFS.
It stores data in a column-oriented format, and each table consists of rows and
column families.
Each column family can contain multiple columns, and data is stored as key-
value pairs.
HBase shell allows us to perform data definition (DDL) and data manipulation
(DML) tasks using simple commands.
Procedure:
Step 1: Start Hadoop and HBase Services
Before executing HBase commands, ensure that the Hadoop and HBase
daemons are running.
[Link]
[Link]
Then open the HBase shell:
hbase shell
You’ll see the HBase shell prompt:
hbase(main):001:0>
Step 2: (I) Create a Table
To create a table named student with two column families, say personal and
academic, use:
create 'student', 'personal', 'academic'
Here,
o student → table name
o personal → first column family (for storing name, age, etc.)
o academic → second column family (for marks, grade, etc.)
You can verify the table creation by listing all tables:
list
Step 3: (II) Insert Data into the Table
Use the put command to insert data into specific column families.
Example:
put 'student', '1', 'personal:name', 'John'
put 'student', '1', 'personal:age', '21'
put 'student', '1', 'academic:dept', 'CSE'
put 'student', '1', 'academic:marks', '89'
put 'student', '2', 'personal:name', 'Anita'
put 'student', '2', 'personal:age', '22'
put 'student', '2', 'academic:dept', 'ECE'
put 'student', '2', 'academic:marks', '93'
Explanation:
'student' → table name
'1' or '2' → row key
'personal:name' → column family and column qualifier
'John', '89', etc. → actual data value
Step 4: (III) Get Data from the Table
Use the get command to retrieve the contents of a specific row.
Example:
get 'student', '1'
This displays all data stored under row key 1.
To retrieve all records in the table:
scan 'student'
Step 5: (IV) Delete Data from the Table
You can delete data in several ways:
To delete a specific column:
delete 'student', '1', 'academic:marks'
To delete an entire row:
deleteall 'student', '1'
To delete the entire table (after disabling it first):
disable 'student'
drop 'student'
Step 6: Exit from the HBase Shell
Once done, exit the shell using:
exit
Sample Output:
hbase(main):001:0> create 'student', 'personal', 'academic'
0 row(s) in 1.3450 seconds
=> Hbase::Table - student
hbase(main):002:0> put 'student', '1', 'personal:name', 'John'
0 row(s) in 0.0220 seconds
hbase(main):003:0> scan 'student'
ROW COLUMN+CELL
1 column=academic:dept, timestamp=..., value=CSE
1 column=academic:marks, timestamp=..., value=89
1 column=personal:age, timestamp=..., value=21
1 column=personal:name, timestamp=..., value=John
2 column=academic:dept, timestamp=..., value=ECE
2 column=academic:marks, timestamp=..., value=93
2 column=personal:age, timestamp=..., value=22
2 column=personal:name, timestamp=..., value=Anita
2 row(s) in 0.0150 seconds
hbase(main):004:0> delete 'student', '1', 'academic:marks'
0 row(s) in 0.0190 seconds
hbase(main):005:0> get 'student', '1'
COLUMN CELL
academic:dept timestamp=..., value=CSE
personal:age timestamp=..., value=21
personal:name timestamp=..., value=John
1 row(s) in 0.0110 seconds
Result:
The HBase table student was successfully created with two column families —
personal and academic.
Data was inserted, retrieved, and deleted using appropriate HBase shell
commands, demonstrating basic HBase CRUD (Create, Read, Update, Delete)
operations.
Write a HIVE DDL commands.
Experiment: Implementation of Hive DDL Commands
Aim:
To understand and execute Hive DDL (Data Definition Language) commands
for creating, modifying, and managing Hive databases and tables.
Theory:
In Hive, DDL commands are used to define the structure of databases, tables,
and partitions.
They do not manipulate data directly but describe how and where the data is
stored in Hive’s warehouse directory (usually in HDFS).
Hive DDL includes the following categories of commands:
Database operations — creating, showing, or dropping databases.
Table operations — creating, altering, describing, truncating, and
dropping tables.
Partition and view management.
These commands are written in Hive Query Language (HQL), similar to SQL but
designed for Hadoop’s distributed data model.
Common Hive DDL Commands:
1. CREATE DATABASE
Creates a new database (namespace) to organize tables.
CREATE DATABASE college;
To create a database only if it doesn’t already exist:
CREATE DATABASE IF NOT EXISTS college;
To specify a custom directory in HDFS:
CREATE DATABASE college
LOCATION '/user/hive/warehouse/[Link]';
2. SHOW DATABASES
Displays all databases present in Hive.
SHOW DATABASES;
3. USE DATABASE
Switches to a specific database context.
USE college;
4. DROP DATABASE
Deletes an existing database.
DROP DATABASE college;
To delete a database along with its tables:
DROP DATABASE college CASCADE;
5. CREATE TABLE
Creates a new table in Hive.
Example:
CREATE TABLE students (
sid INT,
sname STRING,
sage INT,
sdepartment STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Explanation:
ROW FORMAT DELIMITED → data is stored as text, separated by a
delimiter.
FIELDS TERMINATED BY ',' → columns in a row are separated by commas.
STORED AS TEXTFILE → table data is stored as plain text in HDFS.
6. SHOW TABLES
Lists all tables in the current database.
SHOW TABLES;
7. DESCRIBE TABLE
Displays the structure (schema) of a table.
DESCRIBE students;
For a detailed description (with partition info):
DESCRIBE FORMATTED students;
8. ALTER TABLE
Used to rename, add, or replace columns, and change table properties.
Examples:
Rename table:
ALTER TABLE students RENAME TO learners;
Add a new column:
ALTER TABLE learners ADD COLUMNS (gender STRING);
Replace all columns:
ALTER TABLE learners REPLACE COLUMNS (sid INT, sname STRING, sage
INT);
9. TRUNCATE TABLE
Deletes all rows from a table but keeps its structure.
TRUNCATE TABLE learners;
10. DROP TABLE
Removes a table and its data permanently from Hive.
DROP TABLE learners;
Procedure:
1. Start Hadoop and Hive services:
2. [Link]
3. [Link]
4. hive
5. Create and switch to a new database:
6. CREATE DATABASE college;
7. USE college;
8. Create a table:
9. CREATE TABLE students (
10. sid INT,
11. sname STRING,
12. sage INT,
13. sdepartment STRING
14.)
[Link] FORMAT DELIMITED
[Link] TERMINATED BY ','
[Link] AS TEXTFILE;
[Link] the table to verify structure:
[Link] students;
[Link] table (optional):
[Link] TABLE students ADD COLUMNS (gender STRING);
[Link] all tables:
[Link] TABLES;
[Link] or drop table (optional cleanup):
[Link] TABLE students;
Sample Output:
hive> CREATE DATABASE college;
OK
Time taken: 0.349 seconds
hive> USE college;
OK
Time taken: 0.021 seconds
hive> CREATE TABLE students (
> sid INT,
> sname STRING,
> sage INT,
> sdepartment STRING
>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.452 seconds
hive> SHOW TABLES;
OK
students
hive> DESCRIBE students;
OK
sid int
sname string
sage int
sdepartment string
Result:
Hive DDL commands were successfully executed to create, modify, describe,
and drop databases and tables.
This experiment demonstrated how Hive defines and manages the schema of
datasets stored in Hadoop’s distributed environment.
Write a HIVE DML commands.
Experiment: Implementation of Hive DML Commands
Aim:
To understand and perform Data Manipulation Language (DML) operations in
Hive — including loading, inserting, selecting, updating, and deleting data from
Hive tables.
Theory:
Hive DML (Data Manipulation Language) commands are used to manipulate
or access data stored inside Hive tables.
They allow users to add, modify, retrieve, and remove records.
Hive stores its data in the Hadoop Distributed File System (HDFS), so DML
operations are executed as MapReduce, Tez, or Spark jobs under the hood.
The major Hive DML operations include:
1. LOAD DATA – Importing data files into Hive tables.
2. INSERT – Adding records directly through HQL queries.
3. SELECT – Retrieving data from tables.
4. UPDATE – Modifying existing records (supported in newer Hive versions).
5. DELETE – Removing records or partitions.
Procedure:
Step 1: Start Hadoop and Hive
Start the necessary services and open Hive shell:
[Link]
[Link]
hive
Step 2: Create a Database (Optional)
To organize your work:
CREATE DATABASE company;
USE company;
Step 3: Create a Table
Create a sample table named employees:
CREATE TABLE employees (
eid INT,
ename STRING,
eage INT,
edept STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Step 4: (I) LOAD DATA Command
The LOAD DATA command loads external data (from local or HDFS) into a Hive
table.
1. Create a text file named [Link] in your local system (e.g.
/home/hadoop/[Link])
Example content:
2. 101,John,28,HR
3. 102,Anita,30,IT
4. 103,Ravi,26,Finance
5. 104,Sneha,29,Marketing
6. Load the data into the Hive table:
7. LOAD DATA LOCAL INPATH '/home/hadoop/[Link]' INTO TABLE
employees;
o LOCAL → takes data from the local file system.
o If you omit LOCAL, it expects the file in HDFS.
Step 5: (II) INSERT Command
The INSERT command is used to add new rows directly through queries.
Example:
INSERT INTO TABLE employees VALUES (105, 'Karan', 27, 'Sales');
You can also insert data from another table:
INSERT INTO TABLE employees SELECT * FROM another_table;
To overwrite existing data:
INSERT OVERWRITE TABLE employees VALUES (201, 'Meena', 25, 'HR');
Step 6: (III) SELECT Command
The SELECT command retrieves data from the table.
Examples:
To display all records:
SELECT * FROM employees;
To select specific columns:
SELECT eid, ename FROM employees;
To filter data:
SELECT * FROM employees WHERE edept = 'IT';
To perform aggregation:
SELECT edept, COUNT(*) AS total FROM employees GROUP BY edept;
Step 7: (IV) UPDATE Command
Used to modify existing records in a table.
(Requires transactional tables — i.e., tables created with TBLPROPERTIES
('transactional'='true').)
Example:
UPDATE employees SET eage = 29 WHERE ename = 'Ravi';
Step 8: (V) DELETE Command
Deletes records matching a condition.
Example:
DELETE FROM employees WHERE eid = 104;
To delete all records:
TRUNCATE TABLE employees;
Step 9: Verify the Changes
View the updated data:
SELECT * FROM employees;
Sample Output:
hive> LOAD DATA LOCAL INPATH '/home/hadoop/[Link]' INTO TABLE
employees;
Loading data to table [Link]
Table [Link] stats: [numFiles=1, totalSize=120]
OK
Time taken: 0.623 seconds
hive> SELECT * FROM employees;
OK
101 John 28 HR
102 Anita 30 IT
103 Ravi 26 Finance
104 Sneha 29 Marketing
Time taken: 0.288 seconds
hive> INSERT INTO TABLE employees VALUES (105, 'Karan', 27, 'Sales');
OK
Time taken: 0.121 seconds
hive> DELETE FROM employees WHERE eid = 104;
OK
Time taken: 0.159 seconds
hive> SELECT * FROM employees;
OK
101 John 28 HR
102 Anita 30 IT
103 Ravi 26 Finance
105 Karan 27 Sales
Time taken: 0.227 seconds
Result:
The Hive DML commands were successfully executed.
Data was loaded into the Hive table, new records were inserted, selected,
updated, and deleted as required.
This experiment demonstrates how Hive performs data manipulation on large-
scale datasets stored in HDFS using SQL-like syntax.
Experiment: Multiplication of Two Numbers using Hadoop MapReduce
Aim:
To write and execute a Java program that multiplies two numbers using the
Hadoop MapReduce framework.
Objective:
The objective of this experiment is to understand the workflow of Hadoop
MapReduce by implementing a simple arithmetic operation — multiplication
— to demonstrate how data is processed in the Map and Reduce phases.
Theory:
Hadoop MapReduce is a programming model used for processing large data
sets with a distributed algorithm on a cluster.
It divides the task into two phases:
1. Mapper Phase:
o Takes input as key-value pairs.
o Processes the input data and generates intermediate key-value
pairs.
2. Reducer Phase:
o Takes the intermediate data from the Mapper.
o Aggregates or computes the final result.
For this simple multiplication program:
Mapper will read two numbers as input.
Reducer will perform multiplication and output the result.
Software Requirements:
Operating System: Ubuntu / Windows Subsystem for Linux
Hadoop version: 3.x or above
Java JDK 8 or above
Hadoop configured and running in pseudo-distributed mode
Procedure:
1. Step 1: Start Hadoop services
2. [Link]
3. [Link]
4. Step 2: Create an input directory in HDFS
5. hdfs dfs -mkdir /input
6. Step 3: Create a text file [Link] with two numbers to multiply
7. nano [Link]
Example content:
45
Upload to HDFS:
hdfs dfs -put [Link] /input
8. Step 4: Write the MapReduce Java program
Save the following as [Link]
Program:
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class Multiply {
public static class MultiplyMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static Text keyOut = new Text("Result");
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String[] nums = [Link]().split(" ");
int num1 = [Link](nums[0]);
int num2 = [Link](nums[1]);
int product = num1 * num2;
[Link](keyOut, new IntWritable(product));
}
}
public static class MultiplyReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int result = 1;
for (IntWritable val : values) {
result *= [Link]();
}
[Link](key, new IntWritable(result));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Multiplication");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
Step 5: Compile and create a jar file
hadoop [Link] [Link]
jar cf [Link] Multiply*.class
Step 6: Run the MapReduce job
hadoop jar [Link] Multiply /input /output
Step 7: View the output
hdfs dfs -cat /output/part-r-00000
Sample Output:
Result 20
Result:
The MapReduce program successfully multiplied two numbers (4 × 5) and
produced the output 20 using the Hadoop MapReduce framework.
Conclusion:
This experiment demonstrates how a simple arithmetic operation can be
executed in a distributed manner using Hadoop MapReduce. The same logic
can be extended to perform complex computations over large-scale datasets.
Experiment: Addition of Two Numbers using Hadoop MapReduce
Aim:
To write and execute a Java program that adds two numbers using the Hadoop
MapReduce framework.
Objective:
To understand how the Mapper and Reducer phases in Hadoop MapReduce
process simple numerical input and compute the sum of two numbers.
Theory:
Mapper: Reads input lines, splits them into two numbers, and computes
their sum.
Reducer: Aggregates the intermediate results and outputs the final total.
This simple operation demonstrates Hadoop’s distributed computation model.
Procedure:
1. Create input file
2. nano [Link]
Example content:
10 20
3. Upload file to HDFS
4. hdfs dfs -mkdir /input
5. hdfs dfs -put [Link] /input
6. Write Java Program ([Link])
Program:
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class Addition {
public static class AddMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static Text keyOut = new Text("Sum");
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String[] nums = [Link]().split(" ");
int num1 = [Link](nums[0]);
int num2 = [Link](nums[1]);
int sum = num1 + num2;
[Link](keyOut, new IntWritable(sum));
}
}
public static class AddReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int total = 0;
for (IntWritable val : values) {
total += [Link]();
}
[Link](key, new IntWritable(total));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Addition");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
Step 4: Compile and Create Jar File
hadoop [Link] [Link]
jar cf [Link] Addition*.class
Step 5: Run the Program
hadoop jar [Link] Addition /input /output
Step 6: View Output
hdfs dfs -cat /output/part-r-00000
Sample Output:
Sum 30
Result:
The Hadoop MapReduce program successfully computed the sum of two
numbers (10 + 20 = 30) using the Mapper and Reducer phases.