Introduction to Apache Hive Overview
Introduction to Apache Hive Overview
[Link] Page 1
Big Data Analytics-BCS714D-Module 4
[Link] Page 2
Big Data Analytics-BCS714D-Module 4
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.
Hourly Log Data can be stored directly into HDFS and then data cleansing is
performed on the log file. Finally, Hive table(s) can be created to query the log
file.
4.1.4 Hive Data Units
1. Databases: The namespace for tables.
2. Tables: Set of records that have similar schema.
3. Partitions: Logical separations of data based on classification of given
information as per specific attributes. Once hive has partitioned the data based
[Link] Page 3
Big Data Analytics-BCS714D-Module 4
on a specified key, it starts to assemble the records into specific folders as and
when the records are inserted.
4. Buckets (or Clusters): Similar to partitions but uses hash function to
segregate data and determines the cluster or bucket into which the record
should be placed.
• Partitioning tables changes how Hive structures the data storage.
• Hive will create subdirectories reflecting the partitioning structure.
• Although partitioning helps in enhancing performance and is
recommended, having too many partitions may prove detrimental for few
queries.
• Bucketing is another technique of managing large datasets. If we
partition the dataset based on customer_ID, we would end up with far
too many partitions. Instead, if we bucket the customer table and use
customer_id as the bucketing column, the value of this column will be
hashed by a user-defined number into buckets.
When to Use Partitioning/Bucketing?
Bucketing works well when the field has high cardinality (cardinality is the
number of values a column or field can have) and data is evenly distributed
among buckets. Partitioning works best when the cardinality of the partitioning
field is not too high. Partitioning can be done on multiple fields with an order
(Year/Month/ Day) whereas bucketing can be done on only one field.
Figure shows how these data units are arranged in a Hive Cluster.
[Link] Page 4
Big Data Analytics-BCS714D-Module 4
[Link] Page 5
Big Data Analytics-BCS714D-Module 4
[Link] Page 6
Big Data Analytics-BCS714D-Module 4
the Hive metastore service runs in the main Hive Server process, but the
metastore database runs in a separate process, and can be on a separate host.
Figure below shows a Local Metastore.
3. Remote Metastore: In this, the Hive driver and the metastore interface run
on different JVMs (which can run on different machines as well) as in Figure
below. This way the database can be fire-walled from the Hive user and also
database credentials are completely isolated from the users of Hive.
[Link] Page 7
Big Data Analytics-BCS714D-Module 4
[Link] Page 8
Big Data Analytics-BCS714D-Module 4
[Link] Page 9
Big Data Analytics-BCS714D-Module 4
Instead of only partitioning the table horizontally like the row-oriented DBMS
(row-store), RCFile partitions this table first horizontally and then vertically to
serialize the data. Based on the user-specified value, first the table is
partitioned into multiple row groups horizontally. Depicted in Table-2, Table-1
is partitioned into two row groups by considering three rows as the size of each
row group.
Next, in every row group RCFile partitions the data vertically like column-store.
So the table will be serialized as shown in Table-3.
Hive query language provides basic SQL like operations. Below are few of the
tasks which HQL can do easily.
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
[Link] Page 10
Big Data Analytics-BCS714D-Module 4
[Link] Page 11
Big Data Analytics-BCS714D-Module 4
[Link] Page 12
Big Data Analytics-BCS714D-Module 4
SHOW DATABASES LIKE 'Stud???'; Lists all databases whose names start
with "Stud" followed by exactly 3 characters.
Objective:
To view details about an existing database in Hive.
Command (Act):
DESCRIBE DATABASE STUDENTS;
This command displays the database name, comment (description), and the
directory location in HDFS where the database is stored.
Outcome (Sample Output):
students STUDENT Details
hdfs://[Link]/user/hive/warehouse/students.
db root USER
Time taken: 0.03 seconds, Fetched: 1 row(s)
students: The name of the database.
STUDENT Details: The comment provided during creation.
hdfs://.../[Link]: The HDFS directory path where the database is
located.
root USER: Owner information.
This command is useful for confirming database metadata and verifying
storage paths.
Objective:
To describe the extended properties of a database.
Command (Act):
DESCRIBE DATABASE EXTENDED STUDENTS;
This command displays additional information such as DB properties (defined
using DBPROPERTIES during creation), along with standard database
metadata.
Outcome (Sample Output):
[Link] Page 14
Big Data Analytics-BCS714D-Module 4
Verification:
You can verify the updated properties using the command:
DESCRIBE DATABASE EXTENDED STUDENTS;
This will now include:
{creator=JOHN, edited-by=JAMES}
Objective:
To make a database the current working database.
Command (Act):
USE STUDENTS;
Outcome (Example):
hive> USE STUDENTS;
OK
Time taken: 0.02 seconds
There is no direct command to display the current active database in Hive.
However, to make the command prompt display the current database name as
a suffix, use the following setting:
set [Link]=true;
This will help you keep track of which database you're working in during a Hive
session.
Objective:
To drop a database.
Command (Act):
DROP DATABASE STUDENTS;
Hive stores databases in the warehouse directory (e.g., /user/hive/warehouse).
Deleting a database only works if no tables exist inside it — this is the default
RESTRICT mode.
To drop a database along with all its tables:
DROP DATABASE STUDENTS CASCADE;
The CASCADE option deletes the database along with all contained tables.
RESTRICT (default) prevents deletion if any tables are present.
[Link] Page 16
Big Data Analytics-BCS714D-Module 4
Complete Syntax:
DROP DATABASE [IF EXISTS] database_name [RESTRICT | CASCADE];
4.5.5 Tables
Hive provides two kinds of table:
1. Internal or Managed Table
2. External Table
[Link] Page 17
Big Data Analytics-BCS714D-Module 4
[Link] Page 19
Big Data Analytics-BCS714D-Module 4
Input:
1001,John,Smith:Jones,Mark1!45:Mark2!46:Mark3!43
1002,Jack,Smith:Jones,Mark1!46:Mark2!47:Mark3!42
Act:
CREATE TABLE STUDENT_INFO(rollno INT,name String, sub
ARRAY<STRING>,marks MAP<STRING,INT>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':'
MAP KEYS TERMINATED BY '!';
LOAD DATA LOCAL INPATH '/root/hivedemos/[Link]' INTO
STUDENT_INFO;
Outcome:
[Link] Page 20
Big Data Analytics-BCS714D-Module 4
Act:
SELECT * from STUDENT_INFO;
SELECT NAME,SUB FROM STUDENT_INFO;
// To retrieve value of Mark1
SELECT NAME, MARKS['Mark1'] from STUDENT_INFO;-
// To retrieve subordinate (array) value
SELECT NAME,SUB[0] FROM STUDENT_INFO;
Outcome:
4.5.6 Partitions
In Hive, without partitioning, queries scan the entire dataset, leading to high
I/O and slow performance. Partitioning divides data into subdirectories (like by
year or region), so Hive reads only relevant parts. This reduces I/O, speeds up
queries, and improves MapReduce job efficiency.
Partition is of two types:
1. STATIC PARTITION: It is upon the user to mention the partition (the
segregation unit) where the data from the file is to be loaded.
[Link] Page 21
Big Data Analytics-BCS714D-Module 4
Hive creates the folder for the value specified in the partition.
Objective: To add one more static partition based on "gpa" column using
the "alter" statement.
Act:
ALTER TABLE STATIC_PART_STUDENT ADD PARTITION (gpa=3.5);
INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa
=4.0) SELECT rollno,name from EXT_STUDENT where gpa=4.0;
Outcome:
[Link] Page 23
Big Data Analytics-BCS714D-Module 4
Act:
CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno
INT,name STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';
Outcome:
4.5.7 Bucketing
Bucketing is similar to partition. However, there is a subtle difference between
partition and bucketing. In a partition, you need to create partition for each
unique value of the column. This may lead to situations where you may end up
[Link] Page 24
Big Data Analytics-BCS714D-Module 4
4.5.8 Views
In Hive, view support is available only in version starting from 0.6. Views are
purely logical object.
Objective: To create a view table named “STUDENT_VIEW”.
Act:
CREATE VIEW STUDENT_VIEW AS SELECT rollno, name FROM
EXT_STUDENT;
[Link] Page 25
Big Data Analytics-BCS714D-Module 4
Outcome:
4.5.9 Sub-Query
In Hive, sub-queries are supported only in the FROM clause (Hive 0.12). You
need to specify name for sub- query because every table in a FROM clause has
a name. The columns in the sub-query select list should have unique names.
The columns in the subquery select list are available to the outer query just
like columns of a table.
[Link] Page 26
Big Data Analytics-BCS714D-Module 4
4.5.10 Joins
Joins in Hive is similar to the SQL Join.
Objective: To create JOIN between Student and Department tables where
we use RollNo from both the tables as the join key.
Act:
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/root/hivedemos/[Link]' OVERWRITE
INTO TABLE STUDENT;
CREATE TABLE IF NOT EXISTS DEPARTMENT(rollno INT,deptno int,name
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/root/hivedemos/[Link]'
OVERWRITE INTO TABLE DEPARTMENT;
SELECT [Link], [Link], [Link], [Link] FROM STUDENT a JOIN
DEPARTMENT ь ON [Link] = [Link];
4.5.11 Aggregation
[Link] Page 27
Big Data Analytics-BCS714D-Module 4
Act:
CREATE TABLE STUDENT_RC( rollno int, name string,gpa float ) STORED
AS RCFILE; INSERT OVERWRITE table STUDENT_RC SELECT * FROM
STUDENT; SELECT SUM(gpa) FROM STUDENT_RC;
Outcome:
return [Link]();
}
}
Note: Convert this Java Program into Jaṛ.
ADD JAR /root/hivedemos/[Link];
CREATE TEMPORARY FUNCTION touppercase AS
'[Link]';
SELECT TOUPPERCASE(name) FROM STUDENT;
INTRODUCTION TO PIG
4.9 What is PIG?
[Link] Page 30
Big Data Analytics-BCS714D-Module 4
Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and
MapReduce Programming. By default, Pig reads input files from HDFS. Pig
stores the intermediate data (data produced by MapReduce jobs) and the
output in HDFS. However, Pig can also read input from and place output to
other sources. Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.
Pig is widely used for "ETL" (Extract, Transform, and Load). Pig can extract
data from different sources such as ERP, Accounting, Flat Files, etc. Pig then
makes use of various operators to perform transformation on the data and
subsequently loads it into the data warehouse. Refer Figure below.
[Link] Page 32
Big Data Analytics-BCS714D-Module 4
[Link] Page 33
Big Data Analytics-BCS714D-Module 4
2. It should begin with a letter and should be followed only by letters, numbers,
and underscores.
The below table shows the list of valid and invalid identifiers.
[Link] Page 34
Big Data Analytics-BCS714D-Module 4
Table below describes simple data types supported in Pig. In Pig, fields of
unspecified types are considered as an array of bytes which is known as
bytearray.
Null: In Pig Latin, NULL denotes a value that is unknown or is non-existent.
[Link] Page 35
Big Data Analytics-BCS714D-Module 4
1. Interactive Mode
2. Batch Mode
1. Interactive Mode:
Command to start Pig shell:
pig
• Pig runs in Grunt shell, an interactive environment for executing Pig
Latin scripts.
• Warnings show deprecated configuration parameters (e.g.,
[Link], [Link]) and suggest alternatives like
[Link], [Link].
We need to create "Pig Script" to run pig in batch mode. Write Pig Latin
statements in a file and save it with .pig extension.
[Link] Page 36
Big Data Analytics-BCS714D-Module 4
1. Local Mode.
2. MapReduce Mode.
Local Mode
To run pig in local mode, we need to have your files in the local file system.
Syntax:
MapReduce Mode
Syntax:
pig filename
Objective: Learn how to perform HDFS operations directly from the Pig Grunt
shell.
Command Example:
Explanation:
[Link] Page 37
Big Data Analytics-BCS714D-Module 4
Purpose:
The FILTER operator is used to select tuples (records) from a relation based on
specified conditions.
Objective:
Select students whose GPA is greater than 4.0.
Input:
Relation Student with schema:
(rollno:int, name:chararray, gpa:float)
Commands (Act):
A = LOAD '/pigdemo/[Link]' AS (rollno:int, name:chararray,
gpa:float);
B = FILTER A BY gpa > 4.0;
DUMP B;
Output:
(1003, Smith, 4.5) (1004, Scott, 4.2)
FILTER is used to apply conditions and reduce the dataset.
It helps in extracting meaningful subsets of data for further processing.
2. FOREACH Operator in Pig
Purpose:
The FOREACH operator is used for data transformation based on columns of a
relation.
Objective:
Display the names of all students in uppercase.
Input:
Relation Student with schema:
(rollno:int, name:chararray, gpa:float)
Commands (Act):
A = LOAD '/pigdemo/[Link]' AS (rollno:int, name:chararray,
gpa:float);
B = FOREACH A GENERATE UPPER(name);
[Link] Page 38
Big Data Analytics-BCS714D-Module 4
DUMP B;
Output:
(JOHN)
(JACK)
(SMITH)
(SCOTT)
(JOSHI)
FOREACH ... GENERATE is used to apply transformations to each row.
Functions like UPPER() can be used to modify specific fields.
3. GROUP Operator in Pig
Purpose:
The GROUP operator is used to group data based on a column.
Objective:
Group tuples of students based on their GPA.
Input:
Relation Student with schema:
(rollno:int, name:chararray, gpa:float)
Commands (Act):
A = LOAD '/pigdemo/[Link]' AS (rollno:int, name:chararray,gpa:float);
B = GROUP A BY gpa;
DUMP B;
Output:
(3.0,{(1001,John,3.0),(1001,John,3.0)})
(3.5,{(1005,Joshi,3.5),(1005,Joshi,3.5)})
(4.0,{(1008,James,4.0),(1002,Jack,4.0)})
(4.2,{(1007,David,4.2),(1004,Scott,4.2)})
(4.5,{(1006,Alex,4.5),(1003,Smith,4.5)})
GROUP A BY <column> creates groups where all tuples with the same value in
the specified column are grouped together.
Each group is represented as (group_key, bag_of_tuples).
[Link] Page 39
Big Data Analytics-BCS714D-Module 4
[Link] Page 41
Big Data Analytics-BCS714D-Module 4
[Link] Page 42
Big Data Analytics-BCS714D-Module 4
[Link] Page 44
Big Data Analytics-BCS714D-Module 4
9. SPLIT
It is used to partition a relation into two or more relations.
Objective: To partition a relation based on the GPAs acquired by the
students.
• GPA = 4.0, place it into relation X.
• GPA is < 4.0, place it into relation Y.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/[Link]' as (rollno:int, name:chararray, gpa:float);
SPLIT A INTO X IF gpa==4.0, Y IF gpa<=4.0;
DUMP X;
[Link] Page 45
Big Data Analytics-BCS714D-Module 4
10. SAMPLE
It is used to select random sample of data based on the specified sample size.
Objective: To depict the use of SAMPLE.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/[Link]' as (rollno:int, name:chararray, gpa:float);
B = SAMPLE A 0.01;
DUMP B;
[Link] Page 46
Big Data Analytics-BCS714D-Module 4
4.20.2 MAX
MAX is used to compute the maximum of numeric values in a single column
bag.
Objective: To calculate the maximum marks for each student.
Input: Student (studname:chararray,marks:int)
Act:
A= load '/pigdemo/[Link]' USING PigStorage (",") as
(studname:chararray, marks:int);
B = GROUP A BY studname;
C = FOREACH B GENERATE [Link], MAX([Link]);
DUMP C;
4.20.3 COUNT
COUNT is used to count the number of elements in a bag.
Objective: To count the number of tuples in a bag.
Input:
Student (studname:chararray,marks:int)
Act:
A= load '/pigdemo/[Link]' USING PigStorage (",") as
(studname:chararray, marks:int);
B = GROUP A BY studname;
C = FOREACH B GENERATE A. studname,COUNT(A);
DUMP C;
[Link] Page 47
Big Data Analytics-BCS714D-Module 4
DUMP B
Note: Convert above java class into jar to include this function into
your code.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
register /root/pigdemos/[Link];
A = load '/pigdemo/[Link]' as (rollno:int, name:chararray, gpa:float);
B = FOREACH A GENERATE [Link](name);
DUMP B;
[Link] Page 50
Big Data Analytics-BCS714D-Module 4
*****END*****
[Link] Page 51
The SPLIT operator in Pig partitions a relation into multiple sub-relations based on specified conditions, enabling tailored data processing. A practical application is separating datasets into categories, such as GPA scores, to facilitate distinct analyses or visualizations for each subset, optimizing data workflows .
The JOIN operation in Pig, unlike traditional SQL which performs equi-joins by default, executes an inner join, merging records based on a key value only when there is a match across datasets. This requires careful data alignment and large-scale processing considerations, as mismatched joins can lead to data loss that is significant in a fragmented environment like Hadoop .
The DESC DATABASE EXTENDED command in Hive provides more detailed information than DESCRIBE DATABASE. It includes additional details such as database properties defined via DBPROPERTIES during creation, alongside standard metadata, offering a complete overview of the database for thorough management and auditing purposes .
Hive's DDL commands allow comprehensive management of databases and tables, facilitating creation, alteration, and deletion of database objects. This enables administrators to efficiently organize data sets, optimize query performance via views and indexes, and maintain a clean schema structure, crucial for scalable data warehousing operations .
To start a Hive Shell, navigate to the Hive installation path, open a terminal, and type `hive`. The terminal initializes Hive, possibly displaying log messages related to SLF4J bindings. Understanding SLF4J binding messages is important because they indicate that Hive is loading its logging configurations and selecting an appropriate logging framework, which is crucial for debugging and monitoring .
In Hive, internal tables, managed by Hive, store data in the warehouse and drop both the table and its data when deleted. External tables manage data outside Hive; dropping an external table only deletes metadata while data remains intact. This distinction affects how data is preserved upon table deletion .
RCFile improves the efficiency of aggregation operations by storing data in a column-oriented manner. This storage method ensures that the aggregation operation over a large dataset is not an expensive operation because it primarily accesses the necessary columns, minimizing the I/O operations necessary for computation .
Creating external tables in Hive is significant because it specifies data locations outside the Hive warehouse, allowing data to remain post-table operations. This is crucial for managing datasets that must persist independent of metadata changes, facilitating data integration with other Hadoop ecosystem tools or systems .
Hive benefits data warehousing tasks in Hadoop by providing a SQL-like interface, easing complexity for users familiar with SQL to perform complex queries. It supports scalability, customization with partitioning, integration with Hadoop storage, and provides robust functionalities like transaction handling and data serialization, making it effective for big data analytics .
The Pig DISTINCT operator is useful in data processing as it removes duplicate tuples from a relation, operating on entire records rather than individual fields. This reduction of data redundancy is crucial before groupings or joins to streamline data and optimize subsequent analyses or transformations .