APACHE HIVE
By: Dr. Nandini Kumari
2
Apache Hive is an open-source software utility built on
top of Hadoop to provide easy data storage and analysis
in style of SQL.
It is a distributed, fault-tolerant data warehouse system
HIVE that enables analytics at a massive scale and facilitates
reading, writing, and managing petabytes of data
residing in distributed storage using SQL.
Hive is built on top of Apache Hadoop and supports
storage on S3, adls, gfs etc though HDFS.
WHAT IS HIVE?
3
4
Hive Metastore(HMS) • The Hive Metastore (HMS) is a central
repository of metadata for Hive tables and
partitions in a relational database, and
provides clients (including Hive, Impala and
Spark) access to this information using the
metastore service API.
• It has become a building block for data lakes
that utilize the diverse world of open-source
software, such as Apache Spark and Presto.
In fact, a whole ecosystem of tools, open-
source and otherwise, are built around the
Hive Metastore.
5
Features of HIVE •Hive supports any client application written in
Java, PHP, Python, C++ or Ruby by exposing
its Thrift server. (You can use these client – side
languages embedded with SQL for accessing a
database such as DB2, etc.).
•As the metadata information of Hive is stored in
an RDBMS, it significantly reduces the time to
perform semantic checks during query execution.
•Useful for people who aren’t from a
programming background as it eliminates the
need to write complex MapReduce program.
•Extensible and scalable to cope up with the
growing volume and variety of data, without
affecting performance of the system.
•It is as an efficient ETL (Extract, Transform, Load)
tool.
6
Apache Hive – Applications
7
Apache Hive Use Case – Facebook:
$sudo gedit [Link]
Hive Architecture: 9
Hive Architecture can be categorized into the following
components:
•Hive Clients: Hive supports application written in
many languages like Java, C++, Python etc. using JDBC,
Thrift and ODBC drivers. Hence one can always write
hive client application written in a language of their
choice.
•Hive Services: Apache Hive provides various services
like CLI, Web Interface etc. to perform queries.
•Processing framework and Resource
Management: Internally, Hive uses Hadoop
MapReduce framework as execution engine to execute
the queries.
•Distributed Storage: As Hive is installed on top of
Hadoop, it uses the underlying HDFS for the distributed
storage.
10
11
Hive Components: Shell:-
Provides User interface provide an interface
between user and hive. It enables user to submit
queries and other operations to the system.
Metastore –
All the structured data or information of the
different tables and partition in the warehouse
containing attributes and attributes level
information are stored in the metastore.
Execution Engine –
Execution of the execution plan made by the
compiler is performed in the execution engine.
Compiler –
Queries are parses, semantic analysis on the
different query blocks and query expression is
done by the compiler.
Drivers:-
Queries of the user after the interface are
received by the driver within the Hive. Concept of
session handles is implemented by driver
12
Hive Components:
The steps include:
[Link] the Query from UI
[Link] a plan from the driver tasks DAG
stages
[Link] metadata request from the meta store
[Link] metadata from the compiler
[Link] the plan back to the driver
[Link] plan in the execution engine
[Link] results for the appropriate user
query
[Link] results bi-directionally
[Link] engine processing in HDFS with
the map-reduce and fetch results from the
data nodes created by the job tracker. it acts
as a connector between Hive and Hadoop.
Hive Metastore:
13
Metastore service
Metastore database
Hive driver and metastore interface run on different JVM
• Primitive Data Types:
HIVE DATA TYPES 14
Numeric Data types - Data types like integral, float, decimal
String Data type - Data types like char, string
Date/ Time Data type - Data types like timestamp, date, interval
Boolean and binary
• Complex Data Types:
Arrays - A collection of the same entities.
The syntax is: array<data_type>
Maps - A collection of key-value pairs
syntax is map<primitive_type, data_type>
Structs - A collection of complex data with comments.
Syntax: struct<col_name : data_type [COMMENT col_comment],…..>
Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>
Table 1: Primitive, Sting, Date/time datatypes
HIVE DATA TYPES :
• Hive Defines various primitive,
complex, string, date/time,
collection data types and file
formats.
• For handling & Storing different
data formats.
• Table 1: Gives primitive, string ,
date/time and complex Hive data
types & its description.
15
16
Hive Data Model:
Data in Hive can be categorized into three types on the granular level:
Tables:
Tables in Hive are the same as the tables present in a Relational Database. You can perform filter,
project, join and union operations on them. There are two types of tables in Hive:
1. Managed Table:
Command:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;
2. External Table:
Command:
CREATE EXTERNAL TABLE <table_name> (column1 data_type, column2 data_type) LOCATION
‘<table_hive_location>’;
LOAD DATA INPATH ‘<HDFS_file_location>’ INTO TABLE <table_name>;
For external table, Hive is not responsible for managing the data.
HIVE DATA MODEL: 17
1. Database – namespace for tables
[Link] – set of records that have similar schema. Tables in Hive are created the same way it is done in RDBMS
[Link] - Here, tables are organized into partitions for grouping similar types of data based on the partition key
[Link] - Data present in partitions can be further divided into buckets for efficient querying (uses hash function to
segregate data and to determine cluster/bucket to which the record should be placed)
Partitions:
Command:
CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table
can have one or more partition keys to identify a particular partition. This allows us to have a faster query on slices of the
data.
Buckets:
Commands:
CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….) CLUSTERED BY
(column_name1, column_name2, …) SORTED BY (column_name [ASC|DESC], …)] INTO num_buckets BUCKETS;
18
19
20
DIFFERENT MODES OF HIVE 21
• Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending
on the number and size of data nodes. They are:
1. Local Mode - Used when Hadoop has one data node, and the amount of data is small. Here, the
processing will be very fast on smaller datasets, which are present in local machines.
2. Mapreduce Mode - Used when the data in Hadoop is spread across multiple data nodes. Processing
large datasets can be more efficient using this mode.
22
FACEBOOK – USE CASE
• Objective – To store & process large amount of data.
• In 2007 , FB used “Hadoop” – To store and process the generated data. ETL (Via Python)
• But In Hadoop – Programming became difficult, as Hadoop default based is map reduce (java based), to
execute small query larger codes need to be written (Structured Data type).
• As a solution, Hive was developed by FB.
• FB Stats: RDBMS was not suitable
• Hadoop usage Started
• 500 B/Day
• 70k Queries /Day
• 300 Million Photos /Day
HIVE QL:
23
• HiveQL: Querying the large dataset which reside in the HDFS environment.
• HiveQL script commands enable data definition, data manipulation and query processing.
• Supports a large base of SQL users who are acquainted with SQL to “Extract information from
data warehouse”
• SQL tasks: DDL statements
• Create/drop/alter database
• Create/drop/truncate table
• Alter Table/partition/Column
• Create/drop/Alter View
• Create/drop/index
• Show
• Describe
• DML statements
• Loading files into table
• Inserting data into hive table from queries
Hive - Create Database 24
• In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.
• check the list of existing databases, follow the below command: -
hive> show databases;
• Let's create a new database by using the following command: -
hive> create database demo;
Hive - Create Database
25
• Let's check the existence of a newly created database.
hive> show databases;
• Each database must contain a unique name. If we create two databases with the same name, the
following error generates: -
• If we want to suppress the warning generated by Hive on creating the database with the same name,
follow the below command: -
hive> create a database if not exists demo;
Hive - Drop Database
26
•check the list of existing databases by using the following command: -
hive> show databases;
• drop the database by using the following command.
hive> drop database demo;
Let's check whether the database is dropped or not.
hive> show databases;
• If we try to drop the database that doesn't exist, the following error generates:
27
• if we want to suppress the warning generated by Hive on creating the database with the same name, follow the below
command:-
hive> drop database if exists demo;
• In Hive, it is not allowed to drop the database that contains the tables directly. In such a case, we can drop
the database either by dropping tables first or use Cascade keyword with the command.
hive> drop database if exists demo cascade;
Hive - Create Table
28
• In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide range of flexibility where the
data files for tables are stored. It provides two types of table: -
• create an internal/managed table by using the following command:-
hive> create table [Link] (Id int, Name string , Salary float) row format delimited fields terminated by ',' ;
• The metadata of the created table by using the following command:-
Hive - Create Table
29
• Let's see the result when we try to
create the existing table again.
• In such a case, the exception occurs. If
we want to ignore this type of exception,
we can use if not exists command while
creating the table.
• While creating a table, we can add the
comments to the columns and can also define
the table properties.
30
hive> describe new_employee;
• Hive allows creating a new table by using the schema of an existing table.
Hive - Load Data
Once the internal table has been created, the next step is to load the data into it. So, in Hive, we can easily load data31
from any file to the database.
• Let's load the data of the file into the database by using the following command: -
load data local inpat '/home/codegyani/hive/emp_details1' into table [Link];
• Here, emp_details is the file
name that contains the data.
• If we want to add more data into the
current database, execute the same
query again by just updating the new
file name.
External Table
32
To create an external table, follow the below steps: -
• Let's create a directory on HDFS by using the following command: -
hdfs dfs -mkdir /HiveDirectory
• Now, store the file on the created directory.
hdfs dfs -put hive/emp_details /HiveDirectory
• Let's create an external
table using the following
command: -
• Now, we can use the
following command
to retrieve the data: -
• In Hive, if we try to load unmatched data (i.e., one or more column data doesn't match the data type of 33
specified table columns), it will not throw any exception. However, it stores the Null value at the position of
unmatched tuple.
• Let's add one more file to the
current table. This file contains
the unmatched data.
• load the data into the table.
34
• Null values at the position of
unmatched data.
Hive - Drop Table
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the below steps to drop the
table from the database.
• check the list of existing
databases by using the following
command: -
• select the database from
which we want to delete the
table
• check the list of existing tables
in the corresponding database. 35
• drop the table by using the
following command:
• check whether the table is
dropped or not.
• As we can see, the table
new_employee is not present in
the list. Hence, the table is
dropped successfully.
Hive - Alter Table 36
In Hive, we can perform modifications in the existing table like changing the table name, column name,
comments, and table properties. It provides SQL like commands to alter the table.
• change the name of the
table by using the
following command
• In Hive, we can add one or more columns in an existing table by
using the following signature:
Alter table employee_data add columns (age int);
• change the name of the column by using the following command: -
Alter table employee_data change name first_name string;
• alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive 37
The partitioning in Hive can be executed in two ways -
• Static partitioning
• Dynamic partitioning
• In static or manual partitioning, it is required to pass the values of partitioned columns manually
while loading the data into the table. Hence, the data file doesn't contain the partitioned columns.
Example of Static Partitioning
First, select the database in which we want to create a table.
hive> use test;
• Create the table and provide
the partitioned columns by
using the following command
• Let's retrieve the 38
information
associated with the
table.
• Load the data into the table and
pass the values(key) of partition
columns with it by using the
following command: -
• Here, we are partitioning
the students of an
institute based on
courses.
• Load the data of another file
into the same table and pass
the values of partition
columns with it by using the
following command: -
• In the following screenshot, we 39
can see that the table student is
divided into two categories.
• Let's retrieve the entire data of
the table by using the
following command
• try to retrieve the data based on 40
partitioned columns by using the
following command: -
• In this case, we are not examining
the entire data. Hence, this
approach improves query
response time.
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not required to pass
the values of partitioned columns manually.
• First, select the database in which we want to create a table.
• Enable the dynamic partition by using the following
commands: -
hive> set [Link]=true;
hive> set [Link]=nonstrict;
41
•Create a dummy table to store the data.
• Now, load the data into
dummy table.
• Create a partition table
by using the following
command
• Now, insert the data of hive> insert into student_part
dummy table into the partition(course)
partition table. select id, name, age, institute, course
from stud_demo;
42
• In the following
screenshot, we can see
that the table
student_part is divided
into two categories.
• Let's retrieve the
entire data of the
table by using the
following command
43
• Now, try to retrieve the data based
on partitioned columns by using
the following command:
• Let's also retrieve the data of
another partitioned dataset by
using the following command
44
STATIC V/S DYNAMIC PARTITIONING
Bucketing in Hive
45
• The bucketing in Hive is a data organizing technique. It is similar to partitioning in Hive with an
added functionality that it divides large datasets into more manageable parts known as buckets.
• First, select the database
in which we want to
create a table.
• Create a dummy table
to store the data.
• Now, load the data into the
table.
Bucketing in Hive
• Enable the bucketing by 46
using the following hive> set [Link] = true;
command
• Create a bucketing table
by using the following
command:
• Now, insert the data of
dummy table into the
bucketed table.
hive> insert overwrite table
emp_bucket select * from
emp_demo;
• Here, we can see that
the data is divided into
three buckets.
•Let's retrieve the data of bucket 0. 47
According to hash function :
6%3=0
3%3=0
So, these columns stored in bucket 0.
• Let's retrieve the data of bucket 1.
According to hash function :
7%3=1
4%3=1
1%3=1
So, these columns stored in bucket 1.
• Let's retrieve the data of bucket 2.
According to hash function :
8%3=2
5%3=2
2%3=2
So, these columns stored in bucket 2.
Arithmetic Operators in Hive Relational Operators in Hive 48
hive> select id, name, salary + 50 from employee; hive> select * from employee where salary >= 25000;
hive> select id, name, salary - 50 from employee; hive> select * from employee where salary < 25000;
hive> select id, name, (salary * 10) /100 from employee;
49
50
Limitation of Hive:
• Hive Does not provide update, alter and
deletion of records in the database.
• Not developed for “Unstructured Data”
• Not designed for “Real Time Queries”.
• Performs the partition always from the last
column.
• OLTP: No. Because Hive does not provide
insert and update at the row level, it is
not suitable for the OLTP system.
51
HIVE V/S RDBMS
[Link] Characteristics HIVE RDBMS
1 Record Leve Queries No Update & Delete Insert, Update & Delete
2 Transaction Support No Yes
3 Latency Minutes or more In fraction of seconds
4 Data Size Petabytes Terabytes
5 Data Per Query Petabytes Gigabytes
6 Query Language HiveQL SQL
7 Support JDB/ODBC Limited Full