0% found this document useful (0 votes)
18 views50 pages

Apache Hive Overview and Features

Apache Hive is an open-source data warehouse system built on Hadoop that allows for SQL-like data storage and analysis. It features a central metadata repository called the Hive Metastore, supports various client applications, and is designed for scalability and extensibility. Hive operates in both local and MapReduce modes, enabling efficient processing of large datasets across distributed storage systems.

Uploaded by

ritztheratz08
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views50 pages

Apache Hive Overview and Features

Apache Hive is an open-source data warehouse system built on Hadoop that allows for SQL-like data storage and analysis. It features a central metadata repository called the Hive Metastore, supports various client applications, and is designed for scalability and extensibility. Hive operates in both local and MapReduce modes, enabling efficient processing of large datasets across distributed storage systems.

Uploaded by

ritztheratz08
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

APACHE HIVE

By: Dr. Nandini Kumari


2

Apache Hive is an open-source software utility built on


top of Hadoop to provide easy data storage and analysis
in style of SQL.

It is a distributed, fault-tolerant data warehouse system


HIVE that enables analytics at a massive scale and facilitates
reading, writing, and managing petabytes of data
residing in distributed storage using SQL.

Hive is built on top of Apache Hadoop and supports


storage on S3, adls, gfs etc though HDFS.
WHAT IS HIVE?
3
4

Hive Metastore(HMS) • The Hive Metastore (HMS) is a central


repository of metadata for Hive tables and
partitions in a relational database, and
provides clients (including Hive, Impala and
Spark) access to this information using the
metastore service API.

• It has become a building block for data lakes


that utilize the diverse world of open-source
software, such as Apache Spark and Presto.
In fact, a whole ecosystem of tools, open-
source and otherwise, are built around the
Hive Metastore.
5

Features of HIVE •Hive supports any client application written in


Java, PHP, Python, C++ or Ruby by exposing
its Thrift server. (You can use these client – side
languages embedded with SQL for accessing a
database such as DB2, etc.).

•As the metadata information of Hive is stored in


an RDBMS, it significantly reduces the time to
perform semantic checks during query execution.

•Useful for people who aren’t from a


programming background as it eliminates the
need to write complex MapReduce program.

•Extensible and scalable to cope up with the


growing volume and variety of data, without
affecting performance of the system.

•It is as an efficient ETL (Extract, Transform, Load)


tool.
6

Apache Hive – Applications


7

Apache Hive Use Case – Facebook:


$sudo gedit [Link]
Hive Architecture: 9

Hive Architecture can be categorized into the following


components:

•Hive Clients: Hive supports application written in


many languages like Java, C++, Python etc. using JDBC,
Thrift and ODBC drivers. Hence one can always write
hive client application written in a language of their
choice.

•Hive Services: Apache Hive provides various services


like CLI, Web Interface etc. to perform queries.

•Processing framework and Resource


Management: Internally, Hive uses Hadoop
MapReduce framework as execution engine to execute
the queries.

•Distributed Storage: As Hive is installed on top of


Hadoop, it uses the underlying HDFS for the distributed
storage.
10
11

Hive Components: Shell:-


Provides User interface provide an interface
between user and hive. It enables user to submit
queries and other operations to the system.
Metastore –
All the structured data or information of the
different tables and partition in the warehouse
containing attributes and attributes level
information are stored in the metastore.
Execution Engine –
Execution of the execution plan made by the
compiler is performed in the execution engine.
Compiler –
Queries are parses, semantic analysis on the
different query blocks and query expression is
done by the compiler.
Drivers:-
Queries of the user after the interface are
received by the driver within the Hive. Concept of
session handles is implemented by driver
12

Hive Components:
The steps include:
[Link] the Query from UI
[Link] a plan from the driver tasks DAG
stages
[Link] metadata request from the meta store
[Link] metadata from the compiler
[Link] the plan back to the driver
[Link] plan in the execution engine
[Link] results for the appropriate user
query
[Link] results bi-directionally
[Link] engine processing in HDFS with
the map-reduce and fetch results from the
data nodes created by the job tracker. it acts
as a connector between Hive and Hadoop.
Hive Metastore:
13

Metastore service

Metastore database

Hive driver and metastore interface run on different JVM


• Primitive Data Types:
HIVE DATA TYPES 14

Numeric Data types - Data types like integral, float, decimal


String Data type - Data types like char, string
Date/ Time Data type - Data types like timestamp, date, interval
Boolean and binary

• Complex Data Types:

Arrays - A collection of the same entities.

The syntax is: array<data_type>

Maps - A collection of key-value pairs

syntax is map<primitive_type, data_type>

Structs - A collection of complex data with comments.

Syntax: struct<col_name : data_type [COMMENT col_comment],…..>

Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>


Table 1: Primitive, Sting, Date/time datatypes

HIVE DATA TYPES :

• Hive Defines various primitive,


complex, string, date/time,
collection data types and file
formats.
• For handling & Storing different
data formats.
• Table 1: Gives primitive, string ,
date/time and complex Hive data
types & its description.

15
16

Hive Data Model:


Data in Hive can be categorized into three types on the granular level:

Tables:
Tables in Hive are the same as the tables present in a Relational Database. You can perform filter,
project, join and union operations on them. There are two types of tables in Hive:

1. Managed Table:
Command:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;

2. External Table:
Command:
CREATE EXTERNAL TABLE <table_name> (column1 data_type, column2 data_type) LOCATION
‘<table_hive_location>’;
LOAD DATA INPATH ‘<HDFS_file_location>’ INTO TABLE <table_name>;
For external table, Hive is not responsible for managing the data.
HIVE DATA MODEL: 17

1. Database – namespace for tables


[Link] – set of records that have similar schema. Tables in Hive are created the same way it is done in RDBMS
[Link] - Here, tables are organized into partitions for grouping similar types of data based on the partition key
[Link] - Data present in partitions can be further divided into buckets for efficient querying (uses hash function to
segregate data and to determine cluster/bucket to which the record should be placed)

Partitions:
Command:
CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table
can have one or more partition keys to identify a particular partition. This allows us to have a faster query on slices of the
data.

Buckets:
Commands:
CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….) CLUSTERED BY
(column_name1, column_name2, …) SORTED BY (column_name [ASC|DESC], …)] INTO num_buckets BUCKETS;
18
19
20
DIFFERENT MODES OF HIVE 21

• Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending
on the number and size of data nodes. They are:

1. Local Mode - Used when Hadoop has one data node, and the amount of data is small. Here, the
processing will be very fast on smaller datasets, which are present in local machines.
2. Mapreduce Mode - Used when the data in Hadoop is spread across multiple data nodes. Processing
large datasets can be more efficient using this mode.
22

FACEBOOK – USE CASE


• Objective – To store & process large amount of data.
• In 2007 , FB used “Hadoop” – To store and process the generated data. ETL (Via Python)
• But In Hadoop – Programming became difficult, as Hadoop default based is map reduce (java based), to
execute small query larger codes need to be written (Structured Data type).
• As a solution, Hive was developed by FB.

• FB Stats: RDBMS was not suitable


• Hadoop usage Started
• 500 B/Day
• 70k Queries /Day
• 300 Million Photos /Day
HIVE QL:
23

• HiveQL: Querying the large dataset which reside in the HDFS environment.
• HiveQL script commands enable data definition, data manipulation and query processing.
• Supports a large base of SQL users who are acquainted with SQL to “Extract information from
data warehouse”
• SQL tasks: DDL statements
• Create/drop/alter database
• Create/drop/truncate table
• Alter Table/partition/Column
• Create/drop/Alter View
• Create/drop/index
• Show
• Describe
• DML statements
• Loading files into table
• Inserting data into hive table from queries
Hive - Create Database 24
• In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.
• check the list of existing databases, follow the below command: -
hive> show databases;

• Let's create a new database by using the following command: -

hive> create database demo;


Hive - Create Database
25

• Let's check the existence of a newly created database.


hive> show databases;

• Each database must contain a unique name. If we create two databases with the same name, the
following error generates: -

• If we want to suppress the warning generated by Hive on creating the database with the same name,
follow the below command: -
hive> create a database if not exists demo;
Hive - Drop Database
26
•check the list of existing databases by using the following command: -
hive> show databases;

• drop the database by using the following command.


hive> drop database demo;

Let's check whether the database is dropped or not.


hive> show databases;
• If we try to drop the database that doesn't exist, the following error generates:
27

• if we want to suppress the warning generated by Hive on creating the database with the same name, follow the below
command:-
hive> drop database if exists demo;

• In Hive, it is not allowed to drop the database that contains the tables directly. In such a case, we can drop
the database either by dropping tables first or use Cascade keyword with the command.

hive> drop database if exists demo cascade;


Hive - Create Table
28

• In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide range of flexibility where the
data files for tables are stored. It provides two types of table: -
• create an internal/managed table by using the following command:-
hive> create table [Link] (Id int, Name string , Salary float) row format delimited fields terminated by ',' ;

• The metadata of the created table by using the following command:-


Hive - Create Table
29

• Let's see the result when we try to


create the existing table again.

• In such a case, the exception occurs. If


we want to ignore this type of exception,
we can use if not exists command while
creating the table.

• While creating a table, we can add the


comments to the columns and can also define
the table properties.
30
hive> describe new_employee;

• Hive allows creating a new table by using the schema of an existing table.
Hive - Load Data
Once the internal table has been created, the next step is to load the data into it. So, in Hive, we can easily load data31
from any file to the database.
• Let's load the data of the file into the database by using the following command: -
load data local inpat '/home/codegyani/hive/emp_details1' into table [Link];

• Here, emp_details is the file


name that contains the data.

• If we want to add more data into the


current database, execute the same
query again by just updating the new
file name.
External Table
32
To create an external table, follow the below steps: -
• Let's create a directory on HDFS by using the following command: -
hdfs dfs -mkdir /HiveDirectory
• Now, store the file on the created directory.
hdfs dfs -put hive/emp_details /HiveDirectory

• Let's create an external


table using the following
command: -

• Now, we can use the


following command
to retrieve the data: -
• In Hive, if we try to load unmatched data (i.e., one or more column data doesn't match the data type of 33

specified table columns), it will not throw any exception. However, it stores the Null value at the position of
unmatched tuple.

• Let's add one more file to the


current table. This file contains
the unmatched data.

• load the data into the table.


34

• Null values at the position of


unmatched data.

Hive - Drop Table


Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the below steps to drop the
table from the database.

• check the list of existing


databases by using the following
command: -

• select the database from


which we want to delete the
table
• check the list of existing tables
in the corresponding database. 35

• drop the table by using the


following command:

• check whether the table is


dropped or not.

• As we can see, the table


new_employee is not present in
the list. Hence, the table is
dropped successfully.
Hive - Alter Table 36

In Hive, we can perform modifications in the existing table like changing the table name, column name,
comments, and table properties. It provides SQL like commands to alter the table.

• change the name of the


table by using the
following command

• In Hive, we can add one or more columns in an existing table by


using the following signature:
Alter table employee_data add columns (age int);

• change the name of the column by using the following command: -

Alter table employee_data change name first_name string;

• alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive 37

The partitioning in Hive can be executed in two ways -

• Static partitioning
• Dynamic partitioning

• In static or manual partitioning, it is required to pass the values of partitioned columns manually
while loading the data into the table. Hence, the data file doesn't contain the partitioned columns.

Example of Static Partitioning

First, select the database in which we want to create a table.

hive> use test;

• Create the table and provide


the partitioned columns by
using the following command
• Let's retrieve the 38

information
associated with the
table.

• Load the data into the table and


pass the values(key) of partition
columns with it by using the
following command: -
• Here, we are partitioning
the students of an
institute based on
courses.

• Load the data of another file


into the same table and pass
the values of partition
columns with it by using the
following command: -
• In the following screenshot, we 39
can see that the table student is
divided into two categories.

• Let's retrieve the entire data of


the table by using the
following command
• try to retrieve the data based on 40
partitioned columns by using the
following command: -

• In this case, we are not examining


the entire data. Hence, this
approach improves query
response time.

Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not required to pass
the values of partitioned columns manually.

• First, select the database in which we want to create a table.

• Enable the dynamic partition by using the following


commands: -

hive> set [Link]=true;


hive> set [Link]=nonstrict;
41
•Create a dummy table to store the data.

• Now, load the data into


dummy table.

• Create a partition table


by using the following
command

• Now, insert the data of hive> insert into student_part


dummy table into the partition(course)
partition table. select id, name, age, institute, course
from stud_demo;
42
• In the following
screenshot, we can see
that the table
student_part is divided
into two categories.

• Let's retrieve the


entire data of the
table by using the
following command
43
• Now, try to retrieve the data based
on partitioned columns by using
the following command:

• Let's also retrieve the data of


another partitioned dataset by
using the following command
44

STATIC V/S DYNAMIC PARTITIONING


Bucketing in Hive
45
• The bucketing in Hive is a data organizing technique. It is similar to partitioning in Hive with an
added functionality that it divides large datasets into more manageable parts known as buckets.

• First, select the database


in which we want to
create a table.

• Create a dummy table


to store the data.

• Now, load the data into the


table.
Bucketing in Hive
• Enable the bucketing by 46

using the following hive> set [Link] = true;


command

• Create a bucketing table


by using the following
command:

• Now, insert the data of


dummy table into the
bucketed table.

hive> insert overwrite table


emp_bucket select * from
emp_demo;

• Here, we can see that


the data is divided into
three buckets.
•Let's retrieve the data of bucket 0. 47

According to hash function :


6%3=0
3%3=0
So, these columns stored in bucket 0.

• Let's retrieve the data of bucket 1.


According to hash function :
7%3=1
4%3=1
1%3=1
So, these columns stored in bucket 1.

• Let's retrieve the data of bucket 2.

According to hash function :


8%3=2
5%3=2
2%3=2
So, these columns stored in bucket 2.
Arithmetic Operators in Hive Relational Operators in Hive 48

hive> select id, name, salary + 50 from employee; hive> select * from employee where salary >= 25000;
hive> select id, name, salary - 50 from employee; hive> select * from employee where salary < 25000;

hive> select id, name, (salary * 10) /100 from employee;


49
50

Limitation of Hive:
• Hive Does not provide update, alter and
deletion of records in the database.
• Not developed for “Unstructured Data”
• Not designed for “Real Time Queries”.
• Performs the partition always from the last
column.
• OLTP: No. Because Hive does not provide
insert and update at the row level, it is
not suitable for the OLTP system.
51

HIVE V/S RDBMS


[Link] Characteristics HIVE RDBMS
1 Record Leve Queries No Update & Delete Insert, Update & Delete
2 Transaction Support No Yes
3 Latency Minutes or more In fraction of seconds
4 Data Size Petabytes Terabytes
5 Data Per Query Petabytes Gigabytes
6 Query Language HiveQL SQL
7 Support JDB/ODBC Limited Full

You might also like