Introduction to Hive
Big data and Hadoop
• The term ‘Big Data’ is used for
collections of large datasets that
include huge volume, high velocity, and
a variety of data that is increasing day
by day.
• Using traditional data management
systems, it is diffi cult to process Big
Data. Therefore, the Apache Software
Foundation introduced a framework
called Hadoop to solve Big Data
management and processing challenges.
Hadoop
• Hadoop is an open-source framework to store
and process Big Data in a distributed
environment. It contains two modules, one is
MapReduce and another is Hadoop Distributed
File System (HDFS).
– MapReduce: It is a parallel programming
model for processing large amounts of
structured, semi- structured, and
unstructured data on large clusters of
commodity hardware.
– HDFS: Hadoop Dist ribut ed File Syst em is a
part of Hadoop framework, used to store and
process the datasets. It provides a fault-
tolerant file system to run on commodity
Hadoop Tools
• The Hadoop ecosystem contains diff erent
sub- projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop
modules.
– Sqoop: It is used to import and
export data to and f rom HDFS and
RDBMS.
– Pig: It is a procedural language
platform used to develop a script for
MapReduce operations.
– Hive: It is a platform used to
develop SQL type scripts to do
Ways t o execut e MapReduce
• The traditional approach using Java
MapReduce program for structured,
semi-structured, and unstructured data.
• The scripting approach for MapReduce
to process structured and semi
structured data using Pig.
• The Hive Query Language (HiveQL or
HQL) for MapReduce to process
structured data using Hive.
Challenges that Data Analysts faced
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop
Map-Reduce framework to parallelize
processing of Data
What is the catch?
-Hadoop Map Reduce is Java intensive
-Thinking in Map Reduce paradigm can get
tricky
… Enter Hive!
Hive Key Principles
Data Warehouse :DWs are
central repositories of
integrated data from one
or more disparate sources
- A system used for
reporting and data
analysis.
ETL – process of extracting
data from source and
bringing it into data
warehouse. Extract
Transform and Load.
What is hive?
• Hive is a dat a warehouse inf rastruct ure
t ool t o process structured data in
Hadoop.
• It resides on top of Hadoop to
summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation
took it up and developed it f urt her as an
open source under t he name Apache Hive.
• It is used by diff erent companies. For
example, Amazon uses it in Amazon
HiveQL to MapReduce
Hive Framework
Data Analyst
SELECT COUNT(1) FROM Sales;
rowcount, N
rowcount,1 rowcount,1
Sales: Hive table
MR JOB Instance
Hive Data Model
Data in Hive organized into :
Tables
Partitions
Buckets
Hive Data Model Contd.
Tables
- Analogous to relational tables
-Each table has a corresponding directory
in HDFS
-Data serialized and stored as files within
that directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.
Partitions
-Each table can be broken into partitions
-Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different
folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions
/hivebase/Sales
/country=US
/country=CANADA
/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.
Buckets
-Data in each partition divided into buckets
-Based on a hash function of the column
-H(column) mod NumBuckets = bucket
number
-Each bucket is stored as a file in partition
directory
Hive is not-
• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and
row-level updates
Features of Hive
• It stores schema in a database and
processed data into HDFS.
• It is designed for OLAP.
• It provides SQL t ype language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and
extensible.
Hive Architecture
Hive Architecture
• User Interface
– Hive is a data warehouse infrastructure
software that can create interaction
between user and HDFS. The user
interf aces t hat Hive support s are Hive
Web UI, Hive command line, and Hive
HD Insight (In Windows server).
• Meta Store
– Hive chooses respective database servers to
store the schema or Metadata of
tables, databases, columns in a table,
their data types, and HDFS mapping.
Hive Architecture
• HiveQL Process Engine
– HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of
traditional approach for MapReduce program. Instead
of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
• Execution Engine
– The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine
executes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
• HDFS or HBASE
– Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive
Execute Job
Execute Plan
7.1 execute metadata
operations
Get Plan
5. Send Plan
Execution of Hive
1. Execute Query
The Hive int erface such as Command Line or Web UI
sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2. Get Plan
The driver takes the help of query compiler that
parses the query to check the syntax and query plan or
the requirement of query.
3. Get Metadata
The compiler sends metadata request to
Metastore (any database).
4. Send Metadata
Metastore sends metadata as a response to the
compiler.
Execution of Hive
5 Send Plan
The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution
engine.
7 Execute Job
Internally, the process of execution job is a
MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
Execution of Hive
7.1 Metadata Ops
Meanwhile in execution, the execution
engine can execute metadata operations
with Metastore.
8 Fetch Result
The execution engine receives the results
from Data nodes.
9 Send Results
The execution engine sends those resultant
values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HiveQL
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE
DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe
SELECT Query
Hive built in Serde: Record
Avro, ORC, Regex etc Reader
Can use Custom Hive Table
Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video
data, semistructured Hive Row Object
XML data) End User
Object Inspector Map
Fields
Data Hierarchy
Hive is organised hierarchically into:
Databases: namespaces that separate tables and other objects
Tables: homogeneous units of data with the same schema
Analogous to tables in an RDBMS
Partitions: determine how the data is stored
Allow efficient access to subsets of the data
Buckets/clusters
For subsampling within a partition
Join optimization
HiveQL
HiveQL / HQL provides the basic SQL-like
operations:
Select columns using SELECT
Filter rows using WHERE
JOIN between tables
Evaluate aggregates using GROUP BY
Store query results into another table
Download results to a local directory (i.e., export from HDFS)
Manage tables and queries with CREATE, DROP, and ALTER
Primitive Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
Hive Warehouse
Hive tables are stored in the Hive
“warehouse”
Default HDFS location: /user/hive/warehouse
Tables are stored as sub-directories in the
warehouse directory
Partitions are subdirectories of tables
External tables are supported in Hive
The actual data is stored in flat files
Table types
Hive deals with two types of table structures
like Internal and External tables depending
on the loading and design of schema in Hive.
Create Table Syntax
CREATE TABLE table_name
(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Loading And Inserting Data: Summary
Use this For this purpose
LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
Sample Select Clauses
Select from a single table
SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
Select from a partitioned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND
page_views.date <= '2008-03-31'
Relational Operators
ALL and DISTINCT
Specify whether duplicate rows should be returned
ALL is the default (all matching rows are returned)
DISTINCT removes duplicate rows from the result set
WHERE
Filters by expression
Does not support IN, EXISTS or sub-queries in the WHERE clause
LIMIT
Indicates the number of rows to be returned
Relational Operators
GROUP BY
Group data by column values
Select statement can only include columns included in the
GROUP BY clause
ORDER BY / SORT BY
ORDER BY performs total ordering
Slow, poor performance
SORT BY performs partial ordering
Sorts output from each reducer
Simple Table
CREATE TABLE page_view
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables
CREATE TABLE
LOAD: file moved into Hive’s data warehouse directory
DROP: both metadata and data deleted
CREATE EXTERNAL TABLE
LOAD: no files moved
DROP: only metadata deleted
Use this when sharing with other Hadoop applications, or when you want to use
multiple schemas on the same data
Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table’s
partitions
Bucketing
Can speed up queries that involve sampling
the data
Sampling works without bucketing, but Hive has to scan the entire dataset
Use CLUSTERED BY when creating table
For sorted buckets, add SORTED BY
To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Partitions
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data
Use LOAD DATA to load data from a file or
directory
Will read from HDFS unless LOCAL keyword is specified
Will append data unless OVERWRITE specified
PARTITION required if destination table is partitioned
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-
8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inserting Data During Table Creation
Use AS SELECT in the CREATE TABLE
statement to populate a table as it is created
CREATE TABLE page_view AS
SELECT [Link], [Link], pvs.page_url,
pvs.referrer_url
FROM page_view_stg pvs
WHERE [Link] = 'US';
Loading And Inserting Data: Summary
Use this For this purpose
LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
syntax of the SELECT query
SELECT [ALL | DISTINCT] select_expr,
select_expr, ... FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list |
[DISTRIBUTE BY col_list]
[SORT BY col_list]]
[LIMIT number];
hive> SELECT * FROM employee WHERE
salary>30000;
Thank you
This presentation is created using LibreOffice Impress [Link], can be used freely as per GNU General Public
License
Web Resources Blogs [Link]
[Link] h [Link]
ttp://[Link] [Link]
m [Link]
tushar@[Link]