0% found this document useful (0 votes)

12 views49 pages

Understanding Hive Map Types

Uploaded by

shritis2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views49 pages

Understanding Hive Map Types

Uploaded by

shritis2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Introduction to Hive

Big data and Hadoop

• The term ‘Big Data’ is used for

collections of large datasets that
include huge volume, high velocity, and
a variety of data that is increasing day
by day.
• Using traditional data management
systems, it is diffi cult to process Big
Data. Therefore, the Apache Software
Foundation introduced a framework
called Hadoop to solve Big Data
management and processing challenges.
Hadoop

• Hadoop is an open-source framework to store

and process Big Data in a distributed
environment. It contains two modules, one is
MapReduce and another is Hadoop Distributed
File System (HDFS).
– MapReduce: It is a parallel programming
model for processing large amounts of
structured, semistructured, and
unstructured data on large clusters of
commodity hardware.
– HDFS: Hadoop Dist ribut ed File Syst em is a
part of Hadoop framework, used to store and
process the datasets. It provides a fault-
tolerant file system to run on commodity
Hadoop Tools

• The Hadoop ecosystem contains diff erent

sub- projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop
modules.
– Sqoop: It is used to import and
export data to and f rom HDFS and
RDBMS.
– Pig: It is a procedural language
platform used to develop a script for
MapReduce operations.
– Hive: It is a platform used to
develop SQL type scripts to do
Ways t o execut e MapReduce

• The traditional approach using Java

MapReduce program for structured,
semi-structured, and unstructured data.
• The scripting approach for MapReduce
to process structured and semi
structured data using Pig.
• The Hive Query Language (HiveQL or
HQL) for MapReduce to process
structured data using Hive.
Challenges that Data Analysts faced

Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop
Map-Reduce framework to parallelize
processing of Data
What is the catch?
-Hadoop Map Reduce is Java intensive
-Thinking in Map Reduce paradigm can get
tricky
… Enter Hive!
Hive Key Principles

Data Warehouse :DWs are

central repositories of
integrated data from one
or more disparate sources
- A system used for
reporting and data
analysis.
ETL – process of extracting
data from source and
bringing it into data
warehouse. Extract
Transform and Load.
What is hive?

• Hive is a dat a warehouse inf rastruct ure

t ool t o process structured data in
Hadoop.
• It resides on top of Hadoop to
summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation
took it up and developed it f urt her as an
open source under t he name Apache Hive.
• It is used by diff erent companies. For
example, Amazon uses it in Amazon
HiveQL to MapReduce

Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table

MR JOB Instance
Hive Data Model

Data in Hive organized into :

Tables
Partitions
Buckets
Hive Data Model Contd.

Tables
- Analogous to relational tables
-Each table has a corresponding directory
in HDFS
-Data serialized and stored as files within
that directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
Hive Data Model Contd.

Partitions
-Each table can be broken into partitions
-Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different
folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.

Buckets
-Data in each partition divided into buckets
-Based on a hash function of the column
-H(column) mod NumBuckets = bucket
number
-Each bucket is stored as a file in partition
directory
Hive is not-

• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and
row-level updates
Features of Hive

• It stores schema in a database and

processed data into HDFS.
• It is designed for OLAP.
• It provides SQL t ype language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and
extensible.
Hive Architecture
Hive Architecture

• User Interface
– Hive is a data warehouse infrastructure
software that can create interaction
between user and HDFS. The user
interf aces t hat Hive support s are Hive
Web UI, Hive command line, and Hive
HD Insight (In Windows server).
• Meta Store
– Hive chooses respective database servers to
store the schema or Metadata of
tables, databases, columns in a table,
their data types, and HDFS mapping.
Hive Architecture

• HiveQL Process Engine

– HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of
traditional approach for MapReduce program. Instead
of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
• Execution Engine
– The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine
executes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
• HDFS or HBASE
– Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive

Execute Job

Execute Plan

7.1 execute metadata

operations

Get Plan

5. Send Plan
Execution of Hive

1. Execute Query
The Hive int erface such as Command Line or Web UI
sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2. Get Plan
The driver takes the help of query compiler that
parses the query to check the syntax and query plan or
the requirement of query.
3. Get Metadata
The compiler sends metadata request to
Metastore (any database).
4. Send Metadata
Metastore sends metadata as a response to the
compiler.
Execution of Hive

5 Send Plan
The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and
compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution
engine.
7 Execute Job
Internally, the process of execution job is a
MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
Execution of Hive

7.1 Metadata Ops

Meanwhile in execution, the execution
engine can execute metadata operations
with Metastore.
8 Fetch Result
The execution engine receives the results
from Data nodes.
9 Send Results
The execution engine sends those resultant
values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
HiveQL

DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE

DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe

SELECT Query

 Hive built in Serde: Record

Avro, ORC, Regex etc Reader

 Can use Custom Hive Table

Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video
data, semistructured Hive Row Object
XML data) End User
Object Inspector Map
Fields
Data Hierarchy

Hive is organised hierarchically into:

Databases: namespaces that separate tables and other objects
Tables: homogeneous units of data with the same schema
Analogous to tables in an RDBMS
Partitions: determine how the data is stored
Allow efficient access to subsets of the data
Buckets/clusters
For subsampling within a partition
Join optimization
HiveQL

HiveQL / HQL provides the basic SQL-like

operations:
Select columns using SELECT
Filter rows using WHERE
JOIN between tables
Evaluate aggregates using GROUP BY
Store query results into another table
Download results to a local directory (i.e., export from HDFS)
Manage tables and queries with CREATE, DROP, and ALTER
Primitive Data Types

Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types

Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'
Hive Warehouse

Hive tables are stored in the Hive

“warehouse”
Default HDFS location: /user/hive/warehouse
Tables are stored as sub-directories in the
warehouse directory
Partitions are subdirectories of tables
External tables are supported in Hive
The actual data is stored in flat files
Table types

Hive deals with two types of table structures

like Internal and External tables depending
on the loading and design of schema in Hive.
Create Table Syntax

CREATE TABLE table_name

(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
Loading And Inserting Data: Summary

Use this For this purpose

LOAD Load data from a file or directory
INSERT Load data from a query
• One partition at a time
• Use multiple INSERTs to insert into
multiple partitions in the one query
CREATE TABLE AS (CTAS) Insert data while creating a table
Add/modify external file Load new data into external table
Sample Select Clauses

Select from a single table

SELECT *
FROM sales
WHERE amount > 10 AND
region = "US";
Select from a partitioned table
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND
page_views.date <= '2008-03-31'
Relational Operators

ALL and DISTINCT

Specify whether duplicate rows should be returned
ALL is the default (all matching rows are returned)
DISTINCT removes duplicate rows from the result set
WHERE
Filters by expression
Does not support IN, EXISTS or sub-queries in the WHERE clause
LIMIT
Indicates the number of rows to be returned
Relational Operators

GROUP BY
Group data by column values
Select statement can only include columns included in the
GROUP BY clause
ORDER BY / SORT BY
ORDER BY performs total ordering
Slow, poor performance
SORT BY performs partial ordering
Sorts output from each reducer
Simple Table

CREATE TABLE page_view

(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External Table

CREATE EXTERNAL TABLE page_view_stg

(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
More About Tables

CREATE TABLE
LOAD: file moved into Hive’s data warehouse directory
DROP: both metadata and data deleted
CREATE EXTERNAL TABLE
LOAD: no files moved
DROP: only metadata deleted
Use this when sharing with other Hadoop applications, or when you want to use
multiple schemas on the same data
Partitioning

Can make some queries faster

Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table’s
partitions
Bucketing

Can speed up queries that involve sampling

the data
Sampling works without bucketing, but Hive has to scan the entire dataset
Use CLUSTERED BY when creating table
For sorted buckets, add SORTED BY
To query a sample of your data, use
TABLESAMPLE
Browsing Tables And Partitions

Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the
specification ( uses regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view
table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful
only for debugging )
DESCRIBE page_view List information about a partition
PARTITION (ds='2008-10-31');
Loading Data

Use LOAD DATA to load data from a file or

directory
Will read from HDFS unless LOCAL keyword is specified
Will append data unless OVERWRITE specified
PARTITION required if destination table is partitioned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-

8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
Inserting Data During Table Creation

Use AS SELECT in the CREATE TABLE

statement to populate a table as it is created

CREATE TABLE page_view AS

SELECT [Link], [Link], pvs.page_url,
pvs.referrer_url
FROM page_view_stg pvs
WHERE [Link] = 'US';
Loading And Inserting Data: Summary

Use this For this purpose

SELECT [ALL | DISTINCT] select_expr,

select_expr, ... FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list |
[DISTRIBUTE BY col_list]
[SORT BY col_list]]
[LIMIT number];
hive> SELECT * FROM employee WHERE
salary>30000;
Thank you
This presentation is created using LibreOffice Impress [Link], can be used freely as per GNU General Public
License

Web Resources Blogs [Link]

[Link] h [Link]
ttp://[Link] [Link]
m [Link]

tushar@[Link]

Introduction to Apache Hive and Big Data
No ratings yet
Introduction to Apache Hive and Big Data
59 pages
Hive ODBC Integration in Big Data
No ratings yet
Hive ODBC Integration in Big Data
30 pages
Overview of Hive Architecture and Features
No ratings yet
Overview of Hive Architecture and Features
23 pages
Understanding Hive in Big Data
No ratings yet
Understanding Hive in Big Data
30 pages
Introduction to Hive and Pig in Hadoop
No ratings yet
Introduction to Hive and Pig in Hadoop
64 pages
Introduction to Hive and Pig in Hadoop
No ratings yet
Introduction to Hive and Pig in Hadoop
44 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
17 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Introduction to Apache Hive Framework
No ratings yet
Introduction to Apache Hive Framework
26 pages
Understanding Apache Hive and Big Data
No ratings yet
Understanding Apache Hive and Big Data
29 pages
Hive: Big Data Processing Overview
No ratings yet
Hive: Big Data Processing Overview
43 pages
Overview of Apache Hive Architecture
No ratings yet
Overview of Apache Hive Architecture
27 pages
Apache Hive Overview and Installation Guide
No ratings yet
Apache Hive Overview and Installation Guide
19 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hive in Big Data: Overview and Usage
100% (1)
Hive in Big Data: Overview and Usage
24 pages
Configuring Hive Metadata in RDBMS
No ratings yet
Configuring Hive Metadata in RDBMS
22 pages
Hive Overview: Features, Limitations, and Workflow
No ratings yet
Hive Overview: Features, Limitations, and Workflow
39 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
12 pages
Hive: Big Data SQL Query Tool
No ratings yet
Hive: Big Data SQL Query Tool
75 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
28 pages
Apache Hive: Data Warehouse Tool Overview
No ratings yet
Apache Hive: Data Warehouse Tool Overview
10 pages
Introduction to Apache Hive Basics
No ratings yet
Introduction to Apache Hive Basics
17 pages
Hadoop to Hive: Big Data Analytics Guide
No ratings yet
Hadoop to Hive: Big Data Analytics Guide
54 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
14 pages
Overview of Apache Hive Features
No ratings yet
Overview of Apache Hive Features
32 pages
Understanding HIVE for Big Data Analytics
No ratings yet
Understanding HIVE for Big Data Analytics
20 pages
Hive Execution Engine Overview
No ratings yet
Hive Execution Engine Overview
18 pages
Hadoop Ecosystem: Hive, Pig, Spark Overview
No ratings yet
Hadoop Ecosystem: Hive, Pig, Spark Overview
29 pages
Overview of Hive and Its Evolution
No ratings yet
Overview of Hive and Its Evolution
9 pages
Introduction to Hive in Big Data Analytics
No ratings yet
Introduction to Hive in Big Data Analytics
22 pages
Introduction to Hive for Big Data
No ratings yet
Introduction to Hive for Big Data
76 pages
Overview of Hive in Hadoop Ecosystem
No ratings yet
Overview of Hive in Hadoop Ecosystem
14 pages
Hive
No ratings yet
Hive
14 pages
Hive Overview and Architecture in BDA
No ratings yet
Hive Overview and Architecture in BDA
23 pages
Overview of Apache Hive and Its Features
No ratings yet
Overview of Apache Hive and Its Features
44 pages
Overview of Apache Hive Features
No ratings yet
Overview of Apache Hive Features
30 pages
Understanding Hive in Hadoop: Features & Uses
No ratings yet
Understanding Hive in Hadoop: Features & Uses
12 pages
Understanding Apache Hive in Big Data
No ratings yet
Understanding Apache Hive in Big Data
19 pages
Introduction to Hive and Pig in Big Data
No ratings yet
Introduction to Hive and Pig in Big Data
44 pages
Introduction to Apache Hive and Pig
No ratings yet
Introduction to Apache Hive and Pig
90 pages
Apache Hive: Data Warehouse & HiveQL Guide
No ratings yet
Apache Hive: Data Warehouse & HiveQL Guide
45 pages
Overview of Apache Hive Data Formats
100% (1)
Overview of Apache Hive Data Formats
47 pages
Hive: SQL-Based Data Warehousing in Hadoop
No ratings yet
Hive: SQL-Based Data Warehousing in Hadoop
52 pages
Hive: Overview, Architecture, and Data Modeling
No ratings yet
Hive: Overview, Architecture, and Data Modeling
28 pages
Introduction to Apache Hive for Big Data
No ratings yet
Introduction to Apache Hive for Big Data
5 pages
Hive Database Commands for Big Data Analysis
No ratings yet
Hive Database Commands for Big Data Analysis
8 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
23 pages
Understanding Apache Hive for Hadoop
No ratings yet
Understanding Apache Hive for Hadoop
18 pages
Understanding Hive in Hadoop
No ratings yet
Understanding Hive in Hadoop
69 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
69 pages
Understanding Apache Hive Overview
No ratings yet
Understanding Apache Hive Overview
51 pages
Bigdata Unit 4
No ratings yet
Bigdata Unit 4
13 pages
Overview of Apache Hive Architecture
No ratings yet
Overview of Apache Hive Architecture
36 pages
Understanding Hive in Hadoop Ecosystem
No ratings yet
Understanding Hive in Hadoop Ecosystem
30 pages
Overview of Hive and Its Architecture
No ratings yet
Overview of Hive and Its Architecture
12 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
33 pages
Apache Hive: Tools and Features Overview
No ratings yet
Apache Hive: Tools and Features Overview
34 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
35 pages
PMS TCP/IP Interface User Guide
No ratings yet
PMS TCP/IP Interface User Guide
35 pages
Salesforce Data Migration Training Guide
No ratings yet
Salesforce Data Migration Training Guide
11 pages
Understanding File Structures and Operations
No ratings yet
Understanding File Structures and Operations
4 pages
Assigning Surrogate Keys in Data Records
No ratings yet
Assigning Surrogate Keys in Data Records
8 pages
Banking Transaction System Code
No ratings yet
Banking Transaction System Code
2 pages
Enhancing Financial Literacy in College
No ratings yet
Enhancing Financial Literacy in College
16 pages
Factors Influencing Career Choices
No ratings yet
Factors Influencing Career Choices
40 pages
SAP PP Process Order Confirmation Guide
100% (1)
SAP PP Process Order Confirmation Guide
48 pages
National Workshop on Educational Research
No ratings yet
National Workshop on Educational Research
2 pages
Pseudo Ternary Encoding in Data Comm
No ratings yet
Pseudo Ternary Encoding in Data Comm
9 pages
DMS Lab Manual: SQL Queries & Solutions
No ratings yet
DMS Lab Manual: SQL Queries & Solutions
4 pages
SQL Commands for Employee and Department Tables
No ratings yet
SQL Commands for Employee and Department Tables
52 pages
Legal Research Data Processing Guide
No ratings yet
Legal Research Data Processing Guide
11 pages
2025 Basketball Tournament Data Analysis
No ratings yet
2025 Basketball Tournament Data Analysis
8 pages
Advanced OCI IAM Policy Management Guide
No ratings yet
Advanced OCI IAM Policy Management Guide
21 pages
Business Intelligence: Data Mining & Warehousing
No ratings yet
Business Intelligence: Data Mining & Warehousing
44 pages
HTTP Methods and REST API Best Practices
No ratings yet
HTTP Methods and REST API Best Practices
98 pages
Steam Purchase Reporting Automation Guide
No ratings yet
Steam Purchase Reporting Automation Guide
17 pages
Machine Learning for Chronic Kidney Disease
No ratings yet
Machine Learning for Chronic Kidney Disease
69 pages
Objectives and Structures of Reports
No ratings yet
Objectives and Structures of Reports
12 pages
Cache Memory Organization Overview
No ratings yet
Cache Memory Organization Overview
37 pages
Hive Data Types Overview
No ratings yet
Hive Data Types Overview
3 pages
B.C.A. in Data Science Syllabus 2024-25
No ratings yet
B.C.A. in Data Science Syllabus 2024-25
49 pages
CH 20
No ratings yet
CH 20
37 pages
Recruitment Process Study at Orissa Chemicals
No ratings yet
Recruitment Process Study at Orissa Chemicals
8 pages
Data Processing Methods Explained
No ratings yet
Data Processing Methods Explained
2 pages
Main Feature List As Core Success Criteria of Organizing Requirements Elicitation
No ratings yet
Main Feature List As Core Success Criteria of Organizing Requirements Elicitation
16 pages
Analisis Data Kualitatif: Metode dan Proses
No ratings yet
Analisis Data Kualitatif: Metode dan Proses
15 pages
Burning Glass: Evolving to Product Company
No ratings yet
Burning Glass: Evolving to Product Company
13 pages
Cognifyz Internship Program Overview
No ratings yet
Cognifyz Internship Program Overview
12 pages

Understanding Hive Map Types

Uploaded by

Understanding Hive Map Types

Uploaded by

Introduction to Hive

Big data and Hadoop

• The term ‘Big Data’ is used for

• Hadoop is an open-source framework to store

• The Hadoop ecosystem contains diff erent

• The traditional approach using Java

Data Warehouse :DWs are

• Hive is a dat a warehouse inf rastruct ure

SELECT COUNT(1) FROM Sales;

Sales: Hive table

Data in Hive organized into :

• It stores schema in a database and

• HiveQL Process Engine

7.1 execute metadata

7.1 Metadata Ops

 Hive built in Serde: Record

 Can use Custom Hive Table

Hive is organised hierarchically into:

HiveQL / HQL provides the basic SQL-like

Hive tables are stored in the Hive

Hive deals with two types of table structures

CREATE TABLE table_name

Use this For this purpose

Select from a single table

ALL and DISTINCT

CREATE TABLE page_view

CREATE EXTERNAL TABLE page_view_stg

Can make some queries faster

Can speed up queries that involve sampling

Use LOAD DATA to load data from a file or

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-

Use AS SELECT in the CREATE TABLE

CREATE TABLE page_view AS

Use this For this purpose

SELECT [ALL | DISTINCT] select_expr,

Web Resources Blogs [Link]

You might also like