0% found this document useful (0 votes)
10 views75 pages

Hive: Big Data SQL Query Tool

Uploaded by

23mca005
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views75 pages

Hive: Big Data SQL Query Tool

Uploaded by

23mca005
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

HIVE

INTRODUCTION

• Hive is a tool which was developed to help people work with Big Data
without actually needing to spend time learning MapReduce.
• Hive is a “data warehouse software” that enables you to query and
manipulate data using a SQL-like language known as HiveQL.
• It was developed at Facebook so that people who had experience in SQL
would be able to work on querying datasets without actually learning new
paradigms like MapReduce or new programming languages.
Developers face problem in writing MapReduce logic
● How to port existing
○ relational databases
○ SQL infrastructure with Hadoop?
● End users are familiar with SQL queries than MapReduce
and Pig
● Hive’s SQL-like query language makes data churning easy
KEY FEATURES OF HIVE

The main features of Hive are:


● An SQL-like interface to write queries on large datasets.
● Hive can be used to process all variants of data i.e. Structured, Semi-
structured and Unstructured.
● A variety of built-in functions for working with dates, strings, etc.
● Easy ETL (extraction, transformation, and loading) of data.
HIVE ARCHITECTURE
ARCHITECTURE OF HIVE

• User Interface - Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight
• Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
• HiveQL Process Engine- HiveQL is similar to SQL for querying on
schema info on the Megastore. It is one of the replacements of traditional
approach for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and process it.
ARCHITECTURE OF HIVE

• Execution Engine - The conjunction part of HiveQL process Engine and


MapReduce is Hive Execution Engine. Execution engine processes the
query and generates results as same as MapReduce results. It uses the
flavour of MapReduce.
• HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system
WORKING OF HIVE
WORKING

• Execute Query- The Hive interface such as Command Line or Web UI


sends query Driver to execute.
• Get Plan- The driver takes the help of query complier that parses the
query to check the syntax and query plan or the requirement of query.
• Get Metadata- The compiler sends metadata request to Megastore
• Send Metadata- Megastore sends metadata as a response to the compiler.
• Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.
• Execute Plan- the driver sends the execute plan to the execution engine.
• Execute Job- Internally, the process of execution job is a MapReduce job.
The execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
• Metadata Ops- Meanwhile in execution, the execution engine can execute
metadata operations with Megastore.
• Fetch Result- The execution engine receives the results from Data nodes.
• Send Results- The execution engine sends those resultant values to the
driver.
• Send Results- The driver sends the results to Hive Interfaces.
HIVE DATA TYPES AND DATA MODELS
• Hive stores and queries data using its data models. The purpose of using
data models is to make querying convenient and fast.
There are four main components in Hive data models, which are similar to
how an RDBMS stores data:
1. Databases
2. Tables
3. Partitions
4. Buckets
TABLES IN HIVE

Let’s start with tables. Hive has two types of tables:


● Managed (or internal) table
● External table
Note: 'Managed table' and the 'internal table' are synonymous terms.
You should use external tables when:
● You want to use the data outside of Hive as well. For example, when
another existing program is running on the same cluster.
● You want the data to remain stored on the HDFS even after dropping
tables because Hive does not delete the data stored outside (of the Hive
database).
● You do not want Hive to control the storage of your data
(location/directories of storage/etc.).
You use managed tables when:
● The data is temporary. So, when the Hive table is dropped, the data stored
in the internal table is deleted along with the metadata
● You want Hive to manage the life cycle of the data completely, i.e. both
store and process it
HIVE DATA TYPES

All the data types in hive are classified into four types
• Column Types
• Literals
• Null Values
• Complex Types
• Integer Types - Integer type data can be specified using integral data
types, INT. When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
• String Types - String type data types can be specified using single quotes
(' ') or double quotes (" "). It contains two data types: VARCHAR and
CHAR. Hive follows C-types escape characters.
• Timestamp - It supports traditional UNIX timestamp with optional
nanosecond precision. It supports [Link] format “YYYY-
MM-DD HH:MM:[Link]” and format “yyyy-mmdd
hh:mm:[Link]”.
• Dates - DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
• Decimals -The DECIMAL type in Hive is as same as Big Decimal format
of Java. It is used for representing immutable arbitrary precision.
• Union Types - Union is a collection of heterogeneous data [Link] can
create an instance using create union.
LITERALS

• Floating Point Types - Floating point types are nothing but numbers with
decimal points. Generally, this type of data is composed of DOUBLE data
type.
• Decimal Type - Decimal type data is nothing but floating point value with
higher range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
COMPLEX TYPES

• Arrays - Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
• Maps - Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
• Structs - Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [ COMMENT col_comment, … ]>
HIVE SCHEMAS

• Hive is schema-on-read
• Schema is only enforced when the data is read (at query time)
• Allows greater flexibility: same data can be read using multiple
schemas
• Contrast with an RDBMS, which is schema-on-write
• Schema is enforced when the data is loaded
• Speeds up queries at the expense of load times
HIVE DDL

Create Database Statement


A database in Hive is a namespace or a collection of tables.
hive> CREATE SCHEMA userdb;
hive> SHOW DATABASES;
Drop database
hive> DROP DATABASE IF EXISTS userdb;
CREATING HIVE TABLES

Create a table called T with two columns, the first being an integer and the
other a string.
hive> CREATE TABLE T(a INT, b STRING);
CREATE TABLE SYNTAX

CREATE TABLE table_name


(col1 data_type,
col2 data_type,
col3 data_type,
col4 datatype )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS format_type;
SIMPLE TABLE

CREATE TABLE page_view


(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User' )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
MORE COMPLEX TABLE
CREATE TABLE employees (
(name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zip:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
EXTERNAL TABLE
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
ALTERING AND DROPPING TABLES

hive> ALTER TABLE T RENAME TO K;


hive> ALTER TABLE K ADD COLUMNS (col INT);
hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a comment'
);
DML

To understand the Hive DML commands, let's see the employee and employee_department table first.
DML COMMANDS

LOAD DATA
hive> LOAD DATA LOCAL INPATH './usr/Desktop/
[Link]' OVERWRITE INTO TABLE Employee;
SELECTS and FILTERS
hive> SELECT E.EMP_ID FROM Employee E WHERE [Link]='US';
GROUP BY
hive> SELECT E.EMP_ID FROM Employee E GROUP BY [Link];
SORT BY

Hive sort by and order by commands are used to fetch data in sorted order. The
main differences between sort by and order by commands are given below.
hive> SELECT E.EMP_ID FROM Employee E SORT BY [Link];
May use multiple reducers for final output.
Only guarantees ordering of rows within a reducer.
May give partially ordered result.
ORDER BY

hive> SELECT E.EMP_ID FROM Employee E order BY [Link];


Uses single reducer to guarantee total order in output.
LIMIT can be used to minimize sort time.
JOINS
INNER & LEFT OUTER

Inner joins
Select * from employee join employeedepartment ON ([Link]
pid=[Link])
Left outer joins
Select [Link], empName, department from employee e Left outer joi
n employeedepartment ed on([Link]=[Link]);
RIGHT OUTER & FULL OUTER JOIN

Right outer joins


Select [Link], empName, department from employee e Right outer join employee
department ed on([Link]=[Link]);
Full outer joins
Select [Link], empName, department from employee e FULL outer join employee
department ed on([Link]=[Link]);
MORE HIVE QUERIES
AlexRon,123,Mathematics#Physics#Chemistry#MolecularBiology,semse
terfee|18000.50#labfee|1000#examfee|2500,011#1023550456
Alexa,111,Mathematics#Physics#English#ComputerScience,semseterfe
e|17000.50#labfee|2000#examfee|3500,012#1045550451
Derek,143,Mathematics#Physics#Geography#MolecularBiology,semset
erfee|20000.50#labfee|1500#examfee|2200,014#34523550456
Dona,115,Chemistry#Physics#English#ComputerScience,semseterfee|
17400.40#labfee|1000#examfee|1500,011#10455545251

(let the name of the above data file is [Link])


1. To create the table from the data use the below query
CREATE TABLE IF NOT EXISTS students (name STRING ,id INT ,subjects
ARRAY < STRING >,feeDetails MAP < STRING , FLOAT >,phoneNumber
STRUCT <areacode: INT , number : INT > )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '#'
MAP KEYS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
2. Load the data into the table (if stored on HDFS) ,
remember this will move the file to
load data inpath 'add path to your file here' overwrite
into table students;
Note: If you are using the file located on your local directory
you will need to use the below query.
load data local inpath 'add path to your file here'
overwrite into table students;
3. To verify if the data has been loaded correctly use the
below query
Select * FROM students;
EXPLODE FUNCTION
explode(ARRAY<T> a)
Explodes an array to multiple rows. Returns a row-set with a single column (col),
one row for each element from the array.
explode(MAP<Tkey,Tvalue> m)
Explodes a map to multiple rows. Returns a row-set with a two columns (key,value)
, one row for each key-value pair from the input map.
posexplode(ARRAY<T> a)
Explodes an array to multiple rows with additional positional column of int type
(position of items in the original array, starting with 0). Returns a row-set with two
columns (pos,val), one row for each element from the array.
inline(ARRAY<STRUCT<f1:T1,...,fn:Tn>> a)
Explodes an array of structs to multiple rows. Returns a row-set with N columns (N
= number of top level elements in the struct), one row per struct from the array.
USING ADVANCED FUNCTIONS
1. Explode()
Select explode(feedetails) FROM students;
Select explode(subjects) FROM students;
Select explode(feedetails) FROM students WHERE name = "Alexa" ;
2. Upper()
Select upper ( name ) from students;
3. Regex_Replace()
Select regexp_replace( concat ( upper ( name ), id ), ' ' , '' ) as username
from students;
NESTED QUERY EXAMPLE

1. Query to determine the set of students having lab fees greater than 1000
Select name , feeDetails[ 'labfee' ] from students where feeDetails[ 'labfee' ] > 1000
;
2. Assume that the labfees are getting hiked by 20% and you would like to see the projected
values in case it is greater than 1500.
Select name , feeDetails[ 'labfee' ]* 1.2 as projected_fee from students where
feeDetails[ 'labfee' ]* 1.2 > 1500 ;

FROM
( SELECT name , feeDetails[ 'labfee' ]* 1.2 as projected_fee from
students) newfee
SELECT [Link] where newfee.projected_fee > 1500 ;
PARTITIONING AND BUCKETING

1. Partitioning : Partitions are used to make queries faster by dividing the tables into
smaller parts using partition key columns.
( For example, if in a table with details of employees in a company(name, salary, department,
designation), you can create a partition for each department. Hive will store each partition
separately and will scan only the partition that is needed, thereby making the query faster.)

2. Bucketing : Like partitioning, bucketing also divides the data into smaller and more
manageable parts. The key difference between partitioning and bucketing is bucketing uses
the hash of a column to create multiple 'buckets', whereas partitions are more
straightforward - they are simply segmented directories according to the value of a column
(year, month etc.).
PARTITIONING AND CLUSTERING

For creating partitions, "PARTITIONED BY" clause is used.


Similarly for creating buckets, "CLUSTERED BY“ clause is
used.
Guidelines for when to use partitioning and
bucketing:
● You typically perform partitioning on columns (or groups of
columns) such that it produces a manageable number of
partitions (directories), such as year, month, state, country
etc.
● However, if you partition on say customer ID, the number
of partitions will be enormous and may actually reduce
performance; thus, you can choose to create buckets on
customer ID
STATIC PARTITIONING EXAMPLE

1. To create table use the below query


CREATE TABLE IF NOT EXISTS customer_partitioned(customer_fname
varchar ( 64),customer_lname varchar ( 64 ),customer_addr
string ,city varchar ( 64 ))PARTITIONED BY (country VARCHAR
( 64 ),state VARCHAR ( 64 ))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
STATIC PARTITIONING

2. To load data into the table use the below command


LOAD DATA LOCAL INPATH 'yourfilepath'
INTO TABLE customer_partitioned
PARTITION (country= 'usa' , state= 'az’ );
DYNAMIC PARTITIONING EXAMPLE

1. To create table use the below query


CREATE TABLE customer(customer_fname
VARCHAR(64),customer_lname
VARCHAR(64),customer_addr STRING,city VARCHAR(64),state
VARCHAR(64),country VARCHAR(64)) ROW FORMAT DELIMITED
FIELDS
TERMINATED BY ‘,’ STORED AS TEXTFILE;
2. To load data into the table use the below command
LOAD DATA LOCAL INPATH 'yourdatapath' INTO TABLE customer ;
3. By default, dynamic partitioning is disabled in Hive to prevent
accidental partition creation. To enable, we use:
set [Link] = true ;
set [Link]=nonstrict;
4. Create a partitioned table using the below query
CREATE TABLE partitioned_customer(customer_fname
varchar ( 64 ),customer_lname varchar ( 64 ),customer_addr
string ,city
varchar ( 64 ))PARTITIONED BY (country VARCHAR ( 64 ),state
VARCHAR ( 64 ));
5. Load data into the partitioned table using the previously created
normal table
Insert into table partitioned_customer partition (country,state)
select
customer_fname,customer_lname,customer_addr,city,country,sta
te from
customer;
6. To show all the partitions
show partitions partitioned_customer;
7. Simple select query using partition
Select * from partitioned_customer where state= 'az' and country= 'usa' ;
8. If you want to drop any partitions, it can be done as:
ALTER Table partitioned_customer DROP IF EXISTS PARTITION (country= 'usa' ,
state= 'nw' ,);
9. If you need to change the file to be referred by a partition you can use the below query
ALTER TABLE customer_partitioned PARTITION (country= 'usa' ,state= 'az' ) SET
LOCATION 'newpath' ;
ONE MORE EXAMPLE OF PARTITIONING

•Creation of table all states with 3 column names such as state, district, and enrollment
•Loading data into table all states
•Creation of partition table with state as partition key
•In this step Setting partition mode as non-strict( This mode will activate dynamic partition mode)
•Loading data into partition tablestate_part
•Actual processing and formation of partition tables based on state as partition key
•There is going to 38 partition outputs in HDFS storage with the file name as state name
BUCKETING

Buckets in hive is used in segregating of hive table-data into


multiple files or directories. it is used for efficient querying.
• The data i.e. present in that partitions can be divided further into
Buckets
• The division is performed based on Hash of particular columns that
we selected in the table.
• Buckets use some form of Hashing algorithm at back end to read
each record and place it into buckets
• In Hive, we have to enable buckets by using the
[Link]=true;
BUCKETING EXAMPLE

1. To create a normal table use the below query


CREATE TABLE IF NOT EXISTS products_input(productid
int ,transactiondate date ,amount int , store string ) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
TEXTFILE;
2. To load data into the table use the below command
load data inpath 'yourdatapath' into table products_input;
3. To enable bucketing use:
set [Link]=nonstrict;
set [Link]= true ;
set [Link]= true ;
4. Create a bucketed table using the below query
CREATE TABLE IF NOT EXISTS products_bucket(productid
int ,amount int ) PARTITIONED BY ( store String ,transactiondate date
) CLUSTERED
BY (productid) into 4 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Load data into the bucketed table using the previously created normal
table
insert into table products_bucket partition
( store ,transactiondate) select
productid,amount, store ,transactiondate from products_input;
6. Simple select query using buckets
select * from products_bucket where store = "walmart" and
transactiondate= "2018-01-01" and productid = 123 ;
7. Select query for comparison
select * from products_input where store = "walmart" and
transactiondate= "2018-01-01" and productid = 123 ;
1. CREATE SAMPLEBUCKET WITH COLUMN NAMES AS
FIRST_NAME, JOB_ID, DEPARTMENT, SALARY AND COUNTRY .
2. LOADING DATA INTO TABLE SAMPLE
BUCKET
3. DISPLAYING 4 BUCKETS THAT CREATED IN
STEP 1
UDFS(USER DEFINED FUNCTIONS) IN HIVE

• User Defined Functions, also known as UDF, allow you to create


custom functions to process records or groups of records.
• To write a simple UDF:
1. Extend the [Link] class
2. Write an "evaluate" method that has a signature equivalent to the
signature of your UDF in HiveQL.
SIMPLE UDF EXAMPLE
package udfs;
import [Link];

class SimpleUDFExample extends UDF

{ public Text evaluate(Text input)


{ if (input==null) return null;
return new Text("Hello " + [Link]()); } }

ve> ADD JAR /root/hivedemo/[Link];


ve> CREATE TEMPORARY FUNCTION helloworld as ‘[Link]’;
ive> select helloworld(name) from tablename limit 1000;
/** A simple UDF to convert Celcius to Fahrenheit
*/
package p;
import [Link];

public class ConvertToCelcius extends UDF


{ public double evaluate(double value)
{ return (value - 32) / 1.8; } }

hive> add jar [Link]


hive> create temporary function fahrenheit_to_celcius using
“[Link]“; hive> SELECT
fahrenheit_to_celcius(temp_fahrenheit) from temperature_data;
WORKING WITH SEMI STRUCTURED DATA USING HIVE
(XML, JSON)

• Step 1) Creation of Table "xmlsample_guru" with str column with string


data type.
• Step 2) Using XPath () method we will be able to fetch the data stored
inside XML tags.
• Step 3) In this step, we will fetch and display the Raw XML of table
"xmlsample_guru."
JSON FORMAT

• Twitter and websites data is stored in JSON format.


• Whenever we try to fetch data from online servers it
will return JSON files.
• Using Hive as data store we can able to load JSON
data into Hive tables by creating schemas.
• Step 1) In this step, we are going to create JSON table name
"json_guru". Once created loading and displaying contents of the actual
schema.
• Step 2) Using get_json_object() Method we can able to fetch the Data
values stored in JSON hierarchy

You might also like