0% found this document useful (0 votes)

26 views23 pages

Overview of Hive Architecture and Features

The document provides an overview of Hive including its architecture, data types, file formats, commands and how it interacts with Hadoop. Hive is a data warehouse tool that allows SQL-like queries on structured data stored in Hadoop. It uses a metastore to store metadata and execution engine to process queries as MapReduce jobs.

Uploaded by

Shantanu Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views23 pages

Overview of Hive Architecture and Features

Uploaded by

Shantanu Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to HIVE: Hive Architecture, Hive Data types, Hive Collection Types, Hive

File Formats, Hive Query Language, Hive Partitions, Bucketing, Views, RCFile
Implementation, Hive User Defined Function, SerDe, UDF
__________________________________________________________________________
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.
 Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
 Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
 The scripting approach for MapReduce to process structured and semi structured data using
Pig.
 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

1
Architecture of Hive

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can

create interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the

schema or Metadata of tables, databases, columns in a table, their
data types, and HDFS mapping.

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce program
in Java, we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and

MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as MapReduce
results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the

datastorage techniques to store data into file system.

2
Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query

The Hive interface such as Command Line or Web UI sends queryto Driver (any database
driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.

3 Get Metadata

The compiler sends metadata request to Metastore (any database).

4 Send Metadata

Metastore sends metadata as a response to the compiler.

3
5 Send Plan

The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.

8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.

File Formats in Hive

 File Format specifies how records are encoded in files

 Record Format implies how a stream of bytes for a given record are encoded
 The default file format is TEXTFILE – each record is a line in the file
 Hive uses different control characters as delimeters in textfiles
 ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n
 The term field is used when overriding the default delimiter
 FIELDS TERMINATED BY ‘\001’
 Supports text files – csv, tsv
 TextFile can contain JSON or XML documents.

Commonly used File Formats –

1. TextFile format
 Suitable for sharing data with other tools
 Can be viewed/edited manually
2. SequenceFile
 Flat files that stores binary key ,value pair
 SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting
respectively
 Supports – Uncompressed, Record compressed ( only value is compressed) and Block
compressed ( both key,value compressed) formats
4
3. RCFile
 RCFile stores columns of a table in a record columnar way
4. ORC
5. AVRO

Hive Commands

Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User
defined functions.

Hive DDL Commands

create database

drop database

create table

drop table

alter table

create index

create view

Hive DML Commands

Select

Where

Group By

Order By

Load Data

Join:

o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join

5
Hive DDL

Commands Create

Database Statement

A database in Hive is a namespace or a collection of tables.

1. hive> CREATE SCHEMA userdb;
2. hive> SHOW DATABASES;

Drop database

1. ive> DROP DATABASE IF EXISTS userdb;

Creating Hive Tables

Create a table called Sonoo with two columns, the first being an integer and the other a string.

1. hive> CREATE TABLE Sonoo(foo INT, bar STRING);

Create a table called HIVE_TABLE with two columns and a partition column called ds. The
partition column is a virtual column. It is not part of the data itself but is derived from the partition
that a particular dataset is loaded [Link] default, tables are assumed to be of text input format and
the delimiters are assumed to be ^A(ctrl-a).

1. hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds STRI
NG);

Browse the table

1. hive> Show tables;

Altering and Dropping Tables

1. hive> ALTER TABLE Sonoo RENAME TO Kafka;

2. hive> ALTER TABLE Kafka ADD COLUMNS (col INT);
3. hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a comment');
4. hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT, weight STRING, baz
INT COMMENT 'baz replaces new_col1');
Hive DML Commands

To understand the Hive DML commands, let's see the employee and employee_department table
first.

6
LOAD DATA

1. hive> LOAD DATA LOCAL INPATH './usr/Desktop/[Link]' OVERWRITE INTO TABLE Employe
e;

SELECTS and FILTERS

1. hive> SELECT E.EMP_ID FROM Employee E WHERE [Link]='US';

GROUP BY

1. hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY [Link];

Adding a Partition

We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;

partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.
hive> ALTER TABLE employee
> ADD PARTITION (year=’2012’)
> location '/2012/part2012';

Renaming a Partition

The syntax of this command is as follows.

7
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION
partition_spec;
The following query is used to rename a partition:
hive> ALTER TABLE employee PARTITION (year=’1203’)
> RENAME TO PARTITION (Yoj=’1203’);

Dropping a Partition

The following syntax is used to drop a partition:

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION

partition_spec,...;

The following query is used to drop a partition:

hive> ALTER TABLE employee DROP [IF EXISTS]
> PARTITION (year=’1203’);

Hive Query Language

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in operators
and functions generate an expression, which fulfils the condition.

Syntax

Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

Example

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.

8
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following response:
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+ + + + + +

JDBC Program

The JDBC program to apply where clause for the given example is as follows.
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class HiveQLWhere {

private static String driverName = "[Link]";

public static void main(String[] args) throws SQLException {

// Register driver and create driver instance

[Link](driverName);

// get connection
Connection con = [Link]("jdbc:hive://localhost:10000/userdb", "",
"");

// create statement
Statement stmt = [Link]();

// execute statement

9
Resultset res = [Link]("SELECT * FROM employee WHERE salary>30000;");

[Link]("Result:");
[Link](" ID \t Name \t Salary \t Designation \t Dept ");

while ([Link]()) {
[Link]([Link](1) + " " + [Link](2) + " " + [Link](3) + " " +
[Link](4) + " " + [Link](5));
}
[Link]();
}
}
Save the program in a file named [Link]. Use the following commands to compile
and execute this program.
$ javac [Link]
$ java HiveQLWhere

Output:

ID Name Salary Designation Dept

1201 Gopal 45000 Technical manager TP
1202 Manisha 45000 Proofreader PR
1203 Masthanvali 40000 Technical writer TP
1204 Krian 40000 Hr Admin HR

The ORDER BY clause is used to retrieve the details based on one column and sort the result set
by ascending or descending order.

Syntax

Given below is the syntax of the ORDER BY clause:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

10
Example

Let us take an example for SELECT...ORDER BY clause. Assume employee table as given
below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details in order by using Department name.
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
The following query retrieves the employee details using the above scenario:

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

On successful execution of the query, you get to see the following response:
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+ + + + + +

JDBC Program

Here is the JDBC program to apply Order By clause for the given example.
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class HiveQLOrderBy {

private static String driverName = "[Link]";

public static void main(String[] args) throws SQLException {

// Register driver and create driver instance

[Link](driverName);

// get connection

11
Connection con = [Link]("jdbc:hive://localhost:10000/userdb", "",
"");

// create statement
Statement stmt = [Link]();

// execute statement
Resultset res = [Link]("SELECT * FROM employee ORDER BY DEPT;");
[Link](" ID \t Name \t Salary \t Designation \t Dept ");

while ([Link]()) {
[Link]([Link](1) + " " + [Link](2) + " " + [Link](3) + " " +
[Link](4) + " " + [Link](5));
}

[Link]();
}
}
Save the program in a file named [Link]. Use the following commands to compile
and execute this program.
$ javac [Link]
$ java HiveQLOrderBy

Output:

ID Name Salary Designation Dept

1205 Kranthi 30000 Op Admin Admin
1204 Krian 40000 Hr Admin HR
1202 Manisha 45000 Proofreader PR
1201 Gopal 45000 Technical manager TP
1203 Masthanvali 40000 Technical writer TP
1204 Krian 40000 Hr Admin HR

The GROUP BY clause is used to group all the records in a result set using a particular collection
column. It is used to query a group of records.

Syntax

The syntax of GROUP BY clause is as follows:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]

12
[LIMIT number];

Example

Let us take an example of SELECT…GROUP BY clause. Assume employee table as given

below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the
number of employees in each department.
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 45000 | Proofreader | PR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
The following query retrieves the employee details using the above scenario.

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;

On successful execution of the query, you get to see the following response:
+ + +
| Dept | Count(*) |
+ + +
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+ + +

JDBC Program

Given below is the JDBC program to apply the Group By clause for the given example.
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class HiveQLGroupBy {

private static String driverName = "[Link]";

public static void main(String[] args) throws SQLException {

// Register driver and create driver instance

[Link](driverName);

// get connection

13
Connection con = DriverManager.
getConnection("jdbc:hive://localhost:10000/userdb", "", "");

// create statement
Statement stmt = [Link]();

// execute statement
Resultset res = [Link](“SELECT Dept,count(*) ” + “FROM employee GROUP
BY DEPT; ”);
[Link](" Dept \t count(*)");

while ([Link]()) {
[Link]([Link](1) + " " + [Link](2));
}
[Link]();
}
}
Save the program in a file named [Link]. Use the following commands to compile
and execute this program.

$ javac [Link]
$ java HiveQLGroupBy

Output:

Dept Count(*)
Admin 1
PR 2
TP 3
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the database.

Syntax

join_table:

table_reference JOIN table_factor [join_condition]

| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]

Example

We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..

14
+ + + + + +
| ID | NAME | AGE | ADDRESS | SALARY |
+ + + + + +
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+ + + + + +
Consider another table ORDERS as follows:
+ + + + +
|OID | DATE | CUSTOMER_ID | AMOUNT |
+ + + + +
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+ + + + +
There are different types of joins given as follows:

 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
 FULL OUTER JOIN

JOIN

JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys
of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT [Link], [Link], [Link], [Link]
FROM CUSTOMERS c JOIN ORDERS o
ON ([Link] = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
| ID | NAME | AGE | AMOUNT |
+ + + + +
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |

15
| 4 | Chaitali | 25 | 2060 |
+ + + + +

LEFT OUTER JOIN

The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no
matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table,
the JOIN still returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT [Link], [Link], [Link], [Link]
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON ([Link] = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
| ID | NAME | AMOUNT | DATE |
+ + + + +
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+ + + + +

RIGHT OUTER JOIN

The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still
returns a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER
tables.
notranslate"> hive> SELECT [Link], [Link], [Link], [Link] FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON ([Link] = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +

16
| ID | NAME | AMOUNT | DATE |
+ + + + +
|3 | kaushik | 3000 | 2009-10-08 00:00:00 |
|3 | kaushik | 1500 | 2009-10-08 00:00:00 |
|2 | Khilan | 1560 | 2009-11-20 00:00:00 |
|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+ + + + +

FULL OUTER JOIN

The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables
that fulfil the JOIN condition. The joined table contains either all the records from both the tables,
or fills in NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT [Link], [Link], [Link], [Link]
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON ([Link] = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+ + + + +
| ID | NAME | AMOUNT | DATE |
+ + + + +
|1 | Ramesh | NULL | NULL |
|2 | Khilan | 1560 | 2009-11-20 00:00:00 |
|3 | kaushik | 3000 | 2009-10-08 00:00:00 |
|3 | kaushik | 1500 | 2009-10-08 00:00:00 |
|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
|5 | Hardik | NULL | NULL |
|6 | Komal | NULL | NULL |
|7 | Muffy | NULL | NULL |
|3 | kaushik | 3000 | 2009-10-08 00:00:00 |
|3 | kaushik | 1500 | 2009-10-08 00:00:00 |
|2 | Khilan | 1560 | 2009-11-20 00:00:00 |
|4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+ + + + +

Bucketing #

• Bucketing concept is based on (hashing function on the bucketed column) mod (by total number
of buckets). The hash_function depends on the type of the bucketing column.

• Records with the same bucketed column will always be stored in the same bucket.

• We use CLUSTERED BY clause to divide the table into buckets.

17
• Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.

• Bucketing can be done along with Partitioning on Hive tables and even without partitioning.

• Bucketed tables will create almost equally distributed data file parts, unless there is skew in data.

• Bucketing is enabled by setting [Link]= true;

Advantages

• Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try
out queries on a fraction of data for testing and debugging purpose when the original data sets are
very huge.

• Asthe data files are equal sized parts, map-side joins will be faster on bucketed tables than non-
bucketed tables.

• Bucketingconcept also provides the flexibility to keep the records in each bucket to be sorted by
one or more columns. This makes map-side joins even more efficient, since the join of each bucket
becomes an efficient merge-sort.

Bucketing Vs Partitioning

• Partitioning
helps in elimination of data, if used in WHERE clause, where as bucketing helps in
organizing data in each partition into multiple files, so that the same set of data is always written
in same bucket.

• Bucketing helps a lot in joining of columns.

• Hive
Bucket is nothing but another technique of decomposing data or decreasing the data into
more manageable parts or equal parts.

Sampling

• TABLESAMPLE() gives more disordered and random records from a table as compared to
LIMIT. •We can sample using the rand() function, which returns a random number.

SELECT * from users TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;

• Here
rand() refers to any random column. •The denominator in the bucket clause represents the
number of buckets into which data will be hashed. •The numerator is the bucket number selected.

SELECT * from users TABLESAMPLE(BUCKET 2 OUT OF 4 ON name) s;

• If the
columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED
BY clause, TABLESAMPLE queries only scan the required hash partitions of the table.

SELECT * FROM buck_users TABLESAMPLE(BUCKET 1 OUT OF 2 ON id) s LIMIT 1;

18
Joins and Types #

Reduce-Side Join

• If datasets are large, reduce side join takes place.

Map-Side Join

• Incase one of the dataset is small, map side join takes place. •In map side join, a local job runs to
create hash-table from content of HDFS file and sends it to every node.

SET [Link] =true;

Bucket Map Join

• The data must be bucketed on the keys used in the ON clause and the number of buckets for one
table must be a multiple of the number of buckets for the other table. •When these conditions are
met, Hive can join individual buckets between tables in the map phase, because it does not have
to fetch the entire content of one table to match against each bucket in the other table. •set
[Link] =true; •SET [Link] =true;

SMBM Join

• Sort-Merge-Bucket (SMB) joins can be converted to SMB map joins as well.

• SMB joins are used wherever the tables are sorted and bucketed.

• The join boils down to just merging the already sorted tables, allowing this operation to be faster
than an ordinary map-join.

• set [Link] =false;

• set [Link] =true;

• set [Link] = true;

LEFT SEMI JOIN

•A left semi-join returns records from the lefthand table if records are found in the righthand table
that satisfy the ON predicates.

• It’s a special, optimized case of the more general inner join.

• Most SQL dialects support an IN … EXISTS construct to do the same thing.

• SELECT and WHERE clauses can’t reference columns from the righthand table.

• Right semi-joins are not supported in Hive.

19
• The reason semi-joins are more efficient than the more general inner join is as follows:

• Fora given record in the lefthand table, Hive can stop looking for matching records in the
righthand table as soon as any match is found.

• At that point, the selected columns from the lefthand table record can be projected

•A file format is a way in which information is stored or encoded in a computer file.

• In Hive it refers to how records are stored inside the file.

• InputFormat reads key-value pairs from files.

• As we are dealing with structured data, each record has to be its own structure.

• How records are encoded in a file defines a file format.

• These file formats mainly vary between data encoding, compression rate, usage of space and disk
I/O.

• Hive does not verify whether the data that you are loading matches the schema for the table or
not. •However, it verifies if the file format matches the table definition or not.

SerDe in Hive #

• The SerDe interface allows you to instruct Hive as to how a record should be processed.

• A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De).

• TheDeserializer interface takes a string or binary representation of a record, and translates it into
a Java object that Hive can manipulate.

• The Serializer,
however, will take a Java object that Hive has been working with, and turn it into
something that Hive can write to HDFS or another supported system.

• Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers
are used when writing data, such as through an INSERT-SELECT statement.

CSVSerDe

• Use ROW FORMAT SERDE ‘[Link]’

• Define following in SERDEPROPERTIES

( " separatorChar " = < value_of_separator

, " quoteChar " = < value_of_quote_character ,

" escapeChar “ = < value_of_escape_character

20
)

JSONSerDe

• Include [Link] •Use ROW FORMAT SERDE ’

[Link] ’

RegexSerDe

• It is used in case of pattern matching. •Use ROW FORMAT SERDE

'[Link]‘

• In SERDEPROPERTIES, define input pattern and output fields.

For Example

• [Link] = ‘(.)/(.)@(.*)’ •[Link]’ = ’ 1 s 2 s 3 s’;

USE PARTITIONING AND BUCKETING

• Partitioning
a table stores data in sub-directories categorized by table location, which allows Hive
to exclude unnecessary data from queries without reading all the data every time a new query is
made.

• Hive
does support Dynamic Partitioning (DP) where column values are only known at
EXECUTION TIME. To enable Dynamic Partitioning :

SET [Link] =true;

• Another situation we want to protect against dynamic partition insert is that the user may
accidentally specify all partitions to be dynamic partitions without specifying one static partition,
while the original intention is to just overwrite the sub-partitions of one root partition.

SET [Link] =strict;

To enable bucketing:

SET [Link] =true;

Optimizations in Hive #

• Use Denormalisation , Filtering and Projection as early as possible to reduce data before join.

• Joinis a costly affair and requires extra map-reduce phase to accomplish query job. With De-
normalisation, the data is present in the same table so there is no need for any joins, hence the
selects are very fast.

21
• As join requires data to be shuffled across nodes, use filtering and projection as early as possible
to reduce data before join.

TUNE CONFIGURATIONS

• To increase number of mapper, reduce split size :

SET [Link] =1000000; (~1 MB)

• Compress map/reduce output

SET [Link] =true;

• Parallel execution

• Applies toMapReduce jobs that can run in parallel, for example jobs processing different source
tables before a join.

SET [Link] =true;

USE ORCFILE

• Hivesupports ORCfile , a new table storage format that sports fantastic speed improvements
through techniques like predicate push-down, compression and more.

• Using
ORCFile for every HIVE table is extremely beneficial to get fast response times for your
HIVE queries.

USE TEZ

• With Hadoop2 and Tez , the cost of job submission and scheduling is minimized.

• AlsoTez does not restrict the job to be only Map followed by Reduce; this implies that all the
query execution can be done in a single job without having to cross job boundaries.

• Let’s look at an example. Consider a click-stream event table:

CREATE TABLE clicks (

timestamp date,
sessionID string,
url string,
source_ip string
)
STORED as ORC
tblproperties (“ [Link] ” = “SNAPPY”);

• Each record represents a click event, and we would like to find the latest URL for each sessionID

• One might consider the following approach:

22
SELECT [Link], [Link] FROM clicks inner join (select sessionID, max(timestamp)
as max_ts from clicks group by sessionID) latest ON [Link] = [Link] and
[Link] = latest.max_ts;

• Inthe above query, we build a sub-query to collect the timestamp of the latest event in each
session, and then use an inner join to filter out the rest.

• While the query is a reasonable solution —from a functional point of view— it turns out there’s
a better way to re-write this query as follows:

SELECT ranked_clicks.sessionID , ranked_clicks.url FROM (SELECT sessionID , url , RANK()

over (partition by sessionID,order by timestamp desc ) as rank FROM clicks) ranked_clicks
WHERE ranked_clicks.rank =1;

• Here,we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but
without a Join.

• Clearly,
removing an unnecessary join will almost always result in better performance, and when
using big data this is more important than ever.

MAKING MULTIPLE PASS OVER SAME DATA

• Hive has a special syntax for producing multiple aggregations from a single pass through a source
of data, rather than rescanning it for each aggregation.

• This change can save considerable processing time for large input data sets.

• For example, each of the following two queries creates a table from the same source table, history:

INSERT OVERWRITE TABLE sales

SELECT * FROM history WHERE action=‘purchased’;

INSERT OVERWRITE TABLE credits

SELECT * FROM history WHERE action=‘returned’;

Optimizations in Hive

• This syntax is correct, but inefficient.

• The following rewrite achieves the same thing, but using a single pass through the source history
table:

FROM history

INSERT OVERWRITE sales SELECT * WHERE action=‘purchased’

INSERT OVERWRITE credits SELECT * WHERE action=‘returned’;

Overview of Apache Hive Features
No ratings yet
Overview of Apache Hive Features
32 pages
Introduction to Hive and Pig in Hadoop
No ratings yet
Introduction to Hive and Pig in Hadoop
44 pages
Introduction to Hive and Pig in Hadoop
No ratings yet
Introduction to Hive and Pig in Hadoop
64 pages
Introduction to Hive and Pig in Big Data
No ratings yet
Introduction to Hive and Pig in Big Data
44 pages
Understanding Hive Map Types
No ratings yet
Understanding Hive Map Types
49 pages
Hive ODBC Integration in Big Data
No ratings yet
Hive ODBC Integration in Big Data
30 pages
Introduction to Hive for Big Data
No ratings yet
Introduction to Hive for Big Data
76 pages
Hive Overview: Features, Limitations, and Workflow
No ratings yet
Hive Overview: Features, Limitations, and Workflow
39 pages
Overview of Apache Hive and Its Features
No ratings yet
Overview of Apache Hive and Its Features
44 pages
Understanding Hive in Big Data
No ratings yet
Understanding Hive in Big Data
30 pages
Overview of Apache Hive Features
No ratings yet
Overview of Apache Hive Features
30 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
28 pages
Introduction to Apache Hive Framework
No ratings yet
Introduction to Apache Hive Framework
26 pages
Hive: Big Data Processing Overview
No ratings yet
Hive: Big Data Processing Overview
43 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Introduction to Apache Hive and Big Data
No ratings yet
Introduction to Apache Hive and Big Data
59 pages
Introduction to Hive in Big Data Analytics
No ratings yet
Introduction to Hive in Big Data Analytics
22 pages
Introduction to Hive and Pig Basics
No ratings yet
Introduction to Hive and Pig Basics
34 pages
Hive Database Commands for Big Data Analysis
No ratings yet
Hive Database Commands for Big Data Analysis
8 pages
Overview of Apache Hive Architecture
No ratings yet
Overview of Apache Hive Architecture
36 pages
Understanding Apache Hive and Big Data
No ratings yet
Understanding Apache Hive and Big Data
29 pages
Overview of Apache Hive Data Formats
100% (1)
Overview of Apache Hive Data Formats
47 pages
Apache Hive Overview and Installation Guide
No ratings yet
Apache Hive Overview and Installation Guide
19 pages
Apache Hive: Data Warehouse Tool Overview
No ratings yet
Apache Hive: Data Warehouse Tool Overview
10 pages
Overview of Hive and Its Evolution
No ratings yet
Overview of Hive and Its Evolution
9 pages
Hive Execution Engine Overview
No ratings yet
Hive Execution Engine Overview
18 pages
Hive in Big Data: Overview and Usage
100% (1)
Hive in Big Data: Overview and Usage
24 pages
Hive: Big Data SQL Query Tool
No ratings yet
Hive: Big Data SQL Query Tool
75 pages
Big Data Systems: Hive & PIG Overview
No ratings yet
Big Data Systems: Hive & PIG Overview
73 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
69 pages
Understanding Apache Hive Overview
No ratings yet
Understanding Apache Hive Overview
51 pages
Understanding Apache Hive: Features & Architecture
No ratings yet
Understanding Apache Hive: Features & Architecture
52 pages
Big Data Processing with Pig and Hive
No ratings yet
Big Data Processing with Pig and Hive
23 pages
Bigdata Unit 4
No ratings yet
Bigdata Unit 4
13 pages
Apache Hive and HBase Overview Guide
No ratings yet
Apache Hive and HBase Overview Guide
57 pages
Understanding HIVE for Big Data Analytics
No ratings yet
Understanding HIVE for Big Data Analytics
20 pages
Hive Overview and Architecture in BDA
No ratings yet
Hive Overview and Architecture in BDA
23 pages
Overview of Apache Hive Architecture
No ratings yet
Overview of Apache Hive Architecture
27 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
39 pages
Overview of Apache Hive in Big Data
No ratings yet
Overview of Apache Hive in Big Data
65 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
17 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
12 pages
Hive
No ratings yet
Hive
45 pages
Apache Hive: Tools and Features Overview
No ratings yet
Apache Hive: Tools and Features Overview
34 pages
Unit 4
No ratings yet
Unit 4
18 pages
Hive Architecture and Query Process Explained
No ratings yet
Hive Architecture and Query Process Explained
5 pages
Understanding Apache Hive: Features & Use Cases
No ratings yet
Understanding Apache Hive: Features & Use Cases
11 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
35 pages
Hive and Pig: Overview and Architecture
No ratings yet
Hive and Pig: Overview and Architecture
33 pages
Understanding Apache Hive in Big Data
No ratings yet
Understanding Apache Hive in Big Data
19 pages
Overview of Hive and Its Architecture
No ratings yet
Overview of Hive and Its Architecture
12 pages
Introduction to Apache Hive Basics
No ratings yet
Introduction to Apache Hive Basics
17 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
25 pages
Hive: SQL-Based Data Warehousing in Hadoop
No ratings yet
Hive: SQL-Based Data Warehousing in Hadoop
52 pages
Apache Hive: Data Warehouse & HiveQL Guide
No ratings yet
Apache Hive: Data Warehouse & HiveQL Guide
45 pages
Understanding Apache Hive for Hadoop
No ratings yet
Understanding Apache Hive for Hadoop
18 pages
Hadoop Ecosystem: Hive, Pig, Spark Overview
No ratings yet
Hadoop Ecosystem: Hive, Pig, Spark Overview
27 pages
Configuring Hive Metadata in RDBMS
No ratings yet
Configuring Hive Metadata in RDBMS
22 pages
Apache Spark Interview Questions Guide
No ratings yet
Apache Spark Interview Questions Guide
59 pages
Data Science Interview Prep Guide
60% (5)
Data Science Interview Prep Guide
63 pages
1000+ SQL Interview Questions & Answers
No ratings yet
1000+ SQL Interview Questions & Answers
1,261 pages
SQL PDF
100% (13)
SQL PDF
221 pages
Data Analyst Interview Questions
71% (7)
Data Analyst Interview Questions
28 pages
Data Lakehouse Solutions for Team Alignment
100% (8)
Data Lakehouse Solutions for Team Alignment
64 pages
Snowflake Scenario-Based Interview Questions
100% (2)
Snowflake Scenario-Based Interview Questions
20 pages
SQL Interview Questions & Answers Guide
88% (43)
SQL Interview Questions & Answers Guide
48 pages
SQL Scenario-Based Interview Questions
100% (3)
SQL Scenario-Based Interview Questions
23 pages
Understanding Apache Spark and RDDs
100% (3)
Understanding Apache Spark and RDDs
197 pages
SQL - With Practice Exercises, Learn SQL Fast (PDFDrive) PDF
100% (6)
SQL - With Practice Exercises, Learn SQL Fast (PDFDrive) PDF
167 pages
Databricks Data Engineer Study Guide
80% (5)
Databricks Data Engineer Study Guide
157 pages
Databricks Data Engineer Associate Exam Prep
100% (1)
Databricks Data Engineer Associate Exam Prep
23 pages
Snowflake Data Warehousing Essentials
100% (1)
Snowflake Data Warehousing Essentials
40 pages
Data Engineer Interview Preparation Guide
100% (1)
Data Engineer Interview Preparation Guide
13 pages
Snowflake Tutorial and Key Concepts Guide
91% (11)
Snowflake Tutorial and Key Concepts Guide
67 pages
Comprehensive SQL Notes for Beginners
80% (55)
Comprehensive SQL Notes for Beginners
18 pages
SQL Basics and Advanced Techniques
90% (10)
SQL Basics and Advanced Techniques
101 pages
SIMPLE SQL Begginers Guide To Master SQL and Boost Career
90% (10)
SIMPLE SQL Begginers Guide To Master SQL and Boost Career
425 pages
PySpark Interview Questions Guide
100% (5)
PySpark Interview Questions Guide
126 pages
SQL Commands Cheat Sheet
82% (11)
SQL Commands Cheat Sheet
1 page
Azure Databricks Course Overview
80% (5)
Azure Databricks Course Overview
169 pages
Databricks SQL & PySpark Cheat Sheet
100% (4)
Databricks SQL & PySpark Cheat Sheet
11 pages
SQL Interview Questions and Answers Guide
60% (10)
SQL Interview Questions and Answers Guide
24 pages
SQL Query Techniques and Examples
100% (6)
SQL Query Techniques and Examples
194 pages
Databricks Data Engineer Exam Overview
100% (2)
Databricks Data Engineer Exam Overview
103 pages
Databricks Clusters: Types & Management Guide
100% (1)
Databricks Clusters: Types & Management Guide
29 pages
Databricks Tutorial Overview and Setup
100% (1)
Databricks Tutorial Overview and Setup
131 pages
Databricks Certified Associate Data Engineer Exam
No ratings yet
Databricks Certified Associate Data Engineer Exam
142 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Tower of Babylon Problem on SPOJ
No ratings yet
Tower of Babylon Problem on SPOJ
2 pages
Unity Feedback System Analysis
100% (1)
Unity Feedback System Analysis
29 pages
LTE Initial Attach Procedure Overview
No ratings yet
LTE Initial Attach Procedure Overview
2 pages
European Roulette Game Rules Guide
No ratings yet
European Roulette Game Rules Guide
10 pages
Toshiba MQ04ABF100 Driver Status
No ratings yet
Toshiba MQ04ABF100 Driver Status
6 pages
Senior Oracle Hyperion Consultant Profile
No ratings yet
Senior Oracle Hyperion Consultant Profile
6 pages
Top SSD NVMe Hosting Services
No ratings yet
Top SSD NVMe Hosting Services
5 pages
C# Class Design Exercises Guide
No ratings yet
C# Class Design Exercises Guide
8 pages
Domain 0 - Introduction
No ratings yet
Domain 0 - Introduction
12 pages
Icon DP Sys - User Manual
No ratings yet
Icon DP Sys - User Manual
338 pages
Cyber Security and Cryptography Module
No ratings yet
Cyber Security and Cryptography Module
13 pages
Technical Documentation in Software Development
100% (1)
Technical Documentation in Software Development
4 pages
Ahmed Al Shaer: Site Engineer Profile
No ratings yet
Ahmed Al Shaer: Site Engineer Profile
1 page
Microprocessors and Microsystems
No ratings yet
Microprocessors and Microsystems
33 pages
Understanding Device Driver Theory
No ratings yet
Understanding Device Driver Theory
53 pages
Dispenser Operator Guid
No ratings yet
Dispenser Operator Guid
24 pages
Autopsy Answers
No ratings yet
Autopsy Answers
11 pages
USB Peripheral Support in NetBSD
No ratings yet
USB Peripheral Support in NetBSD
7 pages
Bca Syllabus
No ratings yet
Bca Syllabus
40 pages
Moxa Ia240 Series Datasheet v1.0
No ratings yet
Moxa Ia240 Series Datasheet v1.0
5 pages
998 20312057 Product Selection Guide EcoStruxureBuilding
No ratings yet
998 20312057 Product Selection Guide EcoStruxureBuilding
80 pages
Roche cobas 8100 Automation Overview
No ratings yet
Roche cobas 8100 Automation Overview
7 pages
Become a Certified AdWords Expert
No ratings yet
Become a Certified AdWords Expert
3 pages
Avinash Doce
No ratings yet
Avinash Doce
2 pages
Decimal, Binary, Hexadecimal Conversions
No ratings yet
Decimal, Binary, Hexadecimal Conversions
18 pages
IFS/Fixed Assets: IIF FS S A AP PP PL Liic CA AT Tiio ON NS S
No ratings yet
IFS/Fixed Assets: IIF FS S A AP PP PL Liic CA AT Tiio ON NS S
2 pages
CCFP4 0-RDBMSAssignments
No ratings yet
CCFP4 0-RDBMSAssignments
64 pages
PowerShell Commands Cheat Sheet
No ratings yet
PowerShell Commands Cheat Sheet
9 pages
Optimize SAP Spooler Performance Guide
No ratings yet
Optimize SAP Spooler Performance Guide
6 pages
Software Quality and Testing Syllabus
No ratings yet
Software Quality and Testing Syllabus
8 pages