UNIT IV
APACHE HIVE:
Introduction, Architecture and components - Data types and data models - HIVE
partitioning and bucketing - HIVE tables
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Need a basic knowledge of Core Java, Database concepts of SQL, Hadoop File system, and any
of Linux operating system flavors
Hive - Introduction
The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data management
systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation
introduced a framework called Hadoop to solve Big Data management and processing
challenges.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Features of Hive
The major features of Hive for big data processing include:
Open Source: Hive is an open-source software, making it easily accessible to users
worldwide for big data use cases and implementations.
SQL-like Interface: Hive provides a SQL-like interface with a similar query language
called Hive QL, allowing users to interact with data using SQL-like syntax and
commands.
Data Warehousing: Hive is designed for data warehousing tasks, enabling effective
analysis of datasets stored in the Hadoop Distributed File System (HDFS).
Scalability: Hive is highly scalable as it can work on datasets spread across multiple
computer clusters.
Integration with Hadoop Ecosystem: Hive seamlessly integrates with the components
of the Hadoop Ecosystem, like HDFS and Map Reduce, making it a helpful tool for big
data processing.
Data Management: Hive maintains meta-data like tables, columns, and more in a meta-
store, making it easier for the user to manage data
Architecture and components of Hive
The following component diagram depicts the architecture of Hive:
This Hive Architecture enables users to analyze big data using SQL-like queries without writing
complex programs. The layered architecture ensures scalability, fault tolerance, and efficient
processing of large datasets..The given diagram shows how a Hive query flows from the user to
Hadoop for execution. The architecture is divided into four logical layers.
1. Hive Clients
This is the top layer of the architecture. It represents the different ways users can connect to Hive.
Thrift Server
Allows non-Java applications to communicate with Hive using RPC (Remote Procedure
Call).
JDBC Driver
Enables Java applications and BI tools to connect to Hive.
ODBC Driver
Used by Windows-based applications and reporting tools to access [Link] clients
send HiveQL queries to Hive services.
2. Hive Services
This layer acts as a bridge between users and Hive processing.
Hive Web UI
A browser-based interface for submitting Hive queries.
Hive Server
Accepts queries from multiple clients and manages authentication and sessions.
CLI (Command Line Interface)
Allows users to interact with Hive using [Link] these services forward the query
to the Hive Driver.
3. Hive Driver
The Hive Driver is the core controller of Hive query execution.
Functions of Hive Driver:
Receives HiveQL queries from Hive services
Sends the query for parsing and optimization
Interacts with the Metastore to get metadata
Converts queries into execution plans
Submits jobs to MapReduce
Monitors execution and returns results to the client
4. Metastore
The Metastore stores metadata information such as:
Database and table names
Column names and data types
Partition and bucket details
Location of data in HDFS
Hive Driver uses this metadata to understand data structure before executing queries.
5. MapReduce Layer
HiveQL queries are converted into MapReduce jobs.
MapReduce performs parallel processing on large datasets.
It processes data stored in HDFS.
Modern Hive can also use Tez or Spark, but this diagram shows MapReduce.
6. HDFS (Hadoop Distributed File System)
This is the storage layer at the bottom of the architecture.
Stores actual Hive table data
Stores intermediate and final results
Provides fault tolerance and scalability
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
Query Execution Flow
1. User submits a query using Hive Client
2. Query goes to Hive Services
3. Hive Driver processes the query
4. Metadata is fetched from Metastore
5. Query is converted into MapReduce job
6. Data is read from HDFS
7. Results are returned to the user
8.
Hive - Data Types
Datatypes are classified into two types:
Primitive Data Types
Collective Data Types
1. Primitive Data Types
Primitive means were ancient and old. all datatypes listed as primitive are legacy ones. the
important primitive datatypes areas are listed below:
Type Size (byte) Example
Tiny Int 1 20
Small Int 2 20
Int 4 20
Big int 8 20
Boolean Boolean true/False FALSE
Double 8 10.2222
Float 4 10.2222
String Sequence of characters ABCD
Timestamp Integer/float/string 2/3/2012
12:34:56:1234567
Date Integer/float/string 2/3/2019
Hive Data Types are Implemented using JAVA
1. Character arrays are not supported in Hive.
Hive mainly works with string data types instead of fixed character arrays.
2. Hive uses delimiters to separate fields.
This delimiter-based storage improves read and write performance when working with
Hadoop.
3. Column length need not be specified.
While creating Hive tables, specifying the length of each column is not mandatory.
4. String
5.
6. literals in Hive.
String values can be written using:
Single quotes ' '
Double quotes " "
7. VARCHAR data type.
Introduced in newer versions of Hive.
Supports length from 1 to 65,535 characters.
Extra characters are truncated if the value exceeds the defined length.
Length is measured in characters, not bytes.
8. Integer literals.
TINYINT, SMALLINT, and BIGINT are treated as INT by default.
If the value exceeds INT range, it is converted to a suitable larger type.
9. DECIMAL data type.
Stores numeric values exactly.
Provides better precision than DOUBLE.
Suitable for financial and accurate calculations.
10. DOUBLE data type.
Stores approximate values.
Less accurate compared to DECIMAL.
2. Collection / Complex Data Types
There are four collection datatypes in the hive; they are also termed as complex data types.
ARRAY
MAP
STRUCT
UNIONTYPE
Hive Data Model
Apache Hive is an open-source data warehouse system built on top of Hadoop. It is used for
querying and analyzing large datasets stored in Hadoop file systems such as HDFS. Hive
supports structured and semi-structured data and allows users to write queries using SQL-like
language (HiveQL).
In Hive, data is logically organized to make storage and querying efficient. The Hive
Data Model categorizes data into three levels:
1. Table
2. Partition
3. Bucket
1. Table
A Hive table is similar to a table in a relational database. It logically stores data, while its
metadata (schema, column names, data types, location) is stored in the Meta store.
Hive supports operations like SELECT, FILTER, JOIN, and UNION on tables.
Types of Hive Tables
a) Managed Table
By default, Hive creates a Managed Table.
When data is loaded, Hive moves the data into its warehouse directory.
When the table is dropped, both data and metadata are deleted.
Data is permanently deleted.
b) External Table
Hive does not manage the data.
Data location is specified outside the warehouse directory.
When the table is dropped, only metadata is deleted, not the data.
Useful when data is shared with other tools.
2. Partition
Partitioning divides a table into smaller parts based on a column called the partition key.
Each partition stores data of a specific category.
Physically, a partition is a sub-directory inside the table directory.
Partitioning improves query performance by scanning only required data.
Queries for EEE students scan only the EEE partition, not the whole table.
3. Buckets
Bucketing further divides tables or partitions into fixed number of files based on a hash
function of a column.
Each bucket is stored as a file.
Helps in efficient joins, sampling, and querying.
Each partition will have 2 bucket files.
HIVE Tables
In Apache Hive, a table is a logical structure used to store and organize data in Hadoop.
Hive tables are similar to tables in a relational database, but the actual data is stored in HDFS,
while the metadata (schema, column names, data types, location) is stored in the Hive
Metastore.
Hive tables allow users to perform operations such as SELECT, FILTER, JOIN, GROUP BY,
and UNION using HiveQL.
Hive tables provide a structured way to store and analyze big data in Hadoop.
Managed tables give full control to Hive, while external tables offer flexibility and data safety.
Choosing the correct table type improves data management and performance.
Types of HIVE Tables
Hive mainly supports two types of tables:
1. Managed (Internal) Table
2. External Table
1. Managed (Internal) Table
This is the default table type in Hive.
Hive manages both data and metadata.
When data is loaded, Hive moves the data into the Hive warehouse directory.
When the table is dropped, both data and metadata are deleted.
Example
Data is permanently removed from HDFS.
2. External Table
Hive does not manage the data, only the metadata.
Data location is specified explicitly during table creation.
When the table is dropped, only metadata is deleted, data remains safe.
Example