0% found this document useful (0 votes)
7 views12 pages

Big Data Unit 4

Apache Hive is a data warehouse infrastructure tool built on Hadoop for processing structured data, providing a SQL-like interface for querying and analyzing large datasets. It features an architecture with multiple components including Hive Clients, Hive Services, Hive Driver, Metastore, and MapReduce, allowing for efficient data management and analysis. Hive supports various data types and models, including managed and external tables, partitioning, and bucketing, enhancing performance and scalability in big data applications.

Uploaded by

shanmukhsai005
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Big Data Unit 4

Apache Hive is a data warehouse infrastructure tool built on Hadoop for processing structured data, providing a SQL-like interface for querying and analyzing large datasets. It features an architecture with multiple components including Hive Clients, Hive Services, Hive Driver, Metastore, and MapReduce, allowing for efficient data management and analysis. Hive supports various data types and models, including managed and external tables, partitioning, and bucketing, enhancing performance and scalability in big data applications.

Uploaded by

shanmukhsai005
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT IV

APACHE HIVE:
Introduction, Architecture and components - Data types and data models - HIVE
partitioning and bucketing - HIVE tables

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Need a basic knowledge of Core Java, Database concepts of SQL, Hadoop File system, and any
of Linux operating system flavors
Hive - Introduction

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data management
systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation
introduced a framework called Hadoop to solve Big Data management and processing
challenges.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Features of Hive
The major features of Hive for big data processing include:
 Open Source: Hive is an open-source software, making it easily accessible to users
worldwide for big data use cases and implementations.

 SQL-like Interface: Hive provides a SQL-like interface with a similar query language
called Hive QL, allowing users to interact with data using SQL-like syntax and
commands.
 Data Warehousing: Hive is designed for data warehousing tasks, enabling effective
analysis of datasets stored in the Hadoop Distributed File System (HDFS).
 Scalability: Hive is highly scalable as it can work on datasets spread across multiple
computer clusters.

 Integration with Hadoop Ecosystem: Hive seamlessly integrates with the components
of the Hadoop Ecosystem, like HDFS and Map Reduce, making it a helpful tool for big
data processing.

 Data Management: Hive maintains meta-data like tables, columns, and more in a meta-
store, making it easier for the user to manage data

Architecture and components of Hive

The following component diagram depicts the architecture of Hive:

This Hive Architecture enables users to analyze big data using SQL-like queries without writing
complex programs. The layered architecture ensures scalability, fault tolerance, and efficient
processing of large datasets..The given diagram shows how a Hive query flows from the user to
Hadoop for execution. The architecture is divided into four logical layers.

1. Hive Clients
This is the top layer of the architecture. It represents the different ways users can connect to Hive.
 Thrift Server
Allows non-Java applications to communicate with Hive using RPC (Remote Procedure
Call).
 JDBC Driver
Enables Java applications and BI tools to connect to Hive.
 ODBC Driver
Used by Windows-based applications and reporting tools to access [Link] clients
send HiveQL queries to Hive services.

2. Hive Services
This layer acts as a bridge between users and Hive processing.
 Hive Web UI
A browser-based interface for submitting Hive queries.
 Hive Server
Accepts queries from multiple clients and manages authentication and sessions.
 CLI (Command Line Interface)
Allows users to interact with Hive using [Link] these services forward the query
to the Hive Driver.

3. Hive Driver
The Hive Driver is the core controller of Hive query execution.
Functions of Hive Driver:
 Receives HiveQL queries from Hive services
 Sends the query for parsing and optimization
 Interacts with the Metastore to get metadata
 Converts queries into execution plans
 Submits jobs to MapReduce
 Monitors execution and returns results to the client

4. Metastore
The Metastore stores metadata information such as:
 Database and table names
 Column names and data types
 Partition and bucket details
 Location of data in HDFS
Hive Driver uses this metadata to understand data structure before executing queries.

5. MapReduce Layer
 HiveQL queries are converted into MapReduce jobs.
 MapReduce performs parallel processing on large datasets.
 It processes data stored in HDFS.
Modern Hive can also use Tez or Spark, but this diagram shows MapReduce.

6. HDFS (Hadoop Distributed File System)


This is the storage layer at the bottom of the architecture.
 Stores actual Hive table data
 Stores intermediate and final results
 Provides fault tolerance and scalability

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

Query Execution Flow


1. User submits a query using Hive Client
2. Query goes to Hive Services
3. Hive Driver processes the query
4. Metadata is fetched from Metastore
5. Query is converted into MapReduce job
6. Data is read from HDFS
7. Results are returned to the user
8.
Hive - Data Types

Datatypes are classified into two types:


 Primitive Data Types
 Collective Data Types
1. Primitive Data Types

Primitive means were ancient and old. all datatypes listed as primitive are legacy ones. the
important primitive datatypes areas are listed below:

Type Size (byte) Example

Tiny Int 1 20

Small Int 2 20

Int 4 20

Big int 8 20

Boolean Boolean true/False FALSE

Double 8 10.2222

Float 4 10.2222

String Sequence of characters ABCD

Timestamp Integer/float/string 2/3/2012


12:34:56:1234567

Date Integer/float/string 2/3/2019

Hive Data Types are Implemented using JAVA


1. Character arrays are not supported in Hive.
Hive mainly works with string data types instead of fixed character arrays.

2. Hive uses delimiters to separate fields.


This delimiter-based storage improves read and write performance when working with
Hadoop.

3. Column length need not be specified.


While creating Hive tables, specifying the length of each column is not mandatory.

4. String
5.

6. literals in Hive.
String values can be written using:
 Single quotes ' '
 Double quotes " "

7. VARCHAR data type.


 Introduced in newer versions of Hive.
 Supports length from 1 to 65,535 characters.
 Extra characters are truncated if the value exceeds the defined length.
 Length is measured in characters, not bytes.

8. Integer literals.
 TINYINT, SMALLINT, and BIGINT are treated as INT by default.
 If the value exceeds INT range, it is converted to a suitable larger type.

9. DECIMAL data type.


 Stores numeric values exactly.
 Provides better precision than DOUBLE.
 Suitable for financial and accurate calculations.
10. DOUBLE data type.
 Stores approximate values.
 Less accurate compared to DECIMAL.

2. Collection / Complex Data Types


There are four collection datatypes in the hive; they are also termed as complex data types.
 ARRAY
 MAP
 STRUCT
 UNIONTYPE
Hive Data Model

Apache Hive is an open-source data warehouse system built on top of Hadoop. It is used for
querying and analyzing large datasets stored in Hadoop file systems such as HDFS. Hive
supports structured and semi-structured data and allows users to write queries using SQL-like
language (HiveQL).
In Hive, data is logically organized to make storage and querying efficient. The Hive
Data Model categorizes data into three levels:
1. Table
2. Partition
3. Bucket

1. Table
A Hive table is similar to a table in a relational database. It logically stores data, while its
metadata (schema, column names, data types, location) is stored in the Meta store.
Hive supports operations like SELECT, FILTER, JOIN, and UNION on tables.

Types of Hive Tables


a) Managed Table
 By default, Hive creates a Managed Table.
 When data is loaded, Hive moves the data into its warehouse directory.
 When the table is dropped, both data and metadata are deleted.
Data is permanently deleted.
b) External Table
 Hive does not manage the data.
 Data location is specified outside the warehouse directory.
 When the table is dropped, only metadata is deleted, not the data.

Useful when data is shared with other tools.

2. Partition

Partitioning divides a table into smaller parts based on a column called the partition key.

 Each partition stores data of a specific category.


 Physically, a partition is a sub-directory inside the table directory.
 Partitioning improves query performance by scanning only required data.
Queries for EEE students scan only the EEE partition, not the whole table.

3. Buckets

Bucketing further divides tables or partitions into fixed number of files based on a hash
function of a column.

 Each bucket is stored as a file.


 Helps in efficient joins, sampling, and querying.

Each partition will have 2 bucket files.

HIVE Tables
In Apache Hive, a table is a logical structure used to store and organize data in Hadoop.
Hive tables are similar to tables in a relational database, but the actual data is stored in HDFS,
while the metadata (schema, column names, data types, location) is stored in the Hive
Metastore.

Hive tables allow users to perform operations such as SELECT, FILTER, JOIN, GROUP BY,
and UNION using HiveQL.

Hive tables provide a structured way to store and analyze big data in Hadoop.
Managed tables give full control to Hive, while external tables offer flexibility and data safety.
Choosing the correct table type improves data management and performance.

Types of HIVE Tables

Hive mainly supports two types of tables:

1. Managed (Internal) Table


2. External Table

1. Managed (Internal) Table

 This is the default table type in Hive.


 Hive manages both data and metadata.
 When data is loaded, Hive moves the data into the Hive warehouse directory.
 When the table is dropped, both data and metadata are deleted.

Example

Data is permanently removed from HDFS.

2. External Table
 Hive does not manage the data, only the metadata.
 Data location is specified explicitly during table creation.
 When the table is dropped, only metadata is deleted, data remains safe.

Example

You might also like