0% found this document useful (0 votes)

7 views12 pages

Big Data Unit 4

Apache Hive is a data warehouse infrastructure tool built on Hadoop for processing structured data, providing a SQL-like interface for querying and analyzing large datasets. It features an architecture with multiple components including Hive Clients, Hive Services, Hive Driver, Metastore, and MapReduce, allowing for efficient data management and analysis. Hive supports various data types and models, including managed and external tables, partitioning, and bucketing, enhancing performance and scalability in big data applications.

Uploaded by

shanmukhsai005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views12 pages

Big Data Unit 4

Uploaded by

shanmukhsai005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

UNIT IV

APACHE HIVE:
Introduction, Architecture and components - Data types and data models - HIVE
partitioning and bucketing - HIVE tables

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Need a basic knowledge of Core Java, Database concepts of SQL, Hadoop File system, and any
of Linux operating system flavors
Hive - Introduction

The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data management
systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation
introduced a framework called Hadoop to solve Big Data management and processing
challenges.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Features of Hive
The major features of Hive for big data processing include:
 Open Source: Hive is an open-source software, making it easily accessible to users
worldwide for big data use cases and implementations.

 SQL-like Interface: Hive provides a SQL-like interface with a similar query language
called Hive QL, allowing users to interact with data using SQL-like syntax and
commands.
 Data Warehousing: Hive is designed for data warehousing tasks, enabling effective
analysis of datasets stored in the Hadoop Distributed File System (HDFS).
 Scalability: Hive is highly scalable as it can work on datasets spread across multiple
computer clusters.

 Integration with Hadoop Ecosystem: Hive seamlessly integrates with the components
of the Hadoop Ecosystem, like HDFS and Map Reduce, making it a helpful tool for big
data processing.

 Data Management: Hive maintains meta-data like tables, columns, and more in a meta-
store, making it easier for the user to manage data

Architecture and components of Hive

The following component diagram depicts the architecture of Hive:

This Hive Architecture enables users to analyze big data using SQL-like queries without writing
complex programs. The layered architecture ensures scalability, fault tolerance, and efficient
processing of large datasets..The given diagram shows how a Hive query flows from the user to
Hadoop for execution. The architecture is divided into four logical layers.

1. Hive Clients
This is the top layer of the architecture. It represents the different ways users can connect to Hive.
 Thrift Server
Allows non-Java applications to communicate with Hive using RPC (Remote Procedure
Call).
 JDBC Driver
Enables Java applications and BI tools to connect to Hive.
 ODBC Driver
Used by Windows-based applications and reporting tools to access [Link] clients
send HiveQL queries to Hive services.

2. Hive Services
This layer acts as a bridge between users and Hive processing.
 Hive Web UI
A browser-based interface for submitting Hive queries.
 Hive Server
Accepts queries from multiple clients and manages authentication and sessions.
 CLI (Command Line Interface)
Allows users to interact with Hive using [Link] these services forward the query
to the Hive Driver.

3. Hive Driver
The Hive Driver is the core controller of Hive query execution.
Functions of Hive Driver:
 Receives HiveQL queries from Hive services
 Sends the query for parsing and optimization
 Interacts with the Metastore to get metadata
 Converts queries into execution plans
 Submits jobs to MapReduce
 Monitors execution and returns results to the client

4. Metastore
The Metastore stores metadata information such as:
 Database and table names
 Column names and data types
 Partition and bucket details
 Location of data in HDFS
Hive Driver uses this metadata to understand data structure before executing queries.

5. MapReduce Layer
 HiveQL queries are converted into MapReduce jobs.
 MapReduce performs parallel processing on large datasets.
 It processes data stored in HDFS.
Modern Hive can also use Tez or Spark, but this diagram shows MapReduce.

6. HDFS (Hadoop Distributed File System)

This is the storage layer at the bottom of the architecture.
 Stores actual Hive table data
 Stores intermediate and final results
 Provides fault tolerance and scalability

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

Query Execution Flow

1. User submits a query using Hive Client
2. Query goes to Hive Services
3. Hive Driver processes the query
4. Metadata is fetched from Metastore
5. Query is converted into MapReduce job
6. Data is read from HDFS
7. Results are returned to the user
8.
Hive - Data Types

Datatypes are classified into two types:

 Primitive Data Types
 Collective Data Types
1. Primitive Data Types

Primitive means were ancient and old. all datatypes listed as primitive are legacy ones. the
important primitive datatypes areas are listed below:

Type Size (byte) Example

Tiny Int 1 20

Small Int 2 20

Int 4 20

Big int 8 20

Boolean Boolean true/False FALSE

Double 8 10.2222

Float 4 10.2222

String Sequence of characters ABCD

Timestamp Integer/float/string 2/3/2012

12:34:56:1234567

Date Integer/float/string 2/3/2019

Hive Data Types are Implemented using JAVA

1. Character arrays are not supported in Hive.
Hive mainly works with string data types instead of fixed character arrays.

2. Hive uses delimiters to separate fields.

This delimiter-based storage improves read and write performance when working with
Hadoop.

3. Column length need not be specified.

While creating Hive tables, specifying the length of each column is not mandatory.

4. String
5.

6. literals in Hive.
String values can be written using:
 Single quotes ' '
 Double quotes " "

7. VARCHAR data type.

 Introduced in newer versions of Hive.
 Supports length from 1 to 65,535 characters.
 Extra characters are truncated if the value exceeds the defined length.
 Length is measured in characters, not bytes.

8. Integer literals.
 TINYINT, SMALLINT, and BIGINT are treated as INT by default.
 If the value exceeds INT range, it is converted to a suitable larger type.

9. DECIMAL data type.

 Stores numeric values exactly.
 Provides better precision than DOUBLE.
 Suitable for financial and accurate calculations.
10. DOUBLE data type.
 Stores approximate values.
 Less accurate compared to DECIMAL.

2. Collection / Complex Data Types

There are four collection datatypes in the hive; they are also termed as complex data types.
 ARRAY
 MAP
 STRUCT
 UNIONTYPE
Hive Data Model

Apache Hive is an open-source data warehouse system built on top of Hadoop. It is used for
querying and analyzing large datasets stored in Hadoop file systems such as HDFS. Hive
supports structured and semi-structured data and allows users to write queries using SQL-like
language (HiveQL).
In Hive, data is logically organized to make storage and querying efficient. The Hive
Data Model categorizes data into three levels:
1. Table
2. Partition
3. Bucket

1. Table
A Hive table is similar to a table in a relational database. It logically stores data, while its
metadata (schema, column names, data types, location) is stored in the Meta store.
Hive supports operations like SELECT, FILTER, JOIN, and UNION on tables.

Types of Hive Tables

a) Managed Table
 By default, Hive creates a Managed Table.
 When data is loaded, Hive moves the data into its warehouse directory.
 When the table is dropped, both data and metadata are deleted.
Data is permanently deleted.
b) External Table
 Hive does not manage the data.
 Data location is specified outside the warehouse directory.
 When the table is dropped, only metadata is deleted, not the data.

Useful when data is shared with other tools.

2. Partition

Partitioning divides a table into smaller parts based on a column called the partition key.

 Each partition stores data of a specific category.

 Physically, a partition is a sub-directory inside the table directory.
 Partitioning improves query performance by scanning only required data.
Queries for EEE students scan only the EEE partition, not the whole table.

3. Buckets

Bucketing further divides tables or partitions into fixed number of files based on a hash
function of a column.

 Each bucket is stored as a file.

 Helps in efficient joins, sampling, and querying.

Each partition will have 2 bucket files.

HIVE Tables
In Apache Hive, a table is a logical structure used to store and organize data in Hadoop.
Hive tables are similar to tables in a relational database, but the actual data is stored in HDFS,
while the metadata (schema, column names, data types, location) is stored in the Hive
Metastore.

Hive tables allow users to perform operations such as SELECT, FILTER, JOIN, GROUP BY,
and UNION using HiveQL.

Hive tables provide a structured way to store and analyze big data in Hadoop.
Managed tables give full control to Hive, while external tables offer flexibility and data safety.
Choosing the correct table type improves data management and performance.

Types of HIVE Tables

Hive mainly supports two types of tables:

1. Managed (Internal) Table

2. External Table

1. Managed (Internal) Table

 This is the default table type in Hive.

 Hive manages both data and metadata.
 When data is loaded, Hive moves the data into the Hive warehouse directory.
 When the table is dropped, both data and metadata are deleted.

Example

Data is permanently removed from HDFS.

2. External Table
 Hive does not manage the data, only the metadata.
 Data location is specified explicitly during table creation.
 When the table is dropped, only metadata is deleted, data remains safe.

Example

Apache Hive: Data Warehouse Tool Overview
No ratings yet
Apache Hive: Data Warehouse Tool Overview
10 pages
Apache Hive: Tools and Features Overview
No ratings yet
Apache Hive: Tools and Features Overview
34 pages
Introduction to Apache Hive and Big Data
No ratings yet
Introduction to Apache Hive and Big Data
59 pages
Understanding Apache Hive and Big Data
No ratings yet
Understanding Apache Hive and Big Data
29 pages
Hive: Overview, Architecture, and Data Modeling
No ratings yet
Hive: Overview, Architecture, and Data Modeling
28 pages
Overview of Hive and Its Evolution
No ratings yet
Overview of Hive and Its Evolution
9 pages
Apache Hive Overview and Features
No ratings yet
Apache Hive Overview and Features
50 pages
Introduction to Apache Hive and Its Components
No ratings yet
Introduction to Apache Hive and Its Components
39 pages
Hive Data Structuring in Hadoop
No ratings yet
Hive Data Structuring in Hadoop
9 pages
Introduction to Apache Hive Framework
No ratings yet
Introduction to Apache Hive Framework
26 pages
Introduction to Hive in Big Data Analytics
No ratings yet
Introduction to Hive in Big Data Analytics
22 pages
Introduction to Apache Hive Overview
No ratings yet
Introduction to Apache Hive Overview
28 pages
Introduction to Hive and Pig Overview
No ratings yet
Introduction to Hive and Pig Overview
46 pages
Understanding HIVE for Big Data Analytics
No ratings yet
Understanding HIVE for Big Data Analytics
20 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Understanding Hive Map Types
No ratings yet
Understanding Hive Map Types
49 pages
Introduction to Apache Hive Basics
No ratings yet
Introduction to Apache Hive Basics
17 pages
Understanding Hive in Big Data
No ratings yet
Understanding Hive in Big Data
30 pages
Unit IV
No ratings yet
Unit IV
57 pages
Understanding Hive in Hadoop
No ratings yet
Understanding Hive in Hadoop
69 pages
Understanding Apache Hive Architecture
No ratings yet
Understanding Apache Hive Architecture
25 pages
Hive: Big Data Processing Overview
No ratings yet
Hive: Big Data Processing Overview
43 pages
Introduction to Hive and HiveQL Basics
No ratings yet
Introduction to Hive and HiveQL Basics
14 pages
Hive Data Types in Big Data
No ratings yet
Hive Data Types in Big Data
24 pages
Overview of Apache Hive Architecture
No ratings yet
Overview of Apache Hive Architecture
27 pages
Hive Overview and Architecture in BDA
No ratings yet
Hive Overview and Architecture in BDA
23 pages
Hive Architecture and Data Types in Big Data
No ratings yet
Hive Architecture and Data Types in Big Data
53 pages
Hive: Big Data SQL Query Tool
No ratings yet
Hive: Big Data SQL Query Tool
75 pages
Bigdata Unit 4
No ratings yet
Bigdata Unit 4
13 pages
Overview of Apache Hive Data Formats
100% (1)
Overview of Apache Hive Data Formats
47 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
23 pages
Introduction to Apache Hive Features
No ratings yet
Introduction to Apache Hive Features
10 pages
Overview of Hive and Its Architecture
No ratings yet
Overview of Hive and Its Architecture
12 pages
Chapter9 HIVE
No ratings yet
Chapter9 HIVE
77 pages
Overview of Apache Hive Features and Functions
No ratings yet
Overview of Apache Hive Features and Functions
16 pages
Understanding Apache Hive: Features & Use Cases
No ratings yet
Understanding Apache Hive: Features & Use Cases
11 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Hive in Big Data: Overview and Usage
100% (1)
Hive in Big Data: Overview and Usage
24 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Comprehensive Guide to Apache Hive
No ratings yet
Comprehensive Guide to Apache Hive
46 pages
Understanding Apache Hive Overview
No ratings yet
Understanding Apache Hive Overview
51 pages
Big Data Systems: Hive & PIG Overview
No ratings yet
Big Data Systems: Hive & PIG Overview
73 pages
Apache Hive: Data Warehouse & HiveQL Guide
No ratings yet
Apache Hive: Data Warehouse & HiveQL Guide
45 pages
Apache Hive and HBase Overview Guide
No ratings yet
Apache Hive and HBase Overview Guide
57 pages
Hive ODBC Integration in Big Data
No ratings yet
Hive ODBC Integration in Big Data
30 pages
Overview of Apache Hive Data Warehouse
No ratings yet
Overview of Apache Hive Data Warehouse
45 pages
Understanding Apache Hive for Hadoop
No ratings yet
Understanding Apache Hive for Hadoop
18 pages
Understanding Apache Hive in Data Science
No ratings yet
Understanding Apache Hive in Data Science
23 pages
Overview of Apache Hive and Its Features
No ratings yet
Overview of Apache Hive and Its Features
44 pages
Hive and Pig: Overview and Architecture
No ratings yet
Hive and Pig: Overview and Architecture
33 pages
Understanding Apache Hive: Features & Architecture
No ratings yet
Understanding Apache Hive: Features & Architecture
52 pages
Overview of Apache Hive Features
No ratings yet
Overview of Apache Hive Features
30 pages
Introduction to Hive in Big Data
No ratings yet
Introduction to Hive in Big Data
59 pages
Overview of Apache Hive Architecture
No ratings yet
Overview of Apache Hive Architecture
7 pages
Introduction to Hive and Partitioning
No ratings yet
Introduction to Hive and Partitioning
65 pages
Hadoop to Hive: Big Data Analytics Guide
No ratings yet
Hadoop to Hive: Big Data Analytics Guide
54 pages
Apache Hive Overview and Installation Guide
No ratings yet
Apache Hive Overview and Installation Guide
19 pages
Overview of Hive Architecture and Features
No ratings yet
Overview of Hive Architecture and Features
23 pages
Comprehensive Guide to Apache Hive
No ratings yet
Comprehensive Guide to Apache Hive
24 pages
Mobile App Review Mining with ML Techniques
No ratings yet
Mobile App Review Mining with ML Techniques
10 pages
Steps to Build Custom AI Models
No ratings yet
Steps to Build Custom AI Models
4 pages
Generative AI: History and Future Insights
No ratings yet
Generative AI: History and Future Insights
12 pages
The Role of Ontologies in Data Management
No ratings yet
The Role of Ontologies in Data Management
7 pages
XI Class Admission Info 2025-2026
No ratings yet
XI Class Admission Info 2025-2026
1 page
Knowledge Management Processes Overview
100% (1)
Knowledge Management Processes Overview
7 pages
صور العمره 2024: قائمة الصور
No ratings yet
صور العمره 2024: قائمة الصور
1,060 pages
Types of Analytics and Data Processing
No ratings yet
Types of Analytics and Data Processing
6 pages
Courses Offered at Gweru Polytechnic
No ratings yet
Courses Offered at Gweru Polytechnic
1 page
Introduction to Database Systems
No ratings yet
Introduction to Database Systems
17 pages
Understanding Databases and DBMS
No ratings yet
Understanding Databases and DBMS
42 pages
GeoVisualization and Web GIS Overview
No ratings yet
GeoVisualization and Web GIS Overview
7 pages
LLM-Based User Behavior Simulation
No ratings yet
LLM-Based User Behavior Simulation
37 pages
Image Encryption and Decryption Techniques
No ratings yet
Image Encryption and Decryption Techniques
5 pages
Data Warehousing Experiments Manual
No ratings yet
Data Warehousing Experiments Manual
17 pages
SQL Basics: DDL, DML, and DCL Explained
No ratings yet
SQL Basics: DDL, DML, and DCL Explained
8 pages
Learn Apache Cassandra Basics Fast
No ratings yet
Learn Apache Cassandra Basics Fast
9 pages
Enhancing Fraud Detection in Banking With Deep Lea
No ratings yet
Enhancing Fraud Detection in Banking With Deep Lea
15 pages
QGIS Spatial Analysis Tutorial
No ratings yet
QGIS Spatial Analysis Tutorial
11 pages
Classification and Prediction in Data Mining
No ratings yet
Classification and Prediction in Data Mining
6 pages
Cypher and NoSQL Graph Databases
No ratings yet
Cypher and NoSQL Graph Databases
78 pages
Document Layout Analysis Project Report
No ratings yet
Document Layout Analysis Project Report
16 pages
AI Class 12 Question Bank
No ratings yet
AI Class 12 Question Bank
61 pages
Deep Learning Applications in Medical Imaging
No ratings yet
Deep Learning Applications in Medical Imaging
25 pages
Machine Translation - Its Scope and Limits
No ratings yet
Machine Translation - Its Scope and Limits
246 pages
Comprehensive Guide to Machine Learning
No ratings yet
Comprehensive Guide to Machine Learning
3 pages
Cloud Computing Data Management Guide
No ratings yet
Cloud Computing Data Management Guide
6 pages
AI Developer Minor at IIT Ropar
No ratings yet
AI Developer Minor at IIT Ropar
2 pages
IEEE Paper
No ratings yet
IEEE Paper
4 pages
Software Engineer with Project Management Skills
No ratings yet
Software Engineer with Project Management Skills
3 pages

Big Data Unit 4

Uploaded by

Big Data Unit 4

Uploaded by

UNIT IV

Architecture and components of Hive

The following component diagram depicts the architecture of Hive:

6. HDFS (Hadoop Distributed File System)

Query Execution Flow

Datatypes are classified into two types:

Type Size (byte) Example

Boolean Boolean true/False FALSE

String Sequence of characters ABCD

Timestamp Integer/float/string 2/3/2012

Date Integer/float/string 2/3/2019

Hive Data Types are Implemented using JAVA

2. Hive uses delimiters to separate fields.

3. Column length need not be specified.

7. VARCHAR data type.

9. DECIMAL data type.

2. Collection / Complex Data Types

Types of Hive Tables

Useful when data is shared with other tools.

 Each partition stores data of a specific category.

 Each bucket is stored as a file.

Each partition will have 2 bucket files.

Types of HIVE Tables

Hive mainly supports two types of tables:

1. Managed (Internal) Table

1. Managed (Internal) Table

 This is the default table type in Hive.

Data is permanently removed from HDFS.

You might also like