Introduction to Hive
Hive is a data warehousing and analytics tool built on top of Hadoop.
It enables users to query and analyze large datasets stored in HDFS using an SQL-like
language called HiveQL.
Key Features of Hive
• Designed for data analysts
• Uses HiveQL (SQL-like syntax)
• Converts queries into MapReduce / Tez / Spark jobs
• Supports batch processing
• Supports schema on read
• Handles large structured and semi-structured data
Why Hive?
• Faster development than writing MapReduce codes
• Easy integration with BI tools (via JDBC/ODBC)
• Suitable for:
o Log analytics
o Data summarization & aggregation
o Offline batch data processing
Hive Architecture
The architecture of Hive includes the following components:
User Interface
• CLI (Command Line Interface)
• Hive Web UI
• JDBC / ODBC connections
Used to submit HiveQL queries.
Driver
• Manages the entire query lifecycle
• Handles:
o Parsing
o Query compilation
o Optimization
o Execution plan creation
• Coordinates execution with Execution Engine
Compiler
• Converts HiveQL queries into:
o Logical execution plan → DAG
o Physical plan → MapReduce/Tez/Spark jobs
MetaStore
• Stores metadata about:
o Tables, schema, columns
o Data location in HDFS
o Partition information
• Usually backed by MySQL/PostgreSQL
Execution Engine
• Executes tasks in the cluster via:
o MapReduce (default earlier)
o Tez/Spark (faster engines)
• Coordinates with YARN for resources
HDFS
• Storage layer for actual table data
Hive vs RDBMS (Short Point)
• Hive built for read-heavy big data analytics
• Not suitable for OLTP (real-time updates)
Hive Data Types
Hive supports primitive and complex data types.
Primitive Data Types
Type Example Description
INT 20 Integer value
BIGINT 922337 Large integers
FLOAT, DOUBLE 3.14 Decimal numbers
STRING "Hello" Text data
BOOLEAN TRUE/FALSE Logical
TIMESTAMP, DATE 2024-01-01 Time format
BINARY bytes Binary raw data
Complex Data Types
Type Example Description
ARRAY ["A","B"] Ordered elements of same type
MAP {'id':123} Key-value pairs
STRUCT {name:"Ashwin", age:21} Record with multiple fields
UNION Tagged union Can be one of multiple defined types
Additional Hive Data Concepts
• NULL supports missing values
• Supports type casting and conversion
• Table data types help structure data stored in HDFS
Hive Query Language (HQL)
Hive Query Language (HQL) is a SQL-like language used in Apache Hive to manage and
analyze large datasets stored in Hadoop HDFS. Since many data analysts know SQL,
HQL makes it easy to work with distributed data without writing Java MapReduce
programs.
Characteristics of HQL
1. SQL-Like Syntax – Similar to traditional RDBMS SQL, easy learning curve.
2. Schema on Read – Structure applied at query time, supports semi-structured
data.
3. Batch Processing – Queries are converted to MapReduce/Spark jobs internally.
4. High Scalability – Designed to run on distributed storage like HDFS.
5. Supports UDFs – User-defined functions for custom processing.
Common HQL Operations
Category Example
DDL CREATE TABLE, ALTER, DROP, SHOW TABLES
DML LOAD DATA, INSERT, UPDATE, DELETE
Query/Analysis SELECT, WHERE, ORDER BY, GROUP BY, JOIN, LIMIT
Workflow of HQL Query Execution
1. User submits HQL query.
2. Query is compiled and optimized.
3. Execution plan is created.
4. MapReduce/Spark/Tez jobs run on Hadoop.
5. Results are returned to client.
Use Cases
• Data Warehousing
• Log and Clickstream analysis
• Business Intelligence reporting
Introduction to Pig
Apache Pig is a high-level data processing platform developed by Yahoo for Hadoop. It
uses a scripting language called Pig Latin to process large datasets easily.
Why Pig?
• Reduces complexity compared to writing Java MapReduce
• Suitable for ETL (Extract-Transform-Load) pipelines
• Handles structured, semi-structured, and unstructured data
Pig vs Hive
Hive Pig
SQL-like HQL Scripting language (Pig Latin)
Best for analytics/reporting Best for ETL & data transformation
Schema-on-read Flexible schema
Runs mostly interactive batch queries Supports complex data flows
Anatomy of Pig
The internal structure and working of Pig include the following components:
Pig Latin Script
A sequence of data operations like load → transform → group → store
Example:
data==LOAD
data LOAD'[Link]'
'[Link]'USING
USINGPigStorage('
PigStorage('
,'),')AS
AS(id:int,name:chararray);
(id:int,name:chararray);
grp==GROUP
grp GROUPdata
dataBY
BYid;
id;
STOREgrp
STORE grpINTO
INTO'output';
'output';
Pig Execution Modes
• Local Mode → Runs on single JVM (for testing)
• MapReduce Mode → Default mode, executes using Hadoop MapReduce
• Tez/Spark Mode → Faster execution (optional plugins)
Pig Execution Framework
Steps followed internally:
1. Parse the script
2. Logical Plan is created (sequence of operations)
3. Optimization (removal of redundant processing)
4. Physical Plan generation
5. Convert to MapReduce jobs
6. Execute on Hadoop cluster
Pig Components
Component Description
Pig Latin Language to write scripts
Parser Verifies syntax
Optimizer Improves performance
Execution Engine Converts script into Hadoop jobs
Data Model in Pig
• Atom (single value)
• Tuple (ordered set of fields)
• Bag (collection of tuples)
• Map (key-value pairs)
Pig on Hadoop
Apache Pig is designed to work efficiently on top of the Hadoop ecosystem. It uses
Hadoop’s distributed storage and processing framework to manage and analyze
large datasets.
How Pig Works on Hadoop
1. Pig scripts written in Pig Latin are submitted for execution.
2. Pig converts the script into a series of MapReduce jobs automatically.
3. These jobs run on Hadoop, using:
o HDFS for data storage
o MapReduce / Tez / Spark for execution
Benefits
• No need to write complex Java MapReduce programs
• Better performance through automatic optimization
• Handles large-scale structured/semi-structured data
Use Case for Pig
Apache Pig is primarily used in data preparation and transformation tasks in Big
Data pipelines.
Major Use Cases
Use Case Explanation
ETL (Extract-Transform-Load) Clean, filter, transform raw data before loading to
warehouse
Log Processing Analyze web logs, clickstreams, social media
data
Data Integration Combining data from multiple sources
Data Pipeline for Machine Pre-process raw datasets
Learning
Handling Unstructured Data XML, JSON, text analytics
Industry Examples
• Yahoo! using Pig for analyzing user click data
• E-commerce recommendation systems
• Fraud detection data pipelines
ETL Processing in Pig
Pig is widely used as an ETL tool for preparing data for BI and analytics.
ETL Phases in Pig
Phase Pig Feature Used
Extract LOAD data from HDFS/Local/NoSQL
Transform FILTER, GROUP, JOIN, FLATTEN, user-defined functions
Load STORE results back to HDFS/Hive/HBase
Example ETL Script
raw_data = LOAD '/user/data/[Link]' USING PigStorage(',')
AS (id:int, product:chararray, price:int);
filtered = FILTER raw_data BY price > 100;
STORE filtered INTO '/user/output/high_price_sales' USING PigStorage(',');
Data Types in Pig
Pig supports various data types suitable for semi-structured and complex data.
Simple (Scalar) Types
• int – integer
• long – large integer
• float – single precision number
• double – double precision number
• chararray – string
• bytearray – raw bytes
• boolean – true/false
Complex Types
Type Description Example
Tuple Ordered set of fields (1, 'John', 5000)
Bag Collection of tuples {(1,'A'), (2,'B')}
Map Key-value pairs ['name'#'John','age'#25]
These allow Pig to handle nested and flexible data models.
Running Pig
Pig can run in two execution modes:
Local Mode
• Runs on a single JVM
• Used for testing small data
• Command:
pig -x local
MapReduce Mode (Default)
• Runs on a full Hadoop cluster
• Suitable for large datasets
• Command:
pig
Ways to Run Pig Scripts
Method Description
Grunt Shell Interactive execution of Pig commands
Script File Write commands in .pig file → execute using: pig [Link]
Embedded Pig Pig Latin inside Java programs
Quick Summary Table
Topic Key Concept
Pig on Hadoop Converts Pig Latin → MapReduce jobs on HDFS
Use Case Best for ETL, log processing, data cleansing
ETL Processing LOAD → TRANSFORM → STORE
Pig Data Types Scalar + Complex (Tuple, Bag, Map)
Running Pig Local mode & MapReduce mode, grunt shell, scripts
Execution Model of Pig (Extended)
The Pig execution model describes the internal process of how a Pig Latin script gets
executed in a Hadoop environment. Pig acts like a data flow engine and converts high-
level operations into distributed processing tasks.
Stages of Execution
1. Parsing
o Syntax of the Pig Latin script is checked.
o Schema validation is performed.
o A Logical Plan (operator-based flow) is created.
2. Logical Optimization
o Logical plan is optimized for efficiency.
o Common optimizations:
▪ Push down FILTER and LIMIT operators near LOAD
▪ Combine multiple FOREACH into a single operation
o No Hadoop interaction yet, only compiler-level improvements.
3. Physical Plan Generation
o Logical plan is converted into a Physical Execution Plan
o Breaks the tasks into map and reduce phases
4. Compilation into MapReduce Jobs
o Physical plan → One or multiple MapReduce jobs
o Each transformation is translated to a job or set of jobs
5. Execution
o Jobs are submitted to Hadoop YARN
o Data is processed in parallel using:
▪ Map Tasks — scanning and filtering
▪ Reduce Tasks — grouping and aggregation
6. Result Storage
o Output stored back in HDFS / local system / or displayed via DUMP
Conclusion: Pig provides an easy abstraction over MapReduce and enables large-
scale data processing without complex Java coding.
Operators in Pig (Extended)
Pig operators perform data manipulation tasks in the data flow pipeline.
Common Categories
Category Examples Purpose
Loading/Storage LOAD, STORE, DUMP Ingest and save output
Transformation FOREACH, FILTER, MAP, FLATTEN Modify structure or filter data
Grouping/Joining GROUP, COGROUP, JOIN, CROSS Combine or group datasets
Ordering ORDER, DISTINCT, LIMIT Sorting & data reduction
Set Operators UNION, SPLIT Merge or divide relations
Important Note: Pig operators work on relations (data tables) and are designed for
pipelined execution.
Functions in Pig (Extended)
Functions are used to operate on fields, tuples, bags, or maps.
Built-in Functions
• String Functions:
LOWER(), UPPER(), TRIM(), SUBSTRING(), CONCAT()
• Mathematical Functions:
ABS(), ROUND(), SQRT(), RANDOM(), MAX(), MIN()
• Bag/Collection Functions:
SIZE(), TOKENIZE(), COUNT(), GROUP()
• Conditional Functions:
DECODE(), (condition ? value1 : value2)
UDF (User Defined Function)
Used when built-in functions are insufficient.
• Supports multiple languages:
o Java (primary), Python, JavaScript, Ruby
• Registered using:
REGISTER [Link];
Example
REGISTER [Link];
data = LOAD 'input' AS (line:chararray);
result = FOREACH data GENERATE myUDF(line);
DUMP result;
UDFs improve flexibility and allow customized business logic.
Data Types in Pig (Extended)
Pig supports both scalar (simple) and complex data types.
Scalar Data Types
Type Example Use
int 25 Small integer
long 9823479234 Large integer
float/double 10.59 Decimal number
chararray "Hello" Text data
bytearray Binary blob Raw data
boolean TRUE Logical
Complex Data Types
Type Description Example Use
Tuple Ordered fields (101, 'John') Row of data (record)
Bag Collection of tuples {(1,'A'),(2,'B')} Unordered dataset
Map Key-value structure ['age'#25] JSON-like data
These allow Pig to process semi-structured data like logs, XML, and JSON.