0% found this document useful (0 votes)
15 views13 pages

Introduction to Hive and Pig Overview

Hive is a data warehousing and analytics tool built on Hadoop, allowing users to query large datasets with an SQL-like language called HiveQL. It features batch processing, schema on read, and easy integration with BI tools, making it suitable for log analytics and data summarization. Apache Pig, on the other hand, is a high-level data processing platform using Pig Latin for ETL tasks, designed to simplify data transformation and handle structured and unstructured data.

Uploaded by

udhruv444
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Introduction to Hive and Pig Overview

Hive is a data warehousing and analytics tool built on Hadoop, allowing users to query large datasets with an SQL-like language called HiveQL. It features batch processing, schema on read, and easy integration with BI tools, making it suitable for log analytics and data summarization. Apache Pig, on the other hand, is a high-level data processing platform using Pig Latin for ETL tasks, designed to simplify data transformation and handle structured and unstructured data.

Uploaded by

udhruv444
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Hive

Hive is a data warehousing and analytics tool built on top of Hadoop.


It enables users to query and analyze large datasets stored in HDFS using an SQL-like
language called HiveQL.

Key Features of Hive

• Designed for data analysts

• Uses HiveQL (SQL-like syntax)

• Converts queries into MapReduce / Tez / Spark jobs

• Supports batch processing

• Supports schema on read

• Handles large structured and semi-structured data

Why Hive?

• Faster development than writing MapReduce codes

• Easy integration with BI tools (via JDBC/ODBC)

• Suitable for:

o Log analytics

o Data summarization & aggregation

o Offline batch data processing

Hive Architecture

The architecture of Hive includes the following components:

User Interface

• CLI (Command Line Interface)

• Hive Web UI

• JDBC / ODBC connections

Used to submit HiveQL queries.

Driver
• Manages the entire query lifecycle

• Handles:

o Parsing

o Query compilation

o Optimization

o Execution plan creation

• Coordinates execution with Execution Engine

Compiler

• Converts HiveQL queries into:

o Logical execution plan → DAG

o Physical plan → MapReduce/Tez/Spark jobs

MetaStore

• Stores metadata about:

o Tables, schema, columns

o Data location in HDFS

o Partition information

• Usually backed by MySQL/PostgreSQL

Execution Engine

• Executes tasks in the cluster via:

o MapReduce (default earlier)

o Tez/Spark (faster engines)

• Coordinates with YARN for resources

HDFS

• Storage layer for actual table data


Hive vs RDBMS (Short Point)

• Hive built for read-heavy big data analytics

• Not suitable for OLTP (real-time updates)

Hive Data Types

Hive supports primitive and complex data types.

Primitive Data Types

Type Example Description

INT 20 Integer value

BIGINT 922337 Large integers


FLOAT, DOUBLE 3.14 Decimal numbers

STRING "Hello" Text data

BOOLEAN TRUE/FALSE Logical

TIMESTAMP, DATE 2024-01-01 Time format

BINARY bytes Binary raw data

Complex Data Types

Type Example Description

ARRAY ["A","B"] Ordered elements of same type

MAP {'id':123} Key-value pairs

STRUCT {name:"Ashwin", age:21} Record with multiple fields

UNION Tagged union Can be one of multiple defined types

Additional Hive Data Concepts

• NULL supports missing values

• Supports type casting and conversion

• Table data types help structure data stored in HDFS

Hive Query Language (HQL)

Hive Query Language (HQL) is a SQL-like language used in Apache Hive to manage and
analyze large datasets stored in Hadoop HDFS. Since many data analysts know SQL,
HQL makes it easy to work with distributed data without writing Java MapReduce
programs.

Characteristics of HQL

1. SQL-Like Syntax – Similar to traditional RDBMS SQL, easy learning curve.

2. Schema on Read – Structure applied at query time, supports semi-structured


data.

3. Batch Processing – Queries are converted to MapReduce/Spark jobs internally.

4. High Scalability – Designed to run on distributed storage like HDFS.


5. Supports UDFs – User-defined functions for custom processing.

Common HQL Operations

Category Example

DDL CREATE TABLE, ALTER, DROP, SHOW TABLES

DML LOAD DATA, INSERT, UPDATE, DELETE

Query/Analysis SELECT, WHERE, ORDER BY, GROUP BY, JOIN, LIMIT

Workflow of HQL Query Execution

1. User submits HQL query.

2. Query is compiled and optimized.

3. Execution plan is created.

4. MapReduce/Spark/Tez jobs run on Hadoop.

5. Results are returned to client.

Use Cases

• Data Warehousing

• Log and Clickstream analysis

• Business Intelligence reporting

Introduction to Pig

Apache Pig is a high-level data processing platform developed by Yahoo for Hadoop. It
uses a scripting language called Pig Latin to process large datasets easily.

Why Pig?

• Reduces complexity compared to writing Java MapReduce

• Suitable for ETL (Extract-Transform-Load) pipelines

• Handles structured, semi-structured, and unstructured data

Pig vs Hive

Hive Pig

SQL-like HQL Scripting language (Pig Latin)


Best for analytics/reporting Best for ETL & data transformation

Schema-on-read Flexible schema

Runs mostly interactive batch queries Supports complex data flows

Anatomy of Pig

The internal structure and working of Pig include the following components:

Pig Latin Script

A sequence of data operations like load → transform → group → store


Example:
data==LOAD
data LOAD'[Link]'
'[Link]'USING
USINGPigStorage('
PigStorage('
,'),')AS
AS(id:int,name:chararray);
(id:int,name:chararray);

grp==GROUP
grp GROUPdata
dataBY
BYid;
id;

STOREgrp
STORE grpINTO
INTO'output';
'output';

Pig Execution Modes

• Local Mode → Runs on single JVM (for testing)

• MapReduce Mode → Default mode, executes using Hadoop MapReduce

• Tez/Spark Mode → Faster execution (optional plugins)

Pig Execution Framework

Steps followed internally:

1. Parse the script

2. Logical Plan is created (sequence of operations)

3. Optimization (removal of redundant processing)

4. Physical Plan generation

5. Convert to MapReduce jobs

6. Execute on Hadoop cluster

Pig Components

Component Description

Pig Latin Language to write scripts


Parser Verifies syntax

Optimizer Improves performance

Execution Engine Converts script into Hadoop jobs

Data Model in Pig

• Atom (single value)

• Tuple (ordered set of fields)

• Bag (collection of tuples)

• Map (key-value pairs)

Pig on Hadoop

Apache Pig is designed to work efficiently on top of the Hadoop ecosystem. It uses
Hadoop’s distributed storage and processing framework to manage and analyze
large datasets.

How Pig Works on Hadoop

1. Pig scripts written in Pig Latin are submitted for execution.

2. Pig converts the script into a series of MapReduce jobs automatically.

3. These jobs run on Hadoop, using:

o HDFS for data storage

o MapReduce / Tez / Spark for execution

Benefits

• No need to write complex Java MapReduce programs

• Better performance through automatic optimization

• Handles large-scale structured/semi-structured data

Use Case for Pig

Apache Pig is primarily used in data preparation and transformation tasks in Big
Data pipelines.

Major Use Cases


Use Case Explanation

ETL (Extract-Transform-Load) Clean, filter, transform raw data before loading to


warehouse

Log Processing Analyze web logs, clickstreams, social media


data

Data Integration Combining data from multiple sources

Data Pipeline for Machine Pre-process raw datasets


Learning

Handling Unstructured Data XML, JSON, text analytics

Industry Examples

• Yahoo! using Pig for analyzing user click data

• E-commerce recommendation systems

• Fraud detection data pipelines

ETL Processing in Pig

Pig is widely used as an ETL tool for preparing data for BI and analytics.

ETL Phases in Pig

Phase Pig Feature Used

Extract LOAD data from HDFS/Local/NoSQL

Transform FILTER, GROUP, JOIN, FLATTEN, user-defined functions

Load STORE results back to HDFS/Hive/HBase

Example ETL Script

raw_data = LOAD '/user/data/[Link]' USING PigStorage(',')

AS (id:int, product:chararray, price:int);

filtered = FILTER raw_data BY price > 100;

STORE filtered INTO '/user/output/high_price_sales' USING PigStorage(',');


Data Types in Pig

Pig supports various data types suitable for semi-structured and complex data.

Simple (Scalar) Types

• int – integer

• long – large integer

• float – single precision number

• double – double precision number

• chararray – string

• bytearray – raw bytes

• boolean – true/false

Complex Types

Type Description Example

Tuple Ordered set of fields (1, 'John', 5000)

Bag Collection of tuples {(1,'A'), (2,'B')}

Map Key-value pairs ['name'#'John','age'#25]

These allow Pig to handle nested and flexible data models.

Running Pig

Pig can run in two execution modes:

Local Mode

• Runs on a single JVM

• Used for testing small data

• Command:

pig -x local

MapReduce Mode (Default)

• Runs on a full Hadoop cluster

• Suitable for large datasets


• Command:

pig

Ways to Run Pig Scripts

Method Description

Grunt Shell Interactive execution of Pig commands

Script File Write commands in .pig file → execute using: pig [Link]

Embedded Pig Pig Latin inside Java programs

Quick Summary Table

Topic Key Concept

Pig on Hadoop Converts Pig Latin → MapReduce jobs on HDFS

Use Case Best for ETL, log processing, data cleansing

ETL Processing LOAD → TRANSFORM → STORE

Pig Data Types Scalar + Complex (Tuple, Bag, Map)

Running Pig Local mode & MapReduce mode, grunt shell, scripts

Execution Model of Pig (Extended)

The Pig execution model describes the internal process of how a Pig Latin script gets
executed in a Hadoop environment. Pig acts like a data flow engine and converts high-
level operations into distributed processing tasks.

Stages of Execution

1. Parsing

o Syntax of the Pig Latin script is checked.

o Schema validation is performed.

o A Logical Plan (operator-based flow) is created.

2. Logical Optimization

o Logical plan is optimized for efficiency.

o Common optimizations:

▪ Push down FILTER and LIMIT operators near LOAD


▪ Combine multiple FOREACH into a single operation

o No Hadoop interaction yet, only compiler-level improvements.

3. Physical Plan Generation

o Logical plan is converted into a Physical Execution Plan

o Breaks the tasks into map and reduce phases

4. Compilation into MapReduce Jobs

o Physical plan → One or multiple MapReduce jobs

o Each transformation is translated to a job or set of jobs

5. Execution

o Jobs are submitted to Hadoop YARN

o Data is processed in parallel using:

▪ Map Tasks — scanning and filtering

▪ Reduce Tasks — grouping and aggregation

6. Result Storage

o Output stored back in HDFS / local system / or displayed via DUMP

Conclusion: Pig provides an easy abstraction over MapReduce and enables large-
scale data processing without complex Java coding.

Operators in Pig (Extended)

Pig operators perform data manipulation tasks in the data flow pipeline.

Common Categories

Category Examples Purpose

Loading/Storage LOAD, STORE, DUMP Ingest and save output

Transformation FOREACH, FILTER, MAP, FLATTEN Modify structure or filter data

Grouping/Joining GROUP, COGROUP, JOIN, CROSS Combine or group datasets

Ordering ORDER, DISTINCT, LIMIT Sorting & data reduction

Set Operators UNION, SPLIT Merge or divide relations


Important Note: Pig operators work on relations (data tables) and are designed for
pipelined execution.

Functions in Pig (Extended)

Functions are used to operate on fields, tuples, bags, or maps.

Built-in Functions

• String Functions:
LOWER(), UPPER(), TRIM(), SUBSTRING(), CONCAT()

• Mathematical Functions:
ABS(), ROUND(), SQRT(), RANDOM(), MAX(), MIN()

• Bag/Collection Functions:
SIZE(), TOKENIZE(), COUNT(), GROUP()

• Conditional Functions:
DECODE(), (condition ? value1 : value2)

UDF (User Defined Function)

Used when built-in functions are insufficient.

• Supports multiple languages:

o Java (primary), Python, JavaScript, Ruby

• Registered using:

REGISTER [Link];

Example

REGISTER [Link];

data = LOAD 'input' AS (line:chararray);

result = FOREACH data GENERATE myUDF(line);

DUMP result;

UDFs improve flexibility and allow customized business logic.

Data Types in Pig (Extended)


Pig supports both scalar (simple) and complex data types.

Scalar Data Types

Type Example Use

int 25 Small integer

long 9823479234 Large integer

float/double 10.59 Decimal number

chararray "Hello" Text data

bytearray Binary blob Raw data

boolean TRUE Logical

Complex Data Types

Type Description Example Use

Tuple Ordered fields (101, 'John') Row of data (record)

Bag Collection of tuples {(1,'A'),(2,'B')} Unordered dataset

Map Key-value structure ['age'#25] JSON-like data

These allow Pig to process semi-structured data like logs, XML, and JSON.

You might also like