0% found this document useful (0 votes)

15 views13 pages

Introduction to Hive and Pig Overview

Hive is a data warehousing and analytics tool built on Hadoop, allowing users to query large datasets with an SQL-like language called HiveQL. It features batch processing, schema on read, and easy integration with BI tools, making it suitable for log analytics and data summarization. Apache Pig, on the other hand, is a high-level data processing platform using Pig Latin for ETL tasks, designed to simplify data transformation and handle structured and unstructured data.

Uploaded by

udhruv444

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views13 pages

Introduction to Hive and Pig Overview

Uploaded by

udhruv444

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Hive

Hive is a data warehousing and analytics tool built on top of Hadoop.

It enables users to query and analyze large datasets stored in HDFS using an SQL-like
language called HiveQL.

Key Features of Hive

• Designed for data analysts

• Uses HiveQL (SQL-like syntax)

• Converts queries into MapReduce / Tez / Spark jobs

• Supports batch processing

• Supports schema on read

• Handles large structured and semi-structured data

Why Hive?

• Faster development than writing MapReduce codes

• Easy integration with BI tools (via JDBC/ODBC)

• Suitable for:

o Log analytics

o Data summarization & aggregation

o Offline batch data processing

Hive Architecture

The architecture of Hive includes the following components:

User Interface

• CLI (Command Line Interface)

• Hive Web UI

• JDBC / ODBC connections

Used to submit HiveQL queries.

Driver
• Manages the entire query lifecycle

• Handles:

o Parsing

o Query compilation

o Optimization

o Execution plan creation

• Coordinates execution with Execution Engine

Compiler

• Converts HiveQL queries into:

o Logical execution plan → DAG

o Physical plan → MapReduce/Tez/Spark jobs

MetaStore

• Stores metadata about:

o Tables, schema, columns

o Data location in HDFS

o Partition information

• Usually backed by MySQL/PostgreSQL

Execution Engine

• Executes tasks in the cluster via:

o MapReduce (default earlier)

o Tez/Spark (faster engines)

• Coordinates with YARN for resources

HDFS

• Storage layer for actual table data

Hive vs RDBMS (Short Point)

• Hive built for read-heavy big data analytics

• Not suitable for OLTP (real-time updates)

Hive Data Types

Hive supports primitive and complex data types.

Primitive Data Types

Type Example Description

INT 20 Integer value

BIGINT 922337 Large integers

FLOAT, DOUBLE 3.14 Decimal numbers

STRING "Hello" Text data

BOOLEAN TRUE/FALSE Logical

TIMESTAMP, DATE 2024-01-01 Time format

BINARY bytes Binary raw data

Complex Data Types

Type Example Description

ARRAY ["A","B"] Ordered elements of same type

MAP {'id':123} Key-value pairs

STRUCT {name:"Ashwin", age:21} Record with multiple fields

UNION Tagged union Can be one of multiple defined types

Additional Hive Data Concepts

• NULL supports missing values

• Supports type casting and conversion

• Table data types help structure data stored in HDFS

Hive Query Language (HQL)

Hive Query Language (HQL) is a SQL-like language used in Apache Hive to manage and
analyze large datasets stored in Hadoop HDFS. Since many data analysts know SQL,
HQL makes it easy to work with distributed data without writing Java MapReduce
programs.

Characteristics of HQL

1. SQL-Like Syntax – Similar to traditional RDBMS SQL, easy learning curve.

2. Schema on Read – Structure applied at query time, supports semi-structured

data.

3. Batch Processing – Queries are converted to MapReduce/Spark jobs internally.

4. High Scalability – Designed to run on distributed storage like HDFS.

5. Supports UDFs – User-defined functions for custom processing.

Common HQL Operations

Category Example

DDL CREATE TABLE, ALTER, DROP, SHOW TABLES

DML LOAD DATA, INSERT, UPDATE, DELETE

Query/Analysis SELECT, WHERE, ORDER BY, GROUP BY, JOIN, LIMIT

Workflow of HQL Query Execution

1. User submits HQL query.

2. Query is compiled and optimized.

3. Execution plan is created.

4. MapReduce/Spark/Tez jobs run on Hadoop.

5. Results are returned to client.

Use Cases

• Data Warehousing

• Log and Clickstream analysis

• Business Intelligence reporting

Introduction to Pig

Apache Pig is a high-level data processing platform developed by Yahoo for Hadoop. It
uses a scripting language called Pig Latin to process large datasets easily.

Why Pig?

• Reduces complexity compared to writing Java MapReduce

• Suitable for ETL (Extract-Transform-Load) pipelines

• Handles structured, semi-structured, and unstructured data

Pig vs Hive

Hive Pig

SQL-like HQL Scripting language (Pig Latin)

Best for analytics/reporting Best for ETL & data transformation

Schema-on-read Flexible schema

Runs mostly interactive batch queries Supports complex data flows

Anatomy of Pig

The internal structure and working of Pig include the following components:

Pig Latin Script

A sequence of data operations like load → transform → group → store

Example:
data==LOAD
data LOAD'[Link]'
'[Link]'USING
USINGPigStorage('
PigStorage('
,'),')AS
AS(id:int,name:chararray);
(id:int,name:chararray);

grp==GROUP
grp GROUPdata
dataBY
BYid;
id;

STOREgrp
STORE grpINTO
INTO'output';
'output';

Pig Execution Modes

• Local Mode → Runs on single JVM (for testing)

• MapReduce Mode → Default mode, executes using Hadoop MapReduce

• Tez/Spark Mode → Faster execution (optional plugins)

Pig Execution Framework

Steps followed internally:

1. Parse the script

2. Logical Plan is created (sequence of operations)

3. Optimization (removal of redundant processing)

4. Physical Plan generation

5. Convert to MapReduce jobs

6. Execute on Hadoop cluster

Pig Components

Component Description

Pig Latin Language to write scripts

Parser Verifies syntax

Optimizer Improves performance

Execution Engine Converts script into Hadoop jobs

Data Model in Pig

• Atom (single value)

• Tuple (ordered set of fields)

• Bag (collection of tuples)

• Map (key-value pairs)

Pig on Hadoop

Apache Pig is designed to work efficiently on top of the Hadoop ecosystem. It uses
Hadoop’s distributed storage and processing framework to manage and analyze
large datasets.

How Pig Works on Hadoop

1. Pig scripts written in Pig Latin are submitted for execution.

2. Pig converts the script into a series of MapReduce jobs automatically.

3. These jobs run on Hadoop, using:

o HDFS for data storage

o MapReduce / Tez / Spark for execution

Benefits

• No need to write complex Java MapReduce programs

• Better performance through automatic optimization

• Handles large-scale structured/semi-structured data

Use Case for Pig

Apache Pig is primarily used in data preparation and transformation tasks in Big
Data pipelines.

Major Use Cases

Use Case Explanation

ETL (Extract-Transform-Load) Clean, filter, transform raw data before loading to

warehouse

Log Processing Analyze web logs, clickstreams, social media

data

Data Integration Combining data from multiple sources

Data Pipeline for Machine Pre-process raw datasets

Learning

Handling Unstructured Data XML, JSON, text analytics

Industry Examples

• Yahoo! using Pig for analyzing user click data

• E-commerce recommendation systems

• Fraud detection data pipelines

ETL Processing in Pig

Pig is widely used as an ETL tool for preparing data for BI and analytics.

ETL Phases in Pig

Phase Pig Feature Used

Extract LOAD data from HDFS/Local/NoSQL

Transform FILTER, GROUP, JOIN, FLATTEN, user-defined functions

Load STORE results back to HDFS/Hive/HBase

Example ETL Script

raw_data = LOAD '/user/data/[Link]' USING PigStorage(',')

AS (id:int, product:chararray, price:int);

filtered = FILTER raw_data BY price > 100;

STORE filtered INTO '/user/output/high_price_sales' USING PigStorage(',');

Data Types in Pig

Pig supports various data types suitable for semi-structured and complex data.

Simple (Scalar) Types

• int – integer

• long – large integer

• float – single precision number

• double – double precision number

• chararray – string

• bytearray – raw bytes

• boolean – true/false

Complex Types

Type Description Example

Tuple Ordered set of fields (1, 'John', 5000)

Bag Collection of tuples {(1,'A'), (2,'B')}

Map Key-value pairs ['name'#'John','age'#25]

These allow Pig to handle nested and flexible data models.

Running Pig

Pig can run in two execution modes:

Local Mode

• Runs on a single JVM

• Used for testing small data

• Command:

pig -x local

MapReduce Mode (Default)

• Runs on a full Hadoop cluster

• Suitable for large datasets

• Command:

pig

Ways to Run Pig Scripts

Method Description

Grunt Shell Interactive execution of Pig commands

Script File Write commands in .pig file → execute using: pig [Link]

Embedded Pig Pig Latin inside Java programs

Quick Summary Table

Topic Key Concept

Pig on Hadoop Converts Pig Latin → MapReduce jobs on HDFS

Use Case Best for ETL, log processing, data cleansing

ETL Processing LOAD → TRANSFORM → STORE

Pig Data Types Scalar + Complex (Tuple, Bag, Map)

Running Pig Local mode & MapReduce mode, grunt shell, scripts

Execution Model of Pig (Extended)

The Pig execution model describes the internal process of how a Pig Latin script gets
executed in a Hadoop environment. Pig acts like a data flow engine and converts high-
level operations into distributed processing tasks.

Stages of Execution

1. Parsing

o Syntax of the Pig Latin script is checked.

o Schema validation is performed.

o A Logical Plan (operator-based flow) is created.

2. Logical Optimization

o Logical plan is optimized for efficiency.

o Common optimizations:

▪ Push down FILTER and LIMIT operators near LOAD

▪ Combine multiple FOREACH into a single operation

o No Hadoop interaction yet, only compiler-level improvements.

3. Physical Plan Generation

o Logical plan is converted into a Physical Execution Plan

o Breaks the tasks into map and reduce phases

4. Compilation into MapReduce Jobs

o Physical plan → One or multiple MapReduce jobs

o Each transformation is translated to a job or set of jobs

5. Execution

o Jobs are submitted to Hadoop YARN

o Data is processed in parallel using:

▪ Map Tasks — scanning and filtering

▪ Reduce Tasks — grouping and aggregation

6. Result Storage

o Output stored back in HDFS / local system / or displayed via DUMP

Conclusion: Pig provides an easy abstraction over MapReduce and enables large-
scale data processing without complex Java coding.

Operators in Pig (Extended)

Pig operators perform data manipulation tasks in the data flow pipeline.

Common Categories

Category Examples Purpose

Loading/Storage LOAD, STORE, DUMP Ingest and save output

Transformation FOREACH, FILTER, MAP, FLATTEN Modify structure or filter data

Grouping/Joining GROUP, COGROUP, JOIN, CROSS Combine or group datasets

Ordering ORDER, DISTINCT, LIMIT Sorting & data reduction

Set Operators UNION, SPLIT Merge or divide relations

Important Note: Pig operators work on relations (data tables) and are designed for
pipelined execution.

Functions in Pig (Extended)

Functions are used to operate on fields, tuples, bags, or maps.

Built-in Functions

• String Functions:
LOWER(), UPPER(), TRIM(), SUBSTRING(), CONCAT()

• Mathematical Functions:
ABS(), ROUND(), SQRT(), RANDOM(), MAX(), MIN()

• Bag/Collection Functions:
SIZE(), TOKENIZE(), COUNT(), GROUP()

• Conditional Functions:
DECODE(), (condition ? value1 : value2)

UDF (User Defined Function)

Used when built-in functions are insufficient.

• Supports multiple languages:

o Java (primary), Python, JavaScript, Ruby

• Registered using:

Example

data = LOAD 'input' AS (line:chararray);

result = FOREACH data GENERATE myUDF(line);

DUMP result;

UDFs improve flexibility and allow customized business logic.

Data Types in Pig (Extended)

Pig supports both scalar (simple) and complex data types.

Scalar Data Types

Type Example Use

int 25 Small integer

long 9823479234 Large integer

float/double 10.59 Decimal number

chararray "Hello" Text data

bytearray Binary blob Raw data

boolean TRUE Logical

Complex Data Types

Type Description Example Use

Tuple Ordered fields (101, 'John') Row of data (record)

Bag Collection of tuples {(1,'A'),(2,'B')} Unordered dataset

Map Key-value structure ['age'#25] JSON-like data

These allow Pig to process semi-structured data like logs, XML, and JSON.

Differences Between Pig and Hive
No ratings yet
Differences Between Pig and Hive
18 pages
Hive and Pig: Big Data Tools Overview
No ratings yet
Hive and Pig: Big Data Tools Overview
8 pages
ETL Processing and Execution in Pig
No ratings yet
ETL Processing and Execution in Pig
6 pages
Big Data Processing with Pig and Hive
No ratings yet
Big Data Processing with Pig and Hive
18 pages
Introduction to Apache Pig and Hive
No ratings yet
Introduction to Apache Pig and Hive
16 pages
Overview of Apache Pig and Hive
No ratings yet
Overview of Apache Pig and Hive
83 pages
Understanding Apache Pig: Overview & Use Cases
No ratings yet
Understanding Apache Pig: Overview & Use Cases
10 pages
BDA Module 4
No ratings yet
BDA Module 4
10 pages
Cloud Computing & Big Data Analysis Guide
No ratings yet
Cloud Computing & Big Data Analysis Guide
73 pages
Unit 4
No ratings yet
Unit 4
64 pages
Unit-5 TP
No ratings yet
Unit-5 TP
51 pages
Introduction to Apache Pig Overview
No ratings yet
Introduction to Apache Pig Overview
98 pages
Overview of Hadoop Ecosystem Tools
No ratings yet
Overview of Hadoop Ecosystem Tools
15 pages
Introduction to Apache Pig Overview
No ratings yet
Introduction to Apache Pig Overview
58 pages
HBase Commands for Sample Database Setup
No ratings yet
HBase Commands for Sample Database Setup
81 pages
Introduction to Apache Pig and Pig Latin
No ratings yet
Introduction to Apache Pig and Pig Latin
41 pages
Understanding Apache Pig and Pig Latin
No ratings yet
Understanding Apache Pig and Pig Latin
5 pages
Big Data: Hadoop, Pig, Hive, HBase Guide
No ratings yet
Big Data: Hadoop, Pig, Hive, HBase Guide
32 pages
Apache Pig and Hive Overview Guide
No ratings yet
Apache Pig and Hive Overview Guide
39 pages
HBase, HiveQL, and Pig in Hadoop
No ratings yet
HBase, HiveQL, and Pig in Hadoop
50 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Data Analytics Unit 5
No ratings yet
Data Analytics Unit 5
78 pages
Hadoop Applications: Pig & Hive Overview
No ratings yet
Hadoop Applications: Pig & Hive Overview
32 pages
MapReduce, Hive, and Pig Overview
No ratings yet
MapReduce, Hive, and Pig Overview
4 pages
Understanding Apache Pig for Big Data
No ratings yet
Understanding Apache Pig for Big Data
43 pages
Apache Pig Data Storage & Retrieval Guide
No ratings yet
Apache Pig Data Storage & Retrieval Guide
16 pages
Hadoop Ecosystem: Hive & Pig Overview
No ratings yet
Hadoop Ecosystem: Hive & Pig Overview
20 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
Introduction to Apache Pig and Hive
No ratings yet
Introduction to Apache Pig and Hive
42 pages
Hadoop and Hive Overview Guide
No ratings yet
Hadoop and Hive Overview Guide
78 pages
Apache Pig: Execution Modes & Basics
No ratings yet
Apache Pig: Execution Modes & Basics
72 pages
Introduction to Apache Pig Basics
No ratings yet
Introduction to Apache Pig Basics
16 pages
Introduction to Apache Hive and Pig
No ratings yet
Introduction to Apache Hive and Pig
90 pages
Introduction to Apache Pig and Hive
No ratings yet
Introduction to Apache Pig and Hive
15 pages
Hadoop Tools for Geospatial Data & Pig
No ratings yet
Hadoop Tools for Geospatial Data & Pig
14 pages
Apache Hive Overview and Architecture
No ratings yet
Apache Hive Overview and Architecture
16 pages
Introduction to Pig and Execution Modes
No ratings yet
Introduction to Pig and Execution Modes
34 pages
Apache Pig Execution Modes Overview
No ratings yet
Apache Pig Execution Modes Overview
31 pages
Overview of Apache Pig in Hadoop
No ratings yet
Overview of Apache Pig in Hadoop
78 pages
Pig, Hive, and HBase Overview
100% (1)
Pig, Hive, and HBase Overview
31 pages
Chapter10 Pig
No ratings yet
Chapter10 Pig
70 pages
Apache Pig: Simplifying Data Analysis
No ratings yet
Apache Pig: Simplifying Data Analysis
81 pages
Big Data Tools: Pig, Hive, HBase Overview
No ratings yet
Big Data Tools: Pig, Hive, HBase Overview
18 pages
Big Data Applications with Pig and Hive
No ratings yet
Big Data Applications with Pig and Hive
10 pages
Bda Mod1
No ratings yet
Bda Mod1
28 pages
Apache Pig: Big Data Analytics Overview
No ratings yet
Apache Pig: Big Data Analytics Overview
67 pages
Data Transformation
No ratings yet
Data Transformation
30 pages
Big Data Analytics with Hadoop and Pig
No ratings yet
Big Data Analytics with Hadoop and Pig
81 pages
Apache Pig and Hive Overview
No ratings yet
Apache Pig and Hive Overview
35 pages
Introduction to Apache Pig Overview
No ratings yet
Introduction to Apache Pig Overview
30 pages
Overview of Hadoop Ecosystem Tools
No ratings yet
Overview of Hadoop Ecosystem Tools
13 pages
Understanding Apache Pig in Hadoop
No ratings yet
Understanding Apache Pig in Hadoop
8 pages
Unit 5
No ratings yet
Unit 5
37 pages
22am034 LM18
No ratings yet
22am034 LM18
3 pages
Understanding Pig in Big Data Processing
No ratings yet
Understanding Pig in Big Data Processing
12 pages
Big Data Analytics: Pig & Hive Frameworks
No ratings yet
Big Data Analytics: Pig & Hive Frameworks
15 pages
Understanding Apache Hive and Pig
No ratings yet
Understanding Apache Hive and Pig
42 pages
PowerMax CM&BC - Lab Guide - 2022
No ratings yet
PowerMax CM&BC - Lab Guide - 2022
171 pages
PHP Programming Basics and Examples
No ratings yet
PHP Programming Basics and Examples
15 pages
Introduction to OOP in C# Classes
No ratings yet
Introduction to OOP in C# Classes
30 pages
ALS RPL Form 1: Life Experience Documentation
100% (1)
ALS RPL Form 1: Life Experience Documentation
4 pages
Mainline Logic in Program Design
No ratings yet
Mainline Logic in Program Design
47 pages
Computer Peripheral Control Assignment
No ratings yet
Computer Peripheral Control Assignment
2 pages
DSpace 7 Installation Guide
No ratings yet
DSpace 7 Installation Guide
29 pages
Testo 174H Mini Data Logger Overview
No ratings yet
Testo 174H Mini Data Logger Overview
2 pages
Convert Webpages to PDF Easily
No ratings yet
Convert Webpages to PDF Easily
5 pages
Cloud-Based Software Testing Insights
No ratings yet
Cloud-Based Software Testing Insights
53 pages
Overview of Programming Language Classifications
No ratings yet
Overview of Programming Language Classifications
27 pages
Structural Vectoring® in Mineral Exploration
No ratings yet
Structural Vectoring® in Mineral Exploration
28 pages
Backup and Restore Windows Server 2022 with TrueNAS
No ratings yet
Backup and Restore Windows Server 2022 with TrueNAS
27 pages
Turnitin Originality Report for Skripsi
No ratings yet
Turnitin Originality Report for Skripsi
18 pages
Unit 4: Introduction To SAP Best Practices Architecture
No ratings yet
Unit 4: Introduction To SAP Best Practices Architecture
15 pages
Digital Stethoscope Project Overview
No ratings yet
Digital Stethoscope Project Overview
10 pages
rtCamp Associate Software Engineer Hiring
No ratings yet
rtCamp Associate Software Engineer Hiring
18 pages
3D CAD Dimension Drawing Guide
No ratings yet
3D CAD Dimension Drawing Guide
1 page
Untitled
No ratings yet
Untitled
2 pages
Comprehensive EDA Tools Directory
No ratings yet
Comprehensive EDA Tools Directory
4 pages
Semester End Examination Artificial Intelligence CSE (AI&ML) SEM I - 24-25 - Bishwarup Das
No ratings yet
Semester End Examination Artificial Intelligence CSE (AI&ML) SEM I - 24-25 - Bishwarup Das
2 pages
B.Tech Course Structure & Syllabus 2021-22
No ratings yet
B.Tech Course Structure & Syllabus 2021-22
59 pages
Understanding Constructor Chaining in Java
No ratings yet
Understanding Constructor Chaining in Java
3 pages
Set Point Adjustment to -80°C Guide
No ratings yet
Set Point Adjustment to -80°C Guide
1 page
Business-Case-Template Sample
No ratings yet
Business-Case-Template Sample
12 pages
Rule-Based Classifiers in Data Mining
No ratings yet
Rule-Based Classifiers in Data Mining
5 pages
Video-to-Audio Converter Project Report
No ratings yet
Video-to-Audio Converter Project Report
12 pages
Piping Engineer Job Description
No ratings yet
Piping Engineer Job Description
1 page
Mafia 3 Deals on G2A Marketplace
No ratings yet
Mafia 3 Deals on G2A Marketplace
1 page
DBMS Lab File and Assignments Guide
No ratings yet
DBMS Lab File and Assignments Guide
26 pages