0% found this document useful (0 votes)

38 views20 pages

Hadoop Ecosystem: Hive & Pig Overview

The document provides an analysis of the Hadoop ecosystem, focusing on Apache Hive and Apache Pig. Hive is a distributed data warehouse system designed for large-scale analytics using SQL, while Pig is a platform that simplifies data manipulation through a high-level language called Pig Latin. Both tools are built on Hadoop but have distinct features and use cases, with Hive being more suited for data warehousing and Pig for data flow analysis.

Uploaded by

Ishan Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views20 pages

Hadoop Ecosystem: Hive & Pig Overview

Uploaded by

Ishan Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Hadoop Ecosystem-Analysis

Apache Hive (SQL Query)

Pig (Scripting)

Big Data Analytics 1

What is Hive?
• Apache Hive is a distributed, fault-tolerant data warehouse system that
enables analytics at a massive scale.
• Hive Metastore(HMS) provides a central repository of metadata that can
easily be analyzed to make informed, data driven decisions, and therefore
it is a critical component of many data lake architectures.
• Hive is built on top of Apache Hadoop and supports storage on S3, adls, gs
etc though hdfs.
• Hive allows users to read, write, and manage petabytes of data using SQL.
• Hive is not designed for online transaction processing. It is best used for
traditional data warehousing tasks.

Big Data Analytics 2

Features of Hive

Data Declarative Variety

File
warehouse Language
Formats
Tabular
User
Open Adhoc
Adhoc
Defined
Source Format
Format Querying
Querying
Functions
Hadoop
Faster
Hive Features Response
Based
Time
Query PB
Query PB Supports
Supports Fault
HQL
ofData
data ETL
ETL Tolerance
Multiple
Easier than
User OLAP
Java
Support

Big Data Analytics 3

Hive-Server 2 (HS2)

HS2 supports multi-client concurrency and authentication. It is

designed to provide better support for open API clients like
JDBC and ODBC.

Big Data Analytics 4

Hive Metastore Server (HMS)
The Hive Metastore (HMS) is a central repository of metadata for Hive
tables and partitions in a relational database, and provides clients
(including Hive, Impala and Spark) access to this information using
the metastore service API.
It has become a building block for data lakes that utilize the diverse
world of open-source software, such as Apache Spark and Presto.
In fact, a whole ecosystem of tools, open-source and otherwise, are
built around the Hive Metastore, some of which this diagram
illustrates.
Big Data Analytics 5
Hive Metastore Server (HMS)

Big Data Analytics 6

Hive Architecture

Chandramouli, Asha, Rene, Doreen and

Big Data Analytics 7
Jasmine
HIVE CLIENTS Thrift/JDBC/ODBC
WEB User User Application Application

WEB UI Hive CLI

HIVE SERVICES
Hive Server 2

Hive Driver
META Compiler Parser Planner Optimizer
File
STORE Beeline Execution Engine
Systems

Map Reduce, TeZ, Spark

YARN

Meta Store Database HIVE STORAGE (Hcatalog) HDFS or HBASE

Workflow in Hive

Hive Hadoop
Mapreduce
Job Tracker

[Link] Query() 6. Execute Plan() 7. Submit Job() Task Tracker

Execution
nterface Driver Engine
10. Send Result() 9. Send Result() 8. Send Result()
[Link] Plan() 5. Send Plan() Map Reduce
[Link] Metadata()
Meta
Compiler Store HDFS
[Link] Metadata() Name Data Node
Node
Disadvantages

• Limited real-time processing: Hive is designed for batch processing, which

means it may not be the best tool for real-time data processing.
• Slow performance: Hive can be slower than traditional relational
databases because it is built on top of Hadoop, which is optimized for
batch processing rather than interactive querying.
• Steep learning curve: While Hive uses a SQL-like language, it still requires
users to have knowledge of Hadoop and distributed computing, which can
make it difficult for beginners to use.
• Limited flexibility: Hive is not as flexible as other data warehousing tools
because it is designed to work specifically with Hadoop, which can limit its
usability in other environments.
Chandramouli, Asha, Rene, Doreen and
Big Data Analytics 10
Jasmine
• [Link]

Big Data Analytics 11

What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
• To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce
tasks. Apache Pig has a component known as Pig Engine that accepts the Pig
Latin scripts as input and converts those scripts into MapReduce jobs.

[Link]

Big Data Analytics 12

• Pig has the following key properties:
• Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
• Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
• Extensibility. Users can create their own functions to do special-
purpose processing.
Big Data Analytics 13
Why Do We Need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines
of code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you
are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc. In addition, it also provides nested data
types like tuples, bags, and maps that are missing from MapReduce.

Big Data Analytics 14

PIG LATIN SCRIPTS
Pig Architecture

GRUNT SHELL PIG SERVER

PARSER

OPTIMIZER

COMPLIER

EXECUTION ENGINE

MAPREDUCE

HDFS
Big Data Analytics 15
Interpreting a Pig
Script PIG LATIN
PROGRAMS LOGICAL
SEMANTIC OPTIMIZER
CHECKING
LOGICAL
PLAN
LOGICAL TO
QUERY
PHYSICAL
PARSER
PHYSICAL TRANSLATOR
PLAN

MAP –
REDUCE
PLAN

HADOOP
EXECUTION
Big Data Analytics 16
Apache Pig Vs MapReduce

Big Data Analytics 17

Apache Pig Vs SQL

Big Data Analytics 18

Apache Pig Vs Hive

Big Data Analytics 19

Big Data Analytics 20

Real-Time Bidding in Distributed Databases
No ratings yet
Real-Time Bidding in Distributed Databases
11 pages
MA Unit2
No ratings yet
MA Unit2
26 pages
Understanding Complexity Classes
No ratings yet
Understanding Complexity Classes
23 pages
C++ Object Oriented Programming Lab Manual
100% (1)
C++ Object Oriented Programming Lab Manual
22 pages
Online Room Rental System Overview
No ratings yet
Online Room Rental System Overview
32 pages
Sparse Array Representation and Benefits
No ratings yet
Sparse Array Representation and Benefits
34 pages
Advanced Data Structures Exam 2025
No ratings yet
Advanced Data Structures Exam 2025
2 pages
Big Data Analytics: Pig & Hive Overview
No ratings yet
Big Data Analytics: Pig & Hive Overview
10 pages
Advanced Data Structures Lab Manual
No ratings yet
Advanced Data Structures Lab Manual
62 pages
Hive Overview and Key Features
No ratings yet
Hive Overview and Key Features
57 pages
Digital Logic Circuit Fundamentals
No ratings yet
Digital Logic Circuit Fundamentals
37 pages
Key Features of Apache Spark
No ratings yet
Key Features of Apache Spark
7 pages
Banking System Design in C++ OOP
No ratings yet
Banking System Design in C++ OOP
6 pages
Data Structures & Algorithms Course Plan
100% (1)
Data Structures & Algorithms Course Plan
5 pages
Polyhedral Model for Loop Transformations
No ratings yet
Polyhedral Model for Loop Transformations
18 pages
Big Data and NoSQL Overview
No ratings yet
Big Data and NoSQL Overview
88 pages
Google File System Overview 2003
No ratings yet
Google File System Overview 2003
44 pages
Introduction to Cryptography Basics
No ratings yet
Introduction to Cryptography Basics
78 pages
Overview of Perceptron Models
No ratings yet
Overview of Perceptron Models
8 pages
CP25201 Multicore Architectures
No ratings yet
CP25201 Multicore Architectures
12 pages
Block Ciphers: DES, AES, RSA Overview
No ratings yet
Block Ciphers: DES, AES, RSA Overview
30 pages
Disk Management Techniques Explained
No ratings yet
Disk Management Techniques Explained
24 pages
CS3311 Data Structures Lab Manual
No ratings yet
CS3311 Data Structures Lab Manual
59 pages
SQL Commands for Database Records
No ratings yet
SQL Commands for Database Records
38 pages
M.Tech Data Science Syllabus Overview
No ratings yet
M.Tech Data Science Syllabus Overview
59 pages
OOP Inheritance Concepts for CS3391
No ratings yet
OOP Inheritance Concepts for CS3391
32 pages
Disabling Foreign Key Constraints in Oracle
No ratings yet
Disabling Foreign Key Constraints in Oracle
10 pages
Deep Learning Data Processing Guide
No ratings yet
Deep Learning Data Processing Guide
41 pages
UNIT V Applications and Issues in Cloud
No ratings yet
UNIT V Applications and Issues in Cloud
10 pages
Web Security Threats and Solutions
No ratings yet
Web Security Threats and Solutions
37 pages
Question Bank - Ds
No ratings yet
Question Bank - Ds
10 pages
Understanding XML-Native Database Systems
No ratings yet
Understanding XML-Native Database Systems
1 page
HPC Unit-5
No ratings yet
HPC Unit-5
23 pages
MapReduce Methods in Big Data Analytics
No ratings yet
MapReduce Methods in Big Data Analytics
15 pages
JAX-WS Web Service Setup Guide
No ratings yet
JAX-WS Web Service Setup Guide
6 pages
CCS372 Virtualization Course Syllabus
No ratings yet
CCS372 Virtualization Course Syllabus
38 pages
JIT Compilation Models Explained
No ratings yet
JIT Compilation Models Explained
12 pages
Database Development Life Cycle Overview
No ratings yet
Database Development Life Cycle Overview
28 pages
C++ Inheritance and Polymorphism Lab Guide
No ratings yet
C++ Inheritance and Polymorphism Lab Guide
94 pages
Object-Oriented Database Overview
No ratings yet
Object-Oriented Database Overview
13 pages
M.Tech CSE AI & ML Question Bank
No ratings yet
M.Tech CSE AI & ML Question Bank
15 pages
Guest Hopping Attack in Cloud Security
No ratings yet
Guest Hopping Attack in Cloud Security
14 pages
CSE II/IV Algorithms Question Bank 2025-26
No ratings yet
CSE II/IV Algorithms Question Bank 2025-26
6 pages
cp25c01 ADSA LAB 1 Reg 2025
No ratings yet
cp25c01 ADSA LAB 1 Reg 2025
70 pages
CCS341 Data Warehousing Overview
No ratings yet
CCS341 Data Warehousing Overview
4 pages
Information Storage Management Overview
100% (1)
Information Storage Management Overview
1 page
Web Essentials Exam Tasks for IT3401
No ratings yet
Web Essentials Exam Tasks for IT3401
3 pages
Predicting Malicious URLs with ML
No ratings yet
Predicting Malicious URLs with ML
7 pages
Randomized and Approximation Algorithms
No ratings yet
Randomized and Approximation Algorithms
22 pages
Pointers in Embedded C Explained
No ratings yet
Pointers in Embedded C Explained
5 pages
UML State and Activity Diagrams
No ratings yet
UML State and Activity Diagrams
29 pages
Linear Algebra and Hilbert Space Basics
No ratings yet
Linear Algebra and Hilbert Space Basics
12 pages
Computer Vision: Image-Based Rendering Techniques
No ratings yet
Computer Vision: Image-Based Rendering Techniques
25 pages
VBox Constructor Usage in Java OOP
No ratings yet
VBox Constructor Usage in Java OOP
296 pages
OOAD Question Bank Overview
100% (2)
OOAD Question Bank Overview
5 pages
Big Data Processing with Pig and Hive
No ratings yet
Big Data Processing with Pig and Hive
18 pages
Differences Between Pig and Hive
No ratings yet
Differences Between Pig and Hive
18 pages
Big Data Applications with Pig and Hive
No ratings yet
Big Data Applications with Pig and Hive
50 pages
Hive and Pig: Big Data Tools Overview
No ratings yet
Hive and Pig: Big Data Tools Overview
8 pages
Overview of Apache Pig in Hadoop
No ratings yet
Overview of Apache Pig in Hadoop
78 pages
Accessing Web Services in PowerBuilder
No ratings yet
Accessing Web Services in PowerBuilder
6 pages
The Python Tutorial: Navigation
No ratings yet
The Python Tutorial: Navigation
4 pages
vCenter Orchestrator AD Plug-In Guide
No ratings yet
vCenter Orchestrator AD Plug-In Guide
22 pages
Files Reference AIX PDF
No ratings yet
Files Reference AIX PDF
1,078 pages
TunerStudio Pro Dash Input Setup Guide
No ratings yet
TunerStudio Pro Dash Input Setup Guide
10 pages
Satellite Vs Aerial Photo
No ratings yet
Satellite Vs Aerial Photo
4 pages
Onboarding Widget Implementation
No ratings yet
Onboarding Widget Implementation
7 pages
Shadowfox Cyber Security Internship Report
No ratings yet
Shadowfox Cyber Security Internship Report
16 pages
React Practice Calendar Exercises
No ratings yet
React Practice Calendar Exercises
1 page
Pseudocode Algorithm for Number Analysis
No ratings yet
Pseudocode Algorithm for Number Analysis
6 pages
Understanding .BRD Files and Viewers
100% (1)
Understanding .BRD Files and Viewers
4 pages
Abhinandan Pandey: Software Incubator Lead
No ratings yet
Abhinandan Pandey: Software Incubator Lead
1 page
Sistem Pembayaran Koperasi Karyawan
No ratings yet
Sistem Pembayaran Koperasi Karyawan
9 pages
Android App Cloner Debug Log Analysis
No ratings yet
Android App Cloner Debug Log Analysis
131 pages
ICT101 Spring 2025 Exam Guidelines
No ratings yet
ICT101 Spring 2025 Exam Guidelines
9 pages
MS Word Keyboard Shortcuts Guide
No ratings yet
MS Word Keyboard Shortcuts Guide
25 pages
Dairy Farm Record Keeping With Emphasis On Its Importance, Methods, Types, and Status in Some Countries
No ratings yet
Dairy Farm Record Keeping With Emphasis On Its Importance, Methods, Types, and Status in Some Countries
11 pages
Northwind Data Warehouse Creation Script
No ratings yet
Northwind Data Warehouse Creation Script
8 pages
YOLOv8 for Autonomous Drone Navigation
No ratings yet
YOLOv8 for Autonomous Drone Navigation
11 pages
Introduction to Finite Element Method
No ratings yet
Introduction to Finite Element Method
104 pages
uC/OS II Integration with Cortex Tools
No ratings yet
uC/OS II Integration with Cortex Tools
48 pages
Systems Security Engineer Profile
No ratings yet
Systems Security Engineer Profile
2 pages
BRC Support Manual
No ratings yet
BRC Support Manual
73 pages
MaRRS Spelling Bee Category III Guide
No ratings yet
MaRRS Spelling Bee Category III Guide
25 pages
Safe Mode Functionality in OS
No ratings yet
Safe Mode Functionality in OS
6 pages
Scenario Based Q&A
No ratings yet
Scenario Based Q&A
6 pages
Peer-to-Peer Networking Guide for Schools
No ratings yet
Peer-to-Peer Networking Guide for Schools
58 pages
Cadence Shortcuts
No ratings yet
Cadence Shortcuts
3 pages
IEC Application Status and Details
No ratings yet
IEC Application Status and Details
1 page
Computer Networks: Class 9 Chapter 3 Notes
No ratings yet
Computer Networks: Class 9 Chapter 3 Notes
14 pages

Hadoop Ecosystem: Hive & Pig Overview

Uploaded by

Hadoop Ecosystem: Hive & Pig Overview

Uploaded by

Hadoop Ecosystem-Analysis

Apache Hive (SQL Query)

Big Data Analytics 1

Big Data Analytics 2

Data Declarative Variety

Big Data Analytics 3

HS2 supports multi-client concurrency and authentication. It is

Big Data Analytics 4

Big Data Analytics 6

Chandramouli, Asha, Rene, Doreen and

WEB UI Hive CLI

Map Reduce, TeZ, Spark

Meta Store Database HIVE STORAGE (Hcatalog) HDFS or HBASE

[Link] Query() 6. Execute Plan() 7. Submit Job() Task Tracker

• Limited real-time processing: Hive is designed for batch processing, which

Big Data Analytics 11

Big Data Analytics 12

Big Data Analytics 14

GRUNT SHELL PIG SERVER

Big Data Analytics 17

Big Data Analytics 18

Big Data Analytics 19

You might also like