0% found this document useful (0 votes)
38 views20 pages

Hadoop Ecosystem: Hive & Pig Overview

The document provides an analysis of the Hadoop ecosystem, focusing on Apache Hive and Apache Pig. Hive is a distributed data warehouse system designed for large-scale analytics using SQL, while Pig is a platform that simplifies data manipulation through a high-level language called Pig Latin. Both tools are built on Hadoop but have distinct features and use cases, with Hive being more suited for data warehousing and Pig for data flow analysis.

Uploaded by

Ishan Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views20 pages

Hadoop Ecosystem: Hive & Pig Overview

The document provides an analysis of the Hadoop ecosystem, focusing on Apache Hive and Apache Pig. Hive is a distributed data warehouse system designed for large-scale analytics using SQL, while Pig is a platform that simplifies data manipulation through a high-level language called Pig Latin. Both tools are built on Hadoop but have distinct features and use cases, with Hive being more suited for data warehousing and Pig for data flow analysis.

Uploaded by

Ishan Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hadoop Ecosystem-Analysis

Apache Hive (SQL Query)


Pig (Scripting)

Big Data Analytics 1


What is Hive?
• Apache Hive is a distributed, fault-tolerant data warehouse system that
enables analytics at a massive scale.
• Hive Metastore(HMS) provides a central repository of metadata that can
easily be analyzed to make informed, data driven decisions, and therefore
it is a critical component of many data lake architectures.
• Hive is built on top of Apache Hadoop and supports storage on S3, adls, gs
etc though hdfs.
• Hive allows users to read, write, and manage petabytes of data using SQL.
• Hive is not designed for online transaction processing. It is best used for
traditional data warehousing tasks.

Big Data Analytics 2


Features of Hive

Data Declarative Variety


File
warehouse Language
Formats
Tabular
User
Open Adhoc
Adhoc
Defined
Source Format
Format Querying
Querying
Functions
Hadoop
Faster
Hive Features Response
Based
Time
Query PB
Query PB Supports
Supports Fault
HQL
ofData
data ETL
ETL Tolerance
Multiple
Easier than
User OLAP
Java
Support

Big Data Analytics 3


Hive-Server 2 (HS2)

HS2 supports multi-client concurrency and authentication. It is


designed to provide better support for open API clients like
JDBC and ODBC.

Big Data Analytics 4


Hive Metastore Server (HMS)
The Hive Metastore (HMS) is a central repository of metadata for Hive
tables and partitions in a relational database, and provides clients
(including Hive, Impala and Spark) access to this information using
the metastore service API.
It has become a building block for data lakes that utilize the diverse
world of open-source software, such as Apache Spark and Presto.
In fact, a whole ecosystem of tools, open-source and otherwise, are
built around the Hive Metastore, some of which this diagram
illustrates.
Big Data Analytics 5
Hive Metastore Server (HMS)

Big Data Analytics 6


Hive Architecture

Chandramouli, Asha, Rene, Doreen and


Big Data Analytics 7
Jasmine
HIVE CLIENTS Thrift/JDBC/ODBC
WEB User User Application Application

WEB UI Hive CLI

HIVE SERVICES
Hive Server 2

Hive Driver
META Compiler Parser Planner Optimizer
File
STORE Beeline Execution Engine
Systems

Map Reduce, TeZ, Spark


YARN

Meta Store Database HIVE STORAGE (Hcatalog) HDFS or HBASE


Workflow in Hive

Hive Hadoop
Mapreduce
Job Tracker

[Link] Query() 6. Execute Plan() 7. Submit Job() Task Tracker


Execution
nterface Driver Engine
10. Send Result() 9. Send Result() 8. Send Result()
[Link] Plan() 5. Send Plan() Map Reduce
[Link] Metadata()
Meta
Compiler Store HDFS
[Link] Metadata() Name Data Node
Node
Disadvantages

• Limited real-time processing: Hive is designed for batch processing, which


means it may not be the best tool for real-time data processing.
• Slow performance: Hive can be slower than traditional relational
databases because it is built on top of Hadoop, which is optimized for
batch processing rather than interactive querying.
• Steep learning curve: While Hive uses a SQL-like language, it still requires
users to have knowledge of Hadoop and distributed computing, which can
make it difficult for beginners to use.
• Limited flexibility: Hive is not as flexible as other data warehousing tools
because it is designed to work specifically with Hadoop, which can limit its
usability in other environments.
Chandramouli, Asha, Rene, Doreen and
Big Data Analytics 10
Jasmine
• [Link]

Big Data Analytics 11


What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
• To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce
tasks. Apache Pig has a component known as Pig Engine that accepts the Pig
Latin scripts as input and converts those scripts into MapReduce jobs.

[Link]

Big Data Analytics 12


• Pig has the following key properties:
• Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
• Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
• Extensibility. Users can create their own functions to do special-
purpose processing.
Big Data Analytics 13
Why Do We Need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines
of code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you
are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc. In addition, it also provides nested data
types like tuples, bags, and maps that are missing from MapReduce.

Big Data Analytics 14


PIG LATIN SCRIPTS
Pig Architecture

GRUNT SHELL PIG SERVER

PARSER

OPTIMIZER

COMPLIER

EXECUTION ENGINE

MAPREDUCE

HDFS
Big Data Analytics 15
Interpreting a Pig
Script PIG LATIN
PROGRAMS LOGICAL
SEMANTIC OPTIMIZER
CHECKING
LOGICAL
PLAN
LOGICAL TO
QUERY
PHYSICAL
PARSER
PHYSICAL TRANSLATOR
PLAN

MAP –
REDUCE
PLAN

HADOOP
EXECUTION
Big Data Analytics 16
Apache Pig Vs MapReduce

Big Data Analytics 17


Apache Pig Vs SQL

Big Data Analytics 18


Apache Pig Vs Hive

Big Data Analytics 19


Big Data Analytics 20

You might also like