Hadoop Ecosystem-Analysis
Apache Hive (SQL Query)
Pig (Scripting)
Big Data Analytics 1
What is Hive?
• Apache Hive is a distributed, fault-tolerant data warehouse system that
enables analytics at a massive scale.
• Hive Metastore(HMS) provides a central repository of metadata that can
easily be analyzed to make informed, data driven decisions, and therefore
it is a critical component of many data lake architectures.
• Hive is built on top of Apache Hadoop and supports storage on S3, adls, gs
etc though hdfs.
• Hive allows users to read, write, and manage petabytes of data using SQL.
• Hive is not designed for online transaction processing. It is best used for
traditional data warehousing tasks.
Big Data Analytics 2
Features of Hive
Data Declarative Variety
File
warehouse Language
Formats
Tabular
User
Open Adhoc
Adhoc
Defined
Source Format
Format Querying
Querying
Functions
Hadoop
Faster
Hive Features Response
Based
Time
Query PB
Query PB Supports
Supports Fault
HQL
ofData
data ETL
ETL Tolerance
Multiple
Easier than
User OLAP
Java
Support
Big Data Analytics 3
Hive-Server 2 (HS2)
HS2 supports multi-client concurrency and authentication. It is
designed to provide better support for open API clients like
JDBC and ODBC.
Big Data Analytics 4
Hive Metastore Server (HMS)
The Hive Metastore (HMS) is a central repository of metadata for Hive
tables and partitions in a relational database, and provides clients
(including Hive, Impala and Spark) access to this information using
the metastore service API.
It has become a building block for data lakes that utilize the diverse
world of open-source software, such as Apache Spark and Presto.
In fact, a whole ecosystem of tools, open-source and otherwise, are
built around the Hive Metastore, some of which this diagram
illustrates.
Big Data Analytics 5
Hive Metastore Server (HMS)
Big Data Analytics 6
Hive Architecture
Chandramouli, Asha, Rene, Doreen and
Big Data Analytics 7
Jasmine
HIVE CLIENTS Thrift/JDBC/ODBC
WEB User User Application Application
WEB UI Hive CLI
HIVE SERVICES
Hive Server 2
Hive Driver
META Compiler Parser Planner Optimizer
File
STORE Beeline Execution Engine
Systems
Map Reduce, TeZ, Spark
YARN
Meta Store Database HIVE STORAGE (Hcatalog) HDFS or HBASE
Workflow in Hive
Hive Hadoop
Mapreduce
Job Tracker
[Link] Query() 6. Execute Plan() 7. Submit Job() Task Tracker
Execution
nterface Driver Engine
10. Send Result() 9. Send Result() 8. Send Result()
[Link] Plan() 5. Send Plan() Map Reduce
[Link] Metadata()
Meta
Compiler Store HDFS
[Link] Metadata() Name Data Node
Node
Disadvantages
• Limited real-time processing: Hive is designed for batch processing, which
means it may not be the best tool for real-time data processing.
• Slow performance: Hive can be slower than traditional relational
databases because it is built on top of Hadoop, which is optimized for
batch processing rather than interactive querying.
• Steep learning curve: While Hive uses a SQL-like language, it still requires
users to have knowledge of Hadoop and distributed computing, which can
make it difficult for beginners to use.
• Limited flexibility: Hive is not as flexible as other data warehousing tools
because it is designed to work specifically with Hadoop, which can limit its
usability in other environments.
Chandramouli, Asha, Rene, Doreen and
Big Data Analytics 10
Jasmine
• [Link]
Big Data Analytics 11
What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
• To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce
tasks. Apache Pig has a component known as Pig Engine that accepts the Pig
Latin scripts as input and converts those scripts into MapReduce jobs.
[Link]
Big Data Analytics 12
• Pig has the following key properties:
• Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
• Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
• Extensibility. Users can create their own functions to do special-
purpose processing.
Big Data Analytics 13
Why Do We Need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines
of code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you
are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc. In addition, it also provides nested data
types like tuples, bags, and maps that are missing from MapReduce.
Big Data Analytics 14
PIG LATIN SCRIPTS
Pig Architecture
GRUNT SHELL PIG SERVER
PARSER
OPTIMIZER
COMPLIER
EXECUTION ENGINE
MAPREDUCE
HDFS
Big Data Analytics 15
Interpreting a Pig
Script PIG LATIN
PROGRAMS LOGICAL
SEMANTIC OPTIMIZER
CHECKING
LOGICAL
PLAN
LOGICAL TO
QUERY
PHYSICAL
PARSER
PHYSICAL TRANSLATOR
PLAN
MAP –
REDUCE
PLAN
HADOOP
EXECUTION
Big Data Analytics 16
Apache Pig Vs MapReduce
Big Data Analytics 17
Apache Pig Vs SQL
Big Data Analytics 18
Apache Pig Vs Hive
Big Data Analytics 19
Big Data Analytics 20