0% found this document useful (0 votes)
4 views77 pages

Chapter9 HIVE

Hive is a data warehousing tool built on Hadoop, designed for querying structured data using an SQL-like interface. It was created by Facebook to manage large volumes of data and utilizes HDFS for storage and MapReduce for execution. Hive supports batch processing, rich data types, and has evolved through various versions, introducing features like transactions and a cost-based optimizer.

Uploaded by

udayagirisriya1
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views77 pages

Chapter9 HIVE

Hive is a data warehousing tool built on Hadoop, designed for querying structured data using an SQL-like interface. It was created by Facebook to manage large volumes of data and utilizes HDFS for storage and MapReduce for execution. Hive supports batch processing, rich data types, and has evolved through various versions, introducing features like transactions and a cost-based optimizer.

Uploaded by

udayagirisriya1
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

What is Hive?

Big Da
Copyright
What is Hive?

Hive is a Data Warehousing tool. Hive is used to query structured data built
on top of Hadoop. Facebook created Hive component to manage their ever-
growing volumes of data. Hive makes use of the following:

1. HDFS for Storage

2. MapReduce for execution

3. Stores metadata in an RDBMS.

Big Da
Copyright
Why Hive?

Hive provides an SQL-like interface to query the data stored in HDFS, making it easier
to analyze these logs without writing complex MapReduce jobs.
Hive suitable for

[Link] suitable for Data Warehousing (DWH) applications.


2. processes batch jobs on huge data that is immutable.
3. EG: Web logs, Application logs.
Hive history

2007 : Born at FACEBOOK to analyze incoming data

2008 : Hive became Apache Hadoop sub-project.


Hive versions:

Hive 0.10

1. Batch
2. Read only data
3. HiveQL
4. MR

Hive 0.13
1. Interactive
2. Read-only data
3. Substantial SQL- Expanded SQL support with added functionalities
like windowing, analytics functions, and joins.
4. MR

Hive 0.14
1. Transactions with ACID semantics
2. Cost based optimizer
3. SQL temporary tables.
Hive Features

Big Da
Copyright
Features of Hive

1. It is similar to SQL.

2. HQL is easy to code.

3. Hive supports rich data types such as structs, lists, and maps.

4. Hive supports SQL filters, group-by and order-by clauses.

5. Custom Types, Custom functions can be defined.

Big Da
Copyright
Advanced Features not supported by Hive

• Hive does not support recursive queries

• Hive lacks stored procedure support.

• Hive does not support triggers.


When to Use Hive Over SQL

•Big Data: Hive is better suited for querying and analyzing large,
distributed datasets.
•Schema-on-Read: Hive can process semi-structured or unstructured data
without predefined schema.
•Batch Processing: Hive is ideal for batch jobs and data aggregation.

When to Use SQL Over Hive

•OLTP Systems: For real-time and transactional applications.


•Small Datasets: SQL databases are faster and more efficient for smaller
datasets.
•Complex Querying: SQL provides a richer set of features for advanced
querying.
Hive integration and work flow

Work flow of log analysis file.

Hourly log

Hadoop HDFS

Log compression

Hive table 1 Hive table 2


Example querying

Example

SELECT city, COUNT(*)


FROM customer_data
GROUP BY city;

Generated MapReduce Job:

•Map Task: The Mapper processes the customer_data file and outputs (city, 1) for
each record.
•Reduce Task: The Reducer groups the data by city and calculates the COUNT for
each city.

Difference from Manual MapReduce Code


•In Hive, you don't need to manually write Mapper and Reducer code.
•The Hive query engine automatically generates the necessary MapReduce jobs.
Hive Data Units

Big Da
Copyright
Hive Data Units
• Databases: The namespace for tables

• Tables : Set of records that have similar schema

• Partitions: Logical separation of data based on classification of given


information as per specific attributes. Once hive has partitioned the data
based on specified key, it starts to assemble the records into specific
folders as when the records are inserted.

• Use case: Storing data for faster query performance. For example, a table
storing sales data can be partitioned by year or region. A query for sales in
the year = 2023 will only scan the 2023 folder, improving efficiency.

• Buckets (Clusters) : Similar to partition but uses hash function to segregate


the data and determines the cluster or bucket into which the record should
be placed.

Use case: Performing operations like sampling or optimizing queries that


join two tables on the bucketed column.
Example:
A table customers partitioned by region could be bucketed by
Big Da customer_id, ensuring that all records for the same customer_id fall into the
Copyrightsame bucket.
Semblance of hive structure with database

Database: Several tables, table constitute with rows and columns.

Hive:
Tables : Folder
Partition tables: sub directory
Bucketed tables: stored as files.
Hive Architecture

Big Da
Copyright
Hive Architecture

Hive

Command-Line Hive Web Hive Server


Interface Interface (Thrift)

Driver (Query
Metastore
Compiler, Executor)

JobTracker TaskTracker

HDFS
Hadoop

Big Da
Copyright
HIVE components

[Link]-Line Interface (CLI):


•Provides a terminal-based interface to interact with Hive.
•Users can write and execute HiveQL queries via the CLI.

[Link] Web Interface:


•A browser-based interface for interacting with Hive. Eg: Apache Hue,
•Enables users to submit queries and manage Hive operations through a GUI.

[Link] (Query Compiler, Executor):


•Acts as the controller in Hive's architecture.
•Query Compiler: Parses and compiles HiveQL queries into execution plans.
•Executor: Executes the compiled queries and interacts with the Hadoop framework
for job execution.

[Link]:
•A repository that stores metadata about the Hive tables, such as schema
information, table locations in HDFS, and partition details.
•The MetaStore is essential for query planning and execution.

[Link] Server (Thrift):


•Provides a service interface to Hive.
•Applications can connect to Hive using JDBC, ODBC, or other Thrift-based clients.
HIVE components

[Link]:
•Manages the distribution of tasks across nodes in the Hadoop cluster.
•Monitors the execution of MapReduce jobs initiated by Hive.

[Link]:
•Runs on individual cluster nodes.
•Executes tasks assigned by the JobTracker.

[Link] Distributed File System (HDFS):


•The primary storage layer for Hive.
•Stores raw data as well as intermediate and final outputs of queries.
Hive Meta store

Meta store : Hive table definitions and mappings to the data are stored in Meta store.

Meta store consists of the following:


Metastore services: Offers interface to the hive
Databases: Stores data definitions, mappings to the data and others.

Meta data includes ,


ID s of DBS
IDs of Tables
IDs of Indexes
The time of creation of tables
Input format
Output format
Types of Meta Sore

Embedded Meta store:


Used for unit test
Default meta store
only one process is allowed to connect at a time
Here both DB and Meta store services embedded with main Hive server process.

Characteristics:
•Purpose: Primarily used for unit testing or single-user scenarios.
•Architecture:
• Both the Hive metastore service and the database (e.g., Derby) are embedded
in the same process as the Hive server.
• A single Java Virtual Machine (JVM) handles everything (Hive server, metastore
service, and database).
•Limitations:
• Only one connection is allowed at a time, as the embedded database (e.g.,
Derby) does not support concurrent access.
• Not suitable for production environments.
Use Case:
•Best for testing or development where only one user or process needs to access Hive.
Local Metastore
• Meta data can be stored in any RDBMS components like MySQL.
• Allows multiple connections at a time
• Here Meta store service runs in Hive Server but DB runs in separate process.

•Architecture:
• The Hive metastore service runs as part of the HiveServer process.
• The metadata is stored in an external RDBMS like MySQL, PostgreSQL, or Oracle,
which runs in a separate process.

•Limitations:
•The metastore service is not separated from HiveServer, so any failures in the HiveServer
process can affect metastore operations.
•Less isolation and security compared to the remote metastore.

Use Case:
•Suitable for small to medium-scale production environments where Hive and the
metastore can reside in the same process and multiple users need access.
Remote Metastore

• Hive driver and Meta store interface runs on different JVM.


• This way the DB can be firewalled from the hive users and also DB credentials are
completely isolated from hive users.

Advantages:
• Allows scaling of the metastore service independently.
• Improves security by isolating the database behind a firewall.
• Suitable for multi-tenant or enterprise-level environments.

Limitations:
• Slightly higher latency due to network communication between Hive
components and the metastore service.

Use Case:
•Ideal for large-scale production environments, multi-tenant systems, or scenarios
where database credentials and Hive processes need strict isolation.
Hive Data Types

Big Da
Copyright
Hive Data Types

Numeric Data Type


TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number

String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)

Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive

Big Da
Copyright
Hive Data Types

Collection Data Types

STRUCT Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')

MAP A collection of key - value pairs. Fields are accessed using [] notation.
E.g.: map('first', 'John', 'last', 'Doe')

ARRAY Ordered sequence of same types. Fields are accessed using array index.
E.g.: array('John', 'Doe')

Big Da
Copyright
Hive File Format

Big Da
Copyright
Hive File Format
• Text File
The default file format is text file.

• Sequential File
Sequential files are flat files that store binary key-value pairs. Commonly used
in Hadoop.
Supports compression (e.g., Gzip, Snappy).
Provides faster read/write performance than TextFile.

• RCFile (Record Columnar File)


RCFile stores the data in Column Oriented Manner which ensures that
Aggregation operation is not an expensive operation.

ORC (Optimized Row Columnar)


A highly efficient columnar storage format introduced by Hive for big data
workloads.
• Columnar storage with support for advanced compression and indexing.
• Built-in support for predicate pushdown (filters are applied before reading
data).

Big Da
Copyright
Hive Query Language

Big Da
Copyright
Hive Query Language (HQL)

1. Create and manage tables and partitions.


2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
4. Download the contents of a table to a local directory or result of queries to HDFS
directory.

Big Da
Copyright
DDL and DML statements

Big Da
Copyright
Database

To create a database named “STUDENTS” with comments and database properties.

CREATE DATABASE IF NOT EXISTS STUDENTS


COMMENT 'STUDENT Details'
WITH DBPROPERTIES ('creator' = 'JOHN');

Big Da
Copyright
Database

To describe a database.

DESCRIBE DATABASE STUDENTS;

Shows only DB Name, comment, DB directory

DESCRIBE DATABASE EXTENDED STUDENTS;

Shows DB properties also.

Big Da
Copyright
Database

To display, a list of all databases

SHOW DATABASES;

To alter database properties

ALTER DATABASE STUDENTS SET DBPROPERTIES(‘Edited-by’=‘XX’);

To make the databases as current working database

USE STUDENTS;
Database

To drop database.

DROP DATABASE STUDENTS;

All databases stored in data warehouse directory in hive.

Big Da
Copyright
Tables
Hive provides two kinds of tables:

◼ Managed Table
1. Hive stores the managed tables under the ware house folder under Hive
2. The complete life cycle of table and data is managed by Hive
[Link] the internal table is dropped it drops the data as well as meta data.

◼ External Table or self managed table


1. When the table is dropped , it retains the data in the underlying location
2. External Keyword is used.
3. Location needs to be specified to store the dataset in that particular
location.

Big Da
Copyright
Feature Managed Table External Table

Specified by the user using


Data Location Stored in Hive's warehouse directory
LOCATION.

Data is deleted when the table is Data remains intact when the table
Data Deletion
dropped. is dropped.

When Hive manages the data When data is shared with other
Use Case
lifecycle. systems or resides externally.

Control Over
Hive has full control over the data. User maintains control of the data.
Data

More flexible, can query external


Flexibility Less flexible, tied to Hive.
datasets.
Tables

To create managed table named ‘STUDENT’.

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

Big Da
Copyright
Tables

To create external table named ‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name


STRING,gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LOCATION ‘/STUDENT_INFO';

Big Da
Copyright
Tables

To load data into the table from file named [Link].

LOAD DATA LOCAL INPATH ‘/root/hivedemos/[Link]' OVERWRITE INTO


TABLE EXT_STUDENT;

Local keyword is used to load the table in local file systems.


To store in HDFS remove local keyword.

Big Da
Copyright
QueryingTables

To retrieve the student details from “EXT_STUDENT” table.

SELECT * from EXT_STUDENT;

Big Da
Copyright
Collection Data types in Tables

Map( Key- Value)

Input :
1001, John,c++:Java,Mark1!45:Mark2!67:Mark3!75
1002, Jack,c++:cloud,Mark1!55:Mark2!69:Mark3!85

CREATE TABLE STUDENT_INFO(rollno INT,name String,sub ARRAY<STRING>,marks


MAP<STRING,INT>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘:’
MAP KEYS TERMINATED BY ‘!’;

To load data

LOAD DATA LOCAL INPATH ‘/root/hivedemos/[Link]’ INTO TABLE STUDENT_INFO


The table STUDENT_INFO is created to store student information with the following schema:

•rollno: An integer field to store the student's roll number.


•name: A string field to store the student's name.
•sub: An array of strings to store a list of subjects.
•marks: A map with string keys (subject names) and integer values (marks).

The ROW FORMAT specifies the delimiter for fields and collection items:

•Fields are separated by a comma (,).


•Items in collections (arrays or maps) are separated by a colon (:).
•Keys in maps are separated from their values by an exclamation mark (!).

For this schema, the data in the CSV file must adhere to the specified format.
1,John,Math:Science,Math!85:Science!90
2,Jane,History:Geography,History!80:Geography!75
3,Mike,Physics:Chemistry,Physics!92:Chemistry!88
Querying collection Data types

SELCET * FROM STUDENT_INFO;

1001 John [c++:Java] {“Mark1”:45,”Mark2”:67,”Mark3”:75}


1002 Jack [c++:cloud] {“Mark1”:55,”Mark2”:69,”Mark3”:85}

SELECT NAME,SUB FROM STUDENT-INFO;


John [c++:Java]
Jack [c++:cloud]

SELECT NAME,MARKS[‘MARKS1’] FROM STUDENT-INFO;


JOHN 45
JACK 55

SELECT NAME,SUB[0] FROM STUDENT-INFO;


John c++
Jack c++
Partitions

Big Da
Copyright
Partitions

Partitions split the larger dataset into more meaningful chunks.


Hive provides two kinds of partitions: Static Partition and Dynamic Partition.

Static partition: Comprise columns whose values are known at compile time.
Dynamic partition: Dynamic partition have columns whose values are known only at
Execution time.

Feature Static Partition Dynamic Partition


At compile time (before At execution time (while
When values are known
execution). processing).

Partition value specified Manually in the query. Extracted from data dynamically.

When values are


Use case When values are data-driven.
fixed/predictable.

Faster due to less Slightly slower due to runtime


Performance
Big Da computation. evaluation.
Copyright
Static and Dynamic Partition Combined

INSERT INTO TABLE sales_dynamic PARTITION (region='North', year)


SELECT id, amount, year FROM staging_sales WHERE region='North’;

Here, the region column is statically set to 'North', while the year column
is dynamically determined at runtime.
Static partitions

• To create static partition based on “gpa” column.

CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name STRING)


PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

• Load data and querying data into partition table from table.

INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa =4.0)


SELECT rollno, name from EXT_STUDENT where gpa=4.0;

To add one more static partition based on gpa

ALTER TABLE STATIC-PART –STUDENT ADD PARTITION(GPA=3.5);


Partitions

• To create dynamic partition on column date.

CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno INT, name STRING)


PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

• To load data into a dynamic partition table from table.

SET [Link] = true;


SET [Link] = nonstrict;

Note: The dynamic partition strict mode requires at least one static partition column. To
turn this off,
set [Link]=nonstrict

INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT PARTITION (gpa) SELECT


rollno,name,gpa from EXT_STUDENT;

Big Da
Copyright
Partition Example

Input File
id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013

After partition
/tab1/employeedata/2012/file2
1, gopal, TP, 2012
2, kiran, HR, 2012

/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
Renaming a Partition

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION


partition_spec;

Eg:
ALTER TABLE employee PARTITION (year=’1203’) RENAME TO PARTITION
(Yoj=’1203’);
Dropping a Partition

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,


PARTITION partition_spec,...;

Eg:
ALTER TABLE employee DROP [IF EXISTS] PARTITION
(year=’1203’);
Bucketing

Big Da
Copyright
Bucketing

Bucketing is similar to partition.

Difference between partitioning and Bucketing:


Partition need to create partition for each unique value of the column. This leads
to create thousands of partitions .
This can be avoided in Bucketing ,can limit the number of buckets to create.

A Bucket is a file whereas a partition is directory.


Buckets

• To create a bucketed table having 3 buckets.

CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT, name STRING, grade
FLOAT)
CLUSTERED BY (grade) into 3 buckets;

• Load data to bucketed table.

FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET


SELECT rollno,name,grade;

• To display the content of first bucket.

SELECT DISTINCT GRADE FROM STUDENT_BUCKET


TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);

Big Da
Copyright
Views

View support is available only in version starting from 0.6.

To create a view table named “STUDENT_VIEW”

CREATE VIEW STUDENT_VIEW AS SELECT rollno, name FROM EXT_STUDENT;

Querying the view

SELECT * FROM STUDENT_VIEW LIMIT 4;

To drop the view

DROP VIEW STUDENT_VIEW;


Hive - views are different from SQL views

Hive Views are Read-Only

•Views in Hive are essentially stored queries, and the data they return is
dynamically fetched from the underlying tables.

•You cannot perform INSERT, UPDATE, or DELETE operations directly on


views in Hive because Hive does not support these operations on views.

•Views in Hive are read-only and are primarily used for simplifying
complex queries.
Sub Querying
Write a sub query to count occurrence of similar words in the file

-- Create a table to store lines from a text file


CREATE TABLE docs(line STRING);

-- Load data into the `docs` table from a local file


LOAD DATA LOCAL INPATH '/root/hivedemos/[Link]' OVERWRITE INTO TABLE
docs;

-- Create a table to store word counts


CREATE TABLE word_count AS
SELECT
word,
COUNT(1) AS count
FROM (
SELECT EXPLODE(SPLIT(line, ' ')) AS word
FROM docs
)w
GROUP BY word
ORDER BY word;

-- Retrieve the word counts


SELECT * FROM word_count;

◼ w - temporary table created by the subquery.


◼ EXPLODE(...): This takes the array produced by SPLIT and converts each
element of the array into a separate row.
Joins
Joins in Hive is similar to SQL joins

To create JOIN between Student and Department tables where we use RollNo
from both the tables as the join key.

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT, name STRING, gpa FLOAT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

LOAD DATA LOCAL INPATH ‘/root/hivedemos/[Link]’ OVERWRITWE INTO TABLE


STUDENT;

CREATE TABLE IF NOT EXISTS DEPARTMENT(rollno INT, deptno INT ,name STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

LOAD DATA LOCAL INPATH ‘/root/hivedemos/[Link]’ OVERWRITWE INTO


TABLE DEPARTMENT;

1. SELECT [Link],[Link],[Link],[Link] FROM STUDENT a JOIN


DEPARTMENT b ON [Link]=[Link]
Aggregations

Big Da
Copyright
Aggregations
Hive supports aggregation functions like avg, count, etc.

To write the average and count aggregation function.

SELECT avg(gpa) FROM STUDENT;

SELECT count(*) FROM STUDENT;

Big Da
Copyright
Group by and Having

Big Da
Copyright
Group by and Having

To write group by and having function.

SELECT rollno, name,gpa


FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;

Big Da
Copyright
SerDer

Big Da
Copyright
SerDer

• SerDer stands for Serializer/Deserializer.

• Contains the logic to convert unstructured data into records.

• SerDes in Hive allow Hive to work with diverse data formats and file types,
making it flexible for handling structured and semi-structured data.

• Implemented using Java.

• Serializers are used at the time of writing.

• Deserializers are used at query time (SELECT Statement).

Big Da
Copyright
Supporting formats

Format Use Case

Text/CSV Simple tabular data for lightweight use cases.

ORC/Parquet Analytical queries on large datasets.


JSON/XML Semi-structured or nested data.
Avro Schema evolution with compatibility.

SequenceFile Hadoop-specific binary data for key-value storage.

Custom Formats Proprietary formats requiring specific processing.


SERDE

To manipulate the XML data


<employee>
<empid>1001</empid>
<name>John </name>
<designation> TL</designation>
</employee>
<employee>
<empid>1002</empid>
<name>Jack </name>
<designation> PM</designation>
</employee>
SERDE

CREATE TABLE xmlsample(xmldata string);

LOAD DATA LOCAL INPATH ‘/ROOT/HIVEDEMOS/[Link]’ INTO TABLE xmlsample;

CREATE TABLE xpath_table AS SELECT xpath_int(xmldata,’employee/empid’),


Xpath_string(xmldata,’employee/name’),
Xpath_string(xmldata,’employee/designation’) FROM xmlsample;

SELECT * FROM xpath_table;

Hive provides xpath_* functions (like xpath_string, xpath_int) to parse


and extract data from XML.
RCFile Implementation

CREATE TABLE STUDENT_RC(rollno int,name string,gpa float) STORED AS RCFILE;

INSERT OVERWRITE table STUDENT-RC SELECT *FROM STUDENT;

SELECT SUM(gpa) FROM STUDENT_RC;


RCFile (Record Columnar File) Implementation in Hive

• RCFile stands for Record Columnar File, a file format introduced in Hive to
optimize the storage and processing of large datasets.

• It is a hybrid format that combines the row-oriented and columnar-oriented


storage approaches, enabling efficient querying and compression of data.

• RCFile is commonly used in distributed systems like Hadoop for storing


structured and semi-structured data.

Use Cases of RCFile

[Link] Warehousing:
Efficient for analytical queries where specific columns are queried
frequently.

[Link] Workflows:
Storing intermediate data during Extract, Transform, and Load processes.

[Link] Analytics:
Useful for processing and querying log files with structured or semi-
structured data.
Use Cases of RCFile

Key Features of RCFile Implementation


[Link] Storage Model:
1. Rows are divided into row groups.
2. Within each row group, the data is stored column by column.
3. This combination allows efficient read operations for column-specific
queries.
[Link] Compression Ratio:
1. RCFile uses columnar storage within row groups, which makes it easier to
apply compression algorithms and achieve better storage efficiency.
[Link] Query Performance:
1. Since the data is stored column-wise within each row group, Hive queries
that need only a subset of columns can avoid reading the entire dataset.
Disadvantages of RCFile
[Link] Write Operations:
1. Writing data into RCFile tables can be slower compared to plain text or
row-based formats.
[Link] for ORC and Parquet:
1. RCFile has largely been replaced by ORC (Optimized Row Columnar) and
Parquet, which offer better performance and features.
USER DEFINED FUNCTIONS (UDF)

UDFs in Hive allow users to extend Hive's functionality by writing custom functions in
Java, Python, or other supported languages.

Usage of UDFs in Hive

• Transform data: Modify or process column values dynamically during queries.

• Custom logic: Implement business-specific or domain-specific computations that are


not natively supported in Hive.

• Reusability: UDFs can be reused across multiple Hive queries.

Permanent Function: To make the function permanent, you need to register it in the
Hive metastore by using the CREATE FUNCTION command.

Version Compatibility: Ensure the UDF is compatible with your Hive and Hadoop
versions.
USER DEFINED FUNCTIONS (UDF)

Write a hive function to convert the values of a field to upper case

Package [Link];
import [Link]; import
[Link]; @Description(
name=“simple UDF example”)

public final class MyLowerCase extends UDF{ public


String evalutae(final String word){ return
[Link]();
}
}

•Save the Java file as [Link].


•Compile it using any Java compiler, and package the compiled .class file
into a JAR file:
UDF

Convert this java program into Jar:

Upload the JAR file to your Hadoop Distributed File System (HDFS) or local file system.

ADD JAR /root/hivedemos/[Link]

CREATE TEMPORARY FUNCTION touppercase AS ‘[Link]’;

SELECT TOUPPERCASE(name) FROM STUDENT;


Thank you

Big Da
Copyright

You might also like