0% found this document useful (0 votes)
13 views45 pages

Overview of Apache Hive Data Warehouse

Apache Hive is an open-source data warehouse system built on Hadoop for querying and analyzing large datasets using a SQL-like language called HiveQL. It is designed for structured and semi-structured data, offering features like scalability, indexing, and user-defined functions, but it is not suitable for real-time data processing or online transaction processing. Developed by Facebook, Hive is widely used across various industries, including Amazon and IBM, and supports multiple client applications through JDBC, ODBC, and Thrift.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views45 pages

Overview of Apache Hive Data Warehouse

Apache Hive is an open-source data warehouse system built on Hadoop for querying and analyzing large datasets using a SQL-like language called HiveQL. It is designed for structured and semi-structured data, offering features like scalability, indexing, and user-defined functions, but it is not suitable for real-time data processing or online transaction processing. Developed by Facebook, Hive is widely used across various industries, including Amazon and IBM, and supports multiple client applications through JDBC, ODBC, and Thrift.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Apache Hive

What is Apache Hive?


Apache Hive is an open source data warehouse system built on top of Hadoop. Especially, we
use it for querying and analyzing large datasets stored in Hadoop files. Moreover, by using Hive
we can process structured and semi-structured data in Hadoop. In other words, it is a data
warehouse infrastructure which facilitates querying and managing large datasets which reside
in the distributed storage system. Basically, it offers a way to query the data using a SQL-like
query language called HiveQL(Hive Query Language). In addition, a compiler translates HiveQL
statements into MapReduce jobs, internally. Further which are submitted to Hadoop
framework for execution.

Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

Hive is not
Sometimes, few misconceptions occur about Hive. So, let’s clarify that:
 We can say it is not a relational database
 Also, not a design for OnLine Transaction Processing (OLTP)
 Even not a language for real-time queries and row-level updates

Why Hive?

As we know it is mainly used for data querying, analysis, and summarization. Moreover, it helps
to improve the developer productivity. However, that comes at the cost of increasing latency
and decreasing efficiency. In other words, Hive is a variant of SQL and a very good one indeed.
Although, when compared to SQL systems implemented in databases, Hive stands tall. Hive has
many User Defined Functions that makes it easy to contribute to the UDFs. Also, we can
connect Hive queries to various Hadoop packages. Such as RHive, RHipe, and even Apache
Mahout. However, when working for complex analytical processing and data formats that are
challenging, it greatly helps the developer community.
To be more specific, ‘Data warehouse’ means a system we use for reporting and data analysis.
Basically, it refers to inspecting, cleaning, transforming, and modeling data with the goal of
discovering useful information and suggesting conclusions. Moreover, in the different business,
science, and social science domains data analysis has multiple aspects and approaches,
encompassing diverse techniques under a variety of names.
In addition, it allows users to simultaneously access the data and increases the response time. It
means the time a system or functional unit takes to react to a given input. Also, it has a much
faster response time than most other types of queries on the same type of huge datasets.
Moreover, without any drop in performance, it is highly flexible as more commodities can easily
be added in response to more adding of the cluster of data.

Hive – History

Basically, at Facebook, Data Infrastructure Team developed Hive. Especially, to address the
requirements at Facebook, they use Hive technology. Internally, it is very popular with all the
users on Facebook. To be very specific, for a wide variety of applications it is being used to run
thousands of jobs on the cluster with hundreds of users. In addition, Hive-Hadoop cluster stores
more than 2PB of raw data at Facebook. Moreover, on a daily basis, it loads 15 TB of data
regularly. Also, very important to know that it is being used and developed by a number of
companies. Such as Amazon, IBM, Yahoo, Netflix, Financial Industry Regulatory Authority
(FINRA) etc.

Features of Hive

These are the following features of Hive:

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or
Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

Limitations of Hive

o Hive is not capable of handling real-time data.


o It is not designed for online transaction processing.
o Hive queries contain high latency.
Differences between Hive and Pig

Hive Pig
Hive is commonly used by Data Analysts. Pig is commonly used by programmers.
It follows SQL-like queries. It follows the data-flow language.
It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.

Hive Architecture:
Hive Architecture:
Hive Architecture

The above figure shows the architecture of Apache Hive and its major components. The major
components of Apache Hive are:

1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
Hive Client
Hive supports applications written in any language like Python, Java, C++, Ruby, etc. using JDBC,
ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily write a hive
client application in any language of its own choice.

Hive clients are categorized into three types:

1. Thrift Clients
The Hive server is based on Apache Thrift so that it can serve the request from a thrift client.

2. JDBC client
Hive allows for the Java applications to connect to it using the JDBC driver. JDBC driver uses
Thrift to communicate with the Hive Server.

3. ODBC client
Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive. Similar to
the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive Server.

Hive Service
To perform all queries, Hive provides various services like the Hive server2, Beeline, etc. The
various services offered by Hive are:

1. Beeline
The Beeline is a command shell supported by HiveServer2, where the user can submit its
queries and command to the system. It is a JDBC client that is based on SQLLINE CLI (pure Java-
console based utility for connecting with relational database and executing SQL queries).
2. Hive Server 2
HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients to execute queries
against the Hive. It allows multiple clients to submit requests to Hive and retrieve the final
results. It is basically designed to provide the best support for open API clients like JDBC and
ODBC.

Note: Hive server1, also called a Thrift server, is built on Apache Thrift protocol to handle the
cross-platform communication with Hive. It allows different client applications to submit
requests to Hive and retrieve the final results. It does not handle concurrent requests from
more than one client due to which it was replaced by HiveServer2.
3. Hive Driver
The Hive driver receives the HiveQL statements submitted by the user through the command
shell. It creates the session handles for the query and sends the query to the compiler.
4. Hive Compiler
Hive compiler parses the query. It performs semantic analysis and type-checking on the
different query blocks and query expressions by using the metadata stored in metastore and
generates an execution plan.

The execution plan created by the compiler is the DAG(Directed Acyclic Graph), where each
stage is a map/reduce job, operation on HDFS, a metadata operation.
5. Optimizer
Optimizer performs the transformation operations on the execution plan and splits the task to
improve efficiency and scalability.

6. Execution Engine
Execution engine, after the compilation and optimization steps, executes the execution plan
created by the compiler in order of their dependencies using Hadoop.

7. Metastore
Metastore is a central repository that stores the metadata information about the structure of
tables and partitions, including column and column type information.

It also stores information of serializer and deserializer, required for the read/write operation,
and HDFS files where data is stored. This metastore is generally a relational database.

Metastore provides a Thrift interface for querying and manipulating Hive metadata.

We can configure metastore in any of the two modes:

 Remote: In remote mode, metastore is a Thrift service and is useful for non-Java
applications.
 Embedded: In embedded mode, the client can directly interact with the metastore using
JDBC.
8. HCatalog
HCatalog is the table and storage management layer for Hadoop. It enables users with different
data processing tools such as Pig, MapReduce, etc. to easily read and write data on the grid.

It is built on the top of Hive metastore and exposes the tabular data of Hive metastore to other
data processing tools.

9. WebHCat
WebHCat is the REST API for HCatalog. It is an HTTP interface to perform Hive metadata
operations. It provides a service to the user for running Hadoop MapReduce (or YARN), Pig,
Hive jobs.

Processing Framework and Resource Management


Hive internally uses a MapReduce framework as a engine for executing the queries.
MapReduce is a software framework for writing those applications that process a massive
amount of data in parallel on the large clusters of commodity hardware. MapReduce job works
by splitting data into chunks, which are processed by map-reduce tasks.

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class [Link].
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and
its type information, the serializers and deserializers which is used to read and write
data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming
tasks in the order of their dependencies.

How Does Hive Works?


Hive – the following diagram depicts the workflow between Hive and Hadoop.

Apache Hive Tutorial – Working of Hive

The following table defines how Hive interacts with Hadoop framework.

Step-1 Execute Query


At very first, the Hive interface ( Command Line or Web UI) sends the query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

Step-2 Get Plan


Afterwards, the driver takes the help of query compiler which parses the query to check the
syntax and query plan or the requirement of the query.

Step-3 Get Metadata


Further, the compiler sends metadata request to Metastore (any database).

Step-4 Send Metadata


After that Metastore sends metadata as a response to the compiler.

Step-5 Send Plan


Then the compiler checks the requirement and resends the plan to the driver. However, the
parsing and compiling of a query are complete, Up to here.
Step-6 Execute Plan
Further, the driver sends the execution plan to the execution engine.

Step-7 Execute Job


Then, the process of execution job is a MapReduce job, internally. Also, the execution engine
sends the job to JobTracker, which is in name node and it assigns this job to TaskTracker, which
is in data node. Moreover, the query executes MapReduce job, here.

 Metadata Ops
During the execution, the execution engine can execute metadata operations with Metastore.

Step-8 Fetch Result


While execution is over, the execution engine receives the results from Data nodes.

Step-9 Send Results


After fetching results, execution engine sends those resultant values to the driver.

Step-10 Send Results


At last the driver sends the results to Hive interfaces.

HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and complex types. A
list of Hive data types is given below.

Integer Types

Type Size Range


TINYINT 1-byte signed integer -128 to 127
SMALLINT 2-byte signed integer 32,768 to 32,767
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Decimal Type

Type Size Range


FLOAT 4-byte Single precision floating point number
DOUBLE 8-byte Double precision floating point number

Date/Time Types

TIMESTAMP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
o As string, it follows [Link] format "YYYY-MM-DD HH:MM:[Link]" (9
decimal place precision)

DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--MM--DD.
However, it didn't provide the time of the day. The range of Date type lies between 0000--01--
01 to 9999--12--31.

String Types

STRING

The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type

Type Size Range


Struct It is similar to C struct or an object where fields are accessed struct('James','Roy')
using the "dot" notation.
Map It contains the key-value tuples where the fields are accessed map('first','James','last','Roy')
using array notation.
Array It is a collection of similar type of values that indexable using array('James','Roy')
zero-based integers.
Hive DDL Commands – Types of DDL Hive Commands
DDL statements are used to build and modify the tables and other objects in the database.

DDL Command Function


CREATE It is used to create a table or Database
SHOW It is used to show Database, Table, Properties, etc
ALTER It is used to make changes to the existing table
DESCRIBE It describes the table columns
TRUNCATE Used to permanently truncate and delete the rows of table
DELETE Deletes the table data, but, can be restored
Go to Hive shell by giving the command hive and enter the command ‘create database<data
base name>’ to create the new database in the Hive.

Hive - Create Database

In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.

o Initially, we check the default database provided by Hive. So, to check the list of existing
databases, follow the below command: -

1. hive> show databases;

Here, we can see the existence of a default database provided by Hive.

o Let's create a new database by using the following command: -


1. hive> create database demo;

So, a new database is created.

o Let's check the existence of a newly created database.

1. hive> show databases;

o Each database must contain a unique name. If we create two databases with the same
name, the following error generates: -

o If we want to suppress the warning generated by Hive on creating the database with the
same name, follow the below command: -

1. hive> create a database if not exists demo;


o Hive also allows assigning properties with the database in the form of key-value pair.

1. hive>create the database demo


2. >WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-06-03');

o Let's retrieve the information associated with the database.

1. hive> describe database extended demo;

Hive - Drop Database

In this section, we will see various ways to drop the existing database.

o Let's check the list of existing databases by using the following command: -

1. hive> show databases;


o Now, drop the database by using the following command.

1. hive> drop database demo;

o Let's check whether the database is dropped or not.

1. hive> show databases;

As we can see, the database demo is not present in the list. Hence, the database is dropped
successfully.

o If we try to drop the database that doesn't exist, the following error generates:

o However, if we want to suppress the warning generated by Hive on creating the


database with the same name, follow the below command:-

1. hive> drop database if exists demo;


o In Hive, it is not allowed to drop the database that contains the tables directly. In such a
case, we can drop the database either by dropping tables first or use Cascade keyword
with the command.
o Let's see the cascade command used to drop the database:-

[Link]> drop database if exists demo cascade;

This command automatically drops the tables present in the database first.

2. SHOW DATABASE in Hive

The SHOW DATABASES statement lists all the databases present in the Hive.
Syntax:

SHOW (DATABASES|SCHEMAS);
3. DESCRIBE DATABASE in Hive
The DESCRIBE DATABASE statement in Hive shows the name of Database in Hive, its comment
(if set), and its location on the file system.
The EXTENDED can be used to get the database properties.
Syntax:

DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;

DDL DESCRIBE DATABASE Example:


4. USE DATABASE in Hive
The USE statement in Hive is used to select the specific database for a session on which all
subsequent HiveQL statements would be executed.
Syntax:

USE database_name;

DDL USE DATABASE Example:

6. ALTER DATABASE in Hive


The ALTER DATABASE statement in Hive is used to change the metadata associated with the
database in Hive.
Syntax for changing Database Properties:
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES
(property_name=property_value, ...);

DDL ALTER DATABASE properties Example:


In this example, we are setting the database properties of the ‘dataflair’ database after its
creation by using the ALTER command.

Syntax for changing Database owner:

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

DDL ALTER DATABASE owner Example:


In this example, we are changing the owner role of the ‘dataflair’ database using the ALTER
statement.

Syntax for changing Database Location:

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;

Note: The ALTER DATABASE … SET LOCATION statement does not move the database current
directory contents to the newly specified location. This statement does not change the
locations associated with any tables or partitions under the specified database. Instead, it
changes the default parent-directory, where new tables will be added for this database.
No other metadata associated with the database can be changed.
DDL Commands on Tables in Hive

1. CREATE TABLE
The CREATE TABLE statement in Hive is used to create a table with the given name. If a table or
view already exists with the same name, then the error is thrown. We can use IF NOT EXISTS to
skip the error.
Syntax:

CREATE TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type [COMMENT
col_comment], ... [COMMENT col_comment])] [COMMENT table_comment] [ROW FORMAT
row_format] [STORED AS file_format] [LOCATION hdfs_path];

DDL CREATE TABLE Example:


In this table, we are creating a table ‘Employee’ in the ‘dataflair’ database.

ROW FORMAT DELIMITED means we are telling the Hive that when it finds a new line character,
that means a new record.
FIELDS TERMINATED BY ‘,’ tells Hive what delimiter we are using in our files to separate each
column.
STORED AS TEXTFILE is to tell Hive what type of file to expect.
Don’t know about different Data Types supported by hive? Read Hive Data Types article.
2. SHOW TABLES in Hive
The SHOW TABLES statement in Hive lists all the base tables and views in the current database.
Syntax:

SHOW TABLES [IN database_name];

DDL SHOW TABLES Example:


3. DESCRIBE TABLE in Hive
The DESCRIBE statement in Hive shows the lists of columns for the specified table.
Syntax:

DESCRIBE [EXTENDED|FORMATTED] [db_name.] table_name[.col_name ( [.field_name])];

DDL DESCRIBE TABLE Example:


4. DROP TABLE in Hive
The DROP TABLE statement in Hive deletes the data for a particular table and remove all
metadata associated with it from Hive metastore.
If PURGE is not specified then the data is actually moved to the .Trash/current directory.
If PURGE is specified, then data is lost completely.
Syntax:

DROP TABLE [IF EXISTS] table_name [PURGE];

DDL DROP TABLE Example:


In the below example, we are deleting the ‘employee’ table.
5. ALTER TABLE in Hive
The ALTER TABLE statement in Hive enables you to change the structure of an existing table.
Using the ALTER TABLE statement we can rename the table, add columns to the table, change
the table properties, etc.
Syntax to Rename a table:

ALTER TABLE table_name RENAME TO new_table_name;

DDL ALTER TABLE name Example:


In this example, we are trying to rename the ‘Employee’ table to ‘Com_Emp’ using the ALTER
statement.

Syntax to Add columns to a table:

ALTER TABLE table_name ADD COLUMNS (column1, column2) ;

DDL ALTER TABLE columns Example:


In this example, we are adding two columns ‘Emp_DOB’ and ‘Emp_Contact’ in the ‘Comp_Emp’
table using the ALTER command.
Syntax to set table properties:

ALTER TABLE table_name SET TBLPROPERTIES (‘property_key’=’property_new_value’);

DDL ALTER TABLE properties Example:


In this example, we are setting the table properties after table creation by using ALTER
command.

6. TRUNCATE TABLE
TRUNCATE TABLE statement in Hive removes all the rows from the table or partition.
Syntax:

TRUNCATE TABLE table_name;

DDL TRUNCATE TABLE Example:

Hive - Create Table

In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide
range of flexibility where the data files for tables are stored. It provides two types of table: -

o Internal table
o External table

Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
[Link] (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes both
table schema and data.

o Let's create an internal table by using the following command:-

1. hive> create table [Link] (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

Here, the command also includes the information that the data is separated by ','.
o Let's see the metadata of the created table by using the following command:-

1. hive> describe [Link]

o Let's see the result when we try to create the existing table again.

In such a case, the exception occurs. If we want to ignore this type of exception, we can use if
not exists command while creating the table.

1. hive> create table if not exists [Link] (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

o While creating a table, we can add the comments to the columns and can also define
the table properties.

1. hive> create table demo.new_employee (Id int comment 'Employee Id', Name string comment '
Employee Name', Salary float comment 'Employee Salary')
2. comment 'Table Description'
3. TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-06 [Link]');

o Let's see the metadata of the created table by using the following command: -

1. hive> describe new_employee;

o Hive allows creating a new table by using the schema of an existing table.

1. hive> create table if not exists demo.copy_employee


2. like [Link];
Here, we can say that the new table is a copy of an existing table.

External Table

The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.

To create an external table, follow the below steps: -

o Let's create a directory on HDFS by using the following command: -

1. hdfs dfs -mkdir /HiveDirectory


o Now, store the file on the created directory.

1. hdfs dfs -put hive/emp_details /HiveDirectory


o Let's create an external table using the following command: -

1. hive> create external table emplist (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ','
4. location '/HiveDirectory';

o Now, we can use the following command to retrieve the data: -

1. select * from emplist;


Hive DML commands
Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and
delete data from the Hive table once the table and database schema has been defined using
Hive DDL commands.

The various Hive DML commands are:

1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT

Hive - Load Data

Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following command: -

1. load data local inpath '/home/codegyani/hive/emp_details' into table [Link];

Here, emp_details is the file name that contains the data.

o Now, we can use the following command to retrieve the data from the database.

1. select * from [Link];


o If we want to add more data into the current database, execute the same query again
by just updating the new file name.

1. load data local inpath '/home/codegyani/hive/emp_details1' into table [Link];

o Let's check the data of an updated table: -

o In Hive, if we try to load unmatched data (i.e., one or more column data doesn't
match the data type of specified table columns), it will not throw any exception.
However, it stores the Null value at the position of unmatched tuple.
o Let's add one more file to the current table. This file contains the unmatched data.
Here, the third column contains the data of string type, and the table allows the float type
data. So, this condition arises in an unmatched data situation.

o Now, load the data into the table.

1. load data local inpath '/home/codegyani/hive/emp_details2' into table [Link];

Here, data loaded successfully.

o Let's fetch the records of the table.

1. select * from [Link]


Here, we can see the Null values at the position of unmatched data.

1. LOAD Command
The LOAD statement in Hive is used to move data files into the locations corresponding to Hive
tables.
 If a LOCAL keyword is specified, then the LOAD command will look for the file path in the
local filesystem.
 If the LOCAL keyword is not specified, then the Hive will need the absolute URI of the file.
 In case the keyword OVERWRITE is specified, then the contents of the target table/partition
will be deleted and replaced by the files referred by filepath.
 If the OVERWRITE keyword is not specified, then the files referred by filepath will be
appended to the table.

Syntax:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION
(partcol1=val1, partcol2=val2 ...)];

Example:
Here we are trying to load data from the ‘dab’ file in the local filesystem to the ‘emp_data’
table.
2. SELECT COMMAND
The SELECT statement in Hive is similar to the SELECT statement in SQL used for retrieving data
from the database.

Syntax:

SELECT col1,col2 FROM tablename;

Example:
3. INSERT Command
The INSERT command in Hive loads the data into a Hive table. We can do insert to both the
Hive table or partition.
a. INSERT INTO
The INSERT INTO statement appends the data into existing data in the table or partition.
INSERT INTO statement works from Hive version 0.8.
Syntax:

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]


select_statement1 FROM from_statement;

Example:
Here in this example, we are trying to insert the data of ‘emp_data’ table created above into
the table ‘example’.
INSERT statement to load data into table “example”.

b. INSERT OVERWRITE
The INSERT OVERWRITE table overwrites the existing data in the table or partition.
Syntax:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, ..) [IF NOT EXISTS]]
select_statement FROM from_statement;

Example:
Here we are overwriting the existing data of the table ‘example’ with the data of table ‘dummy’
using INSERT OVERWRITE statement.

By using the SELECT statement we can verify whether the existing data of the table ‘example’ is
overwritten by the data of table ‘dummy’ or not.
c. INSERT .. VALUES
INSERT ..VALUES statement in Hive inserts data into the table directly from SQL. It is available
from Hive 0.14.
Syntax:

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES


values_row [, values_row ...];

Example:
Inserting data into the ‘student’ table using INSERT ..VALUES statement.
4. DELETE command
The DELETE statement in Hive deletes the table data. If the WHERE clause is specified, then it
deletes the rows that satisfy the condition in where clause.

The DELETE statement can only be used on the hive tables that support ACID.

Syntax:

DELETE FROM tablename [WHERE expression];

Example:
In the below example, we are deleting the data of the student from table ‘student’ whose
roll_no is 105.
By using the SELECT statement we can verify whether the data of the student from table
‘student’ whose roll_no is 105 is deleted or not.

5. UPDATE Command
The UPDATE statement in Hive deletes the table data. If the WHERE clause is specified, then it
updates the column of the rows that satisfy the condition in WHERE clause.

Partitioning and Bucketing columns cannot be updated.


Syntax:

UPDATE tablename SET column = value [, column = value ...] [WHERE expression];

Example:
In this example, we are updating the branch of the student whose roll_no is 103 in the ‘student’
table using an UPDATE statement.
By using the SELECT statement we can verify whether the branch of the student whose roll_no
is 103 in the ‘student’ table is updated or not.

6. EXPORT Command
The Hive EXPORT statement exports the table or partition data along with the metadata to the
specified output location in the HDFS.
Metadata is exported in a _metadata file, and data is exported in a subdirectory ‘data.’
Syntax:

EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]

TO 'export_target_path' [ FOR replication('eventid') ];

Example:
Here in this example, we are exporting the student table to the HDFS directory
“export_from_hive”.
The table successfully exported. You can check for the _metadata file and data sub-directory
using ls command.

7. IMPORT Command
The Hive IMPORT command imports the data from a specified location to a new table or
already existing table.

Syntax:

IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[,


...])]]

FROM 'source_path' [LOCATION 'import_target_path'];

Example:
Here in this example, we are importing the data exported in the above example into a new Hive
table ‘imported_table’.
Verifying whether the data is imported or not using hive SELECT statement.

You might also like