UNIT-II
Anatomy of Hadoop map-reduce (Input Splits, map phase,
shuffle, sort, combiner, reduce phase)(theory)
Hive: Introduction to Hive, data types and file formats
HiveQL data definition(Creating Databases and Tables)
HiveQL for Data loading
HiveQL data manipulation
Logical joins
Window functions
Optimization
Table partitioning
Bucketing
Indexing
Join Strategies.
Anatomy of Hadoop map-reduce (Input Splits, map phase,
shuffle, sort, combiner, reduce phase)(theory)
• Map Reduce is a programming model for data processing. Hadoop can
run Map Reduce programs written in Java, Ruby and Python.
• Map Reduce programs are inherently parallel, thus very large scale data
analysis can be done fastly.
• In Map Reduce programming, Jobs(applications) are split into a set of
map tasks and reduce tasks.
• Input data set splits into independent chunks. Map tasks process these
independent chunks completely in a parallel manner.
• Reduce task-provides reduced output by combining the output of various
mapers. There are two daemons associated with Map-Reduce
Programming: Job Tracker and Task Tracer.
Job Tracker: Job Tracker is a master daemon responsible for executing over
Map Reduce job.
• It provides connectivity between Hadoop and application.
• Whenever code submitted to a cluster, Job Tracker creates the execution
plan by deciding which task to assign to which node.
• It also monitors all the running tasks. When task fails it automatically re-
schedules the task to a different node after a predefined number of retires.
• There will be one job Tracker process running on a single Hadoop cluster.
Job Tracker processes run on their own Java Virtual machine process.
Fig. Job Tracker and Task Tracker interaction
Task Tracker: This is responsible for executing individual tasks that
is assigned by the Job Tracker.
• Task Tracker continuously sends heart beat message to job tracker.
When a job tracker fails to receive a heartbeat message from a Task
Tracker, the Job Tracker assumes that the Task Tracker has failed
and resubmits the task to another available node in the cluster.
Map Reduce Framework
Phases:
Map: Converts input into key-value pairs.
Reduce: Combines output of mappers and produces a reduced result
set.
Daemons:
Job Tracker: Master, Schedules Task
Task Tracker: Slave, Execute task
MapReduce Flow Chart
• Now let us see How Hadoop MapReduce works
by understanding the end to end Hadoop
MapReduce job execution flow with components
in detail:
4.1. Input Files
• The data for a MapReduce task is stored in input
files, and input files typically lives in HDFS. The
format of these files is arbitrary, while line-based
log files and binary format can also be used.
4.2. InputFormat
• Now, InputFormat defines how these input files
are split and read. It selects the files or other
objects that are used for input. InputFormat
creates InputSplit.
4.3. InputSplits
• It is created by InputFormat, logically represent
the data which will be processed by an
individual Mapper One map task is created for
each split; thus the number of map tasks will be
equal to the number of InputSplits. The split is
divided into records and each record will be
processed by the mapper.
4.4. RecordReader
• It communicates with the InputSplit in Hadoop
MapReduce and converts the data into key-value
pairs suitable for reading by the mapper. By
default, it uses TextInputFormat for converting
data into a key-value pair. RecordReader
communicates with the InputSplit until the file
reading is not completed. It assigns byte offset
4.5. Mapper
• It processes each input record (from
RecordReader) and generates new key-value
pair, and this key-value pair generated by
Mapper is completely different from the input
pair. The output of Mapper is also known as
intermediate output which is written to the
local disk. The output of the Mapper is not
stored on HDFS as this is temporary data and
writing on HDFS will create unnecessary
copies (also HDFS is a high latency system).
Mappers output is passed to the combiner for
further process
4.6. Combiner
• The combiner is also known as ‘Mini-reducer’.
Hadoop MapReduce Combiner performs local
aggregation on the mappers’ output, which helps
to minimize the data transfer between mapper
and reducer (we will see reducer below). Once
the combiner functionality is executed, the
output is then passed to the partitioner for
further work.
4.7. Partitioner
• Hadoop MapReduce, Partitioner comes into the
picture if we are working on more than one
reducer (for one reducer partitioner is not used).
• Partitioner takes the output from combiners and
performs partitioning. Partitioning of output
takes place on the basis of the key and then
sorted. By hash function, key (or a subset of the
key) is used to derive the partition.
• According to the key value in MapReduce, each
combiner output is partitioned, and a record
having the same key value goes into the same
partition, and then each partition is sent to a
reducer. Partitioning allows even distribution of
the map output over the reducer.
4.8. Shuffling and Sorting
• Now, the output is Shuffled to the reduce
node (which is a normal slave node but
reduce phase will run here hence called as
reducer node). The shuffling is the
physical movement of the data which is
done over the network. Once all the
mappers are finished and their output is
shuffled on the reducer nodes, then this
intermediate output is merged and sorted,
which is then provided as input to reduce
phase.
4.9. Reducer
• It takes the set of intermediate key-value pairs
produced by the mappers as the input and then
runs a reducer function on each of them to
generate the output. The output of the reducer
is the final output, which is stored in HDFS.
4.10. RecordWriter
• It writes these output key-value pair from the
Reducer phase to the output files.
4.11. OutputFormat
• The way these output key-value pairs are written in
output files by RecordWriter is determined by the
OutputFormat. OutputFormat instances provided by
the Hadoop are used to write files in HDFS or on the
local disk. Thus the final output of reducer is written
on HDFS by OutputFormat instances.
• Hence, in this manner, a Hadoop MapReduce works
over the cluster.
5. Conclusion
• In conclusion, we can say that data flow in
MapReduce is the combination of different
processing phases of such as Input Files,
InputFormat in Hadoop, InputSplits, RecordReader,
Mapper, Combiner, Partitioner, Shuffling and Sorting,
Reducer, RecordWriter, and OutputFormat. Hence all
these components play an important role in the
Map Reduce working:
• Map Reduce divides a data analysis task into two parts – Map and
Reduce. In the example given below: there two mappers and one reduce.
• Each mapper works on the partial data set that is stored on that node and
the reducer combines the output from the mappers to produce the reduced
result set.
• Steps:
• 1. First, the input dataset is split into multiple pieces of data.
• 2. Next, the framework creates a master and several slave processes and
executes the worker processes remotely.
• 3. Several map tasks work simultaneously and read pieces of data that
were assigned to each map task.
• 4. Map worker uses partitioner function to divide the data into regions.
• 5. When the map slaves complete their work, the master instructs the
reduce slaves to begin their work.
• 6. When all the reduce slaves complete their work, the master transfers
the control to the user program.
Fig. Map Reduce Programming Architecture
• Map Reduce programming using Java requires three classes:
• 1. Driver Class: This class specifies Job configuration details.
• 2. Mapper Class: this class overrides the Map Function based on
the problem statement.
• 3. Reducer Class: This class overrides the Reduce function based
on the problem statement.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data
that is produced by map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader
2. Mapper
3. Combiner
[Link].
The output produced by the map task is known as intermediate <keys,
value> pairs. These intermediate <keys, value> pairs are sent to
reducer. The reduce tasks are broken down into the following
phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Input Splits :
• Input Split in Hadoop Map Reduce is the logical representation of
data. It describes a unit of work that contains a single map task in a
Map Reduce program.
• Hadoop Input Split represents the data which is processed by an
individual Mapper.
• The split is divided into records. Hence, the mapper process each
record (which is a key-value pair).
• Map Reduce Input Split length is measured in bytes and every
Input Split has storage locations (hostname strings).
• Map Reduce system use storage locations to place map tasks as close
to split’s data as possible. Map tasks are processed in the order of the
size of the splits so that the largest one gets processed first (greedy
approximation algorithm) and this is done to minimize the job runtime
(Learn Map Reduce job optimization techniques)
• The important thing to notice is that Input split does not contain the input
• data; it is just a reference to the data.
• As a user, we don’t need to deal with Input Split directly, because
they are created by an Input Format(Input Format creates the
Input split and divide into records).
• File Input Format by default breaks a file into 128MB chunks
(same as blocks in HDFS).
• Input Split in Hadoop is user defined. User can control split size
according to the size of data in Map Reduce program.
• The number of map tasks is equal to the number of Input Splits.
• The client (running the job) can calculate the splits for a job by
calling ‘getSplit()’, and then sent to the application master, which
uses their storage locations to schedule map tasks that will process
them on the cluster. Then, map task passes the split to
the createRecordReader() method on Input Format to get Record
Reader for the split and Record Reader generate record (key-value
pair), which it passes to the map function.
MAP PHASE: Mapper Phases:
• Mapper maps the input <keys, value> pairs into a set of intermediate
<keys, value> pairs.
Each map task is broken into following phases:
1. Record Reader: converts byte oriented view of input in to Record
oriented view and presents it to the Mapper tasks. It presents the tasks
with keys and values.
i) Input Format: It reads the given input file and splits using the method
getsplits().
ii) Then it defines Record Reader using createRecordReader() which is
responsible for generating <keys, value> pairs.
2. Mapper: Map function works on the <keys, value> pairs produced by
RecordReader and generates intermediate (key, value) pairs.
Methods:
• protected void cleanup(Context context): called once attend of task.
• protected void map(KEYIN key, VALUEIN value, Context context):
called once for each key-value pair in input split.
• void run(Context context): user can override this method for complete
control over execution of Mapper.
• protected void setup(Context context): called once at beginning of task to
perform required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper
and applies user specific aggregate function to only one mapper. It is also
known as local Reducer.
• We can optionally specify a combiner using [Link]
(ReducerClass) to perform local aggregation on intermediate outputs.
Map Reduce with Combiner class :
4. Partitioner: Take intermediate <keys, value> pairs produced by the
mapper, splits them into partitions the data using a user-defined
condition.
• The default behavior is to hash the key to determine the
[Link] can control by using the method:
• int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort: Downloads the grouped key-value pairs onto the
local machine, where the Reducer is running.
• The individual <keys, value> pairs are sorted by key into a larger
data list.
• The data list groups the equivalent keys together so that their values
can be iterated easily in the Reducer task.
2. Reducer: The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them.
• Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing.
• Once the execution is over, it gives zero or more key-value pairs to
the final step.
Methods:
• - protected void cleanup(Context context): called once at tend of
task.
• - protected void reduce(KEYIN key, VALUEIN value, Context
context): called once for each key-value pair.
• - void run(Context context): user can override this method for
complete control over execution of Reducer.
• - protected void setup(Context context): called once at beginning of
task to perform required activities to initiate reduce() method.
3. Output format:
In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
4. Compression: In Map Reduce programming we can compress
the output file. Compression provides two benefits as follows:
Reduces the space to store files.
Speeds up data transfer across the network.
• We can specify compression format in the Driver program as
below:
• [Link](“[Link]”,true);
• [Link](“[Link]”,[Link]
,[Link]);
• Here, codec is the implementation of a compression and
decompression algorithm, GzipCodec is the compression algorithm
for gzip.
Hive - Introduction
The Hadoop ecosystem contains different sub-projects
(tools) such as Sqoop, Pig, and Hive that are used to help
Hadoop modules.
• Sqoop: It is used to import and export data to and
from between HDFS and RDBMS.
• Pig: It is a procedural language platform used to
develop a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type
scripts to do MapReduce operations.
Hive - Introduction
There are various ways to execute
MapReduce operations:
1. The traditional approach using Java
MapReduce program for structured, semi-
structured, and unstructured data.
2. The scripting approach for MapReduce to
process structured and semi structured
data using Pig.
3. The Hive Query Language (HiveQL or
HQL) for MapReduce to process structured
data using Hive.
What is Hive
• Hive is a data warehouse infrastructure tool
to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing
easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation took it
up and developed it further as an open source
under the name Apache Hive. It is used by
different companies. For example, Amazon
uses it in Amazon Elastic MapReduce.
What is Hive
Hive is not
• A relational database
• A design for OnLine Transaction Processing
(OLTP)
• A language for real-time queries and row-
level updates
Features of Hive
• It stores schema in a database and
processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
Architecture of Hive
Working of Hive
Working of Hive
[Link] Query The Hive interface such as Command Line
or Web UI sends query to Driver (any database driver such
as JDBC, ODBC, etc.) to execute.
[Link] Plan The driver takes the help of query compiler that
parses the query to check the syntax and query plan or the
requirement of query.
[Link] Metadata The compiler sends metadata request to
Metastore (any database).
[Link] Metadata Metastore sends metadata as a response
to the compiler.
[Link] Plan The compiler checks the requirement and
resends the plan to the driver. Up to here, the parsing and
compiling of a query is complete.
Working of Hive
6. Execute Plan The driver sends the execute plan to the
execution engine.
[Link] Job Internally, the process of execution job is a
MapReduce job. The execution engine sends the job to
JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution
engine can execute metadata operations with Metastore.
8. Fetch Result The execution engine receives the results
from Data nodes.
9. Send Results The execution engine sends those
resultant values to the driver.
10. Send Results The driver sends the results to Hive Interfaces.
Hive - Data Types
All the data types in Hive are classified into
four types, given as follows:
I. Column Types
II. Literals
III. Null Values
IV. Complex Types
Hive - Data Types-- Column
Types
Column type are used as column data types
of Hive. They are as follows:
A. Integral Types:
1. Integer type data can be specified using
integral data types, INT.
2. When the data range exceeds the range
of INT, you need to use BIGINT and if the
data range is smaller than the INT, you
use SMALLINT.
3. TINYINT is smaller than SMALLINT.
Hive - Data Types--
Column Types
B. String Types:
String type data types can be specified using
single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and
CHAR.
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Hive - Data Types--
Column Types
C. Timestamp:
• It supports traditional UNIX timestamp with
optional nanosecond precision.
• It supports [Link] format
“YYYY-MM-DD HH:MM:[Link]” and
format “yyyy-mm-dd hh:mm:[Link]”.
D. Dates:
• DATE values are described in
year/month/day format in the form {{YYYY-
MM-DD}}.
Hive - Data Types--
Column Types
E. Decimals:
• The DECIMAL type in Hive is as same as
Big Decimal format of Java.
• It is used for representing immutable
arbitrary precision. The syntax and
example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Hive - Data Types--
Column Types
E. Union Types:
Union is a collection of heterogeneous data types. You can
create an instance using create union. The syntax and
example is as follows:
UNIONTYPE<int, double, array<string>,
struct<a:int,b:string>>
1. {0:1}
2. {1:2.0}
3. {2:["three","four"]}
4. {3:{"a":5,"b":"five"}}
5. {2:["six","seven"]}
6. {3:{"a":8,"b":"eight"}}
7. {0:9}
8. {1:10.0}
Hive - Data Types--
Literals
The following literals are used in Hive:
A. Floating Point Types:
Floating point types are nothing but numbers
with decimal points. Generally, this type of data
is composed of DOUBLE data type.
B. Decimal Type:
Decimal type data is nothing but floating point
value with higher range than DOUBLE data
type. The range of decimal type is
approximately -10-308 to 10308.
Hive - Data Types– Null
Values
Missing values are represented by the special
value NULL.
Hive - Data Types–
Complex Types
1. Arrays: Arrays in Hive are used the same way
they are used in Java.
Syntax: ARRAY<data_type>
2. Maps: Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
3. Structs: Structs in Hive is similar to using
complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT
col_comment], ...>
Hive DDL Operations
Hive DDL stands for (Data Definition Language) which is
used to define or change the structure of Databases,
Tables, indexes, and so on. The most commonly used DDL
are CREATE, DROP, ALTER, SHOW, and so on.
The following is the list of DDL statements that are
supported in Apache Hive.
1. CREATE
2. DROP
3. TRUNCATE
4. ALTER
5. SHOW
6. DESCRIBE
7. USE
Hive DDL Operations
Hive DDL Operations
Hive - Create Database
Hive is a database technology that can define
databases and tables to analyze structured data. The
theme for structured data analysis is to store the data
in a tabular manner, and pass queries to analyze it.
Note: Hive contains a default database named default.
Create Database Statement:
Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of
tables. The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database
name>
Here, IF NOT EXISTS is an optional clause, which notifies the
user that a database with the same name already exists. We
can use SCHEMA in place of DATABASE in this command.
Hive DDL Operations
Hive - Create Database
The following query is executed to create a
database named userdb:
• hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
• hive> CREATE SCHEMA userdb;
The following query is used to verify a databases
list:
• hive> SHOW DATABASES;
default
userdb
Hive DDL Operations
Hive - Create Table
The conventions of creating a table in HIVE is
quite similar to creating a table using SQL.
Example
Let us assume you need to create a table named employee using CREATE
TABLE statement. The following table lists the fields and their data types in
employee table:
Hive DDL Operations
Hive - Create Table
The following query creates a table
named employee using the above data.
Hive DDL Operations
This section explains how to alter the attributes of a
table such as changing its table name, changing
column names, adding columns, and deleting or
replacing columns.
1. Alter Table Statement: It is used to alter a table
in Hive.
Syntax
The statement takes any of the following syntaxes
based on what attributes we wish to modify in a
table.
[Link] TABLE name RENAME TO new_name
[Link] TABLE name ADD COLUMNS (col_spec[,
col_spec ...])
[Link] TABLE name DROP [COLUMN]
Hive DDL Operations
2. Rename To-- Statement:
The following query renames the table
from employee to emp.
hive> ALTER TABLE employee RENAME
TO emp;
3. Change Statement:
The following table contains the fields
of employee table and it shows the fields to be
changed (in bold).
Hive DDL Operations
The following queries rename the column name and column data
type using the above data:
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary
Double;
4. Add Columns Statement:
The following query adds a column named dept to the employee
table.
hive> ALTER TABLE employee ADD COLUMNS ( dept STRING
COMMENT 'Department name');
5. Replace Statement:
The following query deletes all the columns from
the employee table and replaces it
with emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS ( eid INT
empid Int, ename STRING name String);
HQL – Data Manipulation
This section describes how to drop a table in Hive. When you
drop a table from Hive Metastore, it removes the table/column
data and their metadata. It can be a normal table (stored in
Metastore) or an external table (stored in local file system);
Hive treats both in the same manner, irrespective of their
types.
Drop Table Statement:
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name;
The following query drops a table named employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the
following response:
– OK
– Time taken: 5.3 seconds
– Hive>
HiveQL-Load Data
Statement
1. Generally, after creating a table in SQL, we can insert
data using the Insert statement. But in Hive, we can insert
data using the LOAD DATA statement.
2. While inserting data into Hive, it is better to use LOAD
DATA to store bulk records.
3. There are two ways to load data:
• one is from local file system and
• second is from Hadoop file system.
HiveQL-Load Data
Statement
Syntax
The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
INTO TABLE tablename [PARTITION (partcol1=val1,
partcol2=val2 ...)]
1. LOCAL is identifier to specify the local path. It is
optional.
2. OVERWRITE is optional to overwrite the data in the
table.
3. PARTITION is optional.
HiveQL-Load Data
Statement
Example:
We will insert the following data into the table. It is a text file
named [Link] in /home/user directory.
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/[Link]'
OVERWRITE INTO TABLE employee;
On successful download, you get to see the following response:
• OK
• Time taken: 15.905 seconds
• Hive>
Hive DML Operations
Apache Hive DML stands for (Data Manipulation Language)
which is used to insert, update, delete, and fetch data from
Hive tables. Using DML commands we can load files into
Apache Hive tables, write data into the file system from Hive
queries, perform merge operation on the table, and so on.
The following list of DML statements is supported by Apache
Hive.
1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT
Hive DML Operations- 1. Load
Command
The load command is used to move data files into Hive tables. Load
operations are pure copy/move operations.
1. During the LOAD operation, if a LOCAL keyword is mentioned, then the
LOAD command will check for the file path in the local filesystem.
2. During the LOAD operation, if the LOCAL keyword is not mentioned,
then the Hive will need the absolute URI of the file such as
hdfs://namenode:9000/user/hive/project/data1.
3. During LOAD operation, if the OVERWRITE keyword is mentioned, then
the contents of the target table/partition will be deleted and replaced
by the files referred by the file path.
4. During LOAD operation, if the OVERWRITE keyword is not mentioned,
then the files referred to by the file path will be appended to the table.
Load Table Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TA
BLE tablename [PARTITION (partcol1=val1, partcol2=val2 ..
.)]
Hive DML Operations- 1. Load
Command
Load Table Statement:
LOAD DATA LOCAL INPATH '/home/cloudduggu/
hive/examples/files/ml-00k/
[Link]' OVERWRITE INTO TABLE cloudduggudb.u
serdata;
Hive DML Operations- 2. Select
Command
The Select statement project the records from the table.
Select Command Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
Select Command Statement:
SELECT * FROM [Link] WHERE userid=389
;
Hive DML Operations- [Link]
into Command
The Insert into command appends data from one table
to another table.
Insert Into Syntax:
INSERT INTO TABLE tablename1 [PARTITION (par
tcol1=val1, partcol2=val2 ...)] select_statement
1 FROM from_statement;
Insert Into Statement:
INSERT INTO cloudduggudb.employee_bkp SELE
CT * FROM cloudduggudb.employee_detail;
Hive DML Operations- [Link]
Overwrite Command
The Insert overwrites perform the overwriting of the existing
content of the table.
In this example we will use both tables which we used in the
INSERT INTO section and overwrite the content of
"cloudduggudb.employee_bkp" with
"cloudduggudb.employee_detail".
Insert Overwrite Syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (pa
rtcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select
_statement1 FROM from_statement;
Insert Overwrite Statement:
INSERT OVERWRITE TABLE cloudduggudb.employee_b
kp SELECT a.* FROM cloudduggudb.employee_detail a;
Hive DML Operations- 5. Insert Values
Command
By using the Insert Values command we can manually
insert records in the existing table.
We will use the “cloudduggudb.employee_bkp” table
and insert 2 records in that.
Insert Values Syntax:
INSERT INTO TABLE tablename [PARTITION (part
col1[=val1], partcol2[=val2] ...)] VALUES values_
row [, values_row ...];
Insert Values Statement:
INSERT INTO cloudduggudb.employee_bkp VALU
ES(1207,'Mahesh',70000,'Manager'),
(1208,'Raj',70000,'Executive');
Hive DML Operations- 6. Delete
Command
1. The Delete command is used to delete data from the table. If
we supply where clause then it will delete that particular
record only.
2. To perform Delete/Update operations in Apache Hive we need
to follow the below points while creating a table otherwise
delete/update statements with fail with error 10297.
Note: In Apache Hive, We can perform a DELETE statement
on those tables which follow the ACID property.
Before performing the create, delete, update table we should
enable the ACID property using the below parameters on Hive
prompt.
hive>set [Link]=true;
hive>set [Link]=true;
hive>set [Link]=nonstrict;
hive>set [Link]=true;
Hive DML Operations- 6. Delete
Command
File format should be ORC which can be defined with
TBLPROPERTIES(‘transactional’=’true’).
The table should be created with CLUSTERED BY followed by Buckets.
Now we will enable ACID property and create a table. After table
creation, we will insert data.
Create Table Statement:
CREATE TABLE acidexample (key int, value int)
PARTITIONED BY (load_date date)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');
Insert Table Statement:
INSERT INTO acidexample partition (load_date='2016-03-
01') VALUES (1, 1);
INSERT INTO acidexample partition (load_date='2016-03-
02') VALUES (2, 2);
INSERT INTO acidexample partition (load_date='2016-03-
03') VALUES (3, 3);
Hive DML Operations- 6. Delete
Command
Delete Command Syntax:
DELETE FROM tablename [WHERE expressio
n];
Delete Command with where Statement:
DELETE FROM acidexample WHERE key = 1;
Delete Statement:
DELETE FROM acidexample;
Hive DML Operations- 7. Update
Command
The Update command updates the existing records if where
clause is supplied otherwise it will delete table data. We
can’t perform update command on Partitioning and
Bucketing columns.
Note: In Apache Hive, We can perform an UPDATE
statement on those tables which follow the ACID property.
Update Command Syntax:
UPDATE tablename SET column = value [, column = v
alue ...] [WHERE expression];
Update Statement:
UPDATE acidexample SET value=2 where key=3;
Hive DML Operations- 8. Export
Command
• The Apache Hive EXPORT command is used in case we need
to exports the table's metadata to some other location.
• To perform this activity we have created a directory
“hive_export_location“in HDFS under /data/
hive_export_location and exporting table “acidexample”.
Export Command Syntax:
EXPORT TABLE tablename [PARTITION (part_column="
value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]
Export Statement:
export table acidexample to '/data/hive_export_location';
Hive DML Operations- 9. Import
Command
1. Apache Hive IMPORT command imports the data from a
specific location into Hive tables.
2. To perform this activity we will copy table “acidexample” data
from HDFS location “data/ hive_export_location” into the
“cloudduggudb” database.
Import Command Syntax:
IMPORT [[EXTERNAL] TABLE new_or_original_tablenam
e [PARTITION (part_column="value"[, ...])]]
FROM 'source_path’
[LOCATION 'import_target_path']
Import Statement:
import table acidexample from '/data/
hive_export_location';
Hive Joins
• Apache Hive JOINs are used to combine columns
from one (self-join) or more tables by using values
common to each. Using join we can fetch
corresponding records from two or more tables. It is
almost similar to SQL joins.
• Apache Hive provides four types of joins which are
mentioned below.
1. Inner Join
2. Left Outer Join
3. Right Outer Join
4. Full Outer Join
Hive Joins
The following graph is the representation of Apache
Hive Joins using table A and table B.
Hive Joins
Hive Join Syntax
Hive Joins
The following graph is the representation of Apache
Hive Joins using table A and table B.
THE END