0% found this document useful (0 votes)
6 views64 pages

Introduction to Apache Hive Overview

Hive is a data warehouse infrastructure tool built on Hadoop for processing structured data, initially developed by Facebook and now maintained by the Apache Software Foundation. It provides a SQL-like querying language called HiveQL, supports various data types, and operates in local or MapReduce modes depending on the data size. Key components include the User Interface, Meta Store, and Execution Engine, which facilitate interaction with HDFS and the execution of queries.

Uploaded by

muthuraj810vmp
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views64 pages

Introduction to Apache Hive Overview

Hive is a data warehouse infrastructure tool built on Hadoop for processing structured data, initially developed by Facebook and now maintained by the Apache Software Foundation. It provides a SQL-like querying language called HiveQL, supports various data types, and operates in local or MapReduce modes depending on the data size. Key components include the User Interface, Meta Store, and Execution Engine, which facilitate interaction with HDFS and the execution of queries.

Uploaded by

muthuraj810vmp
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

4.

1INTRODUCTION TO HIVE

⮚ Hive is a data warehouse infrastructure tool to process structured data in Hadoop.


⮚ It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Features of Hive

● It stores schema in a database and processed data into HDFS.


● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.

4.1.2 HIVE ARCHITECTURE

● The following component diagram depicts the architecture of Hive:

Page 1 of 64
User Interface

Hive is a data warehouse infrastructure software that can create interaction between user and
HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).

Meta Store

The component that stores all the structure information of the various tables and partitions in the
warehouse including column and column type information, the serializers and deserializers
necessary to read and write data and the corresponding HDFS files where the data is stored.

HiveQL Process Engine

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the
replacements of traditional approach for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and process it.

Execution Engine

The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as same as MapReduce results. It
uses the flavor of MapReduce.

HDFS or HBASE

Page 2 of 64
Hadoop distributed file system or HBASE are the data storage techniques to store data into file
system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

[Link] Query

The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

[Link] Plan

The driver takes the help of query compiler that parses the query to check the syntax and query
plan or the requirement of query.

[Link] Metadata

The compiler sends metadata request to Metastore (any database).

Page 3 of 64
[Link] Metadata

Metastore sends metadata as a response to the compiler.

[Link] Plan

The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing
and compiling of a query is complete.

[Link] Plan

The driver sends the execute plan to the execution engine.

[Link] Job

Internally, the process of execution job is a MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.

7.1..Metadata Ops

Meanwhile in execution, the execution engine can execute metadata operations with Metastore.

[Link] Result

The execution engine receives the results from Data nodes.

[Link] Results

The execution engine sends those resultant values to the driver.

[Link] Results

The driver sends the results to Hive Interfaces.

Hive Query Language (HQL)

HQL is very similar to traditional database. It stores data in tables, where each table consists of
columns and each column consists of specific number of rows.

Page 4 of 64
Each column has its own data type. Hive supports primitive as well as complex data types.
Primitive types like Integer, Bigint, Smallint, Tinyint, Float, Double Boolean, String, and Binary
are supported. Complex types include Associative array: map ,Structs: struct , and Lists: list .

Data Definition statements (DDL) like create table, alter table, drop table are supported. All
these DDL statements can be used on Database, tables, partitions, views, functions, Index, etc.
Data Manipulation statements (DML) like load, insert, select and explain are supported. Load is
used for taking data from HDFS and moving it into Hive. Insert is used for moving data from
one Hive table to another. Select is used for querying data.

Different modes of Hive


Hive can operate in two modes depending on the size of data nodes in Hadoop.
These modes are,

● Local mode
● Map reduce mode

When to use Local mode:

● If the Hadoop installed under pseudo mode with having one data node we use Hive in this
mode
● If the data size is smaller in term of limited to single local machine, we can use this mode
● Processing will be very fast on smaller data sets present in the local machine

When to use Map reduce mode:

● If Hadoop is having multiple data nodes and data is distributed across different node we
use Hive in this mode
● It will perform on large amount of data sets and query going to execute in parallel way
● Processing of large data sets with better performance can be achieved through this mode

In Hive, we can set this property to mention which mode Hive can work? By default, it works on
Map Reduce mode and for local mode you can have the following setting.
Hive to work in local mode set
SET [Link]=local;
From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode
automatically.

Page 5 of 64
4.1.3Data Types in Hive

⮚ Hive data types are categorized into two types.


⮚ They are the primitive and complex data types.

Primitive Data Types

primitive Data Types also divide into 4 types there are mentioned below

● Numeric Data Type


● String Data Type
● Date/Time Data Type
● Miscellaneous Data Type

Numeric Data Type


Numerical Data types are mainly divided into 2 types

● Integral Data Types


● Floating Data Types

Integral Data Types

● TINYINT (This TINYINT is equal to Java’s BYTE data type)


● SMALLINT (This SMALLINT is equal to Java’s SHORT data type)
● INT (This INT is equal to Java’s INT data type)
● BIGINT (This BIGINTis equal to Java’s LONG data type

Floating Data Types

● FLOAT (This FLOAT is equal to Java’s FLOAT data type )


● DOUBLE (This DOUBLE is equal to Java’s DOUBLE data type)
● DECIMAL (This DECIMAL is equal to SQL’s DECIMAL data type)

DECIMAL(4,3) represents total of 4 digits, out of which 3 are decimal digits. Below is the chart
for all numeric types with their ranges and examples.

Page 6 of 64
By default in Hive integral values are taken as a INT data type unless the value of integral values
range are cross the range of INT values (Please check the above table for values).Suppose if we
need to use a low integral value like 100 it to be treated as TINYINT or SMALLINT or BIGINT
then we need to suffix the value with Y, S or L respectively.
Examples: 100Y – TINYINT, 100S – SMALLINT, 100L – BIGINT

String Data Types

In Hive String Data Types are Mainly divided into 3 types there are mentioned below

● STRING
● VARCHAR
● CHAR

String Data Types

In Hive String Data Types are Mainly divided into 3 types there are mentioned below

● STRING
● VARCHAR

Page 7 of 64
● CHAR

This String Data types are supported in Hive from only 0.14 and above versions only

What is Difference between CHAR vs VARCHAR

● CHAR is fixed length


● VARCHAR is is of variable length but we need to specify the max length of the field
(example : name VARCHAR(64)). If the values are less than the max length specified
then remaining space will be freed out.
● Maximum value of CHAR is 255 but Maximum value of VARCHAR is 65355
● In VARCHAR we can optimized the Space/Storage by using released the unused bytes
but in CHAR we can optimize the unused Space CHAR unused bytes will not be
released but filled with spaces.
● If a string value being assigned to a VARCHAR value exceeds the length specified, then
the string is silently truncated.

DATE/TIME Data Types


Date/Time Data types are mainly Divide into 2 types

Page 8 of 64
● DATE
● TIMESTAMP

This hive date/time data types in UNIX time stamp format for date/time related fields in hive.
DATE
Represented format is YYYY-MM-DD
Example : DATE ‘2014-12-07’
Date ranges allowed are 0000-01-01 to 9999-12-31
TIMESTAMP
TIMESTAMP use the format yyyy-mm-ddhh:mm:ss[.f...].

Cast Type Result

cast(date as date) Same date value

cast(date as string) Date is formatted as a string in the form ‘YYYY-MM-DD’.

cast(date as Midnight of the year/month/day of the date value is returned


timestamp) as timestamp.

If the string is in the form ‘YYYY-MM-DD’, then a date value


cast(string as date) corresponding to that is returned.

If the string value does not match this format, then NULL is returned.

cast(timestamp as
The year/month/day of the timestamp is returned as a date value.
date)

Miscellaneous Types

Hive Supports 2 more primitive Data types

● BOOLEAN
● BINARY

Page 9 of 64
Hive BOOLEAN is similar to Java’s BOOLEAN types,it can stores true or false values only

BINARY is an array of Bytes and similar to VARBINARY in many RDBMSs. BINARY


columns are stored within the record, not separately like BLOBs

Implicit Conversion Between Primitive Data Types

TINYINT—>SMALLINT–>INT–>BIGINT–>FLOAT–>DOUBLE

TINYINT can be converted to any other numeric data type but BIGINT can only be converted to
FLOAT or DOUBLE but the reverse

Boolean & Binary data types will not be converted to any other data type implicitly.

Explicit Conversion Between Primitive Data Types

Explicit type conversion can be done using the cast operator only.

Example: CAST(‘500’ AS INT) will convert the string ‘500’ to the integer value 500. But If
cast is used incorrectly as in CAST(‘Hello’ AS INT) , then cast operation will fail and returns
NULL .

COMPLEX DATA TYPES

The Above Primitive Data Types are almost same in all other Databases,Here hive also
supported the hive complex data types which are not available in many relational database
management systems.

Hive Presently supported 4 complex data types

● ARRAY
● MAP
● STRUCT
● UNIONTYPE

Page 10 of 64
ARRAY Data Type

Same as Array in java,An Ordered sequences of similar type elements that are index
using zero-based integers

Example array(‘John’, ‘Doe’)

The second value is accused by using array[1]

MAP Data Type

Collection of key-value pairs. Fields are accessed using array notation of keys (e.g., [‘key’]).

Example map(‘first’, ‘John’,’last’, ‘Doe’)

if a column name is of type MAP with key→value pairs ‘first’→’John’ and ‘last’→’Doe’, then
the last name can be referenced using name[‘last’].

STRUCT Data Type

● It is similar to STRUCT in C language. It is a record type which encapsulates a set of


named fields that can be any primitive data type. Elements in STRUCT type are accessed
using the DOT (.) notation.

Example – For a column c of type STRUCT {a INT; b INT} the a field is accessed by the
expression c.a

UNIONTYPE

UNIONTYPE is collection of Heterogeneous data [Link] is similar to Unions in C. At any point


of time, an Union Type can hold any one (exactly one) data type from its specified data types

Example

Page 11 of 64
UNIONTYPE<int,double, array<string>,struct<a:int,b:string>>

4.1.4 HIVE COMMANDS

DATABASE:

⮚ Hive is a database technology that can define databases and tables to analyze structured
data.
⮚ The theme for structured data analysis is to store the data in a tabular manner, and pass
queries to analyze it.
⮚ Hive contains a default database named default.

[Link] Database Statement:

Create Database is a statement used to create a database in Hive. A database in Hive is a


namespace or a collection of tables.

The syntax for this statement is as follows:

CREATE DATABASE|SCHEMA [IF NOT EXISTS]<database name>

Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the
same name already exists. We can use SCHEMA in place of DATABASE in this command. The
following query is executed to create a database named userdb:

hive> CREATE DATABASE [IF NOT EXISTS]userdb;

or

hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list:

hive> SHOW DATABASES;


default

Page 12 of 64
userdb

Example: -Create database "guru99"

From the above screen shot, we are doing two things

● Creating database "guru99" in Hive


● Displaying existing databases using "show" command
● In the same screen, Database "guru99" name is displayed at the end when we execute the
show command. Which means Database "guru99" is successfully created.

[Link] Database Statement :

Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as
follows:

DROP DATABASE StatementDROP(DATABASE|SCHEMA)[IF EXISTS]database_name


[RESTRICT|CASCADE];

The following queries are used to drop a database. Let us assume that the database name is
userdb.

hive> DROP DATABASE IF EXISTS userdb;

The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.

hive> DROP DATABASE IF EXISTS userdb CASCADE;

The following query drops the database using SCHEMA.

hive> DROP SCHEMA userdb;

Page 13 of 64
[Link]:

Create, Drop and altering of tables in Hive

Table Operations such as Creation, Altering, and Dropping tables in Hive can be observed in
this section.

In the above screenshot, we are creating a table with columns and altering the table name.

1. Creating table guru_sample with two column names such as "empid" and "empname"
2. Displaying tables present in guru99 database
3. Guru_sample displaying under tables
4. Altering table "guru_sample" as "guru_sampleNew"
5. Again when you execute "show" command, it will display the new name
Guru_sampleNew

Dropping table guru_sampleNew:

[Link] Table Statement

Page 14 of 64
It is used to alter a table in Hive.
Syntax
The statement takes any of the following syntaxes based on what attributes we wish to modify in
a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_namenew_namenew_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.
hive> ALTER TABLE employee RENAME TO emp;
[Link] Table Statement
When you drop a table from Hive Metastore, it removes the table/column data and their
metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file
system); Hive treats both in the same manner, irrespective of their types.

The syntax is as follows:


DROP TABLE [IF EXISTS]table_name;
The following query drops a table named employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following response:
OK
Time taken:5.3 seconds
hive>

Table types and its Usage:

Hive deals with two types of table structures like Internal and External tables depending on the
loading and design of schema in Hive.,

Managed Table in Hive

⮚ Hive managed tables are also called as hive internal tables.

Page 15 of 64
⮚ Internal table file security is controlled solely via HIVE.

⮚ Deleting the table deletes the metadata & data from masternode and HDFS
respectively
⮚ Security needs to be managed within HIVE, probably at the schema level (depends on
organisation to organisation). HDFS security is out of scope in this case.
⮚ Internal Table is tightly coupled in [Link] this type of table, first we have to create
table and load the data.
⮚ We can call this one as data on schema.
⮚ By dropping this table, both data and schema will be removed.
⮚ The stored location of this table will be at /user/hive/warehouse.

When to Choose Internal Table:

⮚ If the processing data available in local file system


⮚ If we want Hive to manage the complete lifecycle of data including the deletion

Sample code Snippet for Internal Table

1. To create the internal table

Hive>CREATE TABLE guruhive_internaltable (id INT,Name STRING);

Row format delimited

Fields terminated by '\t';

2. Load the data into internal table

Hive>LOAD DATA INPATH '/user/guru99hive/[Link]' INTO table


guruhive_internaltable;

3. Display the content of the table

Hive>select * from guruhive_internaltable;

4. To drop the internal table

Hive>DROP TABLE guruhive_internaltable;


Page 16 of 64
If you dropped the guruhive_internaltable, including its metadata and its data will be deleted
from Hive.
From the following screenshot, we can observe the output

External table:

⮚ External table stores files on the HDFS server but tables are not linked to the source file
completely.
⮚ If you delete an external table the file still remains on the HDFS server.
⮚ The file and the table link is there but read only.
⮚ External table files are accessible to anyone who has access to HDFS file structure and
therefore security needs to be managed at the HDFS file/folder level.
⮚ Meta data is maintained on master node and deleting an external table from HIVE, only
deletes the metadata not the data/file
⮚ External Table is loosely coupled in nature. Data will be available in [Link] table is
going to create on HDFS data.

Page 17 of 64
⮚ In other way, we can say like its creating schema on data.
⮚ At the time of dropping the table it drops only schema, the data will be still available in
HDFS as before.
⮚ External tables provide an option to create multiple schemas for the data stored in HDFS
instead of deleting the data every time whenever schema updates

When to Choose External Table:

⮚ If processing data available in HDFS


⮚ Useful when the files are being used outside of Hive

Sample code Snippet for External Table

1. Create External table

Hive>CREATE EXTERNAL TABLE guruhive_external(id INT,Name STRING)

Row format delimited

Fields terminated by '\t'

LOCATION '/user/guru99hive/guruhive_external;

2. If we are not specifying the location at the time of table creation, we can load the data
manually

Hive>LOAD DATA INPATH '/user/guru99hive/[Link]' INTO TABLE


guruhive_external;

3. Display the content of the table

Hive>select * from guruhive_external;

4. To drop the internal table

Hive>DROP TABLE guruhive_external;

From the following screen shot, we can observe the output

Page 18 of 64
Difference between Internal Vs External tables

Feature Internal External

Schema Data on Schema Schema on Data

Storage Location /usr/hive/warehouse HDFS location

Data availability Within local file system Within HDFS

SELECT statement :

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.

Page 19 of 64
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the condition.

Syntax

Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT]select_expr,select_expr,...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list|[DISTRIBUTE BY col_list][SORT BY
col_list]]
[LIMIT number];

Example

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.

+------+--------------+-------------+-------------------+-------
-+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+-------
-+
|1201 |Gopal | 45000 | Technical manager | TP
|
|1202 |Manisha | 45000 | Proofreader | PR
|
|1203 |Masthanvali | 40000 | Technical writer | TP
|

Page 20 of 64
|1204 |Krian | 40000 | Hr Admin | HR
|
|1205 |Kranthi | 30000 | Op Admin | Admin
|
+------+--------------+-------------+-------------------+-------
-+
The following query retrieves the employee details using the above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+-------
-+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+-------
-+
|1201 |Gopal | 45000 | Technical manager | TP
|
|1202 |Manisha | 45000 | Proofreader | PR
|
|1203 |Masthanvali | 40000 | Technical writer | TP
|
|1204 |Krian | 40000 | Hr Admin | HR
|
+------+--------------+-------------+-------------------+-------
-+

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;


On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+-------
-+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+-------
-+
|1205 |Kranthi | 30000 | Op Admin | Admin
|
|1204 |Krian | 40000 | Hr Admin | HR
|
|1202 |Manisha | 45000 | Proofreader | PR
|
|1201 |Gopal | 45000 | Technical manager | TP
|

Page 21 of 64
|1203 |Masthanvali | 40000 | Technical writer | TP
|
+------+--------------+-------------+-------------------+-------
-+

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;


On successful execution of the query, you get to see the following response:
+------+--------------+
| Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+

4.1.5 SQL Partition

The main work of SQL Partition is dividing the large amount of data into number of slices by
based on table single column only. SQL Partition is very hard comparing to Hive Partition
because SQL server only support one column for [Link] can be very harder to apply the
query,indexes take up more space,and SQL Server’s partitioning requires Enterprise Edition on
top of that expensive SAN you’re adding to cope with the extra space.

Hive Partition

The main work of Hive partition is also same as SQL Partition but the main difference between
SQL partition and hive partition is SQL partition is only supported for single column in table but
in Hive partition it supported for Multiple columns in a table .In Hive we can apply Hive
Partition concept on Managed tables and External [Link] we not crated dynamic partition for
hive, Hive also creates an automatic partition scheme when the table is created

Example Hive Partition


create table cityreport(cityidstring,creport string, ctover
string)
partitioned by (city string)
row format delimited

Page 22 of 64
fields terminated by ‘|’
stored as textfile;

In the Above table showing how to create a simple Hive partition [Link] the above table
columns are cityid,creport,ctover and table is partitioned by [Link] above table is partitioned by
only one column city .
Load Data into Hive Partition Table
hive> load data local inpath ‘/home/mahesh/hive-related/[Link]’ into table
cityreport partition (city = ‘hyderabad’);
hive> load data local inpath ‘/home/mahesh/hive-related/[Link]’ into table
cityreport partition (city = ‘banglore’);
hive> load data local inpath ‘/home/mahesh/hive-related/[Link]’ into table cityreport
partition (city = ‘chennai’);
Now the total table data is partitioned by city,so when the query is applied on table it can easily
access the required row by the help partitions
hive> select * from cityreport where cityid = ‘1003_hyd’ and city=’hyderabad';
The above query directly searched in city=’hyderabad’ partiton and not search in other city
partition so the data look up time is decreased and the details displayed very quickly because of
Hive Partition

Multiple column supported Hive Partition Tables

CREATETABLE customer (
id INT,
name STRING,
address1 STRING,
address2 STRING,
city STRING,
state STRING,
zip STRING
)
PARTITION BY (
REGION STRING,
country STRING
);

Page 23 of 64
This above approach is great once you’re using Hive in production but it can be tedious to
initially load a large data warehouse when you can only write to one partition at a time. There is
a better way.

AUTOMATIC PARTITIONING
Hive lets us change settings for a session using the SET command. Changing these settings
permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult
change, but it’s outside of our scope.
SET [Link] = true;

SET [Link] = nonstrict;

A few final important notes on partitioning:

i) Not to over partition your data in [Link] your partitions are relativity small(not so matched)
then data scanning and retrieving very difficult comparing to sample [Link] your table all
partitions columns should be close relation to each other columns in a table .

ii) If you want to use hive partition in your table you should enable below query in your
[Link] property.

[Link]=strict

we can also enable hive partition without changing [Link],just type the above command
in your hive terminal but it only works for that session of time only,If you close the terminal the
settings also cleared so better to set in [Link] file.

4.1.6 Hive Buckets

⮚ The table divide into number of partitions is called Hive Partition, The Hive

Partition can be further subdivided into Clusters or Buckets


⮚ Hive Buckets is nothing but another technique of decomposing data or decreasing the
data into more manageable parts or equal [Link] example we have table with columns
like date,employee_name,employee_id,salary,leavesetc .

Page 24 of 64
⮚ In this table just use date column as the top-level partition and the employee_id as the
second-level partition leads to too many small partitions.
⮚ So here employee table is partition by date and bucketing by employee_id.
⮚ The value of this column will be hashed by a user-defined number into buckets. Records
with the same employee_id will always be stored in the same bucket.

In Partition there is so many chance to create thousands of tiny partitions but coming into Hive
Buckets we can’t create number of Hive Buckets the reason is we should declare the number of
buckets for a table in the time of table creation. In Hive Partition each partition will be created
as directory but in Hive Buckets each bucket will be created as [Link] Partition we
used PARTITIONED BY but in Hive Buckets we used CLUSTERED BY

Advantages With Hive Buckets


The number of buckets is fixed so it does not fluctuate with data
Enables more efficient queries
makes sampling more efficient
Hash(column) MOD(number of buckets) –evenly distributed
Used for Query optimization Techniques

Step 1) Creating Bucket as shown below.

From the above screen shot

● We are creating sample_bucket with column names such as first_name, job_id,


department, salary and country
Page 25 of 64
● We are creating 4 buckets overhere.
● Once the data get loaded it automatically, place the data into 4 buckets

Step 2) Loading Data into table sample bucket

Assuming that"Employees table" already created in Hive system. In this step, we will see the
loading of Data from employees table into table sample bucket.

Before we start moving employees data into buckets, make sure that it consist of column names
such as first_name, job_id, department, salary and country.

Here we are loading data into sample bucket from employees table.

Step 3)Displaying 4 buckets that created in Step 1

From the above screenshot, we can see that the data from the employees table is transferred into
4 buckets created in step 1.

Check below screen and you will realize three files names as 000000_0, 000000_1, 000000_2,
and000000_3, are created these are our data files.

Page 26 of 64
Word Count In Hive

Assume we have data in our table like below

This is a Hadoop Post


and Hadoop is a big data technology

[Link] sentence into words


the data we have is in sentences,first we have to convert that it into words applying space as
[Link] have to use split function of hive.

split (sentence ,' ')

[Link] column into rows


Now we have array of strings like this
[This,is,a,hadoop,Post]
but we have to convert it into multiple rows like below

This
is
a
hadoop
Post

I mean we have to convert every line of data into multiple rows ,for this we have function called
explode in hive and this is also called table generating function.

SELECT explode(split(sentence, ' ')) AS word FROM texttable

and create above output as intermediate table.

(SELECT explode(split(sentence, ' ')) AS word FROM texttable)tempTable

after second step you should get output like below

a
a
and

Page 27 of 64
Big
data
Hadoop
Hadoop
is
is
Post
technology
This
[Link] group by
after second step , it is straight forward ,we have to apply group by to count word occurrences.

selectword,count(1) as count from


(SELECT explode(split(sentence, ' ')) AS word FROM texttable)tempTable
group by word

[Link]

NoSQL is not a relational database.

It provides more flexibility since all records are not restricted by the same column names and
types defined across the entire table

It is designed for distributed data stores where very large scale of data storing needs (for example
Google or Facebook which collects terabits of data every day for their users). These type of data
storing may not require fixed schema, avoid join operations and typically scale horizontally.

Following 2 tables are simple example of a Customer table and an Order table wherein the
Customer's table is linked to the Order's table via a relationship.

NoSQL Database Types

Following are the NoSQL database types :

● Document Databases : In this type, key is paired with a complex data structure called as
Document. Example : MongoDB

Page 28 of 64
● Graph stores : This type of database is ususally used to store networked data. Where in
we can relate data based on some existing data.
● Key-Value stores : These are the simplest NoSQL databases. In this each is stored with a
key to identify it. In some Key-value databases, we can even save the typr of the data
saved along, like in Redis.
● Wide-column stores : Used to store large data sets(store columns of data together).
Example : Cassandra(Used in Facebook), HBase etc.

NoSQL, the tables can probably look like the ones as shown below

Order Table:

Page 29 of 64
1. The first thing you will notice straightaway is that you don't have columns with special
column names defined, but instead each field has a key-value pair.
2. You will notice that in the customer's table that the first 3 keys are the same for all 3
rows, but the fourth key (City and Status) is different for the first 2 rows and not
applicable for the third row.
3. Likewise, in the Orders tables, the 2nd and 3rd row have no values defined for the 4
column (shipment date).

This is what makes NoSQL so special and unique and also very flexible. In our dynamic and
ever changing technology world, business owners now demand for a faster turnaround time to
software solutions.

By using flexible databases such as NoSQL databases, we can inculcate a faster turnaround time,
because we have more flexibility and less constraints in the way data can be defined.

Just imagine the amount of time spent in adding or editing columns to existing tables in a
relational database compared to the amount of effort required in adding the same in a NoSQL
database.

4.3 Cassandra

What is Cassandra
Apache Cassandra is highly scalable, high performance, distributed NoSQL database.
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.

Important Points of Cassandra


● Cassandra is a column-oriented database.
● Cassandra is scalable, consistent, and fault-tolerant.
● Cassandra's distribution design is based on Amazon's Dynamo and its data model on
Google's Bigtable.
Page 30 of 64
● Cassandra is created at Facebook. It is totally different from relational database
management systems.
● Cassandra follows a Dynamo-style replication model with no single point of failure,
but adds a more powerful "column family" data model.
● Cassandra is being used by some of the biggest companies like Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

The data model of Cassandra is significantly different from what we normally see in an
RDBMS. This chapter provides an overview of how Cassandra stores its data.

Cluster
Cassandra database is distributed over several machines that operate together. The
outermost container is known as the Cluster. For failure handling, every node contains a
replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a
cluster, in a ring format, and assigns data to them.

Keyspace

Keyspace is the outermost container for data in Cassandra. The basic attributes of a
Keyspace in Cassandra are −

● Replication factor − It is the number of machines in the cluster that will receive
copies of the same data.
● Replica placement strategy − It is nothing but the strategy to place replicas in the
ring.
● There are three types of strategies such as:

1) Simple strategy (rack-aware strategy)

2) old network topology strategy (rack-aware strategy)

3) network topology strategy (datacenter-shared strategy)

● Column families − Keyspace is a container for a list of one or more column families.
A column family, in turn, is a container of a collection of rows. Each row contains
ordered columns. Column families represent the structure of your data. Each
keyspace has at least one and often many column families.

The syntax of creating a Keyspace is as follows −

Page 31 of 64
CREATE KEYSPACE Keyspace name

WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

The following illustration shows a schematic view of a Keyspace.

Data Modeling Goals


You should have following goals while modeling data in Cassandra:

● Spread Data Evenly Around the Cluster: To spread equal amount of data on each
node of Cassandra cluster, you have to choose integers as a primary key. Data is
spread to different nodes based on partition keys that are the first part of the
primary key.
● Minimize number of partitions read while querying data: Partition is used to
bind a group of records with the same partition key. When the read query is issued,
it collects data from different nodes from different partitions.

In the case of many partitions, all these partitions need to be visited for collecting the query
data. It does not mean that partitions should not be created. If your data is very large, you

Page 32 of 64
can't keep that huge amount of data on the single partition. The single partition will be
slowed down. So you must have a balanced number of partitions.

Creating a Keyspace using Cqlsh


A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster
contains one keyspace per node. Given below is the syntax for creating a keyspace using the
statement CREATE KEYSPACE.

Syntax

CREATE KEYSPACE <identifier> WITH <properties>

CREATE KEYSPACE cse WITH replication = {'class':'SimpleStrategy',


'replication_factor' : 1};

Altering a KeySpace
ALTER KEYSPACE can be used to alter properties such as the number of replicas and the
durable_writes of a KeySpace. Given below is the syntax of this command.

Syntax
ALTER KEYSPACE <identifier> WITH <properties>

Example
ALTER KEYSPACE cse WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 2};

Verification

You can verify whether the table is created or not using the command Describe. If you use
this command over keyspaces, it will display all the keyspaces created as shown below.

cqlsh> DESCRIBE keyspaces;

Page 33 of 64
cse system system_traces

Here you can observe the newly created KeySpace cse.

Using a Keyspace
You can use a created KeySpace using the keyword USE. Its syntax is as follows −

Syntax:USE <identifier>

Example

In the following example, we are using the KeySpace cse.

cqlsh> USE cse;


cqlsh:cse>

Creating a Table
You can create a table using the command CREATE TABLE. Given below is the syntax for
creating a table.

Syntax
CREATE (TABLE | COLUMNFAMILY) <tablename>
('<column-definition>' , '<column-definition>')
(WITH <option> AND <option>)

Example:

CREATE TABLE emp( emp_id int PRIMARY KEY, emp_name text, emp_city text,
emp_sal varint, emp_phone varint );

Verification

Page 34 of 64
The select statement will give you the schema. Verify the table using the select statement as
shown below.

cqlsh:cse> select * from emp;

emp_id | emp_city | emp_name | emp_phone | emp_sal


--------+----------+----------+-----------+---------

(0 rows)

Here you can observe the table created with the given columns. Since we have deleted the
keyspace tutorialspoint, you will not find it in the keyspaces list.

Creating Data in a Table


You can insert data into the columns of a row in a table using the command INSERT. Given
below is the syntax for creating data in a table.

INSERT INTO <tablename>


(<column1 name>, <column2 name>....)
VALUES (<value1>, <value2>....)
USING <option>

Example

Let us assume there is a table called emp with columns (emp_id, emp_name, emp_city,
emp_phone, emp_sal) and you have to insert the following data into the emp table.

INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,'ram',


'Hyderabad', 9848022338, 50000);

INSERT INTO emp (emp_id, emp_name, emp_city,


emp_phone, emp_sal) VALUES(2,'robin', 'Hyderabad', 9848022339, 40000);

INSERT INTO emp (emp_id, emp_name, emp_city,


emp_phone, emp_sal) VALUES(3,'rahman', 'Chennai', 9848022330, 45000);

Page 35 of 64
select * from emp;

Updating Data in a Table


UPDATE is the command used to update data in a table. The following keywords are used
while updating data in a table −

● Where − This clause is used to select the row to be updated.


● Set − Set the value using this keyword.
● Must − Includes all the columns composing the primary key.

While updating rows, if a given row is unavailable, then UPDATE creates a fresh row. Given
below is the syntax of UPDATE command −

UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>

Example
Assume there is a table named emp. This table stores the details of employees of a certain
company, and it has the following details −

UPDATE emp SET emp_city='Delhi',emp_sal=50000 WHERE emp_id=2;

Page 36 of 64
Deleting Datafrom a Table
You can delete data from a table using the command DELETE. Its syntax is as follows −

DELETE FROM <identifier> WHERE <condition>;

Example
DELETE emp_sal FROM emp WHERE emp_id=3

Truncating a Table
You can truncate a table using thTRUNCATE <tablename>e TRUNCATE command. When
you truncate a table, all the rows of the table are deleted permanently. Given below is the
syntax of this command.

Syntax

Example
Page 37 of 64
TRUNCATE emp;

4.4 MongoDB

⮚ MongoDB is a document database. Each database contains collections which in turn


contains documents. Each document can be different with varying number of fields. The
size and content of each document can be different from each other.
⮚ The document structure is more in line with how developers construct their classes and
objects in their respective programming languages. Developers will often say that their
classes are not rows and columns but have a clear structure with key-value pairs.
⮚ As seen in the introduction with NoSQL databases, the rows (or documents as called in
MongoDB) doesn't need to have a schema defined beforehand. Instead, the fields can be
created on the fly.
⮚ The data model available within MongoDB allows you to represent hierarchical
relationships, to store arrays, and other more complex structures more easily.
⮚ Scalability – The MongoDB environments are very scalable. Companies across the world
have defined clusters with some of them running 100+ nodes with around millions of
documents within the database

The below example shows how a document can be modeled in MongoDB.

1. The _id field is added by MongoDB to uniquely identify the document in the collection.
2. What you can note is that the Order Data ( OrderID , Product and Quantity ) which in
RDBMS will normally be stored in a separate table, while in MongoDB it is actually

Page 38 of 64
stored as an embedded document in the collection itself. This is one of the key
differences of how data is modelled in MongoDB.

Common Terms in MongoDB

Below are the a few of the common terms used in MongoDB

1. _id – This is a field required in every MongoDB document. The _id field represents a
unique value in the MongoDB document. The _id field is like the document's primary
key. If you create a new document without an _id field, MongoDB will automatically
create the field. So for example, if we see the example of the above customer table,
Mongo DB will add a 24 digit unique identifier to each document in the collection.

2. Collection – This is a grouping of MongoDB documents. A collection is the equivalent


of a table which is created in any other RDMS such as Oracle or MS SQL. A collection

Page 39 of 64
exists within a single database. As seen from the introduction collections don't enforce
any sort of structure.
3. Cursor – This is a pointer to the result set of a query. Clients can iterate through a cursor
to retrieve results.
4. Database – This is a container for collections like in RDMS wherein it is a container for
tables. Each database gets its own set of files on the file system. A MongoDB server can
store multiple databases.
5. Document - A record in a MongoDB collection is basically called a document. The
document in turn will consist of field name and values.
6. Field - A name-value pair in a document. A document has zero or more fields. Fields are
analogous to columns in relational databases.

The following diagram shows an example of Fields with Key value pairs. So in the
example below CustomerID and 11 is one of the key value pair's defined in the
document.

7. JSON – This is known as JavaScript Object Notation. This is a human-readable, plain


text format for expressing structured data. JSON is currently supported in many
programming languages.

Page 40 of 64
Just a quick note on the key difference between the _id field and a normal collection field. The
_id field is used to uniquely identify the documents in a collection and is automatically added by
MongoDB when the collection is created.

Advantages of NoSQL Databases

Dynamic Schemas

You must be wondering what does dynamic schema means? In Relational Databases like Oracle,
MySQL we define table structures, right? For example, if we want to save records of Student
Data, then we will have to create a table named Student, add columns to it, like student_id,
student_nameetc, this is called defined schema, where in we define the structure before saving
any data.

If in future we plan to add some more related data in our Student table, then we will have to add
a new column to our table. Which is easy, if we have less data in our tables, but what if we have
millions of records. Migration to the updated schema would be a hectic job. NoSQL databases
solve this problem, as in a NoSQL database, schema definition is not required.

Sharding

In Sharding, large databases are partitioned into small, faster and easily manageable databases.

The (classic) Relational Databases follow a vertical architecture where in a single server holds
the data, as all the data is related. Relational Databases does not provide Sharding feature by
default, to achieve this a lot of efforts has to be put in, because transactional
integrity(Inserting/Updating data in transactions), Multiple table JOINS etc cannot be easily
achieved in distributed architecture in case of Relational Databases.

NoSQL Databases have the Sharding feature as default. No additional efforts required. They
automatically spread the data across servers, fetch the data in the fastest time from the server
which is free, while maintaining the integrity of data.

Replication

Auto data replication is also supported in NoSQL databases by default. Hence, if one DB server
goes down, data is restored using its copy created on another server in network.

Page 41 of 64
Integrated Caching

Many NoSQL databases have support for Integrated Caching, where in the frequently demanded
data is stored in cache to make the queries faster.

4.4.1 MongoDB Crud Operations

Create, Retrieve, Update and Delete operations in MongoDB

We will discuss how to create a database in MongoDB, how to create a table in MongoDb, how
to insert data in a MongoDB Collection and how to update and delete data from
MongoDBtable(Collection)

[Link] Shell from Command Line

Start MongoDB shell with 'mongod' command, you will see following screen as shown below.

Page 42 of 64
[Link] connection in MongoDB

Open an another instance of command prompt and connect MongoDB by supplying 'mongo'
command, you will see following screen ready to work in Mongo.

Page 43 of 64
[Link] all Databases in MongoDB

To see all exixtingdatabase in MongoDB, supply 'show dbs' command, you will see following
output containing all exiting database:

Page 44 of 64
[Link] a Database in MongoDB

In MongoDB it is not required to create a database before using it, just supply 'use
databaseName' and you will be switched to specified database even if the database don't exists,
mongo will create the database on the fly at the time you use it.

[Link] a Table in MongoDB

Same as creating a database, we dont need to create a table before using it, just switch to the
databse in which you want to create a new table and supply
'[Link]({column1:"value",column2:"value",...})' the very first time you insert
the data in a table Mongo will create the table on the fly if it does not exists already.

[Link] Data in a Table(Collection) in MongoDB

To insert data in a table use command


'[Link]({column1:"value",column2:"value",...})' or
'[Link]({column1:"value",column2:"value",...})', as shown below :

Page 45 of 64
[Link] or Retrieve data from a Table(Collection) in MongoDB

To select all rows from a table supply command '[Link]()', you will get an output as
shown below :

[Link] data or row in a Table(Collection) in MongoDB

To update data or a row in a table supply command


'[Link]({identifierColumn:value},{$set:{columnToChange:"newValue"}})',
you will get an output as shown below :

Page 46 of 64
[Link] data or a row in Table(Collection) in MongoDB

To delete data or a row in a table supply command


'[Link]({identifierColumn:value})', you will get an output as shown below :

MongoDB Create & Insert Database


In MongoDB, the first basic step is to have a database and collection in place. The database is
used to store all of the collections, and the collection in turn is used to store all of the documents.
The documents in turn will contain the relevant Field Name and Field values.
The snapshot below shows a basic example of how a document would look like.

The Field Names of the document are "Employeeid" and "EmployeeName" and the Field values
are "1" and "Smith' respectively. A bunch of documents would then make up a collection in
MongoDB.
In this article, you will learn -
● Creating a database
● Creating a collection

Page 47 of 64
Creating a database using “use” command
Creating a database in MongoDB is as simple as issuing the "using" command. The following
example shows how this can be done.

Code Explanation:

1. The "use" command is used to create a database in MongoDB. If the database does not
exist a new one will be created.

If the command is executed successfully, the following Output will be shown:


Output:

MongoDB will automatically switch to the database once created.

Creating a collection using insert()


The easiest way to create a collection is to insert a record (which is nothing but a document
consisting of Field names and Values) into a collection. If the collection does not exist a new one
will be created.
The following example shows how this can be done.

Page 48 of 64
[Link]
(
{
"Employeeid" : 1,
"EmployeeName" : "Martin"
}
)

Code Explanation:
As seen above, by using the "insert" command the collection will be created.
Adding documents using insert() command
MongoDB provides the insert () command to insert documents into a collection. The following
example shows how this can be done.
Step 1) Write the "insert" command
Step 2) Within the "insert" command, add the required Field Name and Field Value for the
document which needs to be created.

Code Explanation:

1. The first part of the command is the "insert statement" which is the statement used to
insert a document into the collection.
2. The second part of the statement is to add the Field name and the Field value, in other
words, what is the document in the collection going to contain.

Page 49 of 64
If the command is executed successfully, the following Output will be shown
Output:

The output shows that the operation performed was an insert operation and that one record was
inserted into the collection.

Add MongoDB Array using insert()


he "insert" command can also be used to insert multiple documents into a collection at one time.
The below code example can be used to insert multiple documents at a time.
The following example shows how this can be done,
Step 1) Create a Javascript variable called myEmployee to hold the array of documents
Step 2) Add the required documents with the Field Name and values to the variable
Step 3) Use the insert command to insert the array of documents into the collection
varmyEmployee=
[

Page 50 of 64
{
"Employeeid" : 1,
"EmployeeName" : "Smith"
},
{
"Employeeid: : 2,
"EmployeeName" : "Mohan"
},
{
"Employeeid: : 3,
"EmployeeName" : "Joe"
},

];

[Link](myEmployee);
If the command is executed successfully, the following Output will be shown

The output shows that those 3 documents were added to the collection.

Page 51 of 64
Printing in JSON format
JSON is a format called JavaScript Object Notation, and is just a way to store information in
an organized, easy-to-read manner. In our further examples, we are going to use the JSON print
functionality to see the output in a better format.
Let's look at an example of printing in JSON format
[Link]().forEach(printjson)
Code Explanation:

1. The first change is to append the function called for Each() to the find() function. What
this does is that it makes sure to explicitly go through each document in the collection.
In this way, you have more control of what you can do with each of the documents in the
collection.
2. The second change is to put the printjson command to the forEach statement. This will
cause each document in the collection to be displayed in JSON format.

If the command is executed successfully, the following Output will be shown


Output:

The output clearly shows that all of the documents are printed in JSON style.

MongodbObjectId()
By default when inserting documents in the collection, if you don't add a field name with the _id
in the field name, then MongoDB will automatically add an Object id field.
Why do we need the ObjectId field? Well, MongoDB uses this as the primary key for the
collection so that each document can be uniquely identified in the collection.
Page 52 of 64
When you query the documents in a collection, you can see the ObjectId for each document in
the collection.
If you want to ensure that MongoDB does not create the _id Field when the collection is created
and if you want to specify your own id as the _id of the collection, then you need to explicitly
define this while creating the collection.
When explicitly creating an id field, it needs to be created with _id in its name.
Let's look at an example on how we can achieve this.
[Link]({_id:10, "EmployeeName" : "Smith"})
Code Explanation:

1. We are assuming that we are creating the first document in the collection and hence in
the above statement while creating the collection, we explicitly define the field _id and
define a value for it.

If the command is executed successfully and now use the find command to display the
documents in the collection, the following Output will be shown
Output:

Page 53 of 64
The output clearly shows that the _id field we defined while creating the collection is now used
as the primary key for the collection.
Performing Queries
The method of fetching or getting data from a MongoDB database is carried out by using
queries. While performing a query operation, one can also use criteria’s or conditions which can
be used to retrieve specific data from the database.
MongoDB provides a function called [Link] () which is used for retrieval of
documents from a MongoDB database.
During the course of this tutorial, you will see how this function is used in various ways to
achieve the purpose of document retrieval.

MongoDB Query Document using find()

he basic query operations cover the simple operations such as getting all of the documents in a
MongoDB collection. Let’s look at an example of how we can accomplish this.

All of our code will be run in the MongoDBjavascript command shell. Consider that we have a
collection named ‘Employee’ in our MongoDB database and we execute the below command.

Page 54 of 64
Code Explanation:

1. Employee is the collection name in the MongoDB database


2. The find command is an in-built function which is used to retrieve the documents in the
collection.

If the command is executed successfully, the following Output will be shown

Output:

The output shows all the documents which are present in the collection.

We can also add criteria to our queries so that we can fetch documents based on certain
conditions.

Let's look at a couple of examples of how we can accomplish this.

Page 55 of 64
[Link]({EmployeeName : "Smith"}).forEach(printjson);

1. Here we want to find for an Employee whose name is "Smith" in the collection , hence
we enter the filter criteria as EmployeeName : "Smith"

If the command is executed successfully, the following Output will be shown

Output:

The output shows that only the document which contains "Smith" as the Employee Name is
returned.

Now, let's take a look at another code example which makes use of the greater than search
criteria. When this criteria is included, it actually searches those documents where the value of
the field is greater than the specified value.

[Link]({Employeeid : {$gt:2}}).forEach(printjson);

Code Explanation:

1. Here we want to find for all Employee's whose id is greater than 2. The $gt is called a
query selection operator, and what is just means is to use the greater than expression.

If the command is executed successfully, the following Output will be shown

Output:

Page 56 of 64
All of the documents wherein the Employee id is greater than 2 is returned.

When the [Link] () function is used to search for documents in the collection, the
result returns a pointer to the collection of documents returned which is called a cursor.
By default, the cursor will be iterated automatically when the result of the query is returned. But
one can also explicitly go through the items returned in the cursor one by one. If you see the
below example, if we have 3 documents in our collection, the cursor will point to the first
document and then iterate through all of the documents of the collection.

The following example shows how this can be done.

Page 57 of 64
varmyEmployee = [Link]( { Employeeid : { $gt:2 }});

while([Link]())

print(tojson([Link]()));

}
Code Explanation:

1. First we take the result set of the query which finds the Employee's whose id is greater
than 2 and assign it to the JavaScript variable 'myEmployee'
2. Next we use the while loop to iterate through all of the documents which are returned as
part of the query.
3. Finally for each document, we print the details of that document in JSON readable
format.

If the command is executed successfully, the following Output will be shown

Output:

Page 58 of 64
MongoDB Query Modifications using limit(), sort()

Mongo DB provides query modifiers such as the 'limit' and 'Orders' clause to provide more
flexibility when executing queries. We will take a look at the following query modifiers

1. Limits

This modifier is used to limit the number of documents which are returned in the result set for a
query. The following example shows how this can be done.

[Link]().limit(2).forEach(printjson);

Code Explanation:

1. The above code takes the find function which returns all of the documents in the
collection but then uses the limit clause to limit the number of documents being returned
to just 2.

Output:

If the command is executed successfully, the following Output will be shown

The output clearly shows that since there is a limit modifier, so at most just 2 records are
returned as part of the result set based on the ObjectId in ascending order.

2. Orders

One can specify the order of documents to be returned based on ascending or descending order
of any key in the collection. The following example shows how this can be done.

[Link]().sort({Employeeis:-1}).forEach(printjson)

Page 59 of 64
Code Explanation:

1. The above code takes the sort function which returns all of the documents in the
collection but then uses the modifier to change the order in which the records are
returned. Here the -1 indicates that we want to return the documents based on the
descending order of Employee id.

If the command is executed successfully, the following Output will be shown

Output:

The output clearly shows the documents being returned in descending order of the Employeeid.

MongoDBCount() & remove() function

The concept of aggregation is to carry out a computation on the results which are returned in a
query. For example, suppose you wanted to know what is the count of documents in a collection
as per the query fired, then MongoDB provides the count() function.

Let's look at an example of this.

[Link]()

Code Explanation:

1. The above code executes the count function.

If the command is executed successfully, the following Output will be shown

Page 60 of 64
Output:

The output clearly shows that 4 documents are there in the collection.

Performing Modifications

The other two classes of operations in MongoDB are the update and remove statements.

The update operations allow one to modify existing data, and the remove operations allow the
deletion of data from a collection.

Deleting Documents
In MongoDB, the [Link] () method is used to remove documents from a
collection. Either all of the documents can be removed from a collection or only those which
matches a specific condition.
If you just issue the remove command, all of the documents will be removed from the collection.
The following code example demonstrate how to remove a specific document from the
collection.
[Link]({Employeeid:22})
Code Explanation:

1. The above code use the remove function and specifies the criteria which in this case is to
remove the documents which have the Employee id as 22.

If the command is executed successfully, the following Output will be shown


Output:

Page 61 of 64
The output will show that 1 document was modified.

MongoDBUpdate() Document
MongoDB provides the update() command to update the documents of a collection. To update
only the documents you want to update, you can add a criteria to the update statement so that
only selected documents are updated.
The basic parameters in the command is a condition for which document needs to be updated,
and the next is the modification which needs to be performed.
The following example shows how this can be done.
Step 1) Issue the update command
Step 2) Choose the condition which you want to use to decide which document needs to be
updated. In our example, we want to update the document which has the Employee id 22.
Step 3) Use the set command to modify the Field Name
Step 4) Choose which Field Name you want to modify and enter the new value accordingly.
[Link](
{"Employeeid" : 1},
{$set: { "EmployeeName" : "NewMartin"}});
If the command is executed successfully, the following Output will be shown
Output:

Page 62 of 64
The output clearly shows that one record matched the condition and hence the relevant field
value was modified.
Updating Multiple Values
To ensure that multiple documents are updated at the same time in MongoDB you need to use
the multi option because otherwise by default only one document is modified at a time.
The following example shows how this can be done.
In this example, we are going to first find the document which has the Employee id as "1" and
change the Employee name from "Martin" to "NewMartin"
Step 1) Issue the update command
Step 2) Choose the condition which you want to use to decide which document needs to be
updated. In our example, we want the document which has the Employee id of "1" to be updated.
Step 3) Choose which Field Name's you want to modify and enter their new value accordingly.

[Link]
(
{
Employeeid : 1
},
{
$set :
{
"EmployeeName" : "NewMartin"
"Employeeid" : 22
}
}

Page 63 of 64
)

If the command is executed successfully and if you run the "find" command to search for the
document with Employee id as 22 you will see the following Output will be shown
Output:

The output clearly shows that one record matched the condition and hence the relevant field
value was modified.

Page 64 of 64

You might also like