Introduction to Apache Hive Overview
Introduction to Apache Hive Overview
1INTRODUCTION TO HIVE
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Features of Hive
Page 1 of 64
User Interface
Hive is a data warehouse infrastructure software that can create interaction between user and
HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store
The component that stores all the structure information of the various tables and partitions in the
warehouse including column and column type information, the serializers and deserializers
necessary to read and write data and the corresponding HDFS files where the data is stored.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the
replacements of traditional approach for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and process it.
Execution Engine
The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as same as MapReduce results. It
uses the flavor of MapReduce.
HDFS or HBASE
Page 2 of 64
Hadoop distributed file system or HBASE are the data storage techniques to store data into file
system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
[Link] Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
[Link] Plan
The driver takes the help of query compiler that parses the query to check the syntax and query
plan or the requirement of query.
[Link] Metadata
Page 3 of 64
[Link] Metadata
[Link] Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing
and compiling of a query is complete.
[Link] Plan
[Link] Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the job
to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.
7.1..Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
[Link] Result
[Link] Results
[Link] Results
HQL is very similar to traditional database. It stores data in tables, where each table consists of
columns and each column consists of specific number of rows.
Page 4 of 64
Each column has its own data type. Hive supports primitive as well as complex data types.
Primitive types like Integer, Bigint, Smallint, Tinyint, Float, Double Boolean, String, and Binary
are supported. Complex types include Associative array: map ,Structs: struct , and Lists: list .
Data Definition statements (DDL) like create table, alter table, drop table are supported. All
these DDL statements can be used on Database, tables, partitions, views, functions, Index, etc.
Data Manipulation statements (DML) like load, insert, select and explain are supported. Load is
used for taking data from HDFS and moving it into Hive. Insert is used for moving data from
one Hive table to another. Select is used for querying data.
● Local mode
● Map reduce mode
● If the Hadoop installed under pseudo mode with having one data node we use Hive in this
mode
● If the data size is smaller in term of limited to single local machine, we can use this mode
● Processing will be very fast on smaller data sets present in the local machine
● If Hadoop is having multiple data nodes and data is distributed across different node we
use Hive in this mode
● It will perform on large amount of data sets and query going to execute in parallel way
● Processing of large data sets with better performance can be achieved through this mode
In Hive, we can set this property to mention which mode Hive can work? By default, it works on
Map Reduce mode and for local mode you can have the following setting.
Hive to work in local mode set
SET [Link]=local;
From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode
automatically.
Page 5 of 64
4.1.3Data Types in Hive
primitive Data Types also divide into 4 types there are mentioned below
DECIMAL(4,3) represents total of 4 digits, out of which 3 are decimal digits. Below is the chart
for all numeric types with their ranges and examples.
Page 6 of 64
By default in Hive integral values are taken as a INT data type unless the value of integral values
range are cross the range of INT values (Please check the above table for values).Suppose if we
need to use a low integral value like 100 it to be treated as TINYINT or SMALLINT or BIGINT
then we need to suffix the value with Y, S or L respectively.
Examples: 100Y – TINYINT, 100S – SMALLINT, 100L – BIGINT
In Hive String Data Types are Mainly divided into 3 types there are mentioned below
● STRING
● VARCHAR
● CHAR
In Hive String Data Types are Mainly divided into 3 types there are mentioned below
● STRING
● VARCHAR
Page 7 of 64
● CHAR
This String Data types are supported in Hive from only 0.14 and above versions only
Page 8 of 64
● DATE
● TIMESTAMP
This hive date/time data types in UNIX time stamp format for date/time related fields in hive.
DATE
Represented format is YYYY-MM-DD
Example : DATE ‘2014-12-07’
Date ranges allowed are 0000-01-01 to 9999-12-31
TIMESTAMP
TIMESTAMP use the format yyyy-mm-ddhh:mm:ss[.f...].
If the string value does not match this format, then NULL is returned.
cast(timestamp as
The year/month/day of the timestamp is returned as a date value.
date)
Miscellaneous Types
● BOOLEAN
● BINARY
Page 9 of 64
Hive BOOLEAN is similar to Java’s BOOLEAN types,it can stores true or false values only
TINYINT—>SMALLINT–>INT–>BIGINT–>FLOAT–>DOUBLE
TINYINT can be converted to any other numeric data type but BIGINT can only be converted to
FLOAT or DOUBLE but the reverse
Boolean & Binary data types will not be converted to any other data type implicitly.
Explicit type conversion can be done using the cast operator only.
Example: CAST(‘500’ AS INT) will convert the string ‘500’ to the integer value 500. But If
cast is used incorrectly as in CAST(‘Hello’ AS INT) , then cast operation will fail and returns
NULL .
The Above Primitive Data Types are almost same in all other Databases,Here hive also
supported the hive complex data types which are not available in many relational database
management systems.
● ARRAY
● MAP
● STRUCT
● UNIONTYPE
Page 10 of 64
ARRAY Data Type
Same as Array in java,An Ordered sequences of similar type elements that are index
using zero-based integers
Collection of key-value pairs. Fields are accessed using array notation of keys (e.g., [‘key’]).
if a column name is of type MAP with key→value pairs ‘first’→’John’ and ‘last’→’Doe’, then
the last name can be referenced using name[‘last’].
Example – For a column c of type STRUCT {a INT; b INT} the a field is accessed by the
expression c.a
UNIONTYPE
Example
Page 11 of 64
UNIONTYPE<int,double, array<string>,struct<a:int,b:string>>
DATABASE:
⮚ Hive is a database technology that can define databases and tables to analyze structured
data.
⮚ The theme for structured data analysis is to store the data in a tabular manner, and pass
queries to analyze it.
⮚ Hive contains a default database named default.
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the
same name already exists. We can use SCHEMA in place of DATABASE in this command. The
following query is executed to create a database named userdb:
or
Page 12 of 64
userdb
Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as
follows:
The following queries are used to drop a database. Let us assume that the database name is
userdb.
The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.
Page 13 of 64
[Link]:
Table Operations such as Creation, Altering, and Dropping tables in Hive can be observed in
this section.
In the above screenshot, we are creating a table with columns and altering the table name.
1. Creating table guru_sample with two column names such as "empid" and "empname"
2. Displaying tables present in guru99 database
3. Guru_sample displaying under tables
4. Altering table "guru_sample" as "guru_sampleNew"
5. Again when you execute "show" command, it will display the new name
Guru_sampleNew
Page 14 of 64
It is used to alter a table in Hive.
Syntax
The statement takes any of the following syntaxes based on what attributes we wish to modify in
a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_namenew_namenew_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.
hive> ALTER TABLE employee RENAME TO emp;
[Link] Table Statement
When you drop a table from Hive Metastore, it removes the table/column data and their
metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file
system); Hive treats both in the same manner, irrespective of their types.
Hive deals with two types of table structures like Internal and External tables depending on the
loading and design of schema in Hive.,
Page 15 of 64
⮚ Internal table file security is controlled solely via HIVE.
⮚ Deleting the table deletes the metadata & data from masternode and HDFS
respectively
⮚ Security needs to be managed within HIVE, probably at the schema level (depends on
organisation to organisation). HDFS security is out of scope in this case.
⮚ Internal Table is tightly coupled in [Link] this type of table, first we have to create
table and load the data.
⮚ We can call this one as data on schema.
⮚ By dropping this table, both data and schema will be removed.
⮚ The stored location of this table will be at /user/hive/warehouse.
External table:
⮚ External table stores files on the HDFS server but tables are not linked to the source file
completely.
⮚ If you delete an external table the file still remains on the HDFS server.
⮚ The file and the table link is there but read only.
⮚ External table files are accessible to anyone who has access to HDFS file structure and
therefore security needs to be managed at the HDFS file/folder level.
⮚ Meta data is maintained on master node and deleting an external table from HIVE, only
deletes the metadata not the data/file
⮚ External Table is loosely coupled in nature. Data will be available in [Link] table is
going to create on HDFS data.
Page 17 of 64
⮚ In other way, we can say like its creating schema on data.
⮚ At the time of dropping the table it drops only schema, the data will be still available in
HDFS as before.
⮚ External tables provide an option to create multiple schemas for the data stored in HDFS
instead of deleting the data every time whenever schema updates
LOCATION '/user/guru99hive/guruhive_external;
2. If we are not specifying the location at the time of table creation, we can load the data
manually
Page 18 of 64
Difference between Internal Vs External tables
SELECT statement :
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.
Page 19 of 64
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the condition.
Syntax
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+-------
-+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+-------
-+
|1201 |Gopal | 45000 | Technical manager | TP
|
|1202 |Manisha | 45000 | Proofreader | PR
|
|1203 |Masthanvali | 40000 | Technical writer | TP
|
Page 20 of 64
|1204 |Krian | 40000 | Hr Admin | HR
|
|1205 |Kranthi | 30000 | Op Admin | Admin
|
+------+--------------+-------------+-------------------+-------
-+
The following query retrieves the employee details using the above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+-------
-+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+-------
-+
|1201 |Gopal | 45000 | Technical manager | TP
|
|1202 |Manisha | 45000 | Proofreader | PR
|
|1203 |Masthanvali | 40000 | Technical writer | TP
|
|1204 |Krian | 40000 | Hr Admin | HR
|
+------+--------------+-------------+-------------------+-------
-+
Page 21 of 64
|1203 |Masthanvali | 40000 | Technical writer | TP
|
+------+--------------+-------------+-------------------+-------
-+
The main work of SQL Partition is dividing the large amount of data into number of slices by
based on table single column only. SQL Partition is very hard comparing to Hive Partition
because SQL server only support one column for [Link] can be very harder to apply the
query,indexes take up more space,and SQL Server’s partitioning requires Enterprise Edition on
top of that expensive SAN you’re adding to cope with the extra space.
Hive Partition
The main work of Hive partition is also same as SQL Partition but the main difference between
SQL partition and hive partition is SQL partition is only supported for single column in table but
in Hive partition it supported for Multiple columns in a table .In Hive we can apply Hive
Partition concept on Managed tables and External [Link] we not crated dynamic partition for
hive, Hive also creates an automatic partition scheme when the table is created
Page 22 of 64
fields terminated by ‘|’
stored as textfile;
In the Above table showing how to create a simple Hive partition [Link] the above table
columns are cityid,creport,ctover and table is partitioned by [Link] above table is partitioned by
only one column city .
Load Data into Hive Partition Table
hive> load data local inpath ‘/home/mahesh/hive-related/[Link]’ into table
cityreport partition (city = ‘hyderabad’);
hive> load data local inpath ‘/home/mahesh/hive-related/[Link]’ into table
cityreport partition (city = ‘banglore’);
hive> load data local inpath ‘/home/mahesh/hive-related/[Link]’ into table cityreport
partition (city = ‘chennai’);
Now the total table data is partitioned by city,so when the query is applied on table it can easily
access the required row by the help partitions
hive> select * from cityreport where cityid = ‘1003_hyd’ and city=’hyderabad';
The above query directly searched in city=’hyderabad’ partiton and not search in other city
partition so the data look up time is decreased and the details displayed very quickly because of
Hive Partition
CREATETABLE customer (
id INT,
name STRING,
address1 STRING,
address2 STRING,
city STRING,
state STRING,
zip STRING
)
PARTITION BY (
REGION STRING,
country STRING
);
Page 23 of 64
This above approach is great once you’re using Hive in production but it can be tedious to
initially load a large data warehouse when you can only write to one partition at a time. There is
a better way.
AUTOMATIC PARTITIONING
Hive lets us change settings for a session using the SET command. Changing these settings
permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult
change, but it’s outside of our scope.
SET [Link] = true;
i) Not to over partition your data in [Link] your partitions are relativity small(not so matched)
then data scanning and retrieving very difficult comparing to sample [Link] your table all
partitions columns should be close relation to each other columns in a table .
ii) If you want to use hive partition in your table you should enable below query in your
[Link] property.
[Link]=strict
we can also enable hive partition without changing [Link],just type the above command
in your hive terminal but it only works for that session of time only,If you close the terminal the
settings also cleared so better to set in [Link] file.
⮚ The table divide into number of partitions is called Hive Partition, The Hive
Page 24 of 64
⮚ In this table just use date column as the top-level partition and the employee_id as the
second-level partition leads to too many small partitions.
⮚ So here employee table is partition by date and bucketing by employee_id.
⮚ The value of this column will be hashed by a user-defined number into buckets. Records
with the same employee_id will always be stored in the same bucket.
In Partition there is so many chance to create thousands of tiny partitions but coming into Hive
Buckets we can’t create number of Hive Buckets the reason is we should declare the number of
buckets for a table in the time of table creation. In Hive Partition each partition will be created
as directory but in Hive Buckets each bucket will be created as [Link] Partition we
used PARTITIONED BY but in Hive Buckets we used CLUSTERED BY
Assuming that"Employees table" already created in Hive system. In this step, we will see the
loading of Data from employees table into table sample bucket.
Before we start moving employees data into buckets, make sure that it consist of column names
such as first_name, job_id, department, salary and country.
Here we are loading data into sample bucket from employees table.
From the above screenshot, we can see that the data from the employees table is transferred into
4 buckets created in step 1.
Check below screen and you will realize three files names as 000000_0, 000000_1, 000000_2,
and000000_3, are created these are our data files.
Page 26 of 64
Word Count In Hive
This
is
a
hadoop
Post
I mean we have to convert every line of data into multiple rows ,for this we have function called
explode in hive and this is also called table generating function.
a
a
and
Page 27 of 64
Big
data
Hadoop
Hadoop
is
is
Post
technology
This
[Link] group by
after second step , it is straight forward ,we have to apply group by to count word occurrences.
[Link]
It provides more flexibility since all records are not restricted by the same column names and
types defined across the entire table
It is designed for distributed data stores where very large scale of data storing needs (for example
Google or Facebook which collects terabits of data every day for their users). These type of data
storing may not require fixed schema, avoid join operations and typically scale horizontally.
Following 2 tables are simple example of a Customer table and an Order table wherein the
Customer's table is linked to the Order's table via a relationship.
● Document Databases : In this type, key is paired with a complex data structure called as
Document. Example : MongoDB
Page 28 of 64
● Graph stores : This type of database is ususally used to store networked data. Where in
we can relate data based on some existing data.
● Key-Value stores : These are the simplest NoSQL databases. In this each is stored with a
key to identify it. In some Key-value databases, we can even save the typr of the data
saved along, like in Redis.
● Wide-column stores : Used to store large data sets(store columns of data together).
Example : Cassandra(Used in Facebook), HBase etc.
NoSQL, the tables can probably look like the ones as shown below
Order Table:
Page 29 of 64
1. The first thing you will notice straightaway is that you don't have columns with special
column names defined, but instead each field has a key-value pair.
2. You will notice that in the customer's table that the first 3 keys are the same for all 3
rows, but the fourth key (City and Status) is different for the first 2 rows and not
applicable for the third row.
3. Likewise, in the Orders tables, the 2nd and 3rd row have no values defined for the 4
column (shipment date).
This is what makes NoSQL so special and unique and also very flexible. In our dynamic and
ever changing technology world, business owners now demand for a faster turnaround time to
software solutions.
By using flexible databases such as NoSQL databases, we can inculcate a faster turnaround time,
because we have more flexibility and less constraints in the way data can be defined.
Just imagine the amount of time spent in adding or editing columns to existing tables in a
relational database compared to the amount of effort required in adding the same in a NoSQL
database.
4.3 Cassandra
What is Cassandra
Apache Cassandra is highly scalable, high performance, distributed NoSQL database.
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.
The data model of Cassandra is significantly different from what we normally see in an
RDBMS. This chapter provides an overview of how Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that operate together. The
outermost container is known as the Cluster. For failure handling, every node contains a
replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a
cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a
Keyspace in Cassandra are −
● Replication factor − It is the number of machines in the cluster that will receive
copies of the same data.
● Replica placement strategy − It is nothing but the strategy to place replicas in the
ring.
● There are three types of strategies such as:
● Column families − Keyspace is a container for a list of one or more column families.
A column family, in turn, is a container of a collection of rows. Each row contains
ordered columns. Column families represent the structure of your data. Each
keyspace has at least one and often many column families.
Page 31 of 64
CREATE KEYSPACE Keyspace name
● Spread Data Evenly Around the Cluster: To spread equal amount of data on each
node of Cassandra cluster, you have to choose integers as a primary key. Data is
spread to different nodes based on partition keys that are the first part of the
primary key.
● Minimize number of partitions read while querying data: Partition is used to
bind a group of records with the same partition key. When the read query is issued,
it collects data from different nodes from different partitions.
In the case of many partitions, all these partitions need to be visited for collecting the query
data. It does not mean that partitions should not be created. If your data is very large, you
Page 32 of 64
can't keep that huge amount of data on the single partition. The single partition will be
slowed down. So you must have a balanced number of partitions.
Syntax
Altering a KeySpace
ALTER KEYSPACE can be used to alter properties such as the number of replicas and the
durable_writes of a KeySpace. Given below is the syntax of this command.
Syntax
ALTER KEYSPACE <identifier> WITH <properties>
Example
ALTER KEYSPACE cse WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 2};
Verification
You can verify whether the table is created or not using the command Describe. If you use
this command over keyspaces, it will display all the keyspaces created as shown below.
Page 33 of 64
cse system system_traces
Using a Keyspace
You can use a created KeySpace using the keyword USE. Its syntax is as follows −
Syntax:USE <identifier>
Example
Creating a Table
You can create a table using the command CREATE TABLE. Given below is the syntax for
creating a table.
Syntax
CREATE (TABLE | COLUMNFAMILY) <tablename>
('<column-definition>' , '<column-definition>')
(WITH <option> AND <option>)
Example:
CREATE TABLE emp( emp_id int PRIMARY KEY, emp_name text, emp_city text,
emp_sal varint, emp_phone varint );
Verification
Page 34 of 64
The select statement will give you the schema. Verify the table using the select statement as
shown below.
(0 rows)
Here you can observe the table created with the given columns. Since we have deleted the
keyspace tutorialspoint, you will not find it in the keyspaces list.
Example
Let us assume there is a table called emp with columns (emp_id, emp_name, emp_city,
emp_phone, emp_sal) and you have to insert the following data into the emp table.
Page 35 of 64
select * from emp;
While updating rows, if a given row is unavailable, then UPDATE creates a fresh row. Given
below is the syntax of UPDATE command −
UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
Example
Assume there is a table named emp. This table stores the details of employees of a certain
company, and it has the following details −
Page 36 of 64
Deleting Datafrom a Table
You can delete data from a table using the command DELETE. Its syntax is as follows −
Example
DELETE emp_sal FROM emp WHERE emp_id=3
Truncating a Table
You can truncate a table using thTRUNCATE <tablename>e TRUNCATE command. When
you truncate a table, all the rows of the table are deleted permanently. Given below is the
syntax of this command.
Syntax
Example
Page 37 of 64
TRUNCATE emp;
4.4 MongoDB
1. The _id field is added by MongoDB to uniquely identify the document in the collection.
2. What you can note is that the Order Data ( OrderID , Product and Quantity ) which in
RDBMS will normally be stored in a separate table, while in MongoDB it is actually
Page 38 of 64
stored as an embedded document in the collection itself. This is one of the key
differences of how data is modelled in MongoDB.
1. _id – This is a field required in every MongoDB document. The _id field represents a
unique value in the MongoDB document. The _id field is like the document's primary
key. If you create a new document without an _id field, MongoDB will automatically
create the field. So for example, if we see the example of the above customer table,
Mongo DB will add a 24 digit unique identifier to each document in the collection.
Page 39 of 64
exists within a single database. As seen from the introduction collections don't enforce
any sort of structure.
3. Cursor – This is a pointer to the result set of a query. Clients can iterate through a cursor
to retrieve results.
4. Database – This is a container for collections like in RDMS wherein it is a container for
tables. Each database gets its own set of files on the file system. A MongoDB server can
store multiple databases.
5. Document - A record in a MongoDB collection is basically called a document. The
document in turn will consist of field name and values.
6. Field - A name-value pair in a document. A document has zero or more fields. Fields are
analogous to columns in relational databases.
The following diagram shows an example of Fields with Key value pairs. So in the
example below CustomerID and 11 is one of the key value pair's defined in the
document.
Page 40 of 64
Just a quick note on the key difference between the _id field and a normal collection field. The
_id field is used to uniquely identify the documents in a collection and is automatically added by
MongoDB when the collection is created.
Dynamic Schemas
You must be wondering what does dynamic schema means? In Relational Databases like Oracle,
MySQL we define table structures, right? For example, if we want to save records of Student
Data, then we will have to create a table named Student, add columns to it, like student_id,
student_nameetc, this is called defined schema, where in we define the structure before saving
any data.
If in future we plan to add some more related data in our Student table, then we will have to add
a new column to our table. Which is easy, if we have less data in our tables, but what if we have
millions of records. Migration to the updated schema would be a hectic job. NoSQL databases
solve this problem, as in a NoSQL database, schema definition is not required.
Sharding
In Sharding, large databases are partitioned into small, faster and easily manageable databases.
The (classic) Relational Databases follow a vertical architecture where in a single server holds
the data, as all the data is related. Relational Databases does not provide Sharding feature by
default, to achieve this a lot of efforts has to be put in, because transactional
integrity(Inserting/Updating data in transactions), Multiple table JOINS etc cannot be easily
achieved in distributed architecture in case of Relational Databases.
NoSQL Databases have the Sharding feature as default. No additional efforts required. They
automatically spread the data across servers, fetch the data in the fastest time from the server
which is free, while maintaining the integrity of data.
Replication
Auto data replication is also supported in NoSQL databases by default. Hence, if one DB server
goes down, data is restored using its copy created on another server in network.
Page 41 of 64
Integrated Caching
Many NoSQL databases have support for Integrated Caching, where in the frequently demanded
data is stored in cache to make the queries faster.
We will discuss how to create a database in MongoDB, how to create a table in MongoDb, how
to insert data in a MongoDB Collection and how to update and delete data from
MongoDBtable(Collection)
Start MongoDB shell with 'mongod' command, you will see following screen as shown below.
Page 42 of 64
[Link] connection in MongoDB
Open an another instance of command prompt and connect MongoDB by supplying 'mongo'
command, you will see following screen ready to work in Mongo.
Page 43 of 64
[Link] all Databases in MongoDB
To see all exixtingdatabase in MongoDB, supply 'show dbs' command, you will see following
output containing all exiting database:
Page 44 of 64
[Link] a Database in MongoDB
In MongoDB it is not required to create a database before using it, just supply 'use
databaseName' and you will be switched to specified database even if the database don't exists,
mongo will create the database on the fly at the time you use it.
Same as creating a database, we dont need to create a table before using it, just switch to the
databse in which you want to create a new table and supply
'[Link]({column1:"value",column2:"value",...})' the very first time you insert
the data in a table Mongo will create the table on the fly if it does not exists already.
Page 45 of 64
[Link] or Retrieve data from a Table(Collection) in MongoDB
To select all rows from a table supply command '[Link]()', you will get an output as
shown below :
Page 46 of 64
[Link] data or a row in Table(Collection) in MongoDB
The Field Names of the document are "Employeeid" and "EmployeeName" and the Field values
are "1" and "Smith' respectively. A bunch of documents would then make up a collection in
MongoDB.
In this article, you will learn -
● Creating a database
● Creating a collection
Page 47 of 64
Creating a database using “use” command
Creating a database in MongoDB is as simple as issuing the "using" command. The following
example shows how this can be done.
Code Explanation:
1. The "use" command is used to create a database in MongoDB. If the database does not
exist a new one will be created.
Page 48 of 64
[Link]
(
{
"Employeeid" : 1,
"EmployeeName" : "Martin"
}
)
Code Explanation:
As seen above, by using the "insert" command the collection will be created.
Adding documents using insert() command
MongoDB provides the insert () command to insert documents into a collection. The following
example shows how this can be done.
Step 1) Write the "insert" command
Step 2) Within the "insert" command, add the required Field Name and Field Value for the
document which needs to be created.
Code Explanation:
1. The first part of the command is the "insert statement" which is the statement used to
insert a document into the collection.
2. The second part of the statement is to add the Field name and the Field value, in other
words, what is the document in the collection going to contain.
Page 49 of 64
If the command is executed successfully, the following Output will be shown
Output:
The output shows that the operation performed was an insert operation and that one record was
inserted into the collection.
Page 50 of 64
{
"Employeeid" : 1,
"EmployeeName" : "Smith"
},
{
"Employeeid: : 2,
"EmployeeName" : "Mohan"
},
{
"Employeeid: : 3,
"EmployeeName" : "Joe"
},
];
[Link](myEmployee);
If the command is executed successfully, the following Output will be shown
The output shows that those 3 documents were added to the collection.
Page 51 of 64
Printing in JSON format
JSON is a format called JavaScript Object Notation, and is just a way to store information in
an organized, easy-to-read manner. In our further examples, we are going to use the JSON print
functionality to see the output in a better format.
Let's look at an example of printing in JSON format
[Link]().forEach(printjson)
Code Explanation:
1. The first change is to append the function called for Each() to the find() function. What
this does is that it makes sure to explicitly go through each document in the collection.
In this way, you have more control of what you can do with each of the documents in the
collection.
2. The second change is to put the printjson command to the forEach statement. This will
cause each document in the collection to be displayed in JSON format.
The output clearly shows that all of the documents are printed in JSON style.
MongodbObjectId()
By default when inserting documents in the collection, if you don't add a field name with the _id
in the field name, then MongoDB will automatically add an Object id field.
Why do we need the ObjectId field? Well, MongoDB uses this as the primary key for the
collection so that each document can be uniquely identified in the collection.
Page 52 of 64
When you query the documents in a collection, you can see the ObjectId for each document in
the collection.
If you want to ensure that MongoDB does not create the _id Field when the collection is created
and if you want to specify your own id as the _id of the collection, then you need to explicitly
define this while creating the collection.
When explicitly creating an id field, it needs to be created with _id in its name.
Let's look at an example on how we can achieve this.
[Link]({_id:10, "EmployeeName" : "Smith"})
Code Explanation:
1. We are assuming that we are creating the first document in the collection and hence in
the above statement while creating the collection, we explicitly define the field _id and
define a value for it.
If the command is executed successfully and now use the find command to display the
documents in the collection, the following Output will be shown
Output:
Page 53 of 64
The output clearly shows that the _id field we defined while creating the collection is now used
as the primary key for the collection.
Performing Queries
The method of fetching or getting data from a MongoDB database is carried out by using
queries. While performing a query operation, one can also use criteria’s or conditions which can
be used to retrieve specific data from the database.
MongoDB provides a function called [Link] () which is used for retrieval of
documents from a MongoDB database.
During the course of this tutorial, you will see how this function is used in various ways to
achieve the purpose of document retrieval.
he basic query operations cover the simple operations such as getting all of the documents in a
MongoDB collection. Let’s look at an example of how we can accomplish this.
All of our code will be run in the MongoDBjavascript command shell. Consider that we have a
collection named ‘Employee’ in our MongoDB database and we execute the below command.
Page 54 of 64
Code Explanation:
Output:
The output shows all the documents which are present in the collection.
We can also add criteria to our queries so that we can fetch documents based on certain
conditions.
Page 55 of 64
[Link]({EmployeeName : "Smith"}).forEach(printjson);
1. Here we want to find for an Employee whose name is "Smith" in the collection , hence
we enter the filter criteria as EmployeeName : "Smith"
Output:
The output shows that only the document which contains "Smith" as the Employee Name is
returned.
Now, let's take a look at another code example which makes use of the greater than search
criteria. When this criteria is included, it actually searches those documents where the value of
the field is greater than the specified value.
[Link]({Employeeid : {$gt:2}}).forEach(printjson);
Code Explanation:
1. Here we want to find for all Employee's whose id is greater than 2. The $gt is called a
query selection operator, and what is just means is to use the greater than expression.
Output:
Page 56 of 64
All of the documents wherein the Employee id is greater than 2 is returned.
When the [Link] () function is used to search for documents in the collection, the
result returns a pointer to the collection of documents returned which is called a cursor.
By default, the cursor will be iterated automatically when the result of the query is returned. But
one can also explicitly go through the items returned in the cursor one by one. If you see the
below example, if we have 3 documents in our collection, the cursor will point to the first
document and then iterate through all of the documents of the collection.
Page 57 of 64
varmyEmployee = [Link]( { Employeeid : { $gt:2 }});
while([Link]())
print(tojson([Link]()));
}
Code Explanation:
1. First we take the result set of the query which finds the Employee's whose id is greater
than 2 and assign it to the JavaScript variable 'myEmployee'
2. Next we use the while loop to iterate through all of the documents which are returned as
part of the query.
3. Finally for each document, we print the details of that document in JSON readable
format.
Output:
Page 58 of 64
MongoDB Query Modifications using limit(), sort()
Mongo DB provides query modifiers such as the 'limit' and 'Orders' clause to provide more
flexibility when executing queries. We will take a look at the following query modifiers
1. Limits
This modifier is used to limit the number of documents which are returned in the result set for a
query. The following example shows how this can be done.
[Link]().limit(2).forEach(printjson);
Code Explanation:
1. The above code takes the find function which returns all of the documents in the
collection but then uses the limit clause to limit the number of documents being returned
to just 2.
Output:
The output clearly shows that since there is a limit modifier, so at most just 2 records are
returned as part of the result set based on the ObjectId in ascending order.
2. Orders
One can specify the order of documents to be returned based on ascending or descending order
of any key in the collection. The following example shows how this can be done.
[Link]().sort({Employeeis:-1}).forEach(printjson)
Page 59 of 64
Code Explanation:
1. The above code takes the sort function which returns all of the documents in the
collection but then uses the modifier to change the order in which the records are
returned. Here the -1 indicates that we want to return the documents based on the
descending order of Employee id.
Output:
The output clearly shows the documents being returned in descending order of the Employeeid.
The concept of aggregation is to carry out a computation on the results which are returned in a
query. For example, suppose you wanted to know what is the count of documents in a collection
as per the query fired, then MongoDB provides the count() function.
[Link]()
Code Explanation:
Page 60 of 64
Output:
The output clearly shows that 4 documents are there in the collection.
Performing Modifications
The other two classes of operations in MongoDB are the update and remove statements.
The update operations allow one to modify existing data, and the remove operations allow the
deletion of data from a collection.
Deleting Documents
In MongoDB, the [Link] () method is used to remove documents from a
collection. Either all of the documents can be removed from a collection or only those which
matches a specific condition.
If you just issue the remove command, all of the documents will be removed from the collection.
The following code example demonstrate how to remove a specific document from the
collection.
[Link]({Employeeid:22})
Code Explanation:
1. The above code use the remove function and specifies the criteria which in this case is to
remove the documents which have the Employee id as 22.
Page 61 of 64
The output will show that 1 document was modified.
MongoDBUpdate() Document
MongoDB provides the update() command to update the documents of a collection. To update
only the documents you want to update, you can add a criteria to the update statement so that
only selected documents are updated.
The basic parameters in the command is a condition for which document needs to be updated,
and the next is the modification which needs to be performed.
The following example shows how this can be done.
Step 1) Issue the update command
Step 2) Choose the condition which you want to use to decide which document needs to be
updated. In our example, we want to update the document which has the Employee id 22.
Step 3) Use the set command to modify the Field Name
Step 4) Choose which Field Name you want to modify and enter the new value accordingly.
[Link](
{"Employeeid" : 1},
{$set: { "EmployeeName" : "NewMartin"}});
If the command is executed successfully, the following Output will be shown
Output:
Page 62 of 64
The output clearly shows that one record matched the condition and hence the relevant field
value was modified.
Updating Multiple Values
To ensure that multiple documents are updated at the same time in MongoDB you need to use
the multi option because otherwise by default only one document is modified at a time.
The following example shows how this can be done.
In this example, we are going to first find the document which has the Employee id as "1" and
change the Employee name from "Martin" to "NewMartin"
Step 1) Issue the update command
Step 2) Choose the condition which you want to use to decide which document needs to be
updated. In our example, we want the document which has the Employee id of "1" to be updated.
Step 3) Choose which Field Name's you want to modify and enter their new value accordingly.
[Link]
(
{
Employeeid : 1
},
{
$set :
{
"EmployeeName" : "NewMartin"
"Employeeid" : 22
}
}
Page 63 of 64
)
If the command is executed successfully and if you run the "find" command to search for the
document with Employee id as 22 you will see the following Output will be shown
Output:
The output clearly shows that one record matched the condition and hence the relevant field
value was modified.
Page 64 of 64