0% found this document useful (0 votes)
3 views21 pages

NoSQL Unit 2 Week 5 Notes

The document covers key concepts in MongoDB including aggregation frameworks, data modeling, transactions, sharding, indexing, security, and replication. It explains how to use aggregation pipelines for data transformation, the differences between embedded and normalized data models, and the importance of indexing for query performance. Additionally, it discusses security measures like authentication and role-based access control, as well as the structure and benefits of replica sets for data redundancy and availability.

Uploaded by

coupanhub
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

NoSQL Unit 2 Week 5 Notes

The document covers key concepts in MongoDB including aggregation frameworks, data modeling, transactions, sharding, indexing, security, and replication. It explains how to use aggregation pipelines for data transformation, the differences between embedded and normalized data models, and the importance of indexing for query performance. Additionally, it discusses security measures like authentication and role-based access control, as well as the structure and benefits of replica sets for data redundancy and availability.

Uploaded by

coupanhub
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

V20UDS502 – NOSQL

UNIT – 2, WEEK - 5
AGGREGATION, AGGREGATION PIPELINE, MAP-REDUCE
The aggregation framework lets you transform and combine documents in a collection.
Basically, you build a pipeline that processes a stream of documents through several
building blocks: filtering, projecting, grouping, sorting, limiting, and skipping.
AGGREGATION PIPELINE
Assuming that each article is stored as a document in MongoDB, you could create a pipeline
with several steps:
1. Project the authors out of each article document.
2. Group the authors by name, counting the number of occurrences.
3. Sort the authors by the occurrence count, descending.
4. Limit results to the first five.
Each of these steps maps to a aggregation framework operator:

1. {"$project" : {"author" : 1}}


This projects the author field in each document.
you can select fields to project by specifying "fieldname" : 1 or exclude fields with
"fieldname" : 0.
After this operation, each document in the results looks like: {"_id" : id, "author" :
"authorName"}.

2. {"$group" : {"_id" : "$author", "count" : {"$sum" : 1}}}


This groups the authors by name and increments "count" for each document an author
appears in.
First, group by "author" using "_id" : "$author" field.
Second, field means to add 1 to a "count" field for each document in the group.
There is a new field created by the "$group".
At the end, each document in the results looks like: {"_id" : "authorName", "count" :
articleCount}.

3. {"$sort" : {"count" : -1}}


This reorders the result documents by the "count" field from greatest to least.

4. {"$limit" : 5}
This limits the result set to the first five result documents.
MAPREDUCE
MapReduce is a powerful and flexible tool for aggregating data.
It can solve some problems that are too complex to express using the aggregation
framework’s query language.
It starts with the map step, which maps an operation onto every document in a collection.
The reduce takes this list of values and reduces it to a single element.

INTRODUCTION TO DATA MODELING


• MongoDB provides two types of data models:
- Embedded data model.
- Normalized data model.
• Based on the requirement, you can use either of the models while preparing your
document.
Embedded Data Model
In this model, related data is stored within a single document.
It reduces the need for joins and simplifies data retrieval since all relevant information is
in one place.
Well-suited for cases where data relationships are one-to-few or one-to-one.
Provides fast read performance but may lead to data duplication.
Useful for scenarios where data consistency is not critical, and data is read more
frequently than it's written.
Often used when performance and read speed are primary concerns.
In this model, you can embed all the related data in a single document, it is also known as
de-normalized data model.
For example, assume we are getting the details of employees in three different documents
namely, Personal_details, Contact and, Address, you can embed all the three documents in
a single one.

Normalized Data Model


In this model, data is divided into separate documents or collections, and references are
used to establish relationships between them.
It minimizes data duplication and maintains data consistency, as there's only one copy of
shared data.
Suitable for scenarios where data relationships are many-to-many or one-to-many.
Requires additional queries and potentially more complex application logic to retrieve
related data.
Useful when data integrity and consistency are critical, and write operations are frequent.
Can be beneficial when handling large datasets with complex relationships.
In this model, you can refer the sub documents in the original document, using references.
For example, you can re-write the above document in the normalized model as:

TRANSACTIONS
• In MongoDB, an operation on a single document is atomic.
• Because you can use embedded documents and arrays to capture relationships
between data in a single document structure instead of normalizing across multiple
documents and collections, this single-document atomicity obviates the need for
multi-document transactions for many practical use cases.
The transaction in MongoDB is denoted by two
properties: ReadConcern and WriteConcern.
• A ReadConcern property is used to control the consistency and isolation of the data
we read from the database.
• Similarly, the WriteConcern property is defining when we consider the data we write
to be consistent in the database.

SHARDING CLUSTERS

Sharding is a method for distributing data across multiple machines.


MongoDB uses sharding to support deployments with very large data sets and high
throughput operations.
There are two methods for addressing system growth: vertical and horizontal scaling.
Vertical Scaling involves increasing the capacity of a single server, such as using a more
powerful CPU, adding more RAM, or increasing the amount of storage space.
Horizontal Scaling involves dividing the system dataset and load over multiple servers,
adding additional servers to increase capacity as required.
MongoDB supports horizontal scaling through sharding.
A MongoDB sharded cluster consists of the following components:
shard: Each shard contains a subset of the sharded data. Each shard can be deployed as
a replica set.
mongos: The mongos acts as a query router, providing an interface between client
applications and the sharded cluster.
config servers: Config servers store metadata and configuration settings for the cluster.
• One mongos instance.
• A single shard replica set.
• A replica set config server.

A shard contains a subset of sharded data for a sharded cluster.


Together, the cluster's shards hold the entire data set for the cluster.
Each database in a sharded cluster has a primary shard that holds all the un-sharded
collections for that database.
Each database has its own primary shard. The primary shard has no relation to
the primary in a replica set.
INDEXES
Indexes support efficient execution of queries in MongoDB.
Without indexes, MongoDB must scan every document in a collection to return query
results.
If an appropriate index exists for a query, MongoDB uses the index to limit the number of
documents it must scan.

Create Index:
use the createIndex() shell method.
Syntax:
db.<collection>.createIndex( { <field>: <value> }, { name: "<indexName>" } )
Example: creates a single key descending index on the name field
[Link]( { name: -1 } )

Get Index:
[Link]()
Output:
[
{ v: 2, key: { _id: 1 }, name: '_id_' },
{ v: 2, key: { name: -1 }, name: 'name_-1' }
]
Drop Index:
db.<collection>.dropIndex("<indexName>")
db.<collection>.dropIndexes("<index1>", "<index2>", "<index3>")
db.<collection>.dropIndexes()

INDEX TYPES
• Single Field Index
• Compound Index
• Multikey Index
• Text Index
• Wildcard Index
• Geospatial Index
• Hashed Index

Single Field Index


Single field indexes collect and sort data from a single field in each document in a
collection.
Syntax: db.<collection>.createIndex( { <field>: <sortOrder> } )
Example: [Link]( { score: 1 } )

You can create a single-field index on any field in a document:


Index on a Single Field
Index on an Embedded Document
Index on an Embedded Field
When you create an index, you specify:
The field on which to create the index.
A sort order of 1 sorts values in ascending order.
A sort order of -1 sorts values in descending order.

COMPOUND INDEX
Compound indexes collect and sort data from two or more fields in each document in a
collection.
Data is grouped by the first field in the index and then by each subsequent field.
For example, the following image shows a compound index where documents are first
grouped by userid in ascending order (alphabetically). Then, the scores for each userid are
sorted in descending order.

Syntax:
db.<collection>.createIndex( {

<field1>: <sortOrder>,

<field2>: <sortOrder>,

...

<fieldN>: <sortOrder> } )

Example:

[Link]( {

name: 1,

gpa: -1

})
MULTIKEY INDEX
Multikey indexes collect and sort data from fields containing array values. Multikey
indexes improve performance for queries on array fields.
When you create an index on a field that contains an array value, MongoDB
automatically sets that index to be a multikey index.

Example: collection of blogposts


{
"_id": 1,
"title": "Introduction to Multikey Indexes",
"tags": ["indexing", "MongoDB", "performance"]
}
Multikey index in the MongoDB shell:
• [Link]({ "tags": 1 })
To find all blog posts with the "MongoDB" tag:
• [Link]({ "tags": "MongoDB" })

TEXT INDEX
Text indexes support text search queries on fields containing string content.
Text indexes improve performance when searching for specific words or phrases within
string content.
A collection can only have one text index, but that index can cover multiple fields.
Syntax:
db.<collection>.createIndex({ <field1>: "text", <field2>: "text", ... })
{
"_id": 1,
"title": "MongoDB Text Index Example",
"content": "In this example, we'll demonstrate how to create a text index in MongoDB."
}

Text index in the MongoDB shell:


[Link]({ "content": "text" })
Now, you can perform full-text searches using the $text operator.
For instance, to find articles that contain the word "MongoDB," you can use the following
query:
[Link]({ $text: { $search: "MongoDB" } })

WILDCARD INDEX

MongoDB supports creating indexes on a field, or set of fields, to improve performance


for queries.
MongoDB supports flexible schemas, meaning document field names may differ within a
collection.
Use wildcard indexes to support queries against arbitrary or unknown fields.
To create a wildcard index, use the wildcard specifier ($**) as the index key:
• [Link]( { "$**": <sortOrder> } )
Example: Product collection

"product_name" : "Spy Coat",

"attributes" :

"material" : [ "Tweed", "Wool"],


"size" :

"length" : 72,

"units" : "inches"

Index:

• [Link]( { "attributes.$**" : 1 } )

• [Link]( { "[Link]" : { $gt : 60 } } )

Output:

_id: ObjectId("63472196b1fac2ee2e957ef6"),

product_name: 'Spy Coat',

attributes: {

material: [ 'Tweed', 'Wool', 'Leather' ],

size: { length: 72, units: 'inches' }

GEOSPATIAL INDEX

Geospatial indexes support queries on data stored as GeoJSON objects or legacy


coordinate pairs.
You can use geospatial indexes to improve performance for queries on geospatial data or
to run certain geospatial queries.

MongoDB provides two types of geospatial indexes:

• 2dsphere Indexes, which support queries that interpret geometry on a sphere.

• 2d Indexes, which support queries that interpret geometry on a flat surface.

HASHED INDEX

Hashed indexes collect and store hashes of the values of the indexed field.

Hashed indexes support sharding using hashed shard keys.

Hashed based sharding uses a hashed index of a field as the shard key to partition data
across your sharded cluster.

SECURITY

MongoDB also provides the Security Checklist for a list of recommended actions to
protect a MongoDB deployment.

Some key security features include:


AUTHENTICATION

Authentication is the process of verifying the identity of a client. When access control
(authorization) is enabled, MongoDB requires all clients to authenticate themselves in
order to determine their access.

Although authentication and authorization are closely connected, authentication is distinct


from authorization:

Authentication verifies the identity of a user.

To authenticate as a user, you must provide a username, password, and the


authentication database associated with that user.

Authorization determines the verified user's access to resources and operations.

SCRAM Authentication

• Salted Challenge Response Authentication Mechanism (SCRAM) is the default


authentication mechanism for MongoDB.

• When a user authenticates themselves, MongoDB uses SCRAM to verify the supplied
user credentials against the user's name, password and authentication database.

x.509 Certificate Authentication

• MongoDB supports x.509 certificate authentication for client authentication and


internal authentication of the members of replica sets and sharded clusters.

• x.509 certificate authentication requires a secure TLS/SSL (Transport Layer


Security/Secure Sockets Layer) connection.

ROLE BASED ACCESS CONTROL

MongoDB employs Role-Based Access Control (RBAC) to govern access to a MongoDB


system.

A user is granted one or more roles that determine the user's access to database resources
and operations.

Outside of role assignments, the user has no access to the system.


MongoDB does not enable access control by default. You can enable authorization using
the --auth or the [Link] setting.

Role: A role grants privileges to perform the specified actions on resource.

Each privilege is either specified explicitly in the role or inherited from another role or
both.

The grantRole action on a database to grant a role on that database.

The revokeRole action on a database to revoke a role on that database.

you must be either explicitly granted the role or must have the viewRole action on the
role's database.

LDAP Authorization

MongoDB Enterprise supports querying an Lightweight Directory Access Protocol


(LDAP) server for the LDAP groups the authenticated user is a member of MongoDB
maps the Distinguished Names (DN) of each returned group to roles on the admin
database.

MongoDB authorizes the user based on the mapped roles and their associated privileges.

BUILT IN ROLES

MongoDB provides built-in roles that provide set of privileges commonly needed in a
database system.

MongoDB grants access to data and commands through role-based authorization and
provides built-in roles that provide the different levels of access commonly needed in a
database system.

A role grants privileges to perform sets of actions on defined resources.

A given role applies to the database on which it is defined and can grant access down to a
collection level of granularity.

Each of MongoDB's built-in roles defines access at the database level for all non-system
collections in the role's database and at the collection level for all system collections.
USER DEFINED ROLES

MongoDB provides a number of built-in roles. However, if these roles cannot describe
the desired set of privileges, you can create new roles.

You can additionally create user-defined roles.

To add a role, MongoDB provides the [Link]() method.

REPLICATION

A replica set in MongoDB is a group of mongod processes that maintain the same data
set.

Replica sets provide redundancy and high availability, and are the basis for all production
deployments.
Redundancy and Data Availability

Replication provides redundancy and increases data availability.

With multiple copies of data on different database servers, replication provides a level of
fault tolerance against the loss of a single database server.

In some cases, replication can provide increased read capacity as clients can send read
operations to different servers.

Maintaining copies of data in different data centers can increase data locality and
availability for distributed applications.

You can also maintain additional copies for dedicated purposes, such as disaster recovery,
reporting, or backup.

A replica set is a group of mongod instances that maintain the same data set.

A replica set contains several data bearing nodes and optionally one arbiter node.

Of the data bearing nodes, one and only one member is deemed the primary node, while
the other nodes are deemed secondary nodes.

The primary node receives all write operations.

A replica set can have only one primary capable of confirming writes with { w:
"majority" } write concern; although in some circumstances, another mongod instance
may transiently believe itself to also be primary.

The primary records all changes to its data sets in its operation log, i.e. oplog

The secondaries replicate the primary's oplog and apply the operations to their data sets
such that the secondaries' data sets reflect the primary's data set.

If the primary is unavailable, an eligible secondary will hold an election to elect itself the
new primary.
In the following 3-member replica set, the primary becomes unavailable. This triggers an
election which selects one of the remaining secondaries as the new primary.
• In some circumstances, you may choose to add a mongod instance to a replica set as
an arbiter. An arbiter participates in elections but does not hold data.

• An arbiter will always be an arbiter whereas a primary may step down and become
a secondary and a secondary may become the primary during an election.

You might also like