Data Modelling with
NoSQL databases
By
Dr Shivakumar C
Content
• Why NoSQL? The Value of Relational Databases, Getting at Persistent Data,
Concurrency, Integration, A (Mostly) Standard Model, Impedance Mismatch,
Application and Integration Databases, Attack of the Clusters, the Emergence of
NoSQL.
• Aggregate Data Models: Aggregates: Example of Relations and Aggregates,
Consequences of Aggregate Orientation, Summarizing Aggregate-Oriented
Databases.
• More Details on Data Models: Relationships, Schema less Databases,
Materialized Views, Modeling for Data Access.
Why NoSQL?
• NoSQL, which stands for "Not Only SQL," is a term used to describe a broad category of database
management systems that differ from traditional relational databases (SQL databases) in terms of data model,
storage, and processing.
• There are several reasons why organizations might choose NoSQL databases over traditional SQL databases
for certain use cases:
• 1. Schema flexibility: NoSQL databases are often schema-less or schema-flexible, allowing you to store and
manage data without a predefined schema. This flexibility is particularly beneficial when dealing with
unstructured or semi-structured data, as it allows for easier adaptation to changing data requirements.
• 2. Scalability: Many NoSQL databases are designed to scale horizontally, meaning they can handle a growing
amount of data by adding more servers to a distributed database system. This makes NoSQL databases
suitable for applications with rapidly increasing data volumes and traffic.
Why NoSQL?
• 3. Performance: NoSQL databases are optimized for specific use cases, providing high
performance for certain types of queries and operations. They are often designed to handle
large amounts of read and write operations efficiently, making them suitable for
applications that require low-latency data access.
• 4. Variety of data models: NoSQL databases support various data models, such as
document-oriented, key-value, column-family, and graph databases. This diversity allows
organizations to choose the most appropriate model for their specific application needs.
• 5. Agile development and iteration: The flexibility of NoSQL databases makes them
well-suited for agile development methodologies. Developers can quickly iterate on their
applications without being constrained by a rigid schema, making it easier to adapt to
changing requirements.
Why NoSQL?
• 6. Handling large amounts of unstructured data: NoSQL databases are
often better equipped to handle unstructured or semi-structured data, which
is common in modern applications. This is particularly useful in scenarios
where data doesn't fit neatly into tables with predefined relationships.
• 7. Cost-effectiveness: NoSQL databases can be more cost-effective in
certain scenarios, especially when dealing with large-scale distributed systems.
They can leverage commodity hardware and scale horizontally, potentially
reducing infrastructure costs.
The Value of Relational Databases
• Relational databases have become such an embedded part of our computing
culture that it’s easy to take them for granted.
• It’s therefore useful to revisit the benefits they provide.
Getting at Persistent Data
• Probably the most obvious value of a database is keeping large amounts of
persistent data.
• Most computer architectures have the notion of two areas of memory: a fast
volatile “main memory” and a larger but slower “backing store.”
• Main memory is both limited in space and loses all data when you lose power
or something bad happens to the operating system.
• Therefore, to keep data around, we write it to a backing store, commonly
seen a disk (although these days that disk can be persistent memory).
Getting at Persistent Data
• The backing store can be organized in all sorts of ways. For many
productivity applications (such as word processors), it’s a file in the file
system of the operating system.
• For most enterprise applications, however, the backing store is a
database.
• The database allows more flexibility than a file system in storing large
amounts of data in a way that allows an application program to get at small
bits of that information quickly and easily.
Concurrency
• Enterprise applications tend to have many people looking at the same body of data at
once, possibly modifying that data.
• Most of the time they are working on different areas of that data, but occasionally they
operate on the same bit of data.
• As a result, we have to worry about coordinating these interactions to avoid such things as
double booking of hotel rooms.
• Concurrency is notoriously difficult to get right, with all sorts of errors that can trap even
the most careful programmers.
• Since enterprise applications can have lots of users and other systems all working
concurrently, there’s a lot of room for bad things to happen.
Concurrency
• Relational databases help handle this by controlling all access to their data
through transactions.
• While this isn’t a cure-all (you still have to handle a transactional error when
you try to book a room that’s just gone), the transactional mechanism has
worked well to contain the complexity of concurrency.
• Transactions also play a role in error handling. With transactions, you can
make a change, and if an error occurs during the processing of the change
you can roll back the transaction to clean things up.
Integration
• Enterprise applications live in a rich ecosystem that requires multiple applications,
written by different teams, to collaborate in order to get things done.
• This kind of inter-application collaboration is awkward because it means pushing
the human organizational boundaries.
• Applications often need to use the same data and updates made through one
application have to be visible to others.
• A common way to do this is shared database integration where multiple
applications store their data in a single database.
…
• Using a single database allows all the applications to use each others’ data
easily,
• while the database’s concurrency control handles multiple applications in the
same way as it handles multiple users in a single application
A (Mostly) Standard Model
• Relational databases have succeeded because they provide the core benefits.
• As a result, developers and database professionals can learn the basic relational
model and apply it in many projects.
• Although there are differences between different relational databases, the core
mechanisms remain the same:
• Different vendors’ SQL dialects are similar, transactions operate in mostly the same
way.
• Note: MySQL, PostgresSQL, Oracle, Microsoft Data server, SQLite, IBMdb2
Impedance Mismatch
• Relational databases provide many advantages, but they are by no means perfect. Even from their
early days, there have been lots of frustrations with them.
• For application developers, the biggest frustration has been what’s commonly called the impedance
mismatch: the difference between the relational model and the in-memory data structures.
• The relational data model organizes data into a structure of tables and rows, or more properly,
relations and tuples.
• In the relational model, a tuple is a set of name-value pairs and a relation is a set of tuples.
• As a result, if you want to use a richer in-memory data structure, translate it to a relational
representation to store it on disk. Hence the impedance mismatch - two different representations
that require translation
Relational Model vs. In-Memory Data
Structures:
• Relational Model: Databases, especially relational databases, organize data
into tables with rows and columns. They use a structured query language
(SQL) for querying and manipulating data. Relationships between tables are
typically established using foreign keys.
• In-Memory Data Structures: In contrast, when application developers
work with data in their code, they often use in-memory data structures such
as objects, arrays, and graphs. These data structures are more aligned with
the programming language's native representations.
Challenges and Frustrations
• Object-Relational Mapping (ORM) Challenges: Developers face challenges
when trying to map between the relational model and in-memory data structures.
This mapping process is often manual or requires the use of ORM tools, and
mismatches can lead to complex, error-prone code.
• Performance Concerns: Retrieving data from a relational database and mapping it
to in-memory structures can be computationally expensive, especially when dealing
with complex queries or large datasets.
• Expressiveness Differences: SQL, as a declarative language, has a different
expressiveness than imperative programming languages. Bridging the gap between
the two can be challenging.
Solutions and Mitigations:
• ORM Tools: ORM tools, as mentioned earlier, help automate the mapping between the
relational model and in-memory data structures. Examples include Hibernate, Entity
Framework, and Django ORM.
• In-Memory Databases: Some applications use in-memory databases to reduce the
impedance mismatch. These databases store data in a format closer to in-memory data
structures, improving performance.
• Caching Strategies: Caching can be employed to store frequently accessed data in
memory, reducing the need for repeated database queries.
• Careful Design and Consideration: Developers need to be mindful of the differences
between the two models during application design. Choosing appropriate data structures
and designing efficient data access patterns can mitigate some challenges.
Ongoing Efforts:
• New Database Paradigms: NoSQL databases, which often use more flexible data
models, have emerged as alternatives to traditional relational databases. They can
sometimes provide a better fit for applications with specific requirements.
• Language and Database Integration: Some programming languages and databases are
working on improving integration. For example, some databases support JSON data types,
making it easier to work with semi-structured data in an object-oriented manner.
• The impedance mismatch remains a significant challenge for developers, and addressing it
often involves a combination of careful design, the use of appropriate tools, and ongoing
efforts in both the database and programming language communities.
An order, which looks like a single aggregate structure in
the UI, is split into many rows from many tables in a
relational database
Application and Integration Databases
• An application database which is only directly accessed by a single application
codebase that’s looked after by a single team.
• With an application database, only the team using the application needs to know
about the database structure, which makes it much easier to maintain and evolve
the schema.
• The database acts as an integration database with multiple applications, usually
developed by separate teams, storing their data in a common database.
• This improves communication because all the applications are operating on a
consistent set of persistent data.
Attack of the Clusters
• Websites started tracking activity and structure in a very detailed way.
• Large sets of data appeared: links, social networks, activity in logs, mapping
data.
• With this growth in data came a growth in users as the biggest websites grew
to be vast estates regularly serving huge numbers of visitors.
• Coping with the increase in data and traffic required more computing
resources. To handle this kind of increase, you have two choices: up or out
Attack of the Clusters
• Scaling up implies bigger machines, more processors, disk storage, and
memory. But bigger machines get more and more expensive, not to mention that
there are real limits as your size increases.
• The alternative is to use lots of small machines in a cluster.
• A cluster of small machines can use commodity hardware and ends up being
cheaper at these kinds of scales.
• It can also be more resilient while individual machine failures are common, the
overall cluster can be built to keep going despite such failures, providing high
reliability.
The Emergence of NoSQL
• It’s a wonderful irony that the term “NoSQL” first made its appearance in
the late 90s as the name of an open-source relational database.
• The usage of “NoSQL” that we recognize today traces back to a meetup on
June 11, 2009 in San Francisco organized by Johan Oskarsson.
• The term “NoSQL” caught on like wildfire, but it’s never been a term that’s
had much in the way of a strong definition.
• The original call for the meetup asked for “open-source, distributed,
nonrelational databases.”
The Emergence of NoSQL
• The talks there were from Voldemort, Cassandra, Dynomite, HBase, Hypertable,
CouchDB, and MongoDB—but the term has never been confined to that original septet.
• There is the obvious point that NoSQL databases don’t use SQL.
• Although the term NoSQL is frequently applied to closed-source systems, there’s a notion
that NoSQL is an open-source phenomenon.
• Relational databases use ACID transactions to handle consistency across the whole
database.
• This inherently clashes with a cluster environment, so NoSQL databases offer a range of
options for consistency and distribution
The Emergence of NoSQL
• The change is that now we see relational databases as one option for data
storage.
• This point of view is often referred to as polyglot persistence using
different data stores in different circumstances.
• Instead of just picking a relational database because everyone does, we need
to understand the nature of the data we’re storing and how we want to
manipulate it.
Characteristics
• Schema-less or Schema-flexible:
• NoSQL databases typically allow for a flexible schema, meaning that each
record in a database can have a different set of fields, and new fields can be
added without requiring a predefined schema.
• Non-relational:
• Unlike traditional relational databases, NoSQL databases do not use a fixed
schema and are not based on the traditional tabular structure with rows and
columns.
Characteristics
• Horizontal Scalability:
• NoSQL databases are designed to scale horizontally, allowing them to handle
larger amounts of data and increased load by adding more servers to a
distributed database.
• Distributed Architecture:
• Many NoSQL databases are designed to be distributed across multiple
servers or nodes, providing improved performance and fault tolerance.
Characteristics
• High Performance:
• NoSQL databases are often optimized for specific types of queries and are
capable of providing high-performance read and write operations.
• Various Data Models:
• NoSQL databases support various data models such as document-oriented
(like MongoDB), key-value stores (like Redis), column-family stores (like
Apache Cassandra), and graph databases (like Neo4j).
Characteristics
• BASE (Basically Available, Soft state(temporary nature), Eventually
consistent):
NoSQL databases often prioritize availability and partition tolerance over strict
consistency. This means that, in the event of network partitioning, the system may
continue to operate but may provide inconsistent results temporarily.
• Big Data and Unstructured Data Support:
NoSQL databases are often used for handling large volumes of unstructured or
semi-structured data, making them suitable for big data applications.
Characteristics
• Simple API:
• NoSQL databases typically offer simple APIs for data access and
manipulation, which can be more developer-friendly compared to
SQL-based databases.
• Open Source:
• Many NoSQL databases are open source, allowing for community
collaboration and customization.
Advantages
• Flexible Schema
• Scalability
• High Performance
• Big Data Support
• Variety of Data Models
• Cost-Effective
• Support for Unstructured Data
Disadvantages
• Lack of Standardization:
• NoSQL databases lack a standardized query language, unlike SQL used in relational
databases. Each NoSQL database may have its own set of APIs and query
languages, making it challenging for developers to switch between different
systems.
• Limited ACID Transactions:
• NoSQL databases often prioritize performance and scalability over ACID
(Atomicity, Consistency, Isolation, Durability) transactions. This can be a limitation
in scenarios where strong transactional consistency is critical, such as in financial
applications.
Disadvantages
• Data Consistency Challenges:
• NoSQL databases, especially those that follow the BASE (Basically Available, Soft
state, Eventually consistent) model, may exhibit eventual consistency rather than
immediate consistency. This can lead to challenges in maintaining data consistency
across distributed nodes.
• Limited Support for Complex Queries:
• NoSQL databases are optimized for specific types of queries, but they may lack the
robust support for complex queries and joins that relational databases offer. This
can be a limitation in scenarios requiring complex data retrieval and analysis.
When to use?
• Flexible Schema Requirements
• Scalability and High Throughput
• Rapid Development and Agile Iterations
• Big Data Applications
• High Availability and Fault Tolerance
• Real-time Applications(IOT)
Aggregate Data Models
• A data model is the model through which we perceive and manipulate our
data.
• For people using a database, the data model describes how we interact with
the data in the database.
• This is distinct from a storage model, which describes how the database
stores and manipulates the data internally.
Aggregate Data Models
• The relational model takes the information that we want to store and divides
it into tuples (rows).
• A tuple is a limited data structure: It captures a set of values, so you cannot
nest one tuple within another to get nested records, nor can you put a list of
values or tuples within another.
• This simplicity underpins the relational model.
Aggregate Data Models
• Aggregate orientation takes a different approach.
• It recognizes that often, you want to operate on data in units that have a more
complex structure than a set of tuples.
• It can be handy to think in terms of a complex record that allows lists and other
record structures to be nested inside it.
• Aggregate is a term that comes from Domain-Driven Design.
• In Domain-Driven Design, an aggregate is a collection of related objects that we
wish to treat as a unit.
• In particular, it is a unit for data manipulation and management of consistency
Data model oriented around a relational
database(Using UML Notation)
An aggregate data model
• We’ve used the black-diamond
composition marker in UML to show
how data fits into the aggregation
structure.
• The customer contains a list of billing
addresses;
• The order contains a list of order items, a
shipping address, and payments.
• The payment itself contains a billing
address for that payment.
Consequences of Aggregate Orientation
• Relational databases have no concept of aggregate within their data model, so we
call them aggregate-ignorant.
• In the NoSQL world, graph databases are also aggregate-ignorant.
• Being aggregate-ignorant is not a bad thing. It’s often difficult to draw aggregate
boundaries well, particularly if the same data is used in many different contexts.
• Aggregates have an important consequence for transactions.
• Relational databases allow you to manipulate any combination of rows from any
tables in a single transaction. Such transactions are called ACID transactions:
Atomic, Consistent, Isolated, and Durable
Consequences of Aggregate Orientation
• In general, it’s true that aggregate-oriented databases don’t have ACID
transactions that span multiple aggregates.
• Instead, they support atomic manipulation of a single aggregate at a time.
• This means that if we need to manipulate multiple aggregates in an atomic
way, we have to manage that ourselves in the application code.
More Details on Data Models
Relationships ▪ Some applications will want to access
▪ Aggregates are useful in that they put the order history whenever they access
the customer;
together data that is commonly
accessed together. ▪ This fits in well with combining the
▪ But there are still lots of cases where customer with his order history into a
single aggregate.
data that’s related is accessed
differently. ▪ Other applications, however, want to
▪ Consider the relationship between a process orders individually and thus
model orders as independent
customer and all of his orders. aggregates.
More Details on Data Models
▪ An important aspect of relationships ▪ If you update multiple aggregates at
between aggregates is how they handle once, you have to deal yourself with a
updates. failure partway through.
▪ Aggregate oriented databases treat the ▪ Relational databases help you with
aggregate as the unit of data-retrieval. this by allowing you to modify
multiple records in a single
▪ Consequently, atomicity is only
transaction, providing ACID
supported within the contents of a
guarantees while altering many rows.
single aggregate.
Graph Databases
• Graph-based data models store data
in nodes that are connected by
edges.
• These Aggregate Data Models in
NoSQL are widely used for storing
the huge volumes of complex
aggregates and multidimensional
data having many interconnections
between them.
Use Cases:
• Graph-based Data Models are used in social networking sites to store
interconnections.
• It is used in fraud detection systems.
• This Data Model is also widely used in Networks and IT operations.
Schema less Databases
• A common theme across all the forms of NoSQL databases is that they are
schema less.
• When you want to store data in a relational database, you first have to define
a schema a defined structure for the database which says what tables exist,
which columns exist, and what data types each column can hold.
• Before you store some data, you have to have the schema defined for it.
Schema less Databases
• As well as handling changes, a schema less store also makes it easier to deal
with nonuniform data: data where each record has a different set of fields.
• A schema puts all rows of a table into a straightjacket, which becomes
awkward if you have different kinds of data in different rows.
Materialized Views
• Views provide a mechanism to hide from the client whether data is derived
data or base data but can’t avoid the fact that some views are expensive to
compute.
• To cope with this, materialized views were invented, which are views that
are computed in advance and cached on disk.
• Materialized views are effective for data that is read heavily but can stand
being somewhat stale.
Materialized Views
• Although NoSQL databases don’t have views, they may have precomputed
and cached queries, and they reuse the term “materialized view” to describe
them.
• It’s also much more of a central aspect for aggregate-oriented databases than
it is for relational systems, since most applications will have to deal with
some queries that don’t fit well with the aggregate structure.
Modelling for Data Access
• when modeling data aggregates we
need to consider how the data is
going to be read as well as what are
the side effects on data related to
those aggregates.
Modelling for Data Access
• In this scenario, the application can • When references are needed, we
read the customer’s information could switch to document stores
and all the related data by using the and then query inside the
key. documents, or even change the data
for the key-value store to split the
• If the requirements are to read the value object into Customer and
orders or the products sold in each Order objects and then maintain
order, the whole object has to be these objects’ references to each
read and then parsed on the client other.
side to build the results.