HND I COMPUTER & STATISTICS DATABASE DESING II NOTE
The Object Oriented (OO) Data Model in DBMS
Increasingly complex real-world problems demonstrated a need for a data model that
more closely represented the real world.
In the object oriented data model (OODM), both data and their relationships are
contained in a single structure known as an object.
In turn, the OODM is the basis for the object-oriented database management system
(OODBMS).
The Components of the Object Oriented Data Model
• An object is an abstraction of a real-world entity. In general terms, an object may be
considered equivalent to an ER model’s entity. More precisely, an object represents only
one occurrence of an entity. (The object’s semantic content is defined through several
of the items in this list.)
• Attributes describe the properties of an object. For example, a PERSON object includes
the attributes Name, Social Security Number, and Date of Birth.
• Objects that share similar characteristics are grouped in classes. A class is a collection
of similar objects with shared structure (attributes) and behavior (methods). In a general
sense, a class resembles the ER model’s entity set. However, a class is different from an
entity set in that it contains a set of procedures known as methods. A class’s method
represents a real-world action such as finding a selected PERSON’s name, changing a
PERSON’s name, or printing a PERSON’s address. In other words, methods are the
equivalent of procedures in traditional programming languages. In OO terms, methods
define an object’s behavior.
1
• Classes are organized in a class hierarchy. The class hierarchy resembles an upside-
down tree in which each class has only one parent. For example, the CUSTOMER class
and the EMPLOYEE class share a parent PERSON class. (Note the similarity to the
hierarchical data model in this respect.)
• Inheritance is the ability of an object within the class hierarchy to inherit the attributes
and methods of the classes above it. For example, two classes, CUSTOMER and
EMPLOYEE, can be created as subclasses from the class PERSON. In this case,
CUSTOMER and EMPLOYEE will inherit all attributes and methods from PERSON.
Object oriented data model is based upon real world situations. These situations are
represented as objects, with different attributes. All these object have multiple
relationships between them.
Elements of Object oriented data model
Objects
The real world entities and situations are represented as objects in the Object oriented
database model.
Attributes and Method
Every object has certain characteristics. These are represented using Attributes. The
behaviour of the objects is represented using Methods.
Class
Similar attributes and methods are grouped together using a class. An object can be
called as an instance of the class.
Inheritance
2
A new class can be derived from the original class. The derived class contains attributes
and methods of the original class as well as its own.
Example of the Object Oriented data model is −
Shape, Circle, Rectangle and Triangle are all objects in this model.
Circle has the attributes Center and Radius.
Rectangle has the attributes Length and Breath
Triangle has the attributes Base and Height.
The objects Circle, Rectangle and Triangle inherit from the object Shape.
CONCEPT OF OBJECT ORIENTED LANGUAGES
Object-oriented language (OOL) is a high-level computer programming language that
implements objects and their associated procedures within the programming context to
create software programs.
Object-oriented language uses an object-oriented programming technique that binds
related data and functions into an object and encourages reuse of these objects within
the same and other programs.
3
Object-oriented language was primarily designed to reduce complexity in typical
procedural languages through data binding and encapsulation techniques. In object-
oriented language, the objects created provide limited or no access to other functions or
methods within the program. This enables only authorized or inherited
methods/functions to access a particular object.
Object-oriented language typically supports the following features, at minimum:
The ability to create classes and their associated objects
Encapsulation
Inheritance
Java, C++ and Smalltalk are popular examples of object-oriented
Popular Object-Oriented Programming Languages
Java, Python, C++, Lisp, and Perl are all examples of popular object-oriented
programming languages. They support programming using the classes and objects
paradigm.
Five of the most popular object-oriented languages include:
1. Java
2. Python
3. C++
4. Ruby
5. C#
4
Java – Java is everywhere, and it is one of the most used and in-demand
languages of all time. Java’s motto is ‘write once, run anywhere,’ and that is
reflected in the number of platforms it runs on and places it’s used.
Python – Python is general purpose and used in many places. However, Python
has a strong foothold in machine learning and data science. It’s one of the
preferred languages for that new and ever-growing field.
C++ – C++has the speed of C with the functionality of classes and an object-
oriented paradigm. It’s a compiled, reliable, and powerful language. In fact, it’s
even used to build compilers and interpreters for other languages.
Ruby – Ruby is another general-purpose programming language. It was built for
simplicity. With that said, Ruby is an incredibly powerful language. The creator of
Ruby, Yukihiro “Matz” Matsumoto, has said “Ruby is very simple in appearance,
but is very complex inside, just like our human body.”
C# – C# is a programming language designed by Microsoft. It was designed to
improve upon existing concepts in C. C# powers the Microsoft .NET framework
alongside many web apps, games, desktop apps, and mobile apps.
Object-Oriented Programming Principles
Object-oriented languages come with four principles. These four principles are common
properties that define them and make them significantly more efficient. Some people
call them the four pillars of object-oriented programming.
The four pillars of object-oriented programming are:
1. Encapsulation
5
2. Abstraction
3. Inheritance
4. Polymorphism
The Four Principles
Let’s explore these four principles in greater depth.
Encapsulation – Encapsulation is the concept of binding data to functions, which
works as a security function to keep that data safe. For example, many OOP
languages prevent anything other than specific methods in the class owning the
data to access that data. This is often done explicitly to keep certain data safe.
Abstraction – Abstraction is the idea that if you’re doing one thing too often, it
should be its own entity. For example, if a programmer is repeatedly rewriting
the same function for different objects, that function could be abstracted to
become its own entity.
Inheritance – Inheritance is what it sounds like, and is most often defined as a “is
a” relationship. It lets certain classes be under the umbrella of other classes.
These classes can inherit information and functionality from that class, while also
being able to expand on that functionality. It lets us reuse code that we’ve
already typed in other classes.
Polymorphism – Polymorphism means ‘to take more than one shape.’ In the
sense of OOP languages, it refers to the ability to handle objects differently
depending on what they are. It allows us to define different methods for handling
objects based on their derived class. For example, if we needed to find the size of
something, we might change how we measure it based on what it is. If it’s a plank
6
of wood, I would measure it in inches. This is as opposed to a volume of water,
which I would measure by weight. If we utilize polymorphism, we can create a
function that can accept either object and still give us a sane measurement.
A database trigger is procedural code that is automatically executed in response to
certain events on a particular table or view in a database. The trigger is mostly used for
maintaining the integrity of the information on the database. For example, when a new
record (representing a new worker) is added to the employees table, new records
should also be created in the tables of the taxes, vacations and salaries. Triggers can also
be used to log historical data, for example to keep track of employees' previous salaries.
Database is a place to store data and has the capability of being accessed by other users
whether they are at home, in the office, or anywhere. The databases are normally used
by companies that are transacting with customers, businesses, and so on. Databases are
great for storing information since they are organized and easy to access. It is also easy
to use. Database is composed of forms, tables, rows, columns, fields, labels, and so on.
These are necessary since these items are the ones that will help organize any data
encoded into the form. Databases can also be altered or edited. There are different
kinds of databases each with their own unique operation used. Some of them can also
support other kinds of databases.
Database Forms
A database form is the front end of any database since it is here where the user encodes
the data to be stored. It is also here where the editing of information happens. The
database forms can range from simple to complex. Simple database forms can only have
a few items in it, such as the text field and text boxes and so on. Complex databases, on
the other hand, can include text boxes, text fields, graphics, and lots more. Though the
7
design of the database forms can be different, its use is still the same. This is where all
information are encoded and manipulated before being stored into the database.
The users of databases prefer to interact with the database forms since they are far
simpler to understand and to manipulate. The forms are the ones that provide the users
with the ability to format the data they wish to store.
OVERVIEW OF PHYSICAL STORAGE MEDIA AND TERTIARY STORAGE DEVICES
The main objective of the tertiary storage level is to provide huge storage capacity at
low cost. Several types of storage devices are available to be used at the tertiary storage
level in Hierarchical Storage Systems (HSS). They include: • Magnetic tapes • Optical
disks • Optical tapes These storage devices are composed of fixed storage drives and
removable media units. The storage drives are fixed to the computer system. The
removable media unit can be removed from the drives so that the storage capacity can
be expanded with more media units. When data on a media are accessed, the media
unit is accessed from their normal location. One of the storage drives on the computer
system is chosen. If there is a media unit in the storage drive, the old media unit is
unloaded and ejected. The new media unit is then loaded to the drive. Each type of
storage drive may handle the storage drives and media units differently. The magnetic
8
tapes are described below in the next section. Then, the optical tapes are presented.
Afterwards, the optical disks are briefly described before this chapter is summarized.
1. Several types of data storage exist in most computer systems. They vary in speed
of access, cost per unit of data, and reliability.
o Cache: most costly and fastest form of storage. Usually very small, and
managed by the operating system.
o Main Memory (MM): the storage area for data available to be operated
on.
General-purpose machine instructions operate on main memory.
Contents of main memory are usually lost in a power failure or
``crash''.
Usually too small (even with megabytes) and too expensive to store
the entire database.
o Flash memory: EEPROM (electrically erasable programmable read-only
memory).
Data in flash memory survive from power failure.
Reading data from flash memory takes about 10 nano-secs (roughly
as fast as from main memory), and writing data into flash memory
is more complicated: write-once takes about 4-10 microsecs.
To overwrite what has been written, one has to first erase the
entire bank of the memory. It may support only a limited number of
erase cycles ( to ).
It has found its popularity as a replacement for disks for storing
small volumes of data (5-10 megabytes).
o Magnetic-disk storage: primary medium for long-term storage.
Typically the entire database is stored on disk.
9
Data must be moved from disk to main memory in order for the
data to be operated on.
After operations are performed, data must be copied back to disk if
any changes were made.
Disk storage is called direct access storage as it is possible to read
data on the disk in any order (unlike sequential access).
Disk storage usually survives power failures and system crashes.
o Optical storage: CD-ROM (compact-disk read-only memory), WORM
(write-once read-many) disk (for archival storage of data), and Juke
box (containing a few drives and numerous disks loaded on demand).
o Tape Storage: used primarily for backup and archival data.
Cheaper, but much slower access, since tape must be read
sequentially from the beginning.
Used as protection from disk failures!
2. The storage device hierarchy is presented in Figure 1. Where the higher levels are
expensive (cost per bit), fast (access time), but the capacity is smaller.
10
Figure 1: Storage-device hierarchy
3. Another classification: Primary, secondary, and tertiary storage.
1. Primary storage: the fastest storage media, such as cash and main
memory.
2. Secondary (or on-line) storage: the next level of the hierarchy, e.g.,
magnetic disks.
3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke boxes.
4. Volatility of storage. Volatile storage loses its contents when the power is
removed. Without power backup, data in the volatile storage (the part of the
hierarchy from main memory up) must be written to nonvolatile storage for
safekeeping.
File organization refers to the way data is stored in a file. File organization is very
important because it determines the methods of access, efficiency, flexibility and
11
storage devices to use. There are four methods of organizing files on a storage
media. This include:
sequential,
random,
serial and
indexed-sequential
1. Sequential file organization
Records are stored and accessed in a particular order sorted using a key field.
Retrieval requires searching sequentially through the entire file record by record
to the end.
Because the record in a file are sorted in a particular order, better file searching
methods like the binary search technique can be used to reduce the time used
for searching a file .
Since the records are sorted, it is possible to know in which half of the file a
particular record being searched is located, Hence this method repeatedly divides
the set of records in the file into two halves and searches only the half on which
the records is found.
For example, of the file has records with key fields 20, 30, 40, 50, 60 and the
computer is searching for a record with key field 50, it starts at 40 upwards in its
search, ignoring the first half of the set.
Advantages of sequential file organization
The sorting makes it easy to access records.
The binary chop technique can be used to reduce record search time by as much
as half the time taken.
12
Disadvantages of sequential file organization
The sorting does not remove the need to access other records as the search looks
for particular records.
Sequential records cannot support modern technologies that require fast access
to stored records.
The requirement that all records be of the same size is sometimes difficult to
enforce.
1. Random or direct file organization
Records are stored randomly but accessed directly.
To access a file stored randomly, a record key is used to determine where a
record is stored on the storage media.
Magnetic and optical disks allow data to be stored and accessed randomly.
Advantages of random file access
Quick retrieval of records.
The records can be of different sizes.
1. Serial file organization
Records in a file are stored and accessed one after another.
The records are not stored in any way on the storage medium this type of
organization is mainly used on magnetic tapes.
Advantages of serial file organization
It is simple
It is cheap
13
Disadvantages of serial file organization
It is cumbersome to access because you have to access all proceeding records
before retrieving the one being searched.
Wastage of space on medium in form of inter-record gap.
It cannot support modern high speed requirements for quick record access.
1. Indexed-sequential file organization method
Almost similar to sequential method only that, an index is used to enable the
computer to locate individual records on the storage media. For example, on
a magnetic drum, records are stored sequential on the tracks. However, each
record is assigned an index that can be used to access it directly.
Data Dictionary Concept
A Data Dictionary is a collection of names, definitions, and attributes about data
elements that are being used or captured in a database, information system, or part of a
research project. It describes the meanings and purposes of data elements within the
context of a project, and provides guidance on interpretation, accepted meanings and
representation. A Data Dictionary also provides metadata about data elements. The
metadata included in a Data Dictionary can assist in defining the scope and
characteristics of data elements, as well the rules for their usage and application.
Why Use a Data Dictionary?
Data Dictionaries are useful for a number of reasons. In short, they:
Assist in avoiding data inconsistencies across a project
Help define conventions that are to be used across a project
14
Provide consistency in the collection and use of data across multiple members of a
research team
Make data easier to analyze
Enforce the use of Data Standards
Data Dictionaries
A data dictionary is at the heart of any database management system. The data
dictionary contains important information, such as what files are in the database and
descriptions (called attributes) of the data contained in the files. This information is used
by the system to assess whether or not a particular process can be accomplished and
whether or not a particular user is authorised to carry it out. Information stored in the
data dictionary could normally be expected to include:
What data are available
Where on the storage devices they are located
Data attributes (for instance, data type numerical or alphanumerical)
How the data are used
Definitions of data security requirements (who is allowed to access the data,
who is allowed to update/amend them)
Relationships to other pieces of data
Definitions of data integrity requirements.
15
Basic Concepts
Many queries reference only a small proportion of the records in a file. For example, a
query like “Find all accounts at the Kano branch” or “Find the balance of account
number A-101122” references only a fraction of the account records. It is inefficient for
the system to read every record and to check the branch-name field for the name
“Kano,” or the account-number field for the value A-101122. Ideally, the system should
be able to locate these records directly. To allow these forms of access, we design
additional structures that we associate with files.
An index for a file in a database system works in much the same way as the index in this
textbook. If we want to learn about a particular topic (specified by a word or a phrase)
in this textbook, we can search for the topic in the index at the back of the book, find
the pages where it occurs, and then read the pages to find the information we are
looking for. The words in the index are in sorted order, making it easy to find the word
we are looking for. Moreover, the index is much smaller than the book, further reducing
the effort needed to find the words we are looking for.
Card catalogs in libraries worked in a similar manner (although they are rarely used any
longer). To find a book by a particular author, we would search in the author catalog,
and a card in the catalog tells us where to find the book. To assist us in searching the
catalog, the library would keep the cards in alphabetic order by authors, with one card
for each author of each book.
Database system indices play the same role as book indices or card catalogs in libraries.
For example, to retrieve an account record given the account number, the database
system would look up an index to find on which disk block the corresponding record
16
resides, and then fetch the disk block, to get the account record. Keeping a sorted list of
account numbers would not work well on very large databases with millions of
accounts, since the index would itself be very big; further, even though keeping the
index sorted reduces the search time, finding an account can still be rather time-
consuming. Instead, more sophisticated indexing techniques may be used.
We shall consider several techniques for both ordered indexing and hashing. No one
technique is the best. Rather, each technique is best suited to particular database
applications. Each technique must be evaluated on the basis of these factors:
Access types: The types of access that are supported efficiently. Access types can
include finding records with a specified attribute value and finding records whose
attribute values fall in a specified range.
Access time: The time it takes to find a particular data item, or set of items, using the
technique in question.
Insertion time: The time it takes to insert a new data item. This value includes the
time it takes to find the correct place to insert the new data item, as well as the time
it takes to update the index structure.
Deletion time: The time it takes to delete a data item. This value includes the time it
takes to find the item to be deleted, as well as the time it takes to update the index
structure.
Space overhead: The additional space occupied by an index structure. Provided that
the amount of additional space is moderate, it is usually worth while to sacrifice the
space to achieve improved performance.
B+ tree
17
As we have already seen in previous articles that B+ tree is a (key, value) storage
method in a tree like structure. B+ tree has one root, any number of intermediary nodes
(usually one) and a leaf node. Here all leaf nodes will have the actual records stored.
Intermediary nodes will have only pointers to the leaf nodes; it not has any data. Any
node will have only two leaves. This is the basic of any B+ tree.
Consider the STUDENT table below. This can be stored in B+ tree structure as shown
below. We can observe here that it divides the records into two and splits into left node
and right node. Left node will have all the values less than or equal to root node and the
right node will have values greater than root node. The intermediary nodes at level 2
will have only the pointers to the leaf nodes. The values shown in the intermediary
nodes are only the pointers to next level. All the leaf nodes will have the actual records
in a sorted order.
18
If we have to search for any record, they are all found at leaf node. Hence searching any
record will take same time because of equidistance of the leaf nodes. Also they are all
sorted. Hence searching a record is like a sequential search and does not take much
time.
Suppose a B+ tree has an order of n (it is the number of branches – above tree structure
has 5 branches altogether, hence order is 5), and then it can have n/2 to n intermediary
nodes and n/2 to n-1 leaf nodes. In our example above, n= 5 i.e.; it has 5 branches from
root. Then it can have intermediary nodes ranging from 3 to 5. And it can have leaf
nodes from 3 to 4.
19
The main goal of B+ tree is:
Sorted Intermediary and leaf nodes: Since it is a balanced tree, all nodes should
be sorted.
Fast traversal and Quick Search:
One should be able to traverse through the nodes very fast. That means, if we have to
search for any particular record, we should be able pass through the intermediary node
very easily. This is achieved by sorting the pointers at intermediary nodes and the
records in the leaf nodes.
Any record should be fetched very quickly. This is made by maintaining the balance in
the tree and keeping all the nodes at same distance.
No overflow pages: B+ tree allows all the intermediary and leaf nodes to be
partially filled – it will have some percentage defined while designing a B+ tree.
This percentage up to which nodes are filled is called fill factor. If a node reaches
the fill factor limit, then it is called overflow page. If a node is too empty then it is
called underflow. In our example above, intermediary node with 108 is
underflow. And leaf nodes are not partially filled, hence it is an overflow. In ideal
B+ tree, it should not have overflow or underflow except root node.
Searching a record in B+ Tree
Suppose we want to search 65 in the below B+ tree structure. First we will fetch for the
intermediary node which will direct to the leaf node that can contain record for 65. So
we find branch between 50 and 75 nodes in the intermediary node. Then we will be
redirected to the third leaf node at the end. Here DBMS will perform sequential search
to find 65. Suppose, instead of 65, we have to search for 60. What will happen in this
20
case? We will not be able to find in the leaf node. No insertions/update/delete is
allowed during the search in B+ tree.
Insertion in B+ tree
Suppose we have to insert a record 60 in below structure. It will go to 3 rd leaf node after
55. Since it is a balanced tree and that leaf node is already full, we cannot insert the
record there. But it should be inserted there without affecting the fill factor, balance
and order. So the only option here is to split the leaf node. But how do we split the
nodes?
21
The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current root node is 50.
We will split the leaf node in the middle so that its balance is not altered. So we can
group (50, 55) and (60, 65, 70) into 2 leaf nodes. If these two has to be leaf nodes, the
intermediary node cannot branch from 50. It should have 60 added to it and then we
can have pointers to new leaf node.
This is how we insert a new entry when there is overflow. In normal scenario, it is simple
to find the node where it fits and place it in that leaf node.
Delete in B+ tree
Suppose we have to delete 60 from the above example. What will happen in this case?
We have to remove 60 from 4 th leaf node as well as from the intermediary node too. If
we remove it from intermediary node, the tree will not satisfy B+ tree rules. So we need
to modify it have a balanced tree. After deleting 60 from above B+ tree and re-arranging
nodes, it will appear as below.
22
Failure classification and storage structure and recovery and shadow paging.
Failure
System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability, the system aborts
an active transaction.
The different types of failure in DBMS
Operating system failure.
Main memory crash.
Transaction failure or abortion.
System generated error like integer overflow or divide-by-zero error.
Failure of supporting software.
Power failure.
Types of Database Failures
There are many types of failures that can affect database processing. Some failures
affect the main memory only, while others involve secondary storage. Following are the
types of failure:
23
Hardware failures: Hardware failures may include memory errors, disk crashes,
bad disk sectors, disk full errors and so on. Hardware failures can also be attributed to
design errors, inadequate (poor) quality control during fabrication, overloading (use of
under-capacity components) and wear out of mechanical parts.
Software failures: Software failures may include failures related to software’s
such as, operating system, DBMS software, application programs and so on.
System crashes: System crashes are due to hardware or software ...
Failure Classification
To find that where the problem has occurred, we generalize a failure into the following
categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point
from where it can't go any further. If a few transaction or process is hurt, then
this is called as transaction failure.
Reasons for a transaction failure could be -
1. Logical errors: If a transaction cannot complete due to some code error or
an internal error condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active
transaction because the database system is not able to execute it. For
24
example, the system aborts an active transaction, in case of deadlock or
resource unavailability.
2. System Crash
o System failure can occur due to power failure or other hardware or
software failure. Example: Operating system error.
Fail-stop assumption: In the system crash, non-volatile storage is assumed
not to be corrupted.
3. Disk Failure
o It occurs where hard-disk drives or storage drives used to fail frequently. It
was a common problem in the early days of technology evolution.
o Disk failure occurs due to the formation of bad sectors, disk head crash,
and unreachability to the disk or any other failure, which destroy all or
part of disk storage.
Storage structure
Databases are stored in file formats, which contain records. At physical level, the actual
data is stored in electromagnetic format on some device. These storage devices can be
broadly categorized into three types −
25
Primary Storage − The memory storage that is directly accessible to the CPU
comes under this category. CPU's internal memory (registers), fast memory
(cache), and main memory (RAM) are directly accessible to the CPU, as they are
all placed on the motherboard or CPU chipset. This storage is typically very
small, ultra-fast, and volatile. Primary storage requires continuous power supply
in order to maintain its state. In case of a power failure, all its data is lost.
Secondary Storage − Secondary storage devices are used to store data for future
use or as backup. Secondary storage includes memory devices that are not a
part of the CPU chipset or motherboard, for example, magnetic disks, optical
disks (DVD, CD, etc.), hard disks, flash drives, and magnetic tapes.
Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since
such storage devices are external to the computer system, they are the slowest
in speed. These storage devices are mostly used to take the back up of an entire
system. Optical disks and magnetic tapes are widely used as tertiary storage.
Storage System in DBMS
A database system provides an ultimate view of the stored data. However, data in the
form of bits, bytes get stored in different storage devices.
26
In this section, we.
Types of Data Storage
For storing the data, there are different types used for storing the data:
o Primary Storage
o Secondary Storage
o Tertiary Storage
Memory Hierarchy
Thus, on arranging the above-described storage media in a hierarchy according to its
speed and cost, we conclude the below-described image:
27
In the image, the higher levels are expensive but fast. On moving down, the cost per bit
is decreasing, and the access time is increasing. Also, the storage media from the main
memory to up represents the volatile nature, and below the main memory, all are non-
volatile devices.
Recovery in DBMS
Database recovery is the process of restoring the database to a correct (consistent)
state in the event of a failure. In other words, it is the process of restoring the database
to the most recent consistent state that existed shortly before the time of system
failure.
Paging in DBMS
This is the method where all the transactions are executed in the primary memory or
the shadow copy of database. This mechanism assumes that there will not be any disk
failure and only one transaction executing at a time so that the shadow DB can hold the
data for that transaction. ...
28
Shadowing in DBMS
Shadow paging is a technique for providing atomicity and durability in database
systems. • Shadow paging is a copy-on-write technique for avoiding in-place updates of
pages. Instead, when a page is to be modified, a shadow page is allocated.
• Since the shadow page has no references (from other pages on disk), it can be
modified liberally, without concern for consistency constraints, etc. When the page is
ready to become durable, all pages that referred to the original are updated to refer to
the new replacement page instead. Because the page is "activated" only when it is
ready, it is atomic.
• This increases performance significantly by avoiding many writes on hotspots high up
in the referential hierarchy (e.g.: a file system superblock) at the cost of high commit
latency.
Shadow paging considers:
1. The database is partitioned into fixed-length blocks referred to as PAGES.
2. Page table has n entries – one for each database page.
3. Each contain pointer to a page on disk (1 to 1st page on database and so on…).
The idea is to maintain 2 pages tables during the life of transaction.
1. The current page tables
2. The shadow page table
When transaction starts, both page tables are identical
1. The shadow page table is never changed over the duration of the transaction.
2. The current page table may be changed when a transaction performs a write
operation.
3. All input and output operations use the current page table to locate database
pages on disk.
29
Advantages:
• No Overhead for writing log records.
• No Undo / No Redo algorithm.
• Recovery is faster.
Disadvantages:
• Data gets fragmented or scattered.
• After every transaction completion database page containing old version of modified
data need to be garbage collected.
• Hard to extend algorithm to allow transaction to run concurrently.
30
Decision Support System (DSS) and Data Analysis in Database Management System
Introduction:
Computerized decision support systems became practical with the development of
minicomputers, timeshare operating systems and distributed computing. The history of
the implementation of such systems begins in the mid-1960s. In a technology field as
diverse as DSS, chronicling history is neither neat nor linear. Different people perceive
the field of Decision Support Systems from various vantage points and report different
accounts of what happened and what was important (cf., Arnott & Pervan, 2005; Eom &
Lee, 1990b; McCosh & Correa-Perez, 2006; Power, 2003; Power, 2004a; Silver, 1991).
As technology evolved new computerized decision support applications were
developed and studied. Researchers used multiple frameworks to help build and
understand these systems.
Decision Support System (DSS):
A decision support system (DSS) is an information system that aids a business in
decision-making activities that require judgment, determination, and a sequence of
actions. While Data Analysis is the process of collecting and organizing data in order to
draw helpful conclusions from it. The process of data analysis uses analytical and logical
reasoning to gain information from the data. The information system assists the mid and
high-level management of an organization by analyzing huge volumes of unstructured
data and accumulating information that can help to solve problems and help in decision-
making.
A DSS is also human-powered, automated, or a combination of both.
Brief History of Decision Support Systems
31
Information Systems researchers and technologists have built and investigated
computerized Decision Support Systems (DSS) for approximately 40 years. This article
chronicles and explores the developments related to building and deploying DSS. The
journey begins with building model-driven DSS in the late 1960s, theory developments
in the 1970s, and implementation of financial planning systems, spreadsheet-based DSS
and Group DSS in the early and mid 1980s. Data warehouses, Executive Information
Systems, OLAP and Business Intelligence evolved in the late 1980s and early 1990s.
Finally, the chronicle ends with knowledge-driven DSS and the implementation of Web-
based DSS beginning in the mid-1990s. The field of computerized decision support is
expanding to use new technologies and to create new applications.
Purpose of a Decision Support System
A decision support system produces detailed information reports by gathering and
analyzing data. Hence, a DSS is different from a normal operations application, whose
goal is to collect data and not analyze it.
In an organization, a DSS is used by the planning departments such as the operations
department which collects data and creates a report that can be used by managers for
decision-making. Mainly, a DSS is used in sales projection, for inventory is a current
asset account found on the balance sheet, consisting of all raw materials, work in
progress, and finished goods that are operations related data, and to present
information to customers in an easy to understand manner.
Refer to the figure below for graphical representation:
32
Analysis of
business
data
Computer
program
application
Making
business Presentation
of business
decisions data
easier
Theoretically, a DSS can be employed in various knowledge domains from an
organization to forest management and the medical field. One of the main applications
of a DSS in an organization is real-time reporting. It can be very helpful for organizations
that take part in just-in-time (JIT)Just in Time (JIT) Method The Just in Time (JIT) style of
inventory management – also sometimes referred to as the Toyota Production System
(TPS) – is a strategy of managing inventory [Link] a JIT inventory system, the
organization requires real-time data of their inventory levels to place orders “just in
time” to prevent delays in production and cause a negative domino effect. Therefore, a
DSS is more tailored to the individual or organization that is making the decision rather
than a traditional system.
Components of a Decision Support System
The three main components of a DSS framework are:
1. Model Management System
33
2. The model management system S=stores models that managers can use in their
decision-making. The models are used in decision-making regarding the financial
health of the organization and forecasting demand for a good or service.
2. User Interface
The user interface includes tools that help the end-user of a DSS to navigate through the
system.
3. Knowledge Base
The knowledge base includes information from internal sources (information collected
in a transaction process system) and external sources (newspapers and online
databases).
Decision Support System - Components
Types of Decision Support Systems
1. Communication-driven: Allows companies to support tasks that require more
than one person to work on the task. It includes integrated tools such as
Microsoft SharePoint Workspace and Google Docs.
2. Model-driven: Allows access to and the management of financial, organizational,
and statistical models. Data is collected, and parameters are determined using
the information provided by users. The information is created into a decision-
making model to analyze situations. An example of a model-driven DSS is
Dicodess – an open-source model-driven DSS.
3. Knowledge-driven: Provides factual and specialized solutions to situations by
using stored facts, procedures, rules, or interactive decision-making structures
like flowcharts Flowchart Templates Flowcharts are great for describing business
processes concisely without compromising on structure and detail. Below are
four sample flowchart templates.
34
4. Document-driven: Manages unstructured information in different electronic
formats.
5. Data-driven: Helps companies to store and analyze internal and external data.
Advantages of a Decision Support System
A decision support system increases the speed and efficiency of decision-making
activities. It is possible, as a DSS can collect and analyze real-time data.
It promotes training within the organization, as specific skills must be developed
to implement and run a DSS within an organization.
It automates monotonous managerial processes, which means more of the
manager’s time can be spent on decision-making.
It improves interpersonal communication Being able to communicate effectively
is one of the most important life skills to learn. Communication is defined as
transferring information to produce greater understanding. It can be done vocally
(through verbal exchanges), through written media (books, websites, and
magazines), visually (using graphs, charts, and maps) or non-verbally within the
organization.
Disadvantages of a Decision Support System
The cost to develop and implement a DSS is a huge capital investment, which
makes it less accessible to smaller organizations.
A company can develop a dependence on a DSS, as it is integrated into daily
decision-making processes to improve efficiency and speed. However, managers
35
tend to rely on the system too much, which takes away the subjectivity aspect of
decision-making.
A DSS may lead to information overload because an information system tends to
consider all aspects of a problem. It creates a dilemma for end-users, as they are
left with multiple choices.
Implementation of a DSS can cause fear and backlash from lower-level
employees. It is because many of them are not comfortable with new technology
and are afraid of losing their jobs to technology.
At last a decision support system produces detailed information reports by gathering
and analyzing data. Hence, a DSS is different from a normal operations application,
whose goal is to collect data and not analyze it. As such, Data Analysis is concerned with
the NATURE and USE of data. It involves the identification of the data elements which
are needed to support the data processing system of the organization, the placing of
these elements into logical groups and the definition of the relationships between the
resulting groups.
Data Analysis:
In data analysis we analyze the data and build a systems representation in the form of a
data model (conceptual). A conceptual data model specifies the structure of the data
and the processes which use that data.
Data Analysis = establishing the nature of data.
Functional Analysis = establishing the use of data.
However, since Data and Functional Analysis are so intermixed, we shall use the term
Data Analysis to cover both.
36
In simple words, data analysis is the process of collecting and organizing data in order to
draw helpful conclusions from it. The process of data analysis uses analytical and logical
reasoning to gain information from the data.
The main purpose of data analysis is to find meaning in data so that the derived
knowledge can be used to make informed decisions.
Data analysis in today’s businesses
Data analytics is used in business to help organizations make better business decisions.
Whether it’s market research, product research, positioning, customer reviews,
sentiment analysis, or any other issue for which data exists, analyzing data will provide
insights that organizations need in order to make the right choices.
Data analytics is important for businesses today, because data-driven choices are the
only way to be truly confident in business decisions. Some successful businesses may be
created on a hunch, but almost all successful business choices are data-based.
Examples of data analysis
Data analysis is a somewhat abstract concept to understand without the help of
examples. So in order to better illustrate how and why data analysis is important for
businesses, here are the 4 types of data analysis and examples of each:
1. Descriptive Analysis: Descriptive data analysis looks at past data and tells what
happened. This is often used when tracking Key Performance Indicators (KPIs),
revenue, sales leads, and more.
2. Diagnostic Analysis: Diagnostic data analysis aims to determine why something
happened. Once your descriptive analysis shows that something negative or
positive happened, diagnostic analysis can be done to figure out the reason. A
business may see that leads increased in the month of October and use
diagnostic analysis to determine which marketing efforts contributed the most.
37
3. Predictive Analysis: Predictive data analysis predicts what is likely to happen in
the future. In this type of research, trends are derived from past data which are
then used to form predictions about the future. For example, to predict next
year’s revenue, data from previous years will be analyzed. If revenue has gone up
20% every year for many years, we would predict that revenue next year will be
20% higher than this year. This is a simple example, but predictive analysis can be
applied to much more complicated issues such as risk assessment, sales
forecasting, or qualifying leads.
4. Prescriptive Analysis: Prescriptive data analysis combines the information found
from the previous 3 types of data analysis and forms a plan of action for the
organization to face the issue or decision. This is where the data-driven choices
are made.
These 4 types of data analysis can be applied to any issue with data related to it.
Q2 What is Data Mining?
Data Mining is a process of finding potentially useful patterns from huge data sets. It is a
multi-disciplinary skill that uses machine learning, statistics, and AI to extract
information to evaluate future events probability. The insights derived from Data Mining
are used for marketing, fraud detection, scientific discovery, etc.
Types of Data Mining
Data mining can be performed on following types of data
Relational databases
Data warehouses
Advanced DB and information repositories
Object-oriented and object-relational databases
Transactional and Spatial databases
38
Heterogeneous and legacy databases
Multimedia and streaming database
Text databases
Text mining and Web mining
Data Mining Implementation Process
Data Mining Techniques
1. Classification:
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable,
given the presence of other variables.
RECOVERY WITH CONCURRENT TRANSACTIONS
39
Concurrency control means that multiple transactions can be executed at the same time
and then the interleaved logs occur. However, there may be changes in transaction
results so maintain the order of execution of those transactions.
During recovery, it would be very difficult for the recovery system to backtrack all the
logs and then start recovering.
Recovery with concurrent transactions can be done in the following four ways.
1 Interaction with concurrency control.
2 Transaction rollback.
3 Checkpoints.
4 Restart recovery.
a) Interaction with concurrency control:
In this scheme, the recovery scheme depends greatly on the concurrency control
scheme that is used. So, to rollback a failed transaction, we must undo the updates
performed by the transaction.
b) Transaction rollback:
In this scheme, we rollback a failed transaction by using the log.
The system scans the log backward a failed transaction, for every log record found in the
log the system restores the data item.
c) Checkpoints:
A checkpoint is a process of saving a snapshot of the applications state so that it can
restart from that point in case of failure.
Checkpoint is a point of time at which a record is written onto the database form the
buffers.
Checkpoint shortens the recovery process.
40
When it reaches the checkpoint, then the transaction will be updated into the database,
and until that point, the entire log file will be removed from the file. Then the log file is
updated with the new step of transaction until the next checkpoint and so on.
The checkpoint is used to declare the point before which the DBMS was in the
consistent state, and all the transactions were committed.
To ease this situation, Check point Concept is used by the most DBMS.
In this scheme, we used checkpoints to reduce the number of log records that the
system must scan when it recovers from a crash.
In a concurrent transaction processing system, we require that the checkpoint log
record be of the form <checkpoint L>, where ‘L’ is a list of transactions active at the time
of the checkpoint.
A fuzzy checkpoint is a checkpoint where transactions are allowed to perform updates
even while buffer blocks are being written out.
d) Restart recovery:
When the system recovers from a crash, it constructs two lists.
The undo-list consists of transactions to be undone, and the redo-list consists of
transaction to be redone.
The system constructs the two lists as follows: Initially, they are both empty. The system
scans the log backward, examining each record, until it finds the first <checkpoint>
record.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the at a student-friendly price and become industry
ready.
41
If it is a case of execution of two or more transactions, the interleaving of logs will
occur. It will become very inconvenient for the system of recovery to backtrack
every record and start the process of recovering.
Most database management systems make use of the concept of “checkpoint” to
make the situation more convenient and more manageable.
Whenever more than one transaction is being executed, then the interleaved of
logs occur. During recovery, it would become difficult for the recovery system to
backtrack all logs and then start recovering.
To ease this situation, 'checkpoint' concept is used by most DBMS.
The idea of “checkpoint” has already been extensively discussed under the Transaction
Processing part of the tutorial. So, please, take your time and revise the concepts before
proceeding further with the theories of DBMS.
BUFFER MANAGEMENT
Definition
The database buffer is a main-memory area used to cache database pages. Database
processes request pages from the buffer manager, whose responsibility is to minimize
the number of secondary memory accesses by keeping needed pages in the buffer.
Because typical database workloads are I/O-bound, the effectiveness of buffer
management is critical for system performance.
Historical Background
Buffer management was initially introduced in the 1970s, following the results in virtual
memory systems. One of the first systems to implement it was IBM System-R. The high
cost of main-memory in the early days forced the use of very small buffers, and
consequently moderate performance improvements.
42
Foundations
The buffer is a main-memory area subdivided into frames, and each frame can contain a
page from a secondary storage database file. Database pages are requested from the
buffer manager. If the requested page is in the buffer, it is immediately returned to the
requesting process with
BUFFER MANAGEMENT
DBMS application programs require input/output (I/O) operations, which are performed
by a component of operating system. These I/O operations normally use buffers to
match the speed of the processor and the relatively fast main (or primary) memories
with the slower secondary storages and also to minimize the number of I/O operations
between the main and secondary memories wherever possible. The buffers are the
reserved blocks of the main memory. The assignment and management of memory
blocks is called and the component of the operating system that performs this task is
called buffer manager. The buffer manager is responsible for the efficient management
of the database buffers that are used to transfer (flushing) pages.
BUFFER MANAGEMENT
The buffer manager is the software layer that is responsible for bringing pages from
physical disk to main memory as needed. The buffer manages the available main
memory by dividing the main memory into a collection of pages, which we called as
buffer pool. The main memory pages in the buffer pool are called frames.
Page Requests from Higher Level
43
Buffer Management in a DBMS
Data must be in RAM for DBMS to operate on it!
Buffer manager hides the fact that not all data is in RAM.
The goal of the buffer manager is to ensure that the data requests made by programs
are satisfied by copying data from secondary storage devices into buffer. Infact, if a
program performs reading from existing buffers. Similarly, if a program performs an
output statement: it calls the buffer manager for output operation - to satisfy the
requests by writing to the buffers. Therefore, we can say that input and output
operation occurs between the program and the buffer area only. If an input operation
does not find the requested page in the buffer pool, then the buffer manager (software)
will have to do a physical transfer the page from the secondary memory (disk) to a free
block in buffer pool and then make the requested page placed in the buffer pool, that is,
available to the program requesting the original input operation. A similar scenario will
take place in the revers order for an output operation. That is, the buffer manager
(software) makes a new empty buffer a available to the program for outputting the
page. If there is no space in the buffer pool then the buffer manager (software)
physically transfer one of the page from buffer pool to the disk (secondary memory) to
44
provide the empty space in the buffer pool for the output operation of the program to
display the page in the buffer pool.
In addition to the buffer pool itself, the buffer manager maintains same block keeping
information and two variables for each frame in the pool; pin - count and dirty. The
number of times the page is requested in the frame - each time the pin - count variable
is incremented for that frame (because that page is in this frame). For satisfying each
request of the user; the pin - counter variable is decremented each time for that frame.
Thus, if a page is requested the pin - count is incremented; if it fulfills the request the
pin - count is decremented. In addition to this, if the page has been modified the
Boolean variable; dirty is set as 'on'. Otherwise, 'off '.
45
46