Database Management System Overview
Database Management System Overview
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 1
Database Management System Concepts
Table of Contents
1. INTRODUCTION
In this unit, we will introduce the basic concepts of DBMS. Database technology development
evolved rapidly in the three decades since the rise and eventual dominance of relational
database systems. While many specialized database systems (spatial, object-oriented,
multimedia, etc.) have found substantial user communities in the science and engineering
sections, relational systems remain the preferred database technology for business
enterprises.
1.1 Objectives:
By the end of Unit 1, the learners should be able to:
2. SIGNIFICANCE OF DATABASE
The database is a collection of related data. A data item is the smallest identified unit of data
that has value in the real world – for example, last name, first name, street address, ID
number, or political party – and is the fundamental component of a file in a file system. A
group of related data items considered as a single unit by an application is called a record.
Examples of types of records are salesperson, customer, order, product,and department.
A file is a collection of several records of a single type.
Database System: It is the DBMS software together with the data [Link], the
applications are also included.
Examples of database management systems software are ORACLE, SQL Server, MS Access,
DB2, SYBASE, INFORMIX, etc.
A database can handle accounting and filing, and business inventory, and use the information
in its files to prepare summaries, estimates, and other reports. There can be a database that
stores books, newspaper articles, magazines, and comics. On practically every issue, there is
already a well-defined market for specialised knowledge for a small number of customers.
Self-Assessment Questions – 1
So we can say that databases form an essential part of almost all enterprises today. During
the last four decades of the twentieth century, the useof databases grew in all enterprises.
Few people engaged directly with database systems in the early days, but they did so
indirectly — through printed reports such as credit card statements, or through agents such
as bank tellers and airline reservation agents – without recognising it. Then automated teller
machines came along and let users interact directly with databases. Phone interfaces to
computers also facilitated users to deal with databases directly. For example – a caller could
dial a number, and press phone keys to enter information or selectalternative options, to find
flight arrival/departure times.
During the late 1990s, the internet revolution sharply increased direct user access to
databases. Organizations converted many of their phone interfaces to databases into Web
interfaces and facilitates a different kinds of services and information available online. For
example, when you access a Web site, information about you may be retrieved from a
database, to selectwhich advertisements should be shown to you.
When you explore a book or music collection in an online retailer, you are accessing data
stored in a database.
When you go to a bank's website to get your account balance and transaction history, the
data is pulled from the bank's database system. Moreover, data related to Web access may
be stored in a database. Thus, although user interfaces hide details of access to a database,
and most people are unaware they are dealing with a database, accessingdatabases forms an
essential part of almost everyone’s life these days. The importance of database systems can
be judged in another way – today, database system vendors like Oracle are among the largest
software companies in the world, and database systems form an essential part of the product
line of more diversified companies like IBM and Microsoft.
Self-Assessment Questions – 2
4. DATA INDEPENDENCE
We can define data independence as the capability to modify the schema definition at one
level without disturbing a schema definition at the next higher level. It is usually understood
from two points of view: physical data independence and logical data independence.
Physical data independence permits changes in the physical storage devices or organization
of the files to be made without causing changes in the conceptual view or any of the external
views and hence in the application programs using the database. Thus, the files may migrate
from one physicalmedia to another or the file structure may change without requiring any
changes in the application programs.
Logical data independence refers that application programs need not be changed if fields are
added/deleted to/from an existing record. Logical data independence indicates that the
conceptual schema can be changed without affecting the existing external schemas. In a
database environment, data independence is advantageous, as it allows for changes at one
level of the database, without requiring any change in other levels. The mappings between
the layers absorb these changes.. Since application programs are heavily dependent on the
logical structure of the data they access, so it is more typical to achieve logical data
independence than physical independence.
In many respects, the concept of data independence is similar to the conceptof abstract data
type in programming languages like C++. Both hide implementation details from the users.
This facilitates users to concentrate on the general structure rather than low-level complex
implementation details.
Self-Assessment Questions – 3
Data modeling is preceded by planning and analysis. The effort done to this stage is
proportional to the scope of the database. The planning and analysis of a database help to
serve the needs of an enterprise and requiresmore effort than one intended to serve a small
workgroup.
Entities
These are the main data objects about which information is to be collected; they usually
denote a person, place, thing, or event of informational interest.A particular occurrence of an
entity is said to be an entity instance or sometimes an entity occurrence. Suppose a
Company database; here department, division, project, skill, employee, and location are all
examples of entities. In Unit 3, you'll obtain more examples of entities (Entity- Relationship
Model)
Attributes
These are characteristics of entities that provide details about them.
Employees may have attributes such as emp-id, emp-name, phone-no, fax-no, job-title, emp-
address, and so on.
There are two types of attributes: identifiers and descriptors. An identifier (or key) is used
to uniquely identify an instance of an entity also known as a key attribute; a descriptor (or
monkey attribute) is used to specify a non-unique characteristic of a particular entity
instance. Both identifiers and descriptors may consist of single or composite attribute.
The relationship construct is a diamond box that connects the associated entities. The
relationship name can be written inside or just outside the diamond box.
A role is the name of one end of a relationship when each end needs a distinct name for clarity
of the relationship. The individual duties of each entity in the relationship are clearly defined
by the entity names paired with the relationship name. However, in some cases, role names
should be used to clarify ambiguities. Role names are generally nouns. More information
about relationships and their types is given in Unit 3 (Entity-Relationship Model).
Relationships Types
The contents of a database must conform to specific limitations defined by an E-R enterprise
schema.. Cardinality ratios or mapping cardinalities shows the number of entities to which
another entity can be associated through a relationship set.
For a binary relationship set R between two entity sets A and B, the mapping cardinality
must be one of the following:
• One-to-one (1:1): Each entity in A is linked to only one entity in B, and each entity in
B is linked to only one entity in A.
• A one-to-many (1:N) relationship exists between an entity in A and any number of
entities in B.
However, an entity in B can only be linked to one entity in A.
• Many-to-one (N:1) means that an entity in A is linked to only one entity in B.
However, one entity in B can be linked to any number of entities in A.
• Many-to-many (M: N" ): An entity in A is linked to an unlimited number of entities in
B. A single entity in B can be linked to any number of entities in A
Self-Assessment Questions – 4
Out of several advantages, one of the main advantages of using a database system is that the
organization can exert, through the DBA, centralized management and control over the data.
Any application that necessitates a change in the structure of a data record must make prior
arrangements with the DBA, who will make the appropriate changes.
6.1 Advantages
Sharing Data
A database permits the sharing of data under its control by any number of users or
application programs.
It also reduces the need for further processing to find the relevant data in a huge amount of
data.
Further
Any DBMS redundancies are managed, and the system assures that numerous copies of the
same data are consistent.
Data Integrity
Centralized control can also ensure that sufficient checks are incorporated in the DBMS to
facilitate data integrity. When we talk about data integrity, we're referring to the fact that
the data is accurate.
As a result, data values entering for storage could be double-checked to ensure that they are
within a certain range and are in the correct format. For example, consider the value for
the age of an employee may be in the range of 16 and 75. Another integrity check that
should be incorporated into the database is to ensure that if there is a reference to a certain
object, that objectmust exist. In the case of an automatic teller machine, for example, a user
isnot allowed to transfer funds from a nonexistent saving account to anexisting one.
Data Security
Data is of vital importance for any organization and may be confidential. Such confidential
data must be protected from unauthorized persons. The DBA, as the owner of the data in the
DBMS, can ensure that suitable access protocols are followed, such as proper authentication
schemas for DBMS access and additional checks before granting access to sensitive data.
For different types of data and procedures, multiple levels of security could be established.
The enforcement of security could be data value-dependent (e.g., a manager has access to
the salary details of employees in his or her department only), as well as data-type
dependent (e.g. the manager cannot access the medical history ofany employees, including
those in his/her department).
Conflict Resolution
Since the database is under the control of the DBA, she/ he should resolve the conflicting
requirements of various users and applications. In other words, the DBA selects the
appropriate file format and access mechanism for response-critical applications while
enabling less important apps to continue to use the database with a delayed response time.
Disadvantages
The expense of the DBMS system is a big disadvantage.. In addition to the cost of purchasing
or developing the software, the hardware has to be upgraded to facilitate the extensive
programs and the workspaces required for their storage and execution. The processing
overheadintroduced by the DBMS to implement security, integrity and sharing of the data
results in a degradation of the response and throughput times. Furthermore, an additional
While centralization reduces duplication, the lack of duplication requires that the database
be sufficiently backed up, so that, the data can be recoveredin the case of failure. In a DBMS
setting, backup and recovery operations are fairly difficult, and this is worsened in a
concurrent multi - user database [Link], a database system needs a certain
amount of controlled redundancies and duplication to allow access to related data items.
Due to centralization, the data is accessible from a single source namely the database.
As a result, the severity of security breaches and disruptions to the organization's operations
due to downtime and failures is increased.
Many of the problems caused by downtime and failures are reduced when a centralised
database is replaced with a federation of independent and cooperating distributed
databases.
Self-Assessment Questions – 5
7. DBMS Vs RDBMS
Currently, the market-leading DBMS products are all SQL DBMS products. They were
originally based on the relational database management model - but did not implement the
model completely accurately. First of all, the SQL DBMS permits the data to be queried based
on any column in any table. Relational/SQL data is easier to query than CODASYL,
hierarchical, or some other model.
Secondly, since the relational model is based on set theory its usefulness and accuracy have
a basis in mathematics. A basis in mathematics thatindeed is centuries old and proven.
Thirdly, a relational database describes data in terms of its natural structure only – which
means, it excludes all details having to do with machine representation. The comparison
between DBMS and RDBMS is shown in table 1.1.
Self-Assessment Questions – 6
8. SUMMARY
A database system is a collection of related files along with information about their
definition, interpretation, maintenance, and manipulation. A DBMS is a vital software
component of the database system. It consists of the collection of interrelated data and
programs to access that data. The main goal of a DBMS is to provide an environment that is
both efficient and convenient to use in retrieving desired information from and storing
information in the database.
The DBMS not only makes the integrated collection of reliable and accurate data available to
multiple applications and users but also prohibits unauthorized users to access the data. The
DBMS has its advantages and disadvantages. DBMS is different from RDBMS.
9. TERMINAL QUESTIONS
1. List out the database implicit properties.
2. What are the representative applications of Databases? List them.
3. Differentiate between physical data independence and logical dataindependence.
4. What are entities and attributes? Give one example.
5. What are relationships? Explain the relationship types.
6. Explain the Advantages and Disadvantages of the DBMS.
7. Compare DBMS with RDBMS.
10. ANSWERS
Self Assessment Questions
1. Data item
2. Universe of Discourse
3. Software
4. Database system
5. Airlines
6. Database
7. Automated teller
8. Telecommunication
9. Two
10. Physical
11. Logical
12. Difficult
13. data model
14. Planning
15. Entities
16. Instance
17. One-to-one
18. One
19. Centralized
20. Consistent
21. Data integrity
22. Vital
23. Cost
24. Easier
25. set theory
Terminal Questions
1. Implicit properties of a database are: (i) A database represents some aspect of the real
world, sometimes called the mini-world or universe of discourse (UoD). (ii) A database
is a logically coherent collection of data with some inherent meaning and so on. (Refer
to section 1.2 for detail)
2. List of representative applications of Databases are Banking, Airlines, Universities,
Credit card transactions, Telecommunication, etc. (Refer to section 1.3 for detail.)
3. Physical data independence allows changes in the physical storage devices or
organization of the files to be made without requiring changesin the conceptual view
whereas Logical data independence implies that application programs need not be
changed if fields are added to an existing record. (Refer to section 1.4 for detail)
4. Entities are the principal data objects about which information is to be collected.
Attributes are characteristics of entities that provide descriptive details about them.
(Refer to section 1.5 for detail)
5. Relationships represent real-world associations among one or more entities. A
relationship is indicated by the connectivity between entityoccurrences: one-to-one,
one-to-many, and many-to-many. (Refertosection 1.5.2 for detail)
6. The advantages of using a database system are that the organization can exert, via the
DBA, centralized management and control over the data apart from Reduction of
Redundancies, Data integrity, Data security, and Conflict resolution. (Refer to section
1.6.1 for detail)
7. Relational/SQL data is easier to query than hierarchical, CODASYL, or some other
model. SQL DBMS products were originally based on the relational database
management model – but did not implement the model fully or completely accurately.
(Refer to section 1.7)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 2
Database System Architecture
Table of Contents
2.4 Mapping - -
3 MySQL Architecture 2 2 8 – 10
4 SQL Server 2000 Architecture 3 3 11 – 12
5 Oracle Architecture 4 4 13 – 14
6 Database Management System Facilities - 5
1. INTRODUCTION
The architecture of a database system is greatly influenced by the underlying computer
system on which the database system runs. Database systems can be centralized, or client–
server, where one server machine executes work on behalf of multiple client machines.
Database systems can also be designed to exploit parallel computer architectures.
Distributed databases span multiple geographically separated machines. Distributed query
processing and directory systems are also described in this unit.
Also, the architecture of DBMS packages has evolved from the early monolithic systems,
where the whole DBMS software package was one tightly integrated system, to the modern
DBMS packages that are modularin design, with a client/server system architecture. A DBMS
can be considered as a buffer between application programs, end-users, and a database
designed to fulfill features of data independence. In 1975 the American National Standards
Institute Standards Planning and Requirements Committee (ANSI-SPARC) proposed a three-
level architecture for this buffer. This architecture identified three levels of abstraction.
These levels are sometimes referred to as schemas or views.
1.1 Objectives:
By the end of Unit 2, you should be able to understand:
❖ Three views of data i.e., three levels of architecture of dbms
❖ Architectures of mysql, sql server, and oracle architecture
❖ Database management systems - facilities and structure
❖ Database manager and database administrator
❖ Distributed processing and client-server architecture.
The view at each of these levels is described by a schema. A schema as mentioned earlier is
an outline or a plan that describes the records and relationships existing in the view. The
schema also describes the way in which entities at one level of abstraction can be mapped to
the next level. The overall design of the database is called the database schema. A database
schema includes such information as:
• Characteristics of data items such as entities and attributes
• Logical structure and relationship among those data items
• Format for storage representation
• Integrity parameters such as physical authorization and backup politics.
The concept of a database schema corresponds to programming language notion of the type
definition. A variable of a given type has a particular value ata given instant in time. The
Since each view is defined by a schema, there exist, several schemas in the database, and
these schemas are partitioned following three levels of data abstraction or views. At the
lower level, we have the physical schema; at the intermediate level, we have the conceptual
schema, while at the higher level we have a subschema. In general, a database system
supports one physical schema, one conceptual schema, and several subschemas.
Each external view is described through a schema called an external schema or subschema.
The external schema consists of the definition of thelogical records and the relationships in
the external view. The externalschema also contains the method of deriving the objects in
the external view from the objects in the conceptual view. The object includes entities,
attributes, and relationships.
which contains the definition of the stored record, the method of representing the data fields,
and the access aids used.
2.4 Mapping
Two mappings are required in a database system with three different views as shown in
Figure 2.1. A mapping between the external and conceptual levels gives the correspondence
among the records and the relationships of the external and conceptual levels.
Similarly, there is a mapping from a conceptual record to an internal one. Aninternal record
is a record at the internal level, not necessarily a stored record on a physical storage
device. The internal record of figure 2.2 maybe split up into two or more physical records.
The physical database is the data that is stored on secondary storage devices. It is made up
of records with certain data structures and organized in files. Consequently, there is an
additional mapping from the internal record to one or more stored recordson secondary
storage devices.
Self-Assessment Questions – 1
3. MYSQL ARCHITECTURE
MySQL is based on a tiered architecture, consisting of both primary subsystems and support
components, that interact with each other to read, parse, and execute queries, and to cache
and return query results.
Primary Subsystems
The MySQL architecture consists of five primary subsystems that worktogether to respond
to a request made to the MySQL database server:
• The Query Engine
• The Storage Manager
• The Buffer Manager
• The Transaction Manager
• The Recovery Manager
The organization of these features is shown in Figure 2.2. We’ll explaineach one briefly
to help you gain a better understanding of how the parts fit together.
The Syntax Parser decomposes the SQL commands it receives from callingprograms into a
form that can be understood by the MySQL engine. The objects that will be used are
identified, along with the correctness of the syntax. The Syntax Parser also checks the objects
being referenced to ensure that the privilege level of the calling program allows it to use
them.
The Query Optimizer then streamlines the syntax for use by the Execution Component, which
then prepares the most efficient plan of query execution. The Query Optimizer checks to see
which index should be used to retrieve the data as quickly and efficiently as possible. It
chooses one from among the several ways it has found to execute the query and then creates
a plan of execution that can be understood by the Execution Component.
The Execution Component then interprets the execution plan and, based onthe information
it has received, makes requests of the other components to retrieve the records.
The Storage Manager: The Storage Manager interfaces with the operating system (OS) to
write data to the disk efficiently. Because the storage functions reside in a separate
subsystem, the MySQL engine operates at a level of abstraction away from the operating
system. This means that if you port to a different operating system that uses a different
storagemechanism, for example, you can rewrite only the storage portion of the code while
leaving the rest of the engine as it is. With the help of MySQL’s Function Libraries, the Storage
Manager writes to disk all of the data in the user tables, indexes, and logs as well as the
internal system data.
The Buffer Manager: This subsystem handles all memory management issues between
requests for data by the Query Engine and the Storage Manager. MySQL makes aggressive
use of memory to cache result sets that can be returned as-is rather than making duplicate
requests to the Storage Manager; this cache is maintained in the Buffer Manager.
This is also the area where new records can be cached while waiting for the availability of
targeted tables and indexes. If any new data is needed, it’s requested from the Storage
Manager and placed in the buffer, before then being sent to the Query Engine.
and releases locks on various objects being used in transactions. Each transactional table
handler implements its own Transaction Managerto handle all locking and concurrency
needs.
The Recovery Manager: The Recovery Manager’s job is to keep copies of data for retrieval
later, in case of a loss of data. It also logs commands that modify the data and other significant
events inside the database.
The data storage needs of a modern corporation or government organization are very
complex. Some examples are:
• Online Transaction Processing (OLTP) systems must be capable ofhandling thousands
of orders placed at the same time.
• Organizations have many users who must continue working when they do not have
access to the network. Examples are mobile disconnected users, such as traveling sales
representatives or regional inspectors. These users must synchronize the data on a
notebook or laptop with thecurrent data in the corporate system, disconnect from the
network, record the results of their work while in the field, and then finally reconnect
with the corporate network and merge the results of their fieldwork into the corporate
data store.
Self-Assessment Questions – 3
12. ________ systems must be capable of handling thousands of orders placed at the
same time.
13. In ___________the data storage mechanism must be transparent to the users who
purchase the application.
5. ORACLE ARCHITECTURE
The Oracle server consists of physical files and memory components. Figure 2.4 displays the
architecture of the Oracle Database 9i. It is broadly divided into memory components which
form the Oracle instance and the physical database components, where different kinds of
data are stored.
Self-Assessment Questions – 4
The database management system maintains the information on the file structure, the
method used to efficiently access the relevant data (i.e., the access method). It also provides
a method whereby the application programs indicate their data requirements. The
application program could use a subset of the conceptual data definition language or a
separate language. The database system also contains mapping functions that allowit to
interpret the stored data for the application program.
The internal schema is specified in a somewhat similar data definition language called data
storage definition language. The definition of the internal view is compiled and maintained
by the DBMS. The compiled internal schema specifies the implementation details of the
internal database, including the access methods employed. This information is handled by
the DBMS; the user need not be aware of these details.
modification of existing data. The first of these data manipulation operations is called a
query. A query is a statement in the DML that requests the retrieval of data from the
database. The subset of the DML used to pose a query is known as a query language; however,
we use the terms DML and query language synonymously.
The DML provides commands to select and retrieve data from the [Link] are
also provided to insert, update, and delete records. They could be used in an interactive
mode or embedded in conventional programming languages such as Assembler, COBOL,
FORTRAN, Pascal,or PL/I. The data manipulation functions provided by the DBMS can be
invoked in application programs directly by procedure calls or by preprocessor statements.
The latter would be replaced by appropriate procedure calls by either a preprocessor or the
compiler.
Self-Assessment Questions – 5
The major components of a DBMS structure are explained below in Figure 2.5.
DDL Compiler: The DDL compiler converts the data definition statements into a set of
tables. These tables contain information concerning the database and are in a form that can
be used by other components of the DBMS.
File Manager: The file manager manages the allocation of space on disk storage and the data
structure used to represent information stored on disk. The file manager can be
implemented using an interface to the existing file subsystem provided by the operating
system of the host computer, or it can include a file subsystem written especially for the
DBMS.
Since the movement of data to and from the disk is slow relative to the speed of control
processing unit of computers, it is imperative that the database system structure data to
minimize the need to move data between disk and main memory. A database manager is a
program module that provides the interface between the low-level data stored in the
database and the application programs and queries submitted to the system. It is responsible
for interfacing with the file system.
One of the functions of a database manager is to convert the user's queries coming directly
via the query processor or indirectly via an applicationprogram from the user's logical view
to the physical file system. In addition, the tasks of enforcing constraints to maintain the
consistency and integrity of the data as well as its security are also performed by the
database manager. Synchronizing the simultaneous operations performed by concurrent
users is under the control of the data manager. It also performs backup and recovery
operations. Let us now summarize the important responsibilities of a Database manager:
• Interaction with a file manager: The raw data is stored on the disk using the file
system which is usually provided by a conventional operating system. The database
manager translates the various DML statements into low-level file system commands.
Thus the database manager is responsible for the actual storing, retrieving, and
updating of data in the database.
• Integrity enforcement: The data values stored in the database must satisfy certain
types of consistency constraints. For example, the balance of a bank account may
never fall below a prescribed amount(for example ` 200). Similarly, the number of
holidays per year an employee may be having should not exceed 25 days. These
constraints must be specified explicitly by the DBA. If such constraints are specified,
then the database manager can check whether updates to the database result in the
violation of any of these constraints and if so appropriate action may be imposed.
• Security enforcement: As discussed above, not every user of thedatabase needs to
have access to the entire content of the database. It is the job of the database manager
to enforce these securityrequirements.
• Backup and recovery: A computer system like any other mechanical or electrical
device is subject to failure. There are a variety of causes of
• such failure, including disk crash, power failure, and s/w errors. In eachof these cases,
information concerning the database is lost. It is the responsibility of the database
manager to detect such failures andrestore the database to a state that existed prior to
the occurrence of thefailure. This is usually accomplished through backup and recovery
procedures.
• Concurrency Control: When several users update the database concurrently, the
consistency of data may no longer be preserved. It is necessary for the system to control
the interaction among the concurrentusers, and achieving such control is one of the
responsibilities of the database manager.
Query Processor: The database user retrieves data by formulating a query in the data
manipulation language provided with the database. The query processor is used to interpret
the online user's query and convert it into an efficient series of operations, in a form capable
of being sent to the data manager for execution. The query processor uses the data dictionary
to findthe structure of the relevant portion of the database and uses this information in
modifying the query and preparing an optimal plan to access the database.
Mappings between the internal and the conceptual levels, as well as between the internal
and the conceptual levels, as well as between the conceptual and external levels, are also
defined by the DBA. Ensuring that appropriate measures are in place to maintain the
integrity of the database and that the database is not accessible to unauthorized users is
another responsibility. The DBA is responsible for granting permission to the users of the
database and stores the profile of each user in the database. This profile describes the
permissible activities of a user on that portion of the database accessible to the user via one
or more user views. The user profile can be used by the database system to verify that a
particular user can perform a given operation on the database.
The DBA is also responsible for defining procedures to recover the database from failures
due to human, natural, or hardware causes with minimal loss of data. This recovery
procedure should enable the organization to continue to function, and the intact portion of
the database should continue to be available.
they were used becomes more and more difficult. Of course, it is possible fora programmer
who has coined the available names to bear them in mind,but should the same author come
back to his program after a significant time, or should another programmer have to modify
the program, it would befound that it is extremely difficult to make a reliable account of the
purpose the data files were used for.
The problem becomes even more difficult when the number of data types that an
organization has in its database increases. It has also now perceived that the data of an
organization is a valuable corporate resource, and therefore some kind of an inventory, and
catalog of it must be maintained to assist in both the utilization and management of the
resource.
A comprehensive data dictionary would define data items, how they fit into the data
structure and how they relate to other entities inthe database. With the comprehensive
base of information, the data dictionary can serve several useful purposes connecting across
the whole spectrum of planning, determining information requirements, designing and
implementing operations, and revision. There is now a greater emphasis on having an
integrated system in which the data dictionary is part of the DBMS. In such a case the data
dictionary would store the information concerning the external, conceptual, and internal
levels of the databases. It would combine the source of each data field value that is from
where the authenticate value is obtained, the frequency of its use, and the audit trail
regarding the updates, including user identification with the time of each update.
The DBA uses the data dictionary in every phase of a database life cycle, starting from the
embryonic data gathering phase to the design, implementation, and maintenance phases.
The documentation provided by a data dictionary is as valuable to end-users and managers
as it is essential to the programmers. Users can plan their applications with the database
only if they know exactly what is stored in it. For example, the description of a data item in a
data dictionary may include its origin and other text description in plain English, in addition
to its data format. Thus users and managers will be able to see exactly what is available in
the database. You could consider a data dictionary to be a road map that guides users to
access information within a large database.
An ideal data dictionary should include everything a DBA wants to know about the database.
A data dictionary is implemented as a database so that users can query its content by either
interactive or batch processing. Whether or not the cost of acquiring a data dictionary system
is justifiable depends on the size and complexity of the information system. The cost-
effectiveness of a data dictionary increases as the complexity of an information system
increases. A data dictionary can be a great asset not only to the DBA for database design,
implementation, and maintenance but also to managers or end-users in their project
planning.
Self-Assessment Questions – 6
8. DISTRIBUTED PROCESSING
It is possible to make the distinction between two dimensions of distributed database
systems: distributed data and distributed processing. In this Unit, we provide an overview
of the topic of distributed processing. Distribution can also be discussed in terms of the
distribution of functions or processing. Figure 2.6 illustrates a system characterized by
distributed processing. Here, we also have four sites connected by a communications
network. However, in this configuration only site 1 stores any data. Sites 2, 3, and 4 act as
clients to this database, perhaps running particular information system applications. It is
for this reason we include client-server database systems within our discussion of
distributed database systems. Although most current client-server database systems do not
effectively distribute data, they do distribute functionality.
• Interface subsystem: This subsystem is responsible for managing interaction with the
end-user. It is generally referred to as the user interface
• Rules subsystem: This subsystem manages the application logic in terms of a defined
model of business rules. In a database application, these will comprise functions, which
primarily enforce both inherent and additional constraints on data
• Transaction subsystem: This subsystem acts as a link between the data subsystem and
the rules and interface subsystems. Querying, insertion, and update activity are
triggered at the interface, validated by the rules subsystem, and packaged as units
(transactions) that will initiate actions (responses or changes) in the data subsystem
• Data subsystem: This subsystem is responsible for managing the underlying data
needed by an application.
Most people tend to equate the term client-server with a network of PCs or workstations
connected by a network to a remote machine running a database server.
The server in this configuration is either set up as a file server or SQLserver. In a file
server situation, an SQL query expressed by a client will issue a request to the server for the
appropriate files needed by the query. The client will perform the query and extract the
relevant data. In an SQL server situation, the SQL statement will travel down the
communication line from client to server.
The server then executes the query and sends back only the extracted [Link], because
of the reduced communication traffic, most DBMS now offer SQL server facilities.
Self-Assessment Questions – 7
9. SUMMARY
This Unit gives the descriptions and different components of MySQL, SQL Server 2000, and
Oracle 9i architecture. The DBMS structure is defined by three views of data. A DBMS is a
major software system consisting of a number of elements. It provides users with DDL for
defining the external and conceptual view of the data and DML for manipulating the data
stored in thedatabase. The database manager is the component of DBMS that provides the
interface between the user and the file system. The database administration defines and
maintains the three levels of the database as well as the mapping between levels to insulate
die higher levels from changes that take place in the lower levels. The DBA is responsible for
implementing measures for ensuring the security, integrity, and recovery of the database.
11. ANSWERS
Self-Assessment Questions
1. Database
2. Subschema
3. External
4. Internal
5. One
6. Secondary
7. Five
8. Syntax Parser
9. Storage Manager
10. Query Engine
11. Transaction Manager
12. Online Transaction Processing (OLTP)
13. Independent Software Vendors (ISVs)
14. Three
15. Oracle server
16. Oracle Instance
17. data definition language (DDL)
18. data model
19. Procedural
20. DML Precompiler
21. Tables
22. database manager
23. query processor
24. DBA
25. Authorization
26. data dictionary
27. batch processing
28. Interface subsystem
29. data subsystem
30. rules
31. subordinate
32. local area network
Terminal Questions
1. The three-level architecture of DBMS includes the External Level or Subschema, the
Conceptual Level or Conceptual Schema, and the Internal Level or Physical Schema.
(Refer section 2 for detail)
2. The MySQL architecture consists of five primary subsystems that work together to
respond to a request made to the MySQL database server: the Query Engine, the Storage
Manager, the Buffer Manager, the Transaction Manager, and the Recovery Manager.
(Refer section 3 for detail)
3. Microsoft® SQL Server 2000 is applied in the different fields that meet the data storage
requirements of the largest data processing systems and commercial Web sites, yet at
the same time can provide easy-to-use data storage services to an individual or small
business. (Refer section 4 for detail)
4. The Oracle server consists of physical files and memory components. E.g. the Oracle
9i Database product is made up of three main components namely: (i) The Oracle
Server (ii) The Oracle Instance (iii) The Oracle Database (Refer section 5 for detail)
5. Two main types of facilities that are supported by the DBMS are the data definition
facility or data definition language (DDL), the data manipulation facility, or data
manipulation language (DML). (Refer section 6 for detail)
6. A database manager is a program module which provides the interface between the
low-level data stored in the database and the application programs and queries
submitted to the system. The DBA administers the three levels of the database and, in
consultation with the overall usercommunity, sets up the definition of the global view
or conceptual level of the database. (Refer section 7 for detail)
7. The main feature of Distributed processing is to connect communication networks
having different sites which can store data. They need to connect for processing stored
data or functions. (Refer section 8 for detail)
8. The main advantage of Client/Server Architecture is that the client process always
initiates requests and the server process always responds to requests so this type of
architecture is applicable with the network of PCs or workstations connected by a
network to a remote machine running a database server. (Refer section 8.2 for detail)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 3
Database Models and Implementation
Table of Contents
1. INTRODUCTION
A data model is an integrated collection of concepts for describing data, relationships
between data, and constraints on the data. A data modelallows us to treat a database as an
abstract machine. In other words, wecan concentrate on the principles of design divorced
from an immediate concern with implementation. Data models constitute formal languages
for defining data structures declaring integrity and for manipulating data. A data model is a
mechanism for specifying the schema of some database. Data models establish the principles
underlying DBMS. Every database, andindeed every DBMS, must adhere to the principles of
some data model. However, the term data model is somewhat ambiguous. In the database
literature, the term is used in a number of different senses, two of which are the most
important: that of architecture for data, and that of an integrated set of data requirements.
1.1 Objectives:
After completing Unit 3, you should be able to:
❖ Define the concepts of the Relational Model and its implementations
❖ Understand the concepts of other Data Models like Hierarchical Model, Network Data
Model, Object/Relational Model, Object-Oriented Model, and the Associative Database
Model.
❖ Understand the Entity-Relationship Model and its applications
❖ Differentiate between data models
This set of principles that define a data model may be divided into three major parts:
• Data definition – a set of principles concerned with how data isstructured.
• Data manipulation – a set of principles concerned with how data isoperated upon.
• Data integrity – a set of principles concerned with determining whichstates are
valid for a database.
Data definition involves defining an organization for data, a set of templates for the
organization of data. Data manipulation concerns the process of howthe data is accessed,
and how it is changed in the database. Data integrityis very much linked with the idea of
data manipulation, in the sense that integrity concerns the idea of what are valid changes
and invalid changes todata.
A database based on the relational model developed by E.F. Codd allows the definition of
data structures, storage and retrieval operations, and integrity constraints. In such a
database the data and relations between them are organized in tables. A table is a collection
of records and each record in a table contains the same fields. Properties of Relational
Tables:
• Values Are Atomic
• Each Row is Unique
• Column Values Are of the Same Kind
• The Sequence of Columns is Insignificant
• The Sequence of Rows is Insignificant
• Each Column has a unique Name
In the relational data model, the database is represented as a group of related tables. The
relational data model was introduced in 1970. It iscurrently the most popular model. The
mathematical simplicity, and ease of visualization of the relational data model, have
contributed to its success. The relational data model is based on the mathematics of set
theory, whosebasic components are the following.
A Relation may be defined in multiple ways. The Relation Schema R, denoted by R (A1, A2,
An), is made up of a relation name R and is a list of attributes A1, A2, An.
A tuple is an ordered set of values. Each value is derived from an appropriate domain. Each
row in the CUSTOMER table may be referred to as a tuple in the table and would consist of
four values.
<1759, "Rama Krishna", "101 Main 3rd Cross Manipal", "0820-2653487"> is a tuple
belonging to the CUSTOMER relation.
A relation may be regarded as a set of tuples (rows). Columns in a table are also called
attributes of the relation.
A domain has a logical definition, for example, “India_Phone_numbers” is the set of 10 digit
mobile numbers. A domain may have a data type or a format defined for it. The
India_Phone_numbers may have a format: (ddd)- ddddddd where each d is a decimal digit.
E.g., Dates have various formats such as a month, name, date, year or yyyy-mm-dd, or
dd,mm,yyyy, etc.
An attribute designates the role played by the domain. E.g., the domainDate may be used
to define attributes “Invoice-date” and “Payment-date”. The relation is formed over the
cartesian product of the sets; each set has values from a domain; that domain is used in a
specific role which isconveyed by the attribute name. For example, attribute Cust-name is
defined over the domain of strings of 25 characters. The role these strings play in the
CUSTOMER relation is that of the name of the customers.
Given R(A1, A2, , An)
r(R) dom (A1) X dom (A2) X X dom(An)
Where R is a schema of the relation and r(R) is a specific "value" orpopulation of R. Here R is
also called the intension of a relation and r is also called the extension of a relation.
Let S1 and S2 be domains, S1 = {0,1} and S2 = {a,b,c}. The Relation R canbe written as
R S1 X S2
r(R) = {<0,a> , <0,b> , <1,c> }
is one possible “state” or “population” or “extension” r(R), defined over domains S1 and S2.
It has three tuples.
Characteristics of Relations
• Ordering of tuples in a relation r(R): The tuples are not considered to be ordered,
even though they appear to be in the tabular form.
Key Constraints
Superkey of R: A set of attributes SK of R such that no two tuples in any valid relation
instance r(R) will have the same value for SK. That is, for any distinct tuples t1 and t2 in r(R),
t1[SK] ≠ t2[SK].
Key of R: A "minimal" superkey; that is, a superkey K such that removal of any attribute from
K results in a set of attributes that is not a superkey.
It has two keys Key1 = {State, Reg#}, Key2 = {SerialNo}, which are also superkeys.
{SerialNo, Make} is a superkey but not a key.
If a relation has several candidate keys, one is chosen arbitrarily to be theprimary key. The
primary key attributes are underlined.
Entity Integrity
Relational Database Schema: A set S of relation schemas that belong tothe same database.
S is the name of the database.
S = {R1, R2, ..., Rn}
Entity Integrity: The primary key attributes PK of each relation schemaR in S cannot
have null values in any tuple of r(R). This is because primary key values are used to identify
the individual tuples.
t[PK] null for any tuple t in r(R)
Referential Integrity
A constraint involving two relations (the previous constraints involve a singlerelation). Used
to specify a relationship among tuples in two relations, the referencing relation, and the
referenced relation.
Tuples in the referencing relation R1 have attributes FK (called foreign key attributes) that
reference the primary key attributes PK of the referenced relation R2. A tuple t1 in R1 is said
to reference a tuple t2 in R2 if t1[FK] = t2[PK].
Attribute: A table column. Other commonly used terms for attribute are property and field.
The set of permissible values for each attribute is called the domain for that attribute.
Tuple: A table row. A tuple is an instance of an entity or relationship or whatever is
represented by the relation.
Key: A single attribute or combination of attributes whose values uniquely identify the
tuples of the relation. That is, each row has a different value for the key attribute(s). The
relational model requires that every relation have a key and that:
There are two restrictions on the relational model that are sometimes circumvented in
practice:
• Duplicate tuples are not permitted. If two tuples are entered with the same value for
each and every attribute, they are considered to be the same tuple.
• No ordering of tuples within a relation is assumed. In practice, however, one method or
another of ordering tuples is often used.
One of the main advantages of the relational model is that it is conceptually simple and more
importantly based on the mathematical theory of relation. It also frees the users from details
of the storage structure and access methods. The relational model like all other models
consists of three basic components:
• a set of domains
• a set of relations operation on relations
• integrity rules
In this unit, we first provide the formal definition of a relational data model. The basic
operations of relational algebra are discussed in Unit 4.
In general, we say that a relation defined over n domains has a degree n or is n-ary. The
elements of this set are n-tuples. We shall distinguish between the definition of a relation
and the relation itself. We shall say that the definition of a relation gives a name to the
relation and specifies the components over which it is defined. These components are
referred to as relation attributes or attributes for short. An attribute has a domain associated
with it from which it takes on values. The relation itself, on the other hand, is the set of tuples
which constitute it at a given instance of time.
For example, a statement which says that a relation Supplier is built over attributes S#, P#,
SCITY having domains integer, character string respectively is the definition of the relation
Supplier. The relation itself is shown below. It must be noted at the time the definition of a
relation is just given, a relation with no tuples in it, i.e. a null relation is just given, a relation
with no tuples in it, i.e. a null relation, is created.
Supplier
S# P# SCITY
10 1 BANGALORE
10 2 BANGALORE
10 3 BANGALORE
11 1 BOMBAY
11 2 BOMBAY
Database designers can work with familiar tabular structures and data definition languages
(DDLs) while assimilating new object-management possibilities. Query and procedural
languages and call interfaces in ORDBMSs are familiar: SQL3, vendor procedural languages,
and ODBC, JDBC, and proprietary call interfaces are all extensions of RDBMS languages and
interfaces. And the leading vendors are, of course, quite well known: IBM, Inform ix, and
Oracle.
In contrast to a relational DBMS where a complex data structure must be flattened out to fit
into tables or joined together from those tables to form the in-memory structure, object
DBMSs have no performance overhead to storeor retrieve a web or hierarchy of interrelated
objects. This one-to-one mapping of object programming language objects to database
objects has two benefits over other storage approaches: it provides higher performance
management of objects, and it enables better management of the complex interrelationships
between objects. This makes object DBMSs better suited to support applications such as
financial portfolio risk analysis systems, telecommunications service applications, WWW
(World Wide Web) document structures, design and manufacturing systems, and hospital
patient record systems, which have complex relationships between data.
Self-Assessment Questions – 1
3. ENTITY-RELATIONSHIP MODEL
There are three basic notions that the E-R data model employs: entity sets, relationship sets,
and attributes. Consider an example COMPANY Database, the COMPANY is organized into
DEPARTMENTs. Each department has a name, number, and employee who manages the
department. We keep track of the start date of the department manager. Each department
controlsa number of PROJECTs. Each project has a name, number and is located at a single
location.
We store each EMPLOYEE’s social security number, address, salary, sex, and birth date. Each
employee works for one department but may work on several projects.
We keep track of the number of hours per week that an employee currently works on each
project. We also keep track of the direct supervisor of each employee.
Each employee may have a number of DEPENDENTs. For eachdependent, we keep track of
their name, sex, birth date, and relationship to the employee.
An entity is a “thing” or object in the real world that is distinguishable fromall other
objects. For example, each employee in an enterprise or companyis an entity. An entity has
a set of properties, and the values for some set of properties may uniquely identify an entity.
An entity set is a set of entities of the same type that share the same properties or attributes.
Attributes are properties used to describe an [Link] example, an EMPLOYEE entity may
have a Name, SSN, Address, Sex, BirthDate. A specific entity will have a value for each of its
attributes. For example a specific employee entity may have Name='John Smith',
SSN='123456789', Address ='731, Fondren, Houston, TX', Sex='M', BirthDate='09-JAN-55‘.
Each attribute has a value set (or data type) associated with it – e.g. integer, string, subrange,
enumerated type.
The attribute used in the E-R model can be characterized by the following attribute types.
• Simple
– Each entity has a single atomic value for the attribute. For example, SSN or Sex.
• Composite
– The attribute may be composed of several components. For example, Address
(Apt#, House#, Street, City, State, ZipCode, Country) or Name (FirstName,
An attribute of an entity type for which each entity must have a unique value, is called a key
attribute of the entity type. For example, SSN is the key attribute of the EMPLOYEE relation
schema. A key attribute may be composite. For example, Vehicle Tag Number is a key of the
CAR entity type with components (Number, State).
• Identify the entities: Look for general nouns in the requirement specification
document which are of business interest to business users.
• Find relationships: Identify the natural relationship and their cardinalities between
the entities.
• Identify the key attributes for every entity: Identify the attribute or set of attributes
which can identify the instance of the entity uniquely.
• Identify other relevant attributes: Identify other attributes which are of interest to
business users and which they want to store the informationin the database.
• Complete E-R diagram: Draw a complete E-R diagram with all attributes, including
the primary key.
• Review your results with your business users: Look at the list of attributes
associated with each entity to see if anything has been omitted.
A relationship relates to two or more distinct entities with a specific meaning. For example,
EMPLOYEE John Smith works on the Product X PROJECT orEMPLOYEE Franklin manages the
Research DEPARTMENT. Relationships of the same type are grouped or typed into a
relationship type. For example,the WORKS_ON relationship type in which EMPLOYEEs and
e1 R1 d1
e2 R2
d2
e3 R3
e4 d3
R4
e5
R5
e6
R6
Example:
Suppose that a DEPENDENT entity is identified by the dependent’s first name and birthdate,
and the specific EMPLOYEE that the dependent is related to. DEPENDENT is a weak entity
type with EMPLOYEE as its identifying entity type via the identifying relationship type
DEPENDENT_OF
Relationships Types
An E-R enterprise schema may define certain constraints to which the contents of a database
must conform. Mapping cardinalities, or cardinality ratios, express the number of entities
to which another entity can be associated via a relationship set.
For a binary relationship set R between entity sets A and B, the mapping cardinality must be
one of the following:
❖ One-to-one (1:1): An entity in A is associated with at most one entity in B, and an entity
B is associated with at most one entity in A. See Figure 3.2 (a).
❖ One-to-many (1:N): An entity in A is associated with any number of entities in B. An
entity in B, however, can be associated with at most one entity in A. See Figure 3.2 (b).
❖ Many-to-one (N:1): An entity in A is associated with at most one entityin B. An entity
in B, however, can be associated with any number of entities in A. See Figure 3.2 (c).
❖ Many-to-many (M:N): An entity in A is associated with any number of entities in B. An
entity in B is associated with any number of entities in A. See Figure 3.2 (d).
A B
A B
B1 A1 B1
A1
B2 A2 B2
A2
B3 A3 B3
A3
(a) (b)
A1 B1
A1 B1
A2 B2
A2 B2
A3 B3
A3 B3
A B A B
(c) (d)
Offers Course
Department
2. One course is taken up by multiple students and one student enrolls for multiple
courses, hence the relationship is many-to-many.
Enrolled Student
Course
by
3. Each department has one “Head of Department” (HOD), hence relationis one-to-one.
Self-Assessment Questions – 2
15. There are ____________basic notions that E-R data model employs.
16. An ______________is a “thing” or object in the real world that is
distinguishable from all other objects.
17. An entity set is a set of entities of the same type that share the same
properties or _____________.
18. Attributes are properties used to describe an ___________.
19. Each entity has a single atomic value for the attribute is called _________
attribute.
20. Address is an example for _________attribute.
21. The value for ______type of attribute can be derived from thevalues of
other related attributes or entities.
22. An attribute of an entity type for which each entity must have a unique
value is called a of the entity type.
23. One of the following is a demerit of the ER modeling.
a. Gives a higher level abstraction of the system.
b. Can be generalized and specialized based on needs.
c. Physical design derived from E-R Model may have some amount of
ambiguities or inconsistency.
d. Intuitive and helps in physical database creation
24. How many basics are there in E-R data model?
a. Three
b. Four
c. Five
d. Six
25. An is a “thing” or object in the real world that is
distinguishable from all other objects.
a. Relation
b. Entity
c. Attribute
d. Simple attribute
26. Pick out the composite attribute from the list of attributes
a. Sex
b. Address
c. SSN
d. Department number
27. ‘Color of the car and degrees of students’ are examples for the .
a. Null attribute
b. Derived attribute
c. Single valued
d. Multi valued
28. Identifying the natural relationship and their cardinalities between the
entities is a step of .
a. Identify the entities
b. Find relationships
c. Identify the key attributes for every entity
d. Identify other relevant attributes
29. ER diagram includes a graphical notation, which depicts entity classesas
.
a. Rectangles
b. Ovals
c. Diamonds
d. Circles
30. A is an association among several entities.
a. Relationship
b. Key
c. Partial key
d. Entity
31. An entity that does not have a key attribute is called
a. Weak entity types
b. Entity Types
c. Null attribute
d. Derived attribute
An Associative database has two fundamental data structures. There is aset of "Items" and
a set of "Links" that connect them together. In the "Item" structure entries have a unique
identifier, a name, and a type. Each entry in the "links" structure also has a unique identifier
together with the identifiers for the relevant "source", "verb" and "target" (the subject, verb,
and object from our sentences). This can be illustrated with the following two diagrams. For
clarity, the question of item type has been ignored in this illustration.
Items
Identifier Name
12 Red
41 Is a
76 Colour
14 Mary
81 Vegetarian
43 Eats
82 Plants
15 Ski Lessons
39 Start at
83 08:00
42 On
85 Sunday
Links
Identifier Source Verb Target
101 12 41 76
103 14 41 81
124 81 43 82
105 15 39 83
107 105 42 85
The last entry in the Links structure (107) shows how another entry (identifier 105) has
become the Source for that entry. The two entries show how you could store the "ski lesson"
sentence within the Links structure. Readers with some familiarity with Microsoft Access
will have seen the "AutoNumber" facility, that can be used to create a unique numeric
identifierfor any row in a given table. Here we see an identifier being assigned on a database-
wide basis. The number itself has no significance, it is simply required to be unique.
The Associative model structure is economical with storage space as there is no need to hold
available "spaces" for data that is not available, even if itis a normal part of a given data set.
This contrasts with relational databases.A relational database stores a minimum of a single
"null" byte for missing data items in any given row. Some relational databases reserve the
maximum space for a given column in every row. The Associative database makes the
storage of "custom" data for different users, or for other varying needs, straightforward and
"inexpensive" in terms of maintenance or network resources. If there is a need to store
different data about, say, different customers or customer groups in different countries, then
an Associative database can manage this more efficiently than a relational database.
The Associative model differentiates between what it calls Entities and Associations. An
entity is defined as being discrete and having an independent existence. An association is
something that depends upon one or more other things. An example may help with this. A
person or companyis an entity while a supplier or a customer or an employee is association
– their existence depends upon the role being played at any one time. Indeed it is possible
for an Entity to have multiple business roles simultaneously, each being recorded as an
association. If circumstances change, one or more of the associations may die away but the
entity would endure. The difference may seem a little moot at first but is designed to simplify
rather than complicate the data model.
to particular users or subject groups. Data that is relevant to one user group could be
invisible to another, and indeed may be replaced by an alternate data set.
The concept of a record is missing from the Associative model. To assemble all of the current
information on something as complex as (say) a sales order, the data storage will need to be
re-visited many times. This is a potential disadvantage, although it should be recognized
that a well a normalized relational database would probably also require a number of
data store reads to establish a similar data set. Some rough calculations based upon a small
personal sample would suggest that the Associative database would require more than four
times as many data reads as a relational database. If the process of reading a sequence of
links can be optimized, then this may go some way to minimizing the difference as
experienced by the user.
Self-Assessment Questions – 3
33. In the "Item" structure entries have a ______________ identifier, a name and a
type.
34. A relational database stores a minimum of a single byte formissing
data items in any given row.
35. A ___________is a list of Chapters.
36. The combination of Chapters and Profiles can simplify the ofthe
database to particular users.
37. The concept of a is missing from the Associative model.
5. SUMMARY
In this unit, we reviewed three major traditional data models used in current DBMSs. These
three models are primitive, classic and semantic data models. In the Primitive data, models
approach objects are represented by record structures grouped in file structures. The main
operations available are read and write operations over records. Classic data models are the
hierarchical, network and relational data models.
The hierarchical data model is an extension of the primitive data model discussed above. The
network is an extension of the hierarchical approach. The relational data model is a
fundamental departure from the hierarchical and network approaches. The main problem
with classic data models such as the relational data model is that they maintain a
fundamental record- orientation.
The hierarchical model evolved from the file-based system. It uses tree-type data structure
to represent the relationship among records. The hierarchical data model restricts each
record type to only t h e parent record type. Eachparent record type can have any
number of children record types.
In a network model, one child record may have more than one parent node. A network can
be converted into one or more trees by introducing redundant nodes.
The relational model is based on a collection of tables. A table is also called a relation. A tree
or network structure can be converted into a relational structure by separating each node in
the data structure into a relation.
The entity-relationship diagrams are useful in representing the relationship among entities.
They help in logical database design. We have also presented implementation schemes of
each of the traditional database models.
6. TERMINAL QUESTIONS
1. Distinguish between three major types of the architectural data model.
2. Describe the differences in meaning between the terms relation and relation
schema.
3. Explain the distinctions among the terms primary key, candidate key, and superkey.
4. Construct an E-R diagram for a car-insurance company whose customers own one or
more cars each. Each car has associated with it zero to any number of recorded
accidents.
5. Construct an E-R diagram for a hospital with a set of patients and a set of medical
doctors. Associate with each patient a log of the various tests and examinations
conducted.
6. Explain the difference between a weak and a strong entity set.
7. We can convert any weak entity set to a strong entity set by simply adding appropriate
attributes. Why, then, do we have weak entity sets? Self Assessment Questions
7. ANSWERS
Self-Assessment Questions
1. Data integrity
2. Record
3. Relational
4. Unique
5. Ordered
6. Three
7. Primary
8. Domain
9. Tuples
10. Tree
11. one to many
12. hierarchical
13. object
14. unification
15. Three
16. Entity
17. Attributes
18. Entity
19. Simple
20. Composite
21. Derived
22. key attribute
23. Physical design derived from E-R Model may have some amount ofambiguities or
inconsistencies.
24. Three
25. Entity
26. Address
27. Multi-valued
28. Find relationships
29. Rectangles
30. Relationship
31. Week-entity types
32. Two
33. Unique
34. "null"
35. Profile
36. Tailoring
37. Record
Terminal Questions
1. In terms of three generations, the architectural data models are:
(i) Primitive data models (ii) Classic data models (iii) Semantic data models. These
three data models can be distinguished in terms ofobject representation, hierarchical
network formation, and record-orientation. (Refer section 2)
2. A relation is a collection of tuples, each of which contains values for a fixed number of
attributes whereas the Relation Schema R, denoted byR (A1, A2, ..... An), is made up of
a relation name R and is a list of attributes A1, A2, An. (Refer section 2.1)
3. A different set of attributes which are able to identify any row in thedatabase is known
as super key. And minimal super key is termed as candidate key i.e. among set of super
keys one with minimum number ofattributes. If a relation has several candidate keys,
one is chosen arbitrarily to be the primary key. (Refer section 2.1)
4. To construct E-R diagram, identify car as entities, find naturalrelationships between
car and customer, identify the key attributes for every entity, identify other relevant
attributes related to car, customer, car-insurance to complete E-R diagram. Also review
results. (Refer section 3.1 for detail)
5. Follow the same steps as shown in Q. No. 4 and refer section 3.1 for detail)
6. A strong entity is independent of other entities and can exist on its own. A weak entity
is dependent on one or more other entities in order for it toexist. (Refer section 3.3 for
detail)
7. A weak entity is required because it is dependent on one or more other entities in order
for it to exist and it is often prevalent in database. (Refersection 3.3 for detail)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 4
File Organization for Conventional DBMS
Table of Contents
Indexed Sequential Access Method (ISAM) 11, 12, 13, 14, 15,
5 19 - 24
16, 17, 18
6 Virtual Storage Access Method (VSAM) 19 25 - 26
7 Summary 27
8 Terminal Questions 27
9 Answers 28
1. INTRODUCTION
Although a database system provides a high-level view of data, ultimately data have to be
stored as bits on one or more storage devices. A vast majority of databases today store data
on a magnetic disk and fetch data into main space memory for processing, or copy data onto
tapes and other backup devices for archival storage. The physical characteristics of storage
devices play a major role in the way data are stored, in particular because access to a random
piece of data on disk is much slower than memory access: Disk access takes tens of
milliseconds, whereas memory access takes a tenth of a microsecond.
1.1 Objectives:
By the end of unit 4, the learners should be able to understand:
• Tape storage. Tape storage is used primarily for backup and archival data. Although
magnetic tape is much cheaper than disks, access todata is much slower, because
the tape must be accessed sequentially from the beginning. For this reason, tape storage
is referred to assequential-access storage. In contrast, disk storage is referred to as
direct-access storage because it is possible to read data from any location on a disk.
The fastest storage media – for example, cache and the main memory – are referred to as
primary storage. The media in the next level in the hierarchy – for example, magnetic
disks – are referred to as secondary storage or online storage. The media in the lowest
level in the hierarchy –for example, magnetic tape and optical disk jukeboxes – are referred
to as tertiary storage, or offline storage.
In addition to the speed and cost of the various storage systems, there is also the issue of
storage volatility. Volatile storage loses its contents when the power to the device is
removed. In the hierarchy shown in Figure 4.1,the storage systems from main memory up
are volatile, whereas the storagesystems below main memory are nonvolatile. In the absence
of expensive battery and generator backup systems, data must be written to nonvolatile
storage for safekeeping.
The disk surface is logically divided into tracks, which are subdivided into sectors. A sector
is the smallest unit of information that can be read from or written to the disk. The read-
write head stores information on a sector magnetically as reversals of the direction of
magnetization of the magnetic material. There may be hundreds of concentric tracks on a
disk surface, containing thousands of sectors.
Each side of a platter of a disk has a read-write head, which moves across the platter to access
different tracks. A disk typically contains many platters, and the read-write heads of all the
tracks are mounted on a single assemblycalled a disk arm, and move together. The disk
platters mounted on a spindle and the heads mounted on a disk arm are together known
ashead–disk assemblies.
A fixed-head disk has a separate head for each track. This arrangement allows the computer
to switch from track to track quickly, without having to move the head assembly, but
because of the large number of heads, the device is extremely expensive. Some disk systems
have multiple disk arms, allowing more than one track on the same platter to be accessed at
a time. Fixed-head disks and multiple-arm disks were used in high-performance mainframe
systems, but are no longer in production.
A disk controller interfaces between the computer system and the actual hardware of the
disk drive. It accepts high-level commands to read or writea sector and initiates actions,
such as moving the disk arm to the right track and reading or writing the data. Disk
controllers also attach checksums to each sector that is written; the checksum is computed
from the data written to the sector. When the sector is read back, the controller computes
the checksum again from the retrieved data and compares it with the stored checksum; if the
data are corrupted, with a high probability the newly computed checksum will not match the
stored checksum. If such an error occurs, the controller will retry the read several times; if
the error continues to occur, the controller will signal a read failure. A disk controller
interface is shown in Figure 4.3.
In the storage area network (SAN) architecture, large numbers of disks areconnected by a
high-speed network to a number of server computers. The disks are usually organized locally
using redundant arrays of independentdisks (RAID) storage organizations, but the RAID
organization may be hidden from the server computers: the disk subsystems pretend each
RAID system is a very large and very reliable disk.
Access time is the time from when a read or write request is issued to when data transfer
begins. To access (that is, to read or write) data on a given sector of a disk, the arm first must
move, so that it is positioned over the correct track, and then must wait for the sector to
appear under it as the disk rotates. The time for repositioning the arm is called the seek
time, andit increases with the distance that the arm must move. Typical seek times range
from 2 to 30 milliseconds, depending on how far the track is from the initial arm position.
Smaller disks tend to have lower seek time since the head has to travel a smaller distance.
The average seek time is the average of the seek times, measured over a sequence of
(uniformly distributed) random requests. If all tracks have the same number of sectors, and
we disregard the time required for the head to start moving and to stop moving, we can show
that the average seek time isone-third of the worst case seek time. Taking these factors into
account, the average seek time is around one-half of the maximum seek time. Average seek
time currently ranges between 4 milliseconds and 10 milliseconds, depending on the disk
model.
Once the seek time has started, the time spent waiting for the sector to be accessed to appear
under the head is called the rotational latency time. On an average, one-half of a rotation
of the disk is required for the beginning of the desired sector to appear under the head. Thus,
the average latency time of the disk is one-half the time for a full rotation of thedisk.
The access time is then the sum of the seek time and the latency, and ranges from 8 to 20
milliseconds. Once the first sector of the data to be accessed has come under the head, the
data transfer begins. The data- transfer rate is the rate at which data can be retrieved from
or stored on the disk.
The final commonly used measure of a disk is the meantime to failure (MTTF), which is a
measure of the reliability of the disk. The mean time to failure of a disk (or of any other
system) is the amount of time that, on average, we can expect the system to run continuously
without any failure.
Since access to data on disk is several orders of magnitude slower than access to data in main
memory, equipment designers have focused on techniques for improving the speed of access
to blocks on disk.
• Nonvolatile write buffers: Since the contents of the main memory are lost in a power
failure, information about database updates has to be recorded on the disk to survive
possible system crashes. For this reason, the performance of update-intensive
• Log disk: Another approach to reducing write latencies is to use a log disk – that is, a
disk devoted to writing a sequential log – in much the same way as a nonvolatile RAM
buffer. All access to the log disk is sequential, essentially eliminating seek time and
several consecutive blocks can be written at once, making write to the log disk several
times faster than random writes. As before, the data have to be written to their actual
location on disk as well, but the log disk can do the write later, without the database
system having to wait for the write to be completed. Furthermore, the log disk can
reorder the writes to minimize disk arm movement. If the system crashes before some
writes to the actual disk location have been completed, when the system comes back, it
reads the log disk to find those writes that had not been completed, and carries them
out then. File systems that support log disks as above are called journaling file
systems.
Self-Assessment Questions – 1
3. FILE ORGANIZATION
A file is organized logically as a sequence of records. These records are mapped onto disk
blocks. Files are provided as a basic construct in operating systems, so we shall assume the
existence of an underlying file system. We need to consider ways of representing logical data
models in terms of files.
Although blocks are of a fixed size determined by the physical properties of the disk and by
the operating system, record sizes vary. In a relational database, tuples of distinct relations
are generally of different sizes. A file containing account records is shown in Figure 4.4.
One approach to mapping the database to files is to use several files and store records of only
one fixed-length in any given file.
If we assume that each character occupies 1 byte and that a real occupies8 bytes, our
account record is 40 bytes long. A simple approach is to usethe first 40 bytes for the first
record, the next 40 bytes for the second record,and so on (Figure 4.5). However, there are
two problems with this simple approach:
1. It is difficult to delete a record from this structure. The space occupiedby the record
to be deleted must be filled with some other record of the file, or we must have a way
of marking deleted records so that they can be ignored.
2. Unless the block size happens to be a multiple of 40 (which is unlikely), some records
will cross block boundaries. That is, part of the record will be stored in one block and
part in another. It would thus require two-block accesses to read or write such a record.
Fig 4.5: File of Figure 4.4, with record 2 deleted and all records moved.
When a record is deleted, we could move the record that came after it into the space formerly
occupied by the deleted record, and so on, until every record following the deleted record
has been moved ahead (Figure 4.5). Such an approach requires moving a large number of
records. It might be easier simply to move the final record of the file into the space occupied
by the deleted record (Figure 4.6).
Fig 4.6: File of Figure 4.4, with record 2 deleted and final record moved.
It is undesirable to move records to occupy the space freed by a deleted record since doing
so requires additional block accesses. Since insertions tend to be more frequent than
deletions, it is acceptable to leave open the space occupied by the deleted record and to wait
for a subsequent insertion before reusing the space. A simple marker on a deleted record is
not sufficient, since it is hard to find this available space when an insertion isbeing done.
Thus, we need to introduce an additional structure.
At the beginning of the file, we allocate a certain number of bytes as a file header.
The header will contain a variety of information about the file. For now, all we need to store
there is the address of the first record whose contents are deleted. We use this first record
to store the address of the second availablerecord, and so on. Intuitively, we can think of
these stored addresses as pointers, since they point to the location of a record.
Fig 4.7: File of Figure 4.4, with free list after deletion of records 1, 4 and 6
The deleted records thus form a linked list, which is often referred to as a free list. Figure
4.7 shows the file of Figure 4.4, with the free list, after records 1, 4, and 6 have been deleted.
We define account-info as an array with an arbitrary number of elements. That is, the type
definition does not limit the number of elements in the array, although any actual record
will have a specific number of elements in its array. There is no limit on how large a record
can be (up to, of course,the size of the disk storage!).
The slotted-page structure appears in Figure 4.8. There is a header at thebeginning of each
block, containing the following information:
1. The number of record entries in the header
2. The end of free space in the block
3. An array whose entries contain the location and size of each record
The actual records are allocated contiguously in the block, starting from the end of the block.
The free space in the block is contiguous, between thefinal entry in the header array, and
the first record. If a record is inserted, space is allocated for it at the end of free space, and
an entry containing its size and location is added to the header.
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size
is set to −1, for example). Further, the records in the block before the deleted record are
moved, so that the free space created by the deletion gets occupied, and all free space is again
between the final entry in the header array and the first record. The end-of-free-space
pointer in the header is appropriately updated as well.
• Sequential file organization: Records are stored in sequential order, according to the
value of a “search key” of each record.
Figure 4.9 shows a sequential file of account records taken from our banking example. In
that example, the records are stored in search-key order, using branchname as the search
key.
The sequential file organization allows records to be read in sorted order; that can be useful
for display purposes, as well as for certain query-processing algorithms.
It is difficult, however, to maintain physical sequential order as records are inserted and
deleted, since it is costly to move many records as a result of asingle insertion or deletion.
We can manage deletion by using pointer chains, as we saw previously.
Figure 4.10 shows the file of Figure 4.9 after the insertion of the record (North Town,
A-888, 800). The structure in Figure 4.10 allows the fast insertion of new records, but
forces sequential file-processing applications to process records in an order that does
not match the physical order of the records.
Students(sid: string, name: string, login: string, age: integer, gpa: real)
To answer a range selection such as “Find all students with a gpa higher than 3.0" we must
identify the first such student by doing a binary search of the file and then scan the file from
that point on. If the file is large, the initial binary search can be quite expensive; can we
improve upon this method?
One idea is to create a second file with one record per page in the original (data) file, of the
form (first key on-page, pointer to page), again sorted by the key attribute (which is gpa in
our example). The format of a page in the second index file is illustrated in Figure 4.12.
We refer to pairs of the form <key, pointer> as entries. Notice that each index page contains
one pointer more than the number of key searches. Key serves as a separator for the
contents of the pages pointed to by the pointers to its left and right. This structure is
illustrated in Figure 4.13.
We can do a binary search of the index file to identify the page containing the first key (gpa)
value that satisfies the range selection (in our example, the first student with gpa over 3.0)
and follows the pointer to the page containing the first data record with that key value. We
can then scan the data file sequentially from that point on to retrieve other qualifying
records. This example uses the index to find the first data page containing a Students record
with gpa greater than 3.0, and the data file is scanned from that point on to retrieve other
such Student records.
Because the size of an entry in the index file (key value and page id) is likelyto be much
smaller than the size of a page, and only one such entry exists per page of the data file, the
index file is likely to be much smaller than the data file; thus, a binary search of the index file
is much faster than a binary search of the data file. However, a binary search of the index file
could still be fairly expensive, and the index file is typically still large enough to make inserts
and deletes expensive.
The potential large size of the index file motivates the ISAM idea: Why not apply the previous
step of building an auxiliary file on the index file and soon recursively until the final auxiliary
file fits on one page? This repeated construction of a one-level index leads to a tree structure
that is illustrated inFigure 4.14.
The data entries of the ISAM index are in the leaf pages of the tree and additional overflow
pages that are chained to some leaf page. In addition, some systems carefully organize the
layout of pages so that page boundaries correspond closely to the physical characteristics of
the underlying storage device. The ISAM structure is completely static (except for the
overflow pages, of which it is hoped, there will be few) and facilitates such low-level
optimizations.
Each tree node is a disk page, and all the data resides in the leaf pages. This corresponds to
an index that uses Alternative (1) for data entries, in terms of the alternatives; we can create
an index with Alternative (2) by storing the data records in a separate file and storing <key,
rid> pairs in the leaf pages of the ISAM index. When the file is created, all leaf pages are
allocated sequentially and sorted on the search key value. (If Alternatives (2) or (3) are
used, the data records are created and sorted before allocating the leaf pages of the ISAM
index.) The non-leaf level pages are then allocated. If there are several inserts to the file
subsequently, so that more entries are inserted into a leaf than will fit onto a single
page, additional pages are needed because the index structure is static. These additional
pages are allocated from an overflow area. The allocation of pages is illustrated in Figure
4.15.
The basic operations of insertion, deletion, and search are all quite straightforward. For an
equality selection search, we start at the root node and determine which subtree to search
by comparing the value in thesearch field of the given record with the key values in the node.
For a range query, the starting point in the data (or leaf) level is determined similarly, and
data pages are then retrieved sequentially. For inserts and deletes, the appropriate page is
determined as for a search, and the record is inserted ordeleted with overflow pages added
if necessary.
The following example illustrates the ISAM index structure. Consider the tree shown in
Figure 4.16. All searches begin at the root. For example, to locate a record with the key-value
27, we start at the root and follow the left pointer, since 27 < 40. We then follow the middle
pointer, since 20 <= 27 < 33. For a range search, we find the first qualifying data entry as for
an equality selection and then retrieve primary leaf pages sequentially (also retrieving
overflow pages as needed by following pointers from the primary pages). The primary leaf
pages are assumed to be allocated sequentially this assumption is reasonable because the
number of such pages is known when the tree is created and does not change subsequently
under inserts and deletes and so no ‘next leaf page' pointers are needed.
We assume that each leaf page can contain two entries. If we now insert a record with key-
value 23, the entry 23* belongs on the second data page, which already contains 20* and
27* and has no more space. We deal with this situation by adding an overflow page and
putting 23* in the overflow page. Chains of overflow pages can easily develop.
For instance, inserting 48*, 41*, and 42* leads to an overflow chain of two pages. The tree of
Figure 4.16 with all these insertions is shown in Figure 4.17.
The deletion of an entry k* is handled by simply removing the entry. If this entry is on an
overflow page and the overflow page becomes empty, the page can be removed. If the entry
is on a primary page and deletion makes the primary page empty, the simplest approach is
to simply leave the empty primary page as it is; it serves as a placeholder for future insertions
(and possibly non-empty overflow pages, because we do not move records from the overflow
pages to the primary page when deletions on the primary page create space).
Thus, the number of primary leaf pages is fixed at file creation time. Notice that deleting
entries could lead to a situation in which key values that appear in the index, levels do not
appear in the leaves! Since index levels are used only to direct a search to the correct leaf
page, this situation is not aproblem. The tree of Figure 4.17 is shown in Figure 4.18 after the
deletion of theentries 42*, 51*, and 97*.
Note that after deleting 51*, the key-value 51 continues to appear in the index level. A
subsequent search for 51* would go to the correct leaf page and determine that the entry is
not in the tree.
The non-leaf pages direct a search to the correct leaf page. The number of disks I/Os is equal
to the number of levels of the tree and is equal to logFN, where N is the number of primary
leaf pages and the fan-out F is the number of children per index page. This number is
considerably less than the number of disk I/Os for binary search, which is log2N; in fact, it is
reduced further by pinning the root page in memory. The cost of access viaa one-level index
is log2(N=F). If we consider a file with 1,000,000 records, 10 records per leaf page, and 100
entries per index page, the cost (in page I/Os) of a file scan is 100,000, a binary search of the
sorted data file is 17, a binary search of a one-level index is 10, and the ISAM file (assuming
no overflow) is 3.
You can use VSAM to organize records into four types of data sets: key-sequenced, entry-
sequenced, linear, or relative record. The primarydifference among these types of data sets
is the way their records are stored and accessed.
Several additional methods of accessing data in VSAM are not listed here. Most applications
use VSAM for keyed data.
VSAM works with a logical data area known as a control interval (CI) that is diagrammed in
Figure 4.19. The default CI size is 4K bytes, but it can be up to 32K bytes. The CI contains data
records, unused space, record descriptor fields (RDFs), and a CI descriptor field.
Multiple CIs are placed in a control area (CA). A VSAM data set consists of control areas and
index records. One form of index record is the sequence set, in which the lowest-level index
is pointing to a control interval.
VSAM data is always variable-length and records are automatically blocked in control
intervals. The RECFM attributes (F, FB, V, VB, U) do not apply to VSAM, nor does the BLKSIZE
attribute. You can use the Access Method Services (AMS) utility to define and delete VSAM
structures, such as files and indexes.
7. SUMMARY
This unit has covered storage devices, file organization, indexed, and virtual access methods
which are used in DBMS. We discussed that several types of data storage media are used in
t h e computer system and are classified bythe speed with which data can be accessed, by
the cost per unit of data to buy the medium, and by the medium’s reliability. We also
discussed about the File organizations and ways of representing logical data models in terms
of files. We also illustrated the ISAM index structure and the know-how of accessing files.
Another accessing method discussed is Virtual Storage Access Method (VSAM) that applies
to both a data set type and are used to manage various user data types. We also found out
that VSAM keeps disk records in a unique format that is not understandable by other access
methods.
8. TERMINAL QUESTIONS
1. Explain various storage devices and their characteristics.
2. Explain various performance measures of disks.
3. Explain various file organizations in detail.
4. Explain indexed and virtual storage access methods.
9. ANSWERS
Self-Assessment Questions
1. Cache
2. Lost
3. Power failure.
4. magnetic disk
5. optical
6. sequential
7. circular
8. disk controller
9. storage area network (SAN)
10. Access
11. average seek time
12. block
13. elevator algorithm
Terminal Questions
1. Various storage devices available in most computer systems are: main memory, flash
memory, magnetic disk storage, optical storage, tape storage. Each of these storage
devices has their own characteristics. (Refer section 4.2 for detail)
2. Various performance measures of disk are Access time, average seek time, capacity,
latency time, data-transfer rate, reliability. (Refersection 4.2.3 for detail)
3. File are organized in Fixed-length records, Variable-length records. (Refer section 4.3
for detail)
4. In Indexed sequential access method (ISAM), index files are createdwith records in a
page and pointer are used to do binary search for a key. In Virtual Storage access
method, four types of data sets namely key-sequenced, entry-sequenced, linear, or
relative record are used.
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 5
An Introduction to RDBMS
Table of Contents
1. INTRODUCTION
A relational database is a collection of relations with distinct relation names. The
relational database schema is the collection of schemas for the relations in the database.
For example, a university database contains schemas for relations called Students, Faculty,
Courses, Rooms, Enrolled Numbers, Teachers, etc. An instance of a relational database is a
collection of relation instances, one per relation schema in the database schema; of course,
each relation instance must satisfy the domain constraints in its schema.
1.1 Objectives:
By the end of the Unit 5, the learners should be able to understand:
❖ The relation, relation schema, and domain name concept
❖ The relational database components like tables, columns, etc.
❖ Properties of RDBMS including entity-relationship models
❖ Optimization in RDBMS
❖ System catalogs in RDBMS
We use the example of student information in a university database to illustrate the parts of
a relation schema:
Students(sid: string, name: string, login: string, age: integer, gpa: real)
This says, for instance, that the field name sid has a domain name string. The set of values
associated with the domain string is the set of all character strings. We now turn to the
instances of a relation. An instance of a relationis a set of tuples, also called records, in
which each tuple has the same number of fields as the relation schema. A relation instance
can be thought of as a table in which each tuple is a row, and all rows have the same number
of fields. (The term relation instance is often abbreviated to just relation when there is no
confusion with other aspects of a relation such as its schema.)
An instance of the Students relation appears in Figure 5.1. The instance S1 contains six tuples
and has, as we expect from the schema, five fields. Notethat no two rows are identical.
This is a requirement of the relational model - each relation is defined to bea set of unique
tuples or rows. The order in which the rows are listed is not important.
Figure 5.2 shows the same relation instance. If the fields are named, as in our schema
definitions and figures depicting relation instances, the order of fields does not matter either.
However, an alternative convention is to list fields in a specific order and to refer to a field
by its position. Thus sid is field 1 of Students, login is field 3, and so on. If this convention is
used, the order of fields is significant. Most database systems use a combination of these
conventions. For example, in SQL the named fields’ convention is used in statements that
retrieve tuples, and the ordered fields’ convention is commonly used when inserting tuples.
A relation schema specifies the domain of each field or column in therelation instance.
These domain constraints in the schema specify an important condition that we want each
instance of the relation to satisfy: The values that appear in a column must be drawn from
the domain associated with that column. Thus, the domain of a field is essentially the type of
that field, in programming language terms, and restricts the values that can appear inthe
field. Domain constraints are so fundamental in the relational model that we will henceforth
consider only relation instances that satisfy them; therefore, relation instance means
relation instance that satisfies the domainconstraints in the relation schema.
Self-Assessment Questions – 1
Operations are used to manipulate data and structures in a database. Whenusing operations,
you must adhere to a predefined set of integrity rules.
Integrity rules are laws that govern the operations allowed on data in a database. This
ensures data accuracy and consistency.
A Table is a basic storage structure of an RDBMS and consists of columns and rows as shown
in Figure 5.4. A table represents an entity. For example,the S_DEPT table stores information
about the departments of an organization.
A Row is a combination of column values in a table and is identified by a primary key. Rows
are also known as records. For example, a row in the table S_DEPT contains information
about one department.
A Column is a collection of one type of data in a table. Columns represent the attributes of an
object. Each column has a column name and contains values that are bound by the same type
and size. For example, a column in the table S_DEPT specifies the names of the departments
in the organization.
A Field is an intersection of a row and a column. A field contains one data value. If there is
no data in the field, the field is said to contain a NULLvalue.
A Foreign key is a column or set of columns that refers to a primary key of another table.
You use foreign keys to establish principle connections between, or within, tables. A foreign
key must either match a primary key or else be NULL. Rows are connected logically when
required. The logical connections are based upon conditions that define a relationship
between corresponding values, typically between a primary key and a matching foreign key.
This relational method of linking provides great flexibility as it is independent of physical
links between records. The primary and foreign key concept is shown in Figure 5.5.
Self-Assessment Questions – 2
4. RDBMS PROPERTIES
An RDBMS is easily accessible. You execute commands in the Structured Query Language
(SQL) to manipulate data. SQL is the International Standards Organization (ISO) standard
language for interacting with an RDBMS. The interaction between the SQL and database is
shown in Figure 5.6.
An RDBMS provides full data independence. The organization of the data isindependent of
the applications that use it. You do not need to specify the access routes to tables or know
how data is physically arranged in a database.
A relational database is a collection of individual, named objects. The basic unit of data
storage in a relational database is called a table. A tableconsists of rows and columns used
to store values. For access purposes, the order of rows and columns is insignificant. You can
control the access order as required.
When querying the database, you use conditional operations such as joins and restrictions.
A join combines data from separate database rows. A restriction limits the specific rows
returned by a query as shown in Figure 5.7. We can learn more details about join
operations under relational algebra in Unit 7.
An RDBMS enables data sharing between users. At the same time, you can ensure the
consistency of data across multiple tables by using integrity constraints. An RDBMS uses
various types of data integrity constraints. These types include an entity, column, referential
and user-defined constraints.
The constraint, entity, ensure the uniqueness of rows, and the constraint column ensures
consistency of the type of data within a column. The other type, referential, ensures the
validity of foreign keys, and user-definedconstraints are used to enforce specific business
rules.
An RDBMS minimizes the redundancy of data. This means that similar data is not repeated
in multiple tables.
enterprise, without attention to the efficiency of the physical database design. The entity-
relationship diagrams are then turned into a logical schema in which the database is
implemented.
Short definitions of some of the basic terms that are used for describing important entity-
relationship concepts are:
4. Keys: The key uniquely differentiates one entity instance from all others in the entity.
A key is an identifier.
b) Candidate Key
When multiple possible identifiers exist, each of them is acandidate key.
c) Concatenated Key.
Key i s made up of parts which, when combined, become a uniqueidentifier.
Multiple attribute keys are concatenated keys.
e) Foreign Keys. Foreign keys reference a related table through the primary key of
that related table.
An ER schema may identify certain constraints to which the content of the data must
conform.
Self-Assessment Questions – 3
We can study more about select, project operations in relational algebra in Unit 7.
Self-Assessment Questions – 4
18. The goal of a query optimizer is to find a good evaluation plan for a given __________.
19. Optimizing a relational algebra expression involves _ _ _ _ _ basic steps.
A fundamental property of a database system is that it maintains a description of all the data
that it contains. A relational DBMS maintains information about every relation and index that
it contains. The DBMS also maintains information about views, for which no tuples are stored
explicitly; rather, a definition of the view is stored and used to compute the tuples that belong
in the view when the view is queried. This information is stored in a collection of relations,
maintained by the system, called the catalog relations; an example of a catalog relation is
shown in Figure 5.8. The catalog relations are also called the system catalog, the catalog, or
the data dictionary. The system catalog is sometimes referred to as metadata;that is, not
data, but descriptive information about the data. The information in the system catalog is
used extensively for query optimization.
In addition, statistics about relations and indexes are stored in the system catalogs and
updated periodically (not every time the underlying relations are modified). The following
information is commonly stored:
Cardinality: The number of tuples NTuples(R) for each relation R.
Size: The number of pages NPages(R) for each relation R.
Index Cardinality: Number of distinct key values NKeys(I) for each index I.
Index Size: The number of pages INPages(I) for each index I.
Index Height: The number of nonleaf levels IHeight(I) for each tree index I.
Index Range: The minimum present key value ILow(I) and the maximumpresent key
value IHigh(I) for each index I.
Figure 5.8 shows the tuples in the Attribute Cat relation that describe the attributes of these
two relations. Notice that in addition to the tuples describing Students and Faculty, other
tuples (the first four listed) describe the four attributes of the Attribute Cat relation itself!
These other tuples illustrate an important point: the catalog relations describe all the
relations inthe database, including the catalog relations themselves. When information about
a relation is needed, it is obtained from the system catalog. Of course,at the implementation
level, whenever the DBMS needs to find the schema of a catalog relation, the code that
retrieves this information must be handled specially.
The fact that the system catalog is also a collection of relations is very useful. For example,
catalog relations can be queried just like any other relation, using the query language of the
DBMS! Further, all the techniques available for implementing and managing relations apply
directly to catalog relations. The choice of catalog relations and their schemas is not unique
and is made by the implementor of the DBMS. Real systems vary in their catalog schema
design, but the catalog is always implemented as a collection of relations, and it essentially
describes all the data stored in the database.
Self-Assessment Questions – 5
7. SUMMARY
In this unit, we had an informal look at the relational model initially. We see that the main
construct for representing data in the relational model is a relation that consists of a
relation schema and a relation instance. We also dealt with relational database
management systems and discussed RDBMS properties. We also briefly presented about the
entity-relationship model as a tool for analyzing the semantic features of an application that
are independent of events. Finally, we discussed the optimization and system catalogs in
relational database management systems. We dealt with the catalog relations also known as
system catalog, the catalog, or the data dictionary. The system catalog is sometimes referred
to as metadata which is not data, but descriptive information about the data.
8. TERMINAL QUESTIONS
1. Create a relational schema to hold information about insurance policies and
policyholders. Insurance policies have properties policyNo, startDate, premium,
renewalDate, policyType. Policyolders are characterized by holderNo, holderName,
holderAddress and holderTelno.
a. What would be suitable primary keys in this database?
b. How do we relate insurance policy information with policyholderinformation?
c. Declare a suitable domain for policy type.
2. What is the goal of query optimization? Why is it important?
3. What information is stored in the system catalogs?
4. What are the benefits of making the system catalogs relations?
9. ANSWERS
Self Assessment Questions
1. relation schema
2. records
3. domain constraints
4. Structures
5. Manipulate
6. Consistency
7. Table
8. Primary
9. Field
10. SQL
11. data storage
12. join
13. sharing
14. redundancy
15. Multivalued
16. key
17. unique
18. query
19. two
20. System
21. Definition
22. Cardinality
Terminal Questions
1 . ( a ) You can create a relational schema of your own. Although in thiscase, Policy No.
would be a suitable primary key.
(b) Insurance policy information and policy holder information can berelated by Fields
and Tuples.
(c) You can declare it in your own or give the domain name of existingpolicy type of
any insurance company. (Refer entire unit for details)
2. The goal of a query optimizer is to find a good evaluation plan for agiven query.
(Refer section 5)
3. Information for each relation (like relation name, the file name, file structure, the
attribute name, and type of each of its attributes, the index name of each index on the
relation, the integrity constraints on the relation) are stored in the system catalogs.
(Refer section 6.1 for detail)
4. The main benefits of making the system catalogs relations is that the catalog relations
can be queried just like any other relation, using the query language of the DBMS.
(Refer section 6.1 for detail)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 6: SQL – 1 1
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Unit 6
SQL – 1
Table of Contents
4.2 Subqueries - -
10 - 23
4.3 GROUP BY Feature - -
4.4 Updating the Database - -
4.5 Data Definition Facilities - -
5 Summary - - 24
6 Terminal Questions 1, 2 - 24 - 25
7 Answers - - 25 - 26
Unit 6: SQL – 1 2
DCA2102: Database Management System Manipal University Jaipur (MUJ)
1. INTRODUCTION TO SQL
SQL is an acronym for Structured Query Language. It is available in a number of database
management packages based on the relational model of data, for example, in DB2 of the IBM
and UNIFY of the UNIFY corporation.
It allows for data definition, manipulation, and data control for a relational database. The
data definition facilities of SQL permit the definition of relations and various alternative
views of relations. Further, the data control facility gives features for one user to authorize
other users to access his data. This facility also permits assertions to be made about data
integrity. All the three major facilities of SQL, namely, data manipulation, data definition, and
data control are bound together in one integrated language framework.
1.1 Objectives:
After going through Unit 6, you should be able to:
Unit 6: SQL – 1 3
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Unit 6: SQL – 1 4
DCA2102: Database Management System Manipal University Jaipur (MUJ)
These statements are now universally accepted, as is their functionality, although the degree
to which these commands support this functionality varies somewhat between products.
Compare the functionality of different implementations of UPDATE for example.
Data Control
This deals with three issues:
a) Recovery and Concurrency
Concurrency is concerned with the manner in which multiple users operate upon the
database.
Each user can either reflect the updates of a transaction by using the COMMIT or can cancel
all the updates of a transaction by using ROLLBACK.
b) Security
Security has two aspects to it.
The first is the VIEW mechanism. A view of a relation can be created which hides the sensitive
information and defines only that part of a relation that should be visible. A user can then be
allowed to access this view.
CREATE VIEW LOCAL AS
SELECT * FROM SUPPLIER
WHERE [Link] = 'Delhi'
The above view reveals only the suppliers of Delhi.
The second is by using GRANT operation. This shall grant one or moreaccess rights to
perform the data manipulative operations on the relations.
c) Integrity Constraints
For example, one can specify that an attribute of a relation will not take onnull values.
Unit 6: SQL – 1 5
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Self-Assessment Questions – 1
Unit 6: SQL – 1 6
DCA2102: Database Management System Manipal University Jaipur (MUJ)
3. DATA DEFINITION
Data definition in SQL is via the create statement. The statement can be used to create a table,
index, or view (i.e., a virtual table based on existing tables). To create a table, the create
statement specifies the name of the table and the names and data types of each column of
the table. Its format is:
create table< relation >( < attribute list >)
The data types supported by SQL depend on the particular implementation. However, the
following data types are generally included: integer, decimal, real (i.e., floating point values),
and character strings, both of fixed size and varying length. A number of ranges of values for
the integer data type are generally supported, for example, integer and short int. The decimal
value declaration requires the specification of the total number of decimal digits forthe value
and (optionally), the number of digits on the right of the decimal point. The number of
fractional decimal digits is assumed to be zero if only the total number of digits is specified.
< data type >:: = < integer > |< short int >| < char (n) > | < float >| < decimal (p[,q]) >
In addition, some implementations can support additional data types suchas bit strings,
graphical strings, logical, data, and time. Some DBMSs support the concept of date. One
possible implementation of date could be as eight unsigned decimal digits representing the
data in the yyyymmdd format. Here yyyy represents the year, mm represents the month and
dd represents the day. Two dates can be compared to find the one that islarger and hence
occurring later. The system ensures that only t h e legal date values are inserted (20081016
for the date would be illegal) and functions are provided to perform operations such as
adding a number of days to a date to come up with another date or subtracting a date from
the current date to find the number of days, months, or years. Date constants are provided
in either the format is given above or as a character string in one of the following formats:
mm/dd/yy; mm/dd/yyyy; dd-mmm-yy; dd-mmm-yyyy. In this text, we represent a date
constant as eight unsigned decimal digits in the format yyyymmdd.
Unit 6: SQL – 1 7
DCA2102: Database Management System Manipal University Jaipur (MUJ)
The employee relation for the hotel database can be defined using the create table statement
given below. Here, the Empl_No is specified to be not null to disallow this unique identifier
from having a null value. SQL supports the concept of null values and, unless a column is
declared withthe not null option, it could be assigned a null value.
The create index statement allows the creation of an index for an already existing relation.
The columns to be used in the generation of the index are also specified. The index is named
and the ordering for each column used in the index can be specified as either ascending or
descending. The clusteroption could be specified to indicate that the records are to be placed
in physical proximity to each other. The unique option specifies that only one record could
exist at any time with a given value for the column(s) specified in the statement to create the
index. (Even though this is just an access aid and a wrong place to declare the primary key).
Such columns, for instance, could form the primary key of the relation and hence duplicate
tuples are notallowed. One case is the ORDER relation where the key is the combination of
the attribute Bill#, Dish#. In the case of an existing relation, an attempt to create an index
Unit 6: SQL – 1 8
DCA2102: Database Management System Manipal University Jaipur (MUJ)
with the unique option will not succeed if the relation does not satisfy this uniqueness
criterion. The syntax of the create index statement is shown below:
create [unique] index name-of-index
on existing-table-name
(column-name [ascending or descending]
[, column-name[order]….])
[cluster]
The following statement causes an index called emp index to be built on thecolumn Name
and Pay_Rate. The entries in the index are ascending by Name value and descending by
Pay_Rate. In this example, there are no restrictions on the number of records with the same
Name and Pay_Rate.
Create index empindex
on EMPLOYEE (Nameasc, Pay_Rate desc);
An existing relation or index could be deleted from the database by the drop SQL statement.
The syntax of the drop statement is as follows:
Self-Assessment Questions – 2
Unit 6: SQL – 1 9
DCA2102: Database Management System Manipal University Jaipur (MUJ)
There are parts that are supplied by suppliers. S contains the details about each supplier.
Turnover for a supplier is in terms of lakhs of rupees. Information regarding suppliers of
specific parts is contained in SP, whereas information about the parts themselves is
contained in P.
The column names following SELECT are to be retrieved from the relations specified in the
FROM part. WHERE specifies the condition that the tuples must satisfy in order to be part of
the result.
Unit 6: SQL – 1 10
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Below we shall state first the retrieval query in English and then specify its SQL equivalent.
Unqualified Retrieval
1. Get the part numbers of all the parts being supplied.
SELECT P#
FROM SP
P#
1
1
2
2
3
3
1
Part numbers getting repeated? That's right. SELECT does not eliminateduplicate rows
(unlike the project operation of the relational algebra). In order to do that
2. Get the part numbers of all the parts being supplied with no duplication.
SELECT DISTINCT P#
FROM SP
P#
1
2
3
If all the columns of the relation are to be retrieved then one needn't list all of them. A*
can be specified after SELECT to indicative retrieval ofthe entire relation.
Unit 6: SQL – 1 11
DCA2102: Database Management System Manipal University Jaipur (MUJ)
5. Get the details of suppliers who operate from Bombay with a turnover
[Link] S.*
FROM S
WHERE CITY = 'BOMBAY' AND TURNOVER > 50
S# SNAME SCITY TURNOVER
11 NARMADA BOMBAY 100
The above form is a conjunction of comparison predicates. Acomparison predicate is of the
form
scalar-expr O scalar-expr
Unit 6: SQL – 1 12
DCA2102: Database Management System Manipal University Jaipur (MUJ)
The use of BETWEEN gives the range within which the values must [Link] the value should
lie outside a range then BETWEEN is to be precededby NOT. For example,
SELECT P#
FROM P
WHERE WEIGHT NOT BETWEEN 25 AND 35
P# WEIGHT
3 45
would retrieve all part numbers whose weight is less than 25 or greater than 35 as shown
above.
LIKE Predicate
This predicate is used for pattern matching. A column of type char can be compared with a
string constant. The use of the word LIKE doesn't look for an exact match but a form of wild
string match. A % or - can appear in the string constant where
% stands for a sequence of n (>=0) characters
- stands for a single character
Examples
ADDRESS LIKE '%Bangalore%' – ADDRESS should have Bangaloresomewhat as a part of
it if the match is to succeed.
STRANGE STRING LIKE '\-%' ESCAPE\'
Here, the normal meaning of -is overridden with the use of the escapecharacter. STRANGE
above will match any string beginning with –
Unit 6: SQL – 1 13
DCA2102: Database Management System Manipal University Jaipur (MUJ)
7. Get the names and cities of suppliers whose names begin with
CSELECT SNAME SCITY
FROM S
WHERE SNAME LIKE 'C%'
SNAME SCITY
CAUVERY BANGALORE
When the data is to be retrieved from more than one relation, then both the relation
names are specified in the FROM clause and the join condition in the WHERE part.
8. Get pairs of supplier numbers such that both operate from the same city.
SELECT FIRST.S#, SECOND.S#
FROM S FIRST, S SECOND
WHERE [Link] = [Link]
AND FIRST.S#< > SECOND.S#
FIRST and SECOND are tuple variables, both ranging over S. The last line eliminates a
supplier getting compared with him.
S# S#
11 13
13 11
But, we see that suppliers with numbers 11 and 13 are getting compared twice. Can that
be avoided? How about < instead of< >?
SELECT FIRST.S#, SECOND.S#
FROM S FIRST, S SECOND
WHERE [Link] = [Link]
AND FIRST.S# SECOND.S#
IN Predicate
This is to be used whenever you want to test whether an attribute valueis one of a set of
values. For example,
Unit 6: SQL – 1 14
DCA2102: Database Management System Manipal University Jaipur (MUJ)
4.2 Subqueries
The expression following WHERE can be either a simple predicate asexplained above or it
can be a query itself! This part of the query following WHERE is called a Subquery.
A subquery, which in turn is a query, can have its own subquery and the process of specifying
subqueries can continue ad infinitum! More practically,the process ends once the query has
been fully expressed as an SQL statement.
Subqueries can appear when using the comparison predicate, the IN predicate and when
quantifiers are used.
Comparison Predicate
10. Get the supplier numbers of suppliers who are located in the same city as TAPI.
SELECT S.S#, SNAME
FROM S
WHERE [Link] = (SELECT CITY
FROM S
WHERE SNAME = 'TAPI‘)
The inner select (subquery) retrieves the city of the supplier named TAPI. The outer select
(the main one) then compares the city of each supplier in the supplier relation and picks up
those where the comparison succeeds.
Unit 6: SQL – 1 15
DCA2102: Database Management System Manipal University Jaipur (MUJ)
S# SNAME
11 NARMADA
13 TAPI
Notice that the subquery appears after the comparison operator. The formatof this form of
expression
scalar-expr operator subquery
IN Predicate
In this form, the subquery selects a set of values. The outer query checkswhether the
value of a specified attribute is in this set.
11. Get the names of suppliers who supply part 2
SELECT [Link]
FROM S
WHERE S.S# IN
(SELECT SP.S#
FROM SP
WHERE SP.P#= 2)
SNAME
CAUVERY
NARMADA
Unit 6: SQL – 1 16
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Quantified predicates
The two quantifiers that can be used are ALL and ANY. Any stands for the existential
quantifier and ALL for the universal quantifier.
Let's first look at ANY. It can be specified in a comparison predicate just following the
comparison operator. That is,
scalar-expr O ANY subquery
The subquery is first evaluated to give a set of values. The above expression is true if the
scalar-expr is O comparable with any of the values that form the result of the subquery.
12. Get the part numbers for parts whose cost is less than the current maximum cost.
SELECT P#, COST
FROM P
WHERE COST< ANY
(SELECTCOST
FROM P)
P# COST
1 10
2 15
The inner select gets the cost of all the parts. In the outer select, a P# is selected if its cost is
less than some element of the set selected in the earlier step.
Unit 6: SQL – 1 17
DCA2102: Database Management System Manipal University Jaipur (MUJ)
If the function is followed by the word DISTINCT then unique values are used. On the other
hand, if ALL follows the functions then all the values are used for evaluating the function. ALL
is default.
COUNT(*) has a special meaning in that it counts the number of rows of a relation. COUNT in
any other form must make use of DISTINCT. In other words, except when rows are counted,
COUNT always returns the number of distinct values in a column.
15. Get the part numbers whose cost is greater than the average cost.
SELECT P#
FROM P
WHERE COST >
(SELECT AVG (COST)
FROM P)
16. Get the names of suppliers who supply from a city where there is atleast one more
supplier
SELECT SNAME
FROM S FIRST
WHERE 2 >=
(SELECT COUNT (CITY)
FROM S
WHERE CITY = [Link])
Unit 6: SQL – 1 18
DCA2102: Database Management System Manipal University Jaipur (MUJ)
1 125
2 80
3 110
GROUP BY groups together all the rows which have the same value for P#. The function SUM
is then applied to each group. That is, the result consists of a part number along with the total
quantity in which it is supplied.
Whenever GROUP BY is used then the phrase WHERE is to be replaced by HAVING. The
meaning of HAVING is the same as WHERE except that the condition is now applicable to
each group.
18. Get the part numbers for parts supplied by more than one supplier.
SELECT P#
FROM SP
GROUP BY P#
HAVING COUNT(*) > 1
Each group contains one or more tuples which have the same part number. COUNT(*) is
applied to each such group.
Unit 6: SQL – 1 19
DCA2102: Database Management System Manipal University Jaipur (MUJ)
INSERT
The insertion facility allows new tuples to be inserted into given relations. Attributes which
are not specified by the insertion statement are given null values. Consider
1. Add a part with 14, weight 10, colored red, with the cost and selling price as 20 and 60
respectively
INSERT INTO P: < 14, 10, 'red', 20, 60 >
If all the fields are not known then a tuple can still be added. The attributes whose values are
not specified will have a null value.
If all the fields are not known then a tuple can still be added. The attributes whose values are
not specified will have a null value.
INSERT INTO P:
< 15,'green' >
The values for fields the weight, cost and the selling price which are not specified are
assumed to be null.
2. Let us assume that there is a relation called RED-PART with one column P#.
INSERT INTO RED-PART
SELECT P#
FROM P
WHERE COLOUR = 'red'
The various attributes of P having red color are identified and inserted into the relation RED-
PART.
Unit 6: SQL – 1 20
DCA2102: Database Management System Manipal University Jaipur (MUJ)
DELETE
The deletion facility removes specified tuples from the database. Consider
1. Delete supplier 13
DELETE S
WHERE S# = 13
Since S# is the primary key only one tuple will be deleted from S.
UPDATE
When columns are to be modified, SET clause is used. This clause specifies the update to be
made to selected tuples.
1. Change the city of supplier 13 to Bangalore and increase the turnoverby 20 lakhs.
UPDATE S
SET CITY = 'BANGALORE'
TURNOVER = TURNOVER + 20
WHERE S# = 13
Unit 6: SQL – 1 21
DCA2102: Database Management System Manipal University Jaipur (MUJ)
CREATE statement allows to define a relation. The name of the relation to be created and its
various fields together with their data types must be specified. If a certain attribute is barred
from containing null values then a NONULL specification must be made for it.
It must be noted that the word TABLE is used in this syntax instead of
RELATION.
Example
CREATE TABLE DEPT
(DNO CHAR (2) NONULL,
DNAME VARCHAR (12),
LOC VARCHAR (12))
VIEW
A very important aspect of data definition is the ability to define alternative views of data.
The process of specifying an alternative view is very similar tothat of framing a query. The
derived relation is stored and can be used thereafter as an object of the various commands.
It is also possible to defineother views on top of the newly created relation.
Example
DEFINE VIEW D50 AS
SELECT EMPNO, NAME, JOB
FROM EMP
WHERE DNO = 50
D50 contains the employee number, name, and job of those employees who are in
department 50.
Unit 6: SQL – 1 22
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Self-Assessment Questions – 3
Unit 6: SQL – 1 23
DCA2102: Database Management System Manipal University Jaipur (MUJ)
5. SUMMARY
This unit discussed that most commercial relational DBMSs support some form of the SQL
data manipulation language, and this creates different dialects of SQL. We dealt that SQL
commands can be roughly divided into three major categories: (i) commands to create and
maintain the database structure (ii) commands that manipulate the data in such
structures (iii) commands that control the use of the database. We discussed about
subqueries, functions, GROUP BY features, updating the database, and data definition
facilities.
6. TERMINAL QUESTIONS
1. Consider the insurance database of Figure 6.1, where the primary keys are underlined.
Construct the following SQL queries for this relational database.
a) Find the total number of people who owned cars that were involvedin accidents in
1989.
b) Find the number of accidents in which the cars belonging to “JohnSmith” were
involved.
c) Add a new accident to the database; assume any values for requiredattributes.
d) Delete the Mazda belonging to “John Smith”.
Update the damage amount for the car with license number “AABB2000” in the accident with
report number “AR2197” to $3000.
Unit 6: SQL – 1 24
DCA2102: Database Management System Manipal University Jaipur (MUJ)
2. Consider the employee database of Figure 6.2, where the primary keys are underlined.
Give an expression in SQL for each of the following queries.
a) Find the names of all employees who work for First BankCorporation.
b) Find the names and cities of residence of all employees who workfor First Bank
Corporation.
c) Find the names, street addresses, and cities of residence of allemployees
who work for First Bank Corporation and earn more than $10,000.
d) Find all employees in the database who live in the same cities as thecompanies
for which they work.
3. Consider the relational database of Figure 6.2. Using SQL, define a viewconsisting of
manager-name and the average salary of all employees who work for that manager.
Explain why the database system should not allow updates to be expressed in terms of
this view.
4. Describe the circumstances in which you would choose to use embedded SQL rather
than SQL alone or only a general-purpose programming language.
7. ANSWERS
.
Self-Assessment Questions
1. Three
2. Two
3. DML
4. Concurrency
5. View
6. Create
7. Implementation
8. Alter
9. Drop
10. Retrieve
11. Select
12. Duplicate
13. BETWEEN
14. Matching
15. Comparison
Unit 6: SQL – 1 25
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Terminal Questions
1. Refer the whole unit to construct SQL queries of the given insurance database and
check it practically to confirm results.
2. Updates should not be allowed in view because there is no way to determine how to
change the underlying data and define alternative views of data. (Refer section 6.4 for
detail)
3. Writing queries in SQL is easier than coding the same queries in a general-purpose
programming language. Embedded SQL is lesscomplicated since it avoids the clutter of
the ODBC or JDBC function calls, but requires a specialized pre-processor. (Refer whole
unit for detail)
Unit 6: SQL – 1 26
DCA2102: Database Management System Manipal University Jaipur (MUJ)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 7: SQL – 2 1
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Unit 7
SQL – 2
Table of Contents
Unit 7: SQL – 2 2
DCA2102: Database Management System Manipal University Jaipur (MUJ)
1. INTRODUCTION
In the last Unit, SQL was introduced, and depicted that this language is available in a number
of database management packages based on the relational model of data. It was also said that
SQL is used for data definition,manipulation, and data control for a relational database and
illustrated that all the three major facilities of SQL, namely, data manipulation, data
definition, and data control are bound together in one integrated language framework.
In this unit, an introduction about the Views is given which are useful for hiding unneeded
information and for collecting together information from more thanone relation into a single
view. This Unit also introduces Embedded SQL which uses SQL commands within the host
language program.
We discuss about the transaction processing which consists of a sequenceof operations and
which must appear to be atomic.
1.1 Objectives:
After going through Unit 7, the learners should be able to:
Unit 7: SQL – 2 3
DCA2102: Database Management System Manipal University Jaipur (MUJ)
2. VIEWS
A view is a virtual table that does not actually exist. It is made up of a query on other
tables in the database. It could include only certain columns or rows from a table or from
many tables. A view that restricts the user to certain rows is called a horizontal view, and a
vertical view restricts the user to certain columns. You are not restricted to purely horizontal
or vertical slices of data.
A view can be as complicated as you like. You can have grouped views, where the query
contains a GROUP BY clause. This makes the view a summary of the data in a table or tables.
If the list of column names is omitted the columns in the view take the same name as in the
underlying tables. You must specify column names if the query includes calculated columns
or two columns with the same name. There are several advantages to views, including
Security: Users can be given access to only the rows/columns of tablesthat concern them.
The table may contain data for the whole firm but they only see data for their department.
Data integrity: The WITH CHECK OPTION clause is used to enforce the query conditions on
any updates to the view. If the view shows data for a particular office the user can only enter
data for that office.
Simplicity: Even if a view is a multi-table query, querying the view still lookslike a single-
table query.
Protection from change: If the structure of the database changes, the user's view of the data
can remain the same.
Performance: A view may look like a single table but underneath the DBMSis usually still
running multi-table queries. If the view is complex then even simple queries can take a long
time.
Update restrictions: Updating the data through a view may or may not be possible. If the
view is complex, the DBMS may decide it can't perform updates and make the view read-
only.
Unit 7: SQL – 2 4
DCA2102: Database Management System Manipal University Jaipur (MUJ)
The ISO standard specifies five conditions that a view must meet in order to allow updates:
• The view must not have a DISTINCT clause.
• The view must only name one table in the FROM clause.
• All the columns must be real columns – no expressions, calculated columns, or column
functions
• The WHERE clause must not contain a sub-query
• There must be no GROUP BY or HAVING clause
You will find that most dialects of SQL are not quite so restrictive. The underlying principle
is that updates are allowed if the rows and columns of the view are traceable back to actual
rows and columns in tables.
A view is a relation (virtual other than base) and can be used in query expressions, that is,
queries can be written using the view as a relation. Views generally are not stored, since the
data in the base relations may change. The base relations on which a view is based are
sometimes called the existing relations. The definition of a view in a create view statement
is stored in the system catalog. Having been defined, it can be used as if the view really
represented a real relation. However, such a virtual relation defined by a view is recomputed
whenever a query refers to it.
Self-Assessment Questions – 1
Unit 7: SQL – 2 5
DCA2102: Database Management System Manipal University Jaipur (MUJ)
3. EMBEDDED SQL
We have looked at a wide range of SQL query constructs, treating SQL as an independent
language in its own right. A relational DBMS supports an interactive SQL interface, and users
can directly enter SQL commands. Thissimple approach is fine as long as the task at hand can
be accomplished entirely with SQL commands. In practice, we often encounter situations in
which we need the greater flexibility of a general-purpose programming language, in
addition to the data manipulation facilities provided by SQL. For example, we may want to
integrate a database application with a nice graphical user interface, or we may want to ask
a query that cannot be expressed in SQL.
To deal with such situations, the SQL standard defines how SQL commands can be executed
from within a program in a host language such as C or Java. The use of SQL commands within
a host language program is called embedded SQL. Details of embedded SQL also depend on
the host language. Although similar capabilities are supported for a variety of host languages,
the syntax sometimes varies.
There are, however, two complications to bear in mind. First, the data types recognized by
SQL may not be recognized by the host language, and vice versa. This mismatch is typically
addressed by casting data values appropriately before passing them to or from SQL
commands. (SQL, like C and other programming languages, provides an operator to cast
values of one type into values of another type.) The second complication has to do with the
fact that SQL is set-oriented; commands operate on and produce tables, which are sets (or
multisets) of rows. Programming languages do not typically have a data type that
corresponds to sets or multisets of rows. Thus, although SQL commands deal with tables, the
interface to the host language is constrained to be one row at a time.
Unit 7: SQL – 2 6
DCA2102: Database Management System Manipal University Jaipur (MUJ)
In our discussion of embedded SQL, we assume that the host language is Cfor concreteness
because minor differences exist in how SQL statements are embedded in different host
languages.
The first question that arises is which SQL types correspond to the variousC types since we
have just declared a collection of C variables whose values are intended to be read (and
possibly set) in an SQL run-timeenvironment when an SQL statement that refers to them is
executed. The SQL-92 standard defines such a correspondence between the host language
types and SQL types for a number of host languages. In our example c_sname has the type
CHARACTER(20) when referred to in an SQL statement, c_sid has the type INTEGER, c_rating
has the type SMALLINT, and c_age has the type REAL.
An important point to consider is that SQL needs some way to report what went wrong if an
error condition arises when executing an SQL statement. The SQL-92 standard recognizes
two special variables for reporting errors, SQLCODE and SQLSTATE. SQLCODE is the older
of the two and is defined to return some negative value when an error condition arises,
without specifying further just what error a particular negative integer denotes.
Unit 7: SQL – 2 7
DCA2102: Database Management System Manipal University Jaipur (MUJ)
SQLSTATE, introduced in the SQL-92 standard for the first time, associatespredefined values
with several common error conditions, thereby introducing some uniformity to how errors
are reported. One of these two variables must be declared. The appropriate C type for
SQLCODE is long and the appropriate C type for SQLSTATE is char[6], that is, a character
string that is five characters long. (Recall the null-terminator in C strings!) In this unit, we
will assume that SQLSTATE is declared.
As a simple example, the following embedded SQL statement inserts a row, whose column
values are based on the values of the host language variables contained in it, into the Sailors
relation:
EXEC SQL INSERT INTO Sailors VALUES (:c_sname, :c_sid, :c_rating, :c_age);
Observe that a semicolon terminates the command, as per the convention for terminating
statements in C.
The SQLSTATE variable should be checked for errors and exceptions after each embedded
SQL statement. SQL provides the WHENEVER command to simplify this tedious task:
The intent is that after each embedded SQL statement is executed, the value of SQLSTATE
should be checked. If SQLERROR is specified and the value of SQLSTATE indicates an
exception, control is transferred to stmt, which is presumably responsible for
error/exception handling. Control is alsotransferred to stmt if NOT FOUND is specified and
the value of SQLSTATE is 02000, which denotes NO DATA.
Unit 7: SQL – 2 8
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Self-Assessment Questions – 2
6. The use of SQL commands within a host language program is called __________ .
7. The _________variable should be checked for errors andexceptions after
each embedded SQL statement.
4. TRANSACTION PROCESSING
A user writes data access/update programs in terms of the high-level query and update
language supported by the DBMS. To understand how the DBMS handles such requests, with
respect to concurrency control andrecovery, it is convenient to regard an execution of a user
program, or transaction, as a series of reads and writes of database objects:
To read a database object, it is first brought into main memory (specifically, some frame in
the buffer pool) from the disk, and then its value is copied into a program variable.
To write a database object, an in-memory copy of the object is first modified and then written
to disk.
Database ‘objects’ are the units in which programs read or write [Link] units could
be pages, records, and so on, but this is dependent on the DBMS and is not central to the
principles underlying concurrency control or recovery. In this unit, we will consider a
database to be a fixed collection of independent objects. When objects are added to or deleted
from a database, or there are relationships between database objects that wewant to exploit
for performance, some additional issues arise.
There are four important properties of transactions that a DBMS must ensure to maintain
data in the face of concurrent access and system failures:
1. Users should be able to regard the execution of each transaction as atomic: either all
actions are carried out or none are. Users should not have to worry about the effect of
incomplete transactions (say, when a system crash occurs).
2. Each transaction, run by itself with no concurrent execution of other transactions, must
preserve the consistency of the database. This property is called consistency, and the
Unit 7: SQL – 2 9
DCA2102: Database Management System Manipal University Jaipur (MUJ)
DBMS assumes that it holds for each transaction. Ensuring this property of a
transaction is the responsibility of the user.
3. Users should be able to understand a transaction without consideringthe effect of
other concurrently executing transactions, even if the DBMSinterleaves the actions of
several transactions for performance reasons. This property is sometimes referred to
as isolation: Transactions are isolated, or protected, from the effects of concurrently
scheduling other transactions.
4. Once the DBMS informs the user that a transaction has been successfully completed, its
effects should persist even if the system crashes before all its changes are reflected on
the disk. This property is called durability.
The acronym ACID is sometimes used to refer to the four properties of transactions that we
have presented here: atomicity, consistency, isolation, and durability. We now consider how
each of these properties is ensured in a DBMS.
The isolation property is ensured by guaranteeing that even though actions of several
transactions might be interleaved, the net effect is identical to executing all transactions one
after the other in some serial order. For example, if two transactions T1 and T2 are executed
concurrently, the net effect is guaranteed to be equivalent to executing (all of) T1 followed
Unit 7: SQL – 2 10
DCA2102: Database Management System Manipal University Jaipur (MUJ)
Database Consistency is the property in that every transaction sees a consistent database
instance. Database consistency follows from transaction atomicity, isolation, and
transaction consistency. Next, we discuss how atomicity and durability are guaranteed in a
DBMS.
Of course, since users think of transactions as being atomic, a transaction that is interrupted
in the middle may leave the database in an inconsistent state. Thus a DBMS must find a way
to remove the effects of partial transactions from the database, that is, it must ensure
transaction atomicity: either all of a transaction's actions are carried out, or none are. A
DBMS ensures transaction atomicity by undoing the actions of incomplete transactions. This
means that users can ignore incomplete transactions in thinking about how the database is
modified by transactions over time. Tobe able to do this, the DBMS maintains a record, called
the log, of all, writes to the database.
The log is also used to ensure durability. If the system crashes before the changes made by
a completed transaction are written to disk, the log isused to remember and restore these
changes when the system restarts.
Unit 7: SQL – 2 11
DCA2102: Database Management System Manipal University Jaipur (MUJ)
5. SUMMARY
In this unit, we dealt with the SQL data definition language used to create relations with
specified schemas. We discussed Views, embedded SQL, and transaction processing
properties. The SQL DDL supports a number of types including date and time types. SQL
queries can be invoked from host languages, via embedded SQL.
7 6. TERMINAL QUESTIONS
1. What is View? With an example, use the format of view statement tocreate view.
2. What is an Embedded SQL statement? Describe briefly.
3. Explain any two important properties of transactions that a DBMS must ensure to
maintain data in the face of concurrent access and system failures.
Unit 7: SQL – 2 12
DCA2102: Database Management System Manipal University Jaipur (MUJ)
7. ANSWERS
Self-Assessment Questions
1. Virtual
2. Tables
3. can remain the same.
4. DISTINCT
5. Existing
6. embedded SQL
7. SQLSTATE
8. Modified
9. Consistency
10. four properties
11. Users
12. Three
13. Log
Terminal Questions
1. A view is a virtual table which does not actually exist. It is made up of a query on other
tables in the database. It could include only certaincolumns or rows from a table or
from many tables. (Refer section 7.2 for detail)
2. The use of SQL commands within a host language program is called embedded SQL.
(Refer section 7.3 for detail)
3. There are four important properties of transactions that a DBMS must ensure to
maintain data in the face of concurrent access and system failures. They are (i)
Atomicity (ii) Consistency (iii) Isolation (iv) Durability. (Refer section 7.4 for detail)
Unit 7: SQL – 2 13
DCA2102: Database Management System Manipal University Jaipur (MUJ)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 8
Relational Algebra
Table of Contents
1. INTRODUCTION
Relational algebras received little attention until the publication of E.F. Codd's relational
model of data in 1970. Codd proposed such an algebra as a basis for database query
languages. The first query language to be based on Codd's algebra was ISBL, and this
pioneering work has been acclaimed by many authorities as having shown the way to make
Codd's idea into a useful language. Even the query language of SQL is loosely based on a
relational algebra, though the operands in SQL (tables) are not exactly relations and several
useful theorems about relational algebra do not hold inthe SQL counterpart (arguably to the
detriment of optimizers and/or users).
1.1 Objectives:
By the end of Unit 8, the learners should be able to understand:
2. BASIC OPERATIONS
Basic operations are the traditional set operations: union, difference, intersection, and
Cartesian product. Three of these four basic operations- union, intersection, and difference
– require that operand relations be union compatible. Two relations are union compatible if
they have the same parity and one-to-one correspondence of the attributes with the
corresponding attributes defined over the same domain. The Cartesian product can be
defined on any two relations. Two relations P and Q are said to be union compatible if both
P and Q are of the same degree n and the domain of the corresponding n attributes are
identical,
i.e.,
if P= { P1,..,Pn } and Q={ Q1,...Qn } then
Dom(Pi) = Dom(Qi) for i = { 1,2...n }
where Dom(Pi) represents the domain of the attribute Pi.
Example 1
In the examples to follow, we utilize two relations P and Q given in Figure 8.1. R is a
computed result relation. We assume that the relations P and Qin Figure 8.1 represent
employees working on the development of software application packages J1 (say) and J2
(say) respectively.
P:
Id Name
101 Jones
103 Smith
104 Lalonde
107 Evan
110 Drew
112 Smith
Q:
Id Name
103 Smith
104 Lalonde
106 Byron
110 Drew
Fig 8.1: Union compatible relations
The resultant relation, R = P U Q, has tuples drawn from P and Q such that
R = { t | P v t Q }
The result relation R contains tuples that are in either P or Q or in both of them. The duplicate
tuples are eliminated.
Remember that from our definition of union compatibility the degree of the relations P and
R is the same. The cardinality of the resultant relation depends on the duplication of tuples
in P and Q.
From the above expression, we can see that if all the tuples in Q were contained in P, then |R|
= |P| and R= P, while if the tuples in P and Q were disjoint, then |R| = |P|+|Q|.
Example 2
R, the union of P and Q given in Figure 8.1 in the above example 1 is shownin Figure 8.2(a).
R represents employees working on the packages J1 or J2, or both of these packages. Since a
relation does not have duplicate tuples, an employee working on both J1 and J2 will appear
in the relation R only once.
R:
Id Name
101 Jones
103 Smith
104 Lalonde
106 Byron
107 Evan
110 Drew
112 Smith
a) P Q
R:
Id Name
101 Jones
107 Evan
112 Smith
b) P-Q
R:
Id Name
103 Smith
104 Lalonde
110 Drew
c) PQ
Fig 8.2: Results of (a) union (b) difference and (c) intersection operations
2.2 Difference ( - )
The difference operation removes common tuples from the first relation.
R = P-Q such that
R = { t | t P t / Q }
Example 3
R, the result of P-Q, gives employees working only on package J1. (figure 8.2(b) in example
2). Employees working on both packages J1 and J2 have been removed.
Example 4
The resultant relation PQ is the set of all employees working on both the packages. (figure
8.2(c) of example 2).
It is, however, more convenient to write an expression with a single intersection operation
than one involving a pair of difference operations.
Note that in these examples the operand and the result relation schemes, including the
attribute names, are identical i.e. P=Q=R. If the attribute names of compatible relations are
not identical, the naming of the attributes of the result relation will have to be resolved.
R=PxQ
where a tuple r R is given by { t1|| t2 | t1 P t2 Q }, i.e. the result relation is obtained
by concatenating each tuple in relation P with each tuplein relation Q. Here, we represent
the concatenation operation.
Example 5
The Cartesian product of the PERSONNEL relation and SOFTWARE_PACKAGE relations of
figure 8.3(a) is shown in figure 8.3(b). Note that the relations P and Q from figure 8.1 of
Example 1 are a subset of the PERSONNEL relation.
PERSONNEL:
Id Name
101 Jones
103 Smith
104 Lalonde
107 Evan
110 Drew
112 Smith
Software Packages:
S
J1
J2
(a)
PERSONNEL:
Id P. Name S
101 Jones J1
101 Jones J2
103 Smith J1
103 Smith J2
104 Lalonde J1
104 Lalonde J2
106 Byron J1
106 Byron J2
107 Evan J1
107 Evan J2
110 Drew J1
110 Drew J2
112 Smith J1
112 Smith J2
(b)
Fig 8.3: (a) PERSONNEL (Emp#, Name) and SOFTWARE_PACKAGE(S)
represent employees and software packages respectively; (b) the Cartesianproduct of
PERSONNEL and SOFTWARE_PACKAGES.
Self-Assessment Questions – 1
The basic set operations, which provide a very limited data manipulation facility, have been
supplemented by the definition of the following operations: projection, selection, join, and
division. These operations arerepresented by symbols , , ⨝ and respectively. Projection
and selectionare unary operations; join and divisions are binary.
Id Name Name
101 Jones Jones
103 Smith Smith
104 Lalonde Lalonde
106 Byron Byron
107 Evan Evan
110 Drew Drew
112 Smith Smith
We defined the projection of a tuple ti over the attribute A, denoted ti[A] or A(ti), as (a),
where a is the value of tuple ti over the attribute A. Similarly, we define the projection of a
relation T, denoted by T[A] or A(ti), on the attribute A. This is defined in terms of the
projection for each tuple in ti belonging to T on the attribute A as:
T[A] = { ai | ti [A] = ai ti T}
where T[A] is a single attribute relation and | T[A] | £ T. The cardinality T[A] may be less
than the cardinality |T| because of the deletion of any duplicates in the result. A case in point
is illustrated in figure 8.4.
T[X] = { ti[A] | ti T }
Here A belongs to X where ti[A] represents the concatenation of all ti[A] for all A X (A
belongs to X)
Simply stated, the projection of a relation P on the set of attribute names Y belongs to P is the
projection of each tuple of the relation P on the set of attribute names Y.
Note that the projection operation reduces the arity if the number of attributes in X is less
than the arity of the relation. The projection operation may also reduce the cardinality of the
result relation since duplicate tuples are removed. (Note that the projection operation
produces a relation as the result. By definition, a relation cannot have duplicate tuples. In
most commercial implementations of the relational model, however, the duplicates would
still be present in the result).
the tuples are included in the result. To have a tuple included in the result relation, the
specified selection conditions or predicates must be satisfied byit. The selection operation is
sometimes known as the restriction operation.
PERSONNEL:
Id Name
101 Jones
103 Smith
104 Lalonde
107 Evan
110 Drew
112 Smith
Results of Selection
Id Name
101 Jones
103 Smith
104 Lalonde
Any finite number of predicates connected by Boolean operators may be specified in the
selection operation. The predicates may define a comparison between two domain-
compatible attributes or between an attribute and a constant value; if the comparison is
between attribute A1 andconstant c1, then c1 belongs to Dom (A1).
Given a relation P and a predicate expression B, the selections of those tuples of relation P
that satisfy the predicate B is a relation R written as:
R = B (P)
The above expression could be read as "select those tuples t from P in which the predicate
B(t) is true". The set of tuples with R is in this case defined as follows:
R= { t | t to P B (t) }
Example 6
In figure 8.6 we encounter the following relations:
ASSIGNMENT (Emp#, Prod#, Job#)
JOB_FUNCTION (Job#, Title)
EMPLOYEE:
Emp# Name Profession
101 Jones Analyst
103 Smith Programmer
104 Lalonde Receptionist
106 Byron Receptionist
107 Evan VPR & D
110 Drew VP operations
112 Smith Manager
PRODUCT:
Prod# Prod-Name Prod-details
HEAP1 HEAP-SORT ISS Module
BINS9 BINARY-SEARCH ISS/R Module
FM6 FILE-MANAGER ISS/R-PC Subsys
B++1 B++_TREE ISS/R Turbo Sys
B++2 B++_TREE VPR & D
a)
JOB-FUNCTION:
Job# Title
1000 CEO
900 President
800 Manager
700 Chief Programmer
600 Analyst
ASSIGNMENT
Emp# Prod# Job#
107 HEAP1 800
101 HEAP1 600
110 BINS9 800
103 HEAP1 700
101 BINS9 700
110 FM6 800
107 B++1 800
b)
Fig 8.6: (a) Relation schemes for employee role in development teams
b) Sample relations
Suppose we want to respond to the query “Get product number of assignments whose
development teams have a chief programmer”. This requires first computing the Cartesian
product of the ASSIGNMENT and JOB_FUNCTION relations. Let us name this product relation
TEMP. This is followed by selecting those tuples of TEMP where the attribute Title has the
value chief programmer and the value of the attribute Job# in ASSIGNMENT and
JOB_FUNCTION are the same. The required result, shown below is obtained by projecting
these tuples on the attribute Prod#. The operations are specified below.
TEMP = (ASSIGNMENT X JOB_FUNCTION)
Prod#( Title = 'chief programmer' [Link]# (TEMP))
Prod#
HEAP1
BINS9
In another method of responding to this query, we can first select those tuples from the
JOB_FUNCTION relation so that the value of the attribute Title is a chief programmer. Let us
call this set of tuples the relation TEMP1. We then compute the Cartesian product of TEMP1
and ASSIGNMENT, calling the product TEMP2. This is followed by a projection on Prod# over
TEMP2 to give us the required response. These operations are specified below:
TEMP1 = ( Title = 'chief programmer' (JOB_FUNCTION))
TEMP2 = ( [Link]# = JOB_FUNCTION.Job# (ASSIGNMENTX TEMP1))
Prod# (TEMP2) gives the required result.
Notice that in the selection operation that follows the Cartesian product we take only those
tuples where the value of the attributes [Link]#and JOB_FUNCTION.Job# are the
same. These combined operations of the cartesian product followed by selection are the join
operation. Note that we have qualified the identically named attributes by the name of the
corresponding relation to distinguishing them.
In case of the join of a relation with itself, we would need to rename either the attributes of
one of the copies of the relation or the relation name itself. We illustrate this in example 7.
In general, the join condition may have more than one term, necessitating the use of the
subscript in the comparison operator. Now we shall define the different types of join
operations.
In these discussions, we use P, Q, R, and so on to represent both the relation scheme and the
collection or bag of underlying domains of the attributes. We call it a bag of domains because
more than one attribute may be defined on the same domain.
Typically, P С Q may be null and this guarantees the uniqueness of attributenames in the
result relation. When the same attribute name occurs in thetwo schemes we use qualified
names.
Two common and very useful variants of the join are the equi-join and the natural join. In
the equi-join the comparison operator theta (i=1,2,...,n) is always the equality operator (=).
Similarly, in the natural join, the comparison operator is always the equality operator.
However, only one of the two sets of domain compatible attributes involved in the natural
join is Ai from P and Bi from Q, for i = 1,...n, the natural join predicate is a conjunction of
terms of the following form:
( t1[Ai] = t2[Bi] ) for i = 1,2...n
Domain compatibility requires that the domains of Ai and Bi be compatible, and for this
reason relation scheme P and Q have attributes defined on common domains, i.e., P С Q¹.
Therefore, join attributes have common domains in the relation schemes P and Q.
Consequently, only one set of the join attributes on these common domains needs to be
preserved in the result relation. This is achieved by taking a projection after the join
operation, thereby eliminating the duplicate attributes. If the relations P and Q have
attributes with the same domain, but different attribute names, then renaming or projection
may be specified.
Example 7
Given the EMPLOYEE and SALARY relations of figure 8.7(i), if we have to find the salary of
employees by name, we join the tuples in the relation EMPLOYEE with those in SALARY such
that the value of the attribute Id in EMPLOYEE is the same as that in SALARY. The natural
join takes the predicate expression to be [Link] = [Link]. The result of the natural
join is shown in figure 8.7(ii). When using the natural join, we do not need to specify this
predicate. The expression to specify the operation of finding the salary of employees by name
is given as follows. Here we project the result of the natural join operation on the attributes
Name and Salary:
SALARY:
Id Salary
101 67
103 55
104 75
107 80
[Link]# [Link]#
107 107
107 101
107 103
101 107
101 101
101 103
110 101
103 107
103 101
103 103
101 110
Example 8
Given the relations P and Q, as shown in figure 8.8(a), the result of dividingP by Q is the
relation R and it has two tuples. For each tuple in R, its product with the tuples of Q must be
in P. In our example (a1,b1) and (a2,b2) must both be tuples in P; the same is true for (a5,b1)
and (a5,b2).
Simply stated, the cartesian product of Q and R is a subset of P. In figure 8.8(b), the result
relation R has four tuples; the cartesian product of R and Qgives a resulting relation which is
again a subset of P. In figure 8.8(c) since there are no tuples in P with a value b3 for the
attribute B (i.e., selecting B=b3(P) = 0), we have an empty relation R, which has a cardinality
of zero.
Fig 8.8: Examples of the division operation. (a) R = P / Q; (b) R = P / Q(P is same in
part i); (c) R=P/Q (P is same as in part i); (d) R= P/Q (P is same as in part i)
In figure 8.8(d), the relation Q is empty. The result relation can be definedas the projection
of P on the attributes in P-Q. However, it is usual to disallow division by an empty relation.
Let us treat the Q as representing one set of properties (the properties are defined on the Q,
each tuple in Q representing an instance of these properties) and the relation r as
representing entities with these properties (entities are defined on P-Q, and the properties
are, as before, defined on Q); note that P Q must be equal to P. Each tuple in P represents
an object with some given property. The resultant relation R, then, is the set of entities that
possesses all the properties specified in Q. The two entities a1 and a5 possess all the
properties, i.e., b1 and b2. The other entities in P, a2, a3, and a4, only possess one, not both, of
the properties. The divisionoperation is useful when a query involves the phrase "for all
objects having all the specified properties." Note that both P-Q and Q in general representa
set of attributes. It should be clear that Q is not a subset of P.
Self-Assessment Questions – 2
4. SUMMARY
In this unit, we discussed that relational algebra defines a set of algebraic operations that
operate on tables, and output tables as their results. These operations can be combined to
get expressions that express desired queries. The algebra defines the basic operations used
within relational query languages.
The operations in relational algebra can be divided into two types. One is Basic operations
and the other is Additional operations that can be expressed interms of the basic operations.
We used relational algebra with the assignment operator to express these modifications.
5. TERMINAL QUESTIONS
1. Explain the statement that relational algebra operators can be composed. Why is the
ability to compose operators importantly?
2. Given two relations R1 and R2, where R1 contains N1 tuples, R2 contains N2 tuples, and
N2 > N1 > 0, give the minimum and maximum possible sizes (in tuples) for the result
relation produced by each of the following relational algebra expressions. In each case,
state any assumptions about the schemas for R1 and R2 that are needed to makethe
expression meaningful:
a. R1 υ R2
b. R1 ∩ R2
c. R1− R2
d. R1 x R2
e. a=5 (R1)
f. a (R1)
g. R1/R2
3. Consider the following schema:
Suppliers(sid: integer, sname: string, address: string)
Parts(pid: integer, pname: string, color: string)
Catalog(sid: integer, pid: integer, cost: real)
The key fields are underlined, and the domain of each field is listed after thefield name. Thus
sid is the key for Suppliers, pid is the key for Parts, and sidand pid together form the key for
Catalog. The Catalog relation lists the prices charged for parts by Suppliers. Write the
following queries in relational algebra.
1. Find the names of suppliers who supply some red part.
2. Find the sids of suppliers who supply some red or green part.
3. Find the sids of suppliers who supply some red part or are at 221Packer Ave.
4. Find the sids of suppliers who supply some red part and some greenpart.
5. Find the sids of suppliers who supply every part.
6. Find the sids of suppliers who supply every red part.
7. Find the sids of suppliers who supply every red or green part.
8. Find the sids of suppliers who supply every red part or supply everygreen part.
6. ANSWERS
Self-Assessment Questions
1. Set
2. Two
3. Union
4. Common
5. Intersection
6. Concatenation
7. Commutative Difference
8. Vertical
9. Reduce
10. Horizontal
11. Join
12. Relationships
13. natural join
14. equi-join
15. comparison
Terminal Questions
1. A set of basic operators in the Relational Algebra can take relations as their operands
and return the result as a relation. Hence the output of one operation may be used as
input to another operation. In other words operators are composed. (Refer section 3
for detail)
2. Refer section 2 for detail.
3. Refer section 2 and 3 for detail.
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 9
Relational Calculus
Table of Contents
1. INTRODUCTION
Relational calculus is an alternative to relational algebra. In contrast to algebra, which is
procedural, the calculus is nonprocedural, or declarative, in that it allows us to describe
the set of answers, without being explicitabout how they should be computed. Relational
calculus has had a big influence on the design of commercial query languages such as SQL
and, especially, Query-by-Example (QBE).
The variant of the calculus that we present in detail is called the tuple relational calculus
(TRC). Variables in TRC take on tuples as values. In another variant, called the domain
relational calculus (DRC), the variables range over field values. TRC has had more of an
influence on SQL, while DRC has strongly influenced QBE.
1.1 Objectives:
After learning this unit, you should be able to:
For Example: Find all sailors with a rating above 7. (Sailors- S is a relation)
{ S | S ε Sailors ^ S: rating > 7 }
When this query is evaluated on an instance of the Sailors relation, the tuplevariable S is
instantiated successively with each tuple, and the test [Link]>7 is applied. The answer
contains those instances of S that pass this test.
A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(R) denotes a formula in which the variableR appears:
In the last two clauses above, the quantifiers “For any” and “For all” are said to bind the
variable R. A variable is said to be free in a formula or subformula (a formula contained in a
larger formula) if the subformula does not contain an occurrence of a quantifier that binds
it.
We observe that every variable in a TRC formula appears in a subformula that is atomic, and
every relation schema specifies a domain for each field; this observation ensures that each
variable in a TRC formula has a well-defined domain from which values for the variable are
drawn. That is, each variable has a well-defined type, in the programming language sense.
Informally, an atomic formula R ε Rel gives R the type of tuples in Rel, and comparisons such
as R.a op S.b and R.a op constant induce type restrictions on the field R.a. If a variable R does
not appear in an atomic formula of the form R ε Rel (i.e., it appears only in atomic formulas
that are comparisons), we will follow the convention that the type of R is a tuple, whose fields
include all (and only) fields of R that appear in the formula.
We will not define types of variables formally, but the type of a variable should be clear in
most cases, and the important point to note is that comparisons of values having different
types should always fail. (In discussions of relational calculus, the simplifying assumption is
often made that there is a single domain of constants and that this is the domain associated
with each field of each relation.)
A TRC query is defined to be an expression of the form { T | p(T) }, where T is the only free
variable in the formula p.
A query is evaluated on a given instance of the database. Let each free variable in a formula
F be bound to a tuple value. For the given assignment of tuples to variables, with respect
to the given database instance,F evaluates to (or simply ‘is') true if one of the following
holds:
• F is an atomic formula R ε Rel, and R is assigned a tuple in the instanceof relation Rel.
• F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples assigned
to R and S have field values R.a and S.b that make the comparison true.
• F is of the form ¬p, and p is not true or of the form p ^ q and both p andq are true; or of
the form p ν q, and one of them is true, or of the form (p)q and q is true whenever p is
true.
• F is of the form For Any R(p(R)), and there is some assignment of tuplesto the free
variables in p(R), including the variable R that makes the formula p(R) true.
• F is of the form For all R(p(R)), and there is some assignment of tuplesto the free
variables in p(R) that makes the formula p(R) true no matter what tuple is assigned to
R.
Similarly, we use the notation For all R ε Rel(p(R)) for all R(R ε Rel) p(R)).Find the names
and ages of sailors with a rating above 7.
This query illustrates a useful convention: P is considered to be a tuple variable with exactly
two fields, which are called name and age, because these are the only fields of P that are
mentioned and P does not range over any of the relations in the query; that is, there is no
subformula of the formP ε Relname. The result of this query is a relation with two fields,
name and age. The atomic formulas [Link] = [Link] and [Link] = [Link] give values to the
fields of an answer tuple P. On instances B1, R2, and S3, the answer is the set of tuples
<Lubber, 55:5>, <Andy, 25:5>, <Rusty, 35.0>, <Zorba, 16.0>, and <Horatio, 35.0>.
Find the sailor name, boat id, and reservation date for each reservation.
For each reserve tuple, we look for a tuple in Sailors with the same sid. Given a pair of such
tuples, we construct an answer tuple P with fields sname, bid, and day by copying the
corresponding fields from these two tuples. This query illustrates how we can combine
values from different relations in each answer tuple. The answer to this query on instances
B1, R2, and S3 is shown in Figure 9.4.
This query can be read as follows: “Retrieve all sailor tuples for which thereexists a tuple in
Reserves, having the same value in the sid field, and with bid = 103." That is, for each sailor
tuple, we look for a tuple in Reserves thatshows that this sailor has reserved boat 103. The
answer tuple P contains just one field, sname.
(Q2) Find the names of sailors who have reserved a red boat.
This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples R
in Reserves and B in Boats such that [Link] = [Link], [Link] = [Link], and [Link] = ‘red’.” Another
way to write this query, which corresponds more closely to this reading, is as follows:
(Q7) Find the names of sailors who have reserved at least two boats.
Contrast this query with the algebra version and see how much simpler the calculus version
is. In part, this difference is due to the cumbersome renaming of fields in the algebra version,
but the calculus version really is simpler.
(Q9) Find the names of sailors who have reserved all boats.
This query was expressed using the division operator in relational algebra. Notice how easy
it is expressed in calculus. The calculus query directly reflects how we might express the
query in English: “Find sailors S such thatfor all boats B there is a Reserves tuple showing
that sailor S has reserved boat B.”
This query can be read as follows: For each candidate (sailor), if a boat is red, the sailor must
have reserved it. That is, for a candidate sailor, a boat being red must imply the sailor having
reserved it. Observe that since we can return an entire sailor tuple as the answer instead of
just the sailor's name, we have avoided introducing a new free variable (e.g., the variable P
in the previous example) to hold the answer values. In instances B1, R2, and S3, the answer
contains the Sailors tuples with sids 22 and 31.
We can write this query without using implication, by observing that an expression of the
form p=>q is logically equivalent to ¬pνq:
This query should be read as follows: “Find sailors S such that for all boats B, either the boat
is not red, or a Reserves tuple shows that sailor S has reserved boat B.”
Self-Assessment Questions – 1
A DRC formula is defined in a manner that is very similar to the definition of a TRC formula.
The main difference is that the variables are now domain variables. Let op denote an
operator in the set { <, >,=, ≤, ≥, ≠ } and let X and Y be domain variables.
A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(X) denotes a formula in which the variable X appears: any atomic formula.
The reader is invited to compare this definition with the definition of TRC formulas and see
how closely these two definitions correspond. We will not define the semantics of DRC
formulas formally; this is left as an exercise forthe reader.
This differs from the TRC version in giving each attribute a (variable) name. The condition <
I,N,T,A > ε Sailors ensures that the domain variables I, N, T,and A are restricted to be fields of
the same tuple. In comparison with the TRC query, we can say T > 7 instead of [Link] > 7,
but we must specifythe tuple < I,N, T,A> in the result, rather than just S.
(Q1) Find the names of sailors who have reserved boat 103.
Notice that only the sname field is retained in the answer and that only N isa free variable.
We use the notation For any Ir, Br, D(…) as a shorthand for For any Ir(For any Br(For any D(:
: :))).
Very often, all the quantified variables appear in a single relation, as in this example. An even
more compact notation in this case is For any < Ir, Br, D> ε Reserves. With this notation, which
we will use henceforth, the above query would be as follows:
The comparison with the corresponding TRC formula should now be straightforward. This
query can also be written as follows; notice the repetition of variable I and the use of the
constant 103:
(Q2) Find the names of sailors who have reserved a red boat.
(Q7) Find the names of sailors who have reserved at least two boats
〈〉
〈〉|{N | I , T, A
( I, N,T, A
Sailors
Br1, Br2, D1, D2 (
I, Br1, D1 Reserves
I, Br2, D2 Reserves Br1 Br2)
〈〉〈〉Notice how the repeated use of variable I ensures that the same sailor has reserved both
the boats in question.
(Q9) Find the names of sailors who have reserved all boats.
This query can be read as follows: “Find all values of N such that there is some tuple <I,N,T,A>
in Sailors satisfying the following condition: for every <B, BN, C >, either this is not a
tuple in Boats or there is some tuple <Ir, Br,D> in Reserves that proves that Sailor I has
reserved boat B.” The For all quantifier allows the domain variables B, BN, and C to range
overall values in their respective attribute domains, and the pattern ‘ ¬(<B, BN,C> ε Boats) ν
’ is necessary to restrict attention to those values that appear in tuples of Boats. This pattern
is common in DRC formulas, and the notation For all <B, BN,C> ε Boats can be used as a
shorthand instead. This is similar to the notation introduced earlier for for any. With this
notationthe query would be written as follows:
Here, we find all sailors such that for every red boat there is a tuple inReserves that
shows the sailor has reserved it.
Self-Assessment Questions – 2
6. A _______is a variable that ranges over the values in the domainof some attribute.
7. A DRC formula is defined in a manner that is very similar to thedefinition of a
_______.
8. The main difference between the DRC and TRC is that the variablesare now
_______variables.
Consider the query { S | ¬(S ε Sailors) }. This query is syntactically correct. However, it asks
for all tuples S such that S is not in (the given instance of) Sailors. The set of such S tuples is
obviously infinite, in the context of infinitedomains such as the set of all integers. This simple
example illustrates an unsafe query. It is desirable to restrict relational calculus to disallow
unsafe queries.
We now sketch how calculus queries are restricted to be safe. Consider a set I of relation
instances, with one instance per relation that appears in the query Q. Let Dom(Q,I) be the set
of all constants that appear in these relation instances I, or in the formulation of the query Q
itself. Since we only allow finite instances I, Dom(Q, I) is also finite.
For a calculus formula Q to be considered safe, at a minimum we want to ensure that for any
given I, the set of answers for Q contains only values that are in Dom(Q, I).
While this restriction is obviously required, it is not enough. Not only do we want the set of
answers to be composed of constants in Dom(Q,I), we wish to compute the set of answers by
only examining tuples that contain constants in Dom(Q, I)! This wish leads to a subtle point
associated with theuse of quantifiers For all and For any: Given a TRC formula of the form
For any R(p(R)), we want to find all values for variable R, that make this formulatrue by
checking only tuples that contain constants in Dom(Q, I). Similarly, given a TRC formula of
the form For all R(p(R)), we want to find any values for variable R, that make this formula
false, by checking only tuples that contain constants in Dom(Q, I).
Note that this definition is not constructive, that is, it does not tell us how to check if a query
is safe.
The query Q = { S |¬(S 2 Sailors) } is unsafe by this definition. Dom(Q,I) is the set of all values
that appear in (an instance I of) Sailors. Consider the instance S1 shown in Figure 9.1. The
answer to this query obviously includes values that do not appear in Dom(Q, S1). Returning
to the questionof expressiveness, we can show that every query that can be expressed using
a safe relational calculus query, can also be expressed as a relational algebra query. The
expressive power of relational algebra is often used as ametric of how powerful a relational
database query language is. If a query language can express all the queries that we can
express in relational algebra, it is said to be relationally complete. A practical query
language isexpected to be relationally complete; in addition, commercial query languages
typically support features that allow us to express some queries that cannot be expressed in
relational algebra.
Self-Assessment Questions – 3
9. Can every query that can be expressed in relational algebra also beexpressed
in relational calculus?
10. If a query language can express all the queries that we can express inrelational
algebra, it is said to be __________.
5. SUMMARY
In this unit, we dealt with that instead of describing a query by how to compute the output
relation, a relational calculus query describes the tuples in the output relation. The language
for specifying the output tuples is essentially arestricted subset of first-order predicate logic.
In tuple relational calculus, variables take on tuple values and in domain relational calculus,
variables take on field values, but the two versions of the calculus are very similar.
All relational algebra queries can be expressed in relational calculus. If we restrict ourselves
to safe queries on the calculus, the converse also holds. An important criterion for
commercial query languages is that they should berelationally complete in the sense that
they can express all relational algebra queries.
6. TERMINAL QUESTION
1. What is relational completeness? If a query language is relationallycomplete, can you
write any desired query in that language?
2. What is an unsafe query? Give an example and explain why it isimportant to disallow
such queries.
3. Let the following relation schemas be given:
R = (A,B,C)
S = (D,E, F)
Let relations r(R) and s(S) be given. Give an expression in the tuplerelational
calculus that is equivalent to each of the following:
a. ΠA(r)
b. σB =17 (r)
c. r×s
d. ΠA,F (σC =D(r × s))
Let R = (A, B, C), and let r1 and r2 both be relations on schema R. Give an expression in
the domain relational calculus that is equivalent to eachof the following:
a. ΠA(r1)
b. σB =17 (r1)
c. r1 ∪ r2
d. r1 ∩ r2
e. r1 − r2
f. ΠA,B(r1) ∪ B,C(r2)
4. How do you differentiate relational algebra and relational calculus?
7. ANSWERS
Self-Assessment Questions
1. tuple variable
2. tuple relational calculus
3. bind
4. TRC
5. Instance
6. domain variable
7. TRC formula
8. Domain
9. Yes
10. relationally complete
Terminal Questions
1. If a query language can express all the queries that we can express in relational
algebra, it is said to be relationally complete. Yes, we can write the desired query in
that language if features are supported. (Refer section 4)
2. Queries where the set of S tuples is obviously infinite in the context of infinite
domains such as the set of all integers then such queries are unsafe queries.
3. Refer the whole unit for detail.
4. All relational algebra queries can be expressed in relational calculus. If we restrict
ourselves to safe queries on the calculus, the converse also holds.
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 10
Normalization
Table of Contents
1. INTRODUCTION
The basic objective of normalization is to reduce redundancy which means that information
is to be stored only once. Storing information several times leads to wastage of storage space,
and an increase in the total size of the data stored. Relations are normalized so that when
relations in a database are tobe altered during the lifetime of the database, we do not lose
information or introduce inconsistencies. The type of alterations normally needed for
relations are:
• Insertion of new data values to a relation. This should be possiblewithout being
forced to leave blank fields for some attributes.
• Deletion of a tuple, namely, a row of a relation. This should be possiblewithout losing
vital information unknowingly.
• Updating or changing a value of an attribute in a tuple. This should be possible
without exhaustively searching all the tuples in the relation.
1.1 Objectives
After going through this unit, the reader should be able to:
2. FUNCTIONAL DEPENDENCY
As the concept of dependency is very important, it is essential that we first understand it well
and then proceed to the idea of normalization. There is nofool-proof algorithmic method of
identifying dependency. We have to use our commonsense and judgment to specify
dependencies.
Let X and Y be the two attributes of a relation. Given the value of X, if there is only one value
of Y corresponding to it, then Y is said to be functionally dependent on X. This is indicated by
the notation:
X→Y
For example, given the value of the item code, there is only one value of the item name for it.
Thus item name is functionally dependent on the item code. This is shown as:
Item code → item name
Similarly in Table 10.1, given an order number, the date of the order is known.
Thus: Order no➔ Order date
Thus :
Order no., Item code → Qty., Price
As another example, consider the relation
In this relation, Name is functionally dependent on Roll no. In fact, given thevalue of Roll
no., the values of all other attributes can be uniquely
determined. Name and Department are not functionally dependent, because,given the name
of a student, one cannot find his department uniquely. This is due to the fact that there may
be more than one student with the same name. Name in this case is not a key. Department
and Year of study are not functionally dependent, as Year of study pertains to a student,
whereas Department is an independent attribute. The functional dependency in this relation
is shown in the following figure as a dependency diagram. Such dependency diagrams shown
in figure 10.1 are very useful in normalization.
Relation Key: Consider the relation of Table 10.1. Given the Vendor code, the Vendor name
and Address are uniquely determined. Thus Vendor code is the relation key. Given a relation,
if the value of an attribute X uniquely determines the value of all other attributes in a row,
then X is said to be the key of that relation. Sometimes more than one attribute is needed to
uniquely determine other attributes in a relation row. In that case, such a set of attributes is
the key.
In table 10.1, Order no. and Item code together form the key. In the relation "Supplies"
(Vendor code, Item code, Qty supplied, Date of supply, Price/unit) Vendor code and Item
code together form the key. This dependency is shown in the following diagram (figure 10.2).
Observe that in the figure the fact that Vendor code and Item code together form a composite
key is clearly shown by enclosing them together in a rectangle.
1. Let X and Y be the two attributes of a relation, The Functional Dependency can
be written as _________.
2. Functional dependency may also be based on a attribute.
3. ANOMALIES IN A DATABASE
Consider the following relation scheme pertaining to the information about a student
maintained by a university:
STDINF(Name, Course, Phone_No, Major, Prof, Grade)
Table 10.2 shows some tuples of a relation on the relation scheme STDINF(Name, Course,
Phone_No, Major, Prof, Grade). The functionaldependencies among its attributes are shown
in Figure 10.3. The key of the relation is Name Course and the relation has, in addition, the
following functional dependencies {Name➔ Phone_No, Name➔ Major, Name Course➔
Grade, Course➔ Prof }.
The relation scheme STDINF can lead to several undesirable problems: Redundancy: The
aim of the database system is to reduce redundancy,meaning that information is to be
stored only once. Storing information several times leads to the waste of storage space
and an increase in the total size of the data stored.
Updates to the database with such redundancies have the potential ofbecoming inconsistent
as explained below. In the relation of table 10.2, the Major and Phone_No. of a student are
stored several times in the database:once for each course that is or was taken by a student.
Update Anomalies: Multiple copies of the same fact may lead to update anomalies or
inconsistencies when an update is made, and only some of the multiple copies are updated.
Thus, a change in the Phone_No. of Jones must be made, for consistency, in all tuples
pertaining to the student Jones. If one of the three tuples of Figure 10.3 is not changed to
reflect the new Phone_No. of Jones, there will be an inconsistency in the data.
Insertion Anomalies: If this is the only relation in the database showing the association
between a faculty member and the course he or she teaches, the fact that a given professor
is teaching a given course cannot be entered into the database unless a student is registered
in the course. Also, if another relation also establishes a relationship between a course and a
professor, who teaches that course, the information stored in these relations has to be
consistent.
Deletion Anomalies: If the only student registered in a given course discontinues the
course, the information as to which professor is offering thecourse will be lost, if this is the
only relation in the database showing the association between a faculty member and the
course she or he teaches. If another relation in the database also establishes the relationship
between a course and a professor, who teaches that course, the deletion of the last tuple in
STDINF for a given course will not cause the information about the course's teacher to be
lost.
The problems of database inconsistency and redundancy of data are similarto the problems
that exist in the hierarchical and network models. These problems are addressed in the
network model by the introduction of virtual fields and in the hierarchical model by the
introduction of virtual records. In the relational model, the above problems can be remedied
by decomposition. We define decomposition as follows:
Definition: Decomposition
The decomposition of a relation scheme R = (A1 ,A2,...,An) is its replacement by a set of
relation schemes {R1,R2,...,Rm} such that R1 R for 1 i m and R1 R2 Rm = R.
The problems in the relation scheme STDINF can be resolved if we replace it with the
following relation schemes:
STUDENT _ INFO (Name,Phone_No,Major)
TRANSCRIPT (Name,Course,Grade)
TEACHER (Course, Prof)
The first relation scheme gives the phone number and the major of each student, and such
information will be stored only once for each student. Anychange in the phone number will
thus require a change in only one tuple of this relation.
The second relation scheme stores the grade of each student in each course that the student
is or was enrolled in. The third relation scheme records the teacher of each course. One of
the disadvantages of replacing the original relation scheme STDINF with the three relation
schemes is that the retrieval of certain information requires a natural join operation to be
performed. For instance, to find the majors of a student who obtained a grade of A in course
353 requires a join to be performed: (STUDENT_INFO |x| TRANSCRIPT). The same
information could be derived from the original relation STDINF by selection and projection.
When we replace the original scheme STDINF with the relation schemes STUDENT_INFO,
TRANSCRIPT, and TEACHER, the consistency and referential integrity constraints have to
be enforced. The referential integrity enforcement implies that if a tuple in the relation
TRANSCRIPT exists, such as (Jones, 353, in prog), a tuple must exist in STUDENT_INFO with
Name =Jones, and furthermore, a tuple must exist in STUDENT_INFO with Course = 353.
The attribute Name, which forms part of the key of the relation TRANSCRIPT, is a key of the
relation STUDENT_INFO. Such an attribute (or a group of attributes), which establishes a
relationship between specific tuples (of the same or two distinct relations), is called a foreign
key. Notice that the attribute Course in relation to TRANSCRIPT is also a foreign key since it
is a key of the relation TEACHER.
Note that the decomposition of STDINF into the relation schemes STUDENT (Name,
Phone_No, Major, Grade) and COURSE (Course, Prof.) is a bad decomposition for the
following reasons:
1. Redundancy and update anomaly, because the data for the attributesPhone_no and
Major are repeated.
2. Loss of information, because we lose the fact that a student has a given grade in a
particular list.
Self-Assessment Questions – 2
The idea of normalizing relations to higher and higher normal forms is toattain the goal
of having a set of ideal relations meeting the above criteria.
Self-Assessment Questions – 3
5. FIRST NORMALIZATION
The relation shown in table 10.1 is said to be in the First Normal Form, abbreviated as 1NF.
This form is also called a flat-file. There are nocomposite attributes, and every attribute is
single and describes one property. How do we convert an un-normalized table into the first
normal form? Consider the following example:
Table 10.3 is not in the first normal form because the Location column contains multiple
values. For example, the first row includes values "Delhi" and"Kolkata". To bring this table
to its first normal form, we split the table into two tables and now we have the resulting
tables:
1 Production 1 Delhi
2 Sales 1 Kolkata
3 Marketing 2 Mumbai
4 Research 3 Chennai
4 Goa
4 Gurugram
Now the first normal form is satisfied, as the columns on each table all hold just one value.
Converting a relation to the 1NF form is the first essential step in normalization. There are
successive higher normal forms known as 2NF, 3NF, BNCF, 4NF, and 5NF. Each form is an
improvement over the earlier form. In other words, 2NF is an improvement on 1NF, 3NF is
an improvement on 2NF, and so on. A higher normal form relation is a subsetof a lower
normal form as shown in the following figure 10.4. The higher normalization steps are based
on three important concepts:
This relation is in 1NF. The key is (Order no., Item code). The dependency diagram for
attributes of this relation is shown in figure 10.5. The non-key attribute Price/Unit is
functionally dependent on the Item code which is part of the relation key. Also, the non-key
attribute Order date is functionally dependent on Order no. which is a part of the relation
key. Thusthe relation is not in 2NF. It can be transformed to 2NF by splitting it into three
relations as shown in table 10.4.
In table 10.4 the relation Orders has Order no. as the key. The relation Order details
have the composite key Order no. and Item code.
In both relations, the non-key attributes are functionally dependent on the whole key.
Observe that by transforming to 2NF relations the repetition of the Order date (table 10.1)
has been removed. Further, if an order for an item is cancelled, the price of an item is not
lost. For example, if Order no.1886 for item code 4629 is cancelled in table 10.1, then the
fourth row will beremoved, and the price of the item is lost. In table 10.4 only the fourth row
oftable 10.4(b) is omitted. The item price is not lost as it is available in table 10.4(c). The date
of the order is also not lost as it is in table 10.4(a).
Table 10.4: Splitting of Relation given in table 10.1 into 2NF Relations
a) Orders
Order no. Order date
1456 260289
1886 040389
1788 040489
b) Order Details
Order no. Item Code Qty.
1456 3687 52
1456 4627 38
1456 3214 20
1886 4629 45
1886 4627 30
1788 4627 40
c) Prices
Item code Price/unit
3687 50.40
4627 60.20
3214 17.50
4629 20.25
These relations in 2NF form meet all the "ideal" conditions specified. Observe that the three
relations obtained are self-contained. There is no duplication of data within a relation.
Self-Assessment Questions – 5
Observe that given the year of the student, his hostel is known and vice versa. The
dependency of the hostel on year leads to duplication of data as is evident from table 10.5. If
it is decided to ask all first-year students to move to Kaveri hostel, and all second-year
students to Ganga hostel, this change should be made in many places in table 10.5. Also, when
a student's year ofstudy changes, his hostel change also should be noted in table 10.5. This is
undesirable. Table 10.5 is said to be in 3NF if it is in 2NF and no non-key attribute is
functionally dependent on any other non-key attribute. Table 10.5 is thus not in 3NF. To
transform it to 3NF, we should introduce another relation that includes the functionally
related non-key attributes.
Let us consider another example of a relation. The relation Employee is given below and its
dependency diagram in figure 10.7.
Employee (Employee code, Employee name, Dept., Salary, Project no., Termination date of
project).
As can be seen from the figure, the termination date of a project is dependent on Project no.
Thus this relation is not in 3NF. The 3NF relations are:
Employee (Employee code, Employee name, Salary, Project no.) Project (Project no.
Termination date)
12. A __________Form normalization will be needed where all attributes ina relation
tuple are not functionally dependent only on the key attribute.
13. If two non-key attributes are functionally _______, then there will be no
unnecessary duplication of data.
14. In 3 NF dependency between attributes is a ______property and hasto be stated
in the problem specification.
It is assumed that
1. A professor can work in more than one department
2. The percentage of the time he spends in each department is given.
3. Each department has only one Head of Department.
The relationship diagram for the above relation is given in figure 10.8. Table 10.7 gives the
relation attributes. The two possible composite keys are Professor Code and Dept. or
Professor Code and Head of Dept. Observe that department as well as Head of Dept. are not
non-key attributes. They are a part of a composite key.
Fig 10.8: Dependency diagram of Professor relation Table 10.7: Normalization of Relation
"Professor"
The relation given in table 10.7 is 3NF. Observe, however, that the namesof Dept. and Head
of Dept. are duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose
the information that Rao is the Head of the Department of Chemistry.
The normalization of the relation is done by creating new relation for [Link] Head of Dept.
and deleting Head of Dept. from Professor Relation. The normalized relations are shown in
the following table 10.8 and the dependency diagrams for these new relations are in figure
10.8.
b)
Department Head of Dept.
Physics Ghosh
Mathematics Krishnan
Chemistry Rao
The dependency diagram gives the important clue to this normalization stepas is clear from
figures 10.8 and 10.9.
Table 10.9 gives the relation for this problem and figure 10.10 the dependency diagram(s).
Table 10.9: Vendor-supply-projects Relation
Vendor Code Item code Project no.1
V1 I1 P1
V1 I2 P1
V1 I1 P3
V1 I2 P3
V2 I2 P1
V2 I3 P1
V3 I1 P2
V3 I1 P2
The relation given in table 10.9 has a number of problems. For example:
If vendor V1 has to supply to project P2, but the item is not yet decided, then a row with a
blank for item code has to be introduced. The information about item 1 is stored twice for
vendor V3. Observe that the relation given inTable 10.8 is in 3NF and also in BCNF. It still has
the problems mentioned above. The problem is reduced by expressing this relation as two
relations in the Fourth Normal Form (4NF). A relation is in 4NF if it has no more than one
independent multivalued dependency, or one independent multivalued dependency with a
functional dependency.
Table 10.9 can be expressed as the two 4NF relations given in Table 10.10. The fact that
vendors are capable of supplying certain items and that they are assigned to supply for some
projects are independently specified in the 4NF relation.
b) Vendor project
Vendor Code Project no.
V1 P1
V1 P3
V2 P1
V3 P1
V3 P2
These relations still have a problem. Even though vendor V1's capability to supply items and
his allotment to supply for specified projects may not need it. We thus need another relation
that specifies this. This is called the 5NF form. The 5NF relations are the relations in Table
10.10(a) and 10.10(b) together with the relation given in table 10.11.
Table 10.11: 5NF Additional Relation
Project no. Item code
P1 I1
P1 I2
P2 I1
P3 I1
P3 I3
In table 10.12 we summarize the normalization steps already explained
10. SUMMARY
In this unit, we discussed that there is no fool-proof algorithmic method of identifying
dependency and hence we have to use our commonsense and judgment to specify
dependencies. We dealt about the importance of having a consistent database without
repetition of data and pointed out the anomalies that could be introduced in a database with
an undesirable scheme. We also discussed the properties of normalized relations. Then we
discussed the several forms of normalization that could help in removing these anomalies.
12. ANSWERS
Self-Assessment Questions
1. X ➔ Y
2. composite
3. only once
4. anomalies
5. decomposition
6. Duplicated
7. self-contained
8. flat
9. lower
10. 1NF
11. non-key
12. Third Normal
13. Dependent
14. Semantic
15. BCNF
16. Independent
17. two
Terminal Questions
1. Basic purpose of 4NF transformation is Normalization of data when attributes in a
relation have multi-valued dependency. It is done by adding one relation relating
attributes with multivalued dependency to the two relations with multivalued
dependency. (Refer section 9)
2. Anomalies found in relational databases are Update Anomalies,Insertion Anomalies,
and Deletion Anomalies (Refer section 3 for detail)
3. A functional dependency occurs when one attribute in a relation uniquely determines
another attribute. This can be written X -> Y which would be the same as stating "Y is
functionally dependent upon X".
4. Refer section 2 for detail
5. Yes, it is possible for R to be in BCNF if R has more than one possiblekey. (Refer section
8 for detail)
6. The given pair of dependencies means functional dependencies. (Refersection 10.2 for
detail)
7. Refer whole unit for detail.
8. Refer whole unit for detail.
9. Refer whole unit for detail.
DCA2101
COMPUTER ORIENTED NUMERICAL
METHODS
Unit 11
Query Processing and Optimization
Table of Contents
1. INTRODUCTION
In the previous unit, you have learned that normalization is needed to reduce redundancy,
and relations are normalized so that when relations in a database are to be altered during
the lifetime of the database, we do not lose information or introduce inconsistencies. In this
unit, we learn about the query processing and optimization techniques used in Database
management systems.
Query processing is a set of activities to obtain the desired information from a database
system in a predictable and reliable fashion. These activities are
(i) parsing a query and translating it into the form such that it can be optimized (ii)
Optimization of query data and (iii) finally evaluating for an execution plan over physical
data model, using operations on file structures, indices, etc. In SQL, queries are expressed in
high-level declarative form. So the query has to be processed and optimized so that the query
of the internal form gets a suitable execution strategy for processing. This optimization helps
get the result in a lesser time.
There can be several possible strategies for processing a query depending upon its
complexity. One should select a good strategy for processing a query.
This unit will introduce you to the basic concepts of query processing process and query
optimization strategies in the relational database domain.
1.1 Objectives
2. QUERY INTERPRETATION
There are three phases that a query passes through during the DBMS processing of that
query: (i) Parsing and translation (ii) Optimization
(iii) Evaluation. Figure 11.1 shows the steps in the query processing process.
Parsing and translation are done because Query in High-Level language is suitable for human
use only. For example in SQL, queries are expressed in high-level declarative form and a high-
level relational query is generally non-procedural in nature. It simply informs about “what“
rather than informing “how” to get a query. Internally a query should be represented in a
more useful form, like relational algebra. So, first, the system must translate the query into
its internal form. After parsing and translating the query into a relational algebra expression,
the query is then transformed into a form, usually a query tree or graph, which can be
handled by the optimization engine (Query Optimizer).
After the query is been processed and translated into the form that can be optimized, the
question arises why we need to optimize such a form of data. A very efficient methodology
should be applied to find the result of the query in the existing database structure. If
optimization is done then it can be queried using the information in the main memory, with
little or no disk access. After optimization, various analyses are performed on the query data,
thereby helping in giving solutions for valid evaluation plans for execution of the result.
Execution of the query requires disk access. The problem again is that the transfer of data
from disk is slow, relative to the speed of the main memory and the CPU. So it is better to
think about saving the disk accesses also. Keeping all these limitations in mind, optimization
is done to find out the best possible methods.
There are two main approaches to finding the best possible solutions. They are
(i) Rewriting the query in a more effective manner. (ii) Estimating the cost of various
execution strategies for the query.
Usually, both strategies are combined because one has to solve it in a very less amount of
time.
In the network and hierarchical systems, optimization is left for the most part to the
application programmer. It is so because one has to know about the entire application
program and it is not easy to transform a hierarchical or network query into another one.
When the optimization phase is over then the detailed strategy is observed and the best
evaluation plan is found by the Evaluation engine for processing the query. Here we choose
specific indices to use, and the order in which tuples are to be processed, etc.
The best evaluation plan is chosen out of multiple methods of executing a query and then
executed to get the result.
Self-Assessment Questions - 1
3. EQUIVALENCE OF EXPRESSIONS
It is always better to find out a query-processing strategy to find a relational algebra
expression that is equivalent to the given query. It gives efficiency in execution.
An SQL query is first translated into an equivalent extended relational algebra expression–
represented as a query tree data structure–that is then optimized. This can be done by first
decomposing the SQL queries into query blocks. This basic unit can be translated into an
extended relational algebra expression and then optimized. In fact, Nested queries are not
query blocks, but are identified as separate query blocks.
Let us look at the SQL query in the example of a University database on the FACULTY relation.
FACULTY (FID: string, FNAME: string, LNAME:string, DEPT: string, SALARY: real, ENO:
int, DNO: int, SEX:string)
FACULTY FID FNAME LNAME DEPT SALARY DNO ENO SEX
SELECT
LNAME, FNAME
FROM FACULTY
WHERE SALARY > (SELECT MAX (SALARY)
FROM FACULTY
WHERE DNO=2);
This query includes a nested subquery and hence would be decomposed into two blocks. The
inner block is
FROM FACULTY
Where c represents the result returned from the inner query block. The inner block can be
translated into the extended relational algebra expression
δ<MAX SALARY> (σ<DNO=2> (FACULTY))
And the outer block into the expression
π <FNAME, LNAME> (σ<SALARY > c> (FACULTY))
After the SQL query is decomposed then the query optimizer chooses an execution plan for
each block. It should be noted that in the above example, the inner block needs to be
evaluated only once to produce the maximum salary, which is then used–as the constant c–
by the outer block. This is known as an uncorrelated nested query. It is difficult to optimize
the more complex correlated nested queries, where a tuple variable from the outer block
appears in the WHERE-clause of the inner block.
Self-Assessment Questions - 2
3. Nested query blocks are not identified as separate query blocks. (True/False)
4. It is difficult to optimize the more complex correlated nested queries, where a tuple
variable from the outer block appears in the ________of the inner block.
In this section, we will discuss the algorithm for executing query operations. They are (i)
EXTERNAL sorting, (ii) SELECT operation, (iii) JOIN operation,
(iv) PROJECT & SET operation, (v) AGGREGATE operation, and (vi) Outer Join.
External Sorting is one of the Primary algorithms used in query processing that are suitable
for large files of records stored on disk that do not fit entirely in main memory, as most
database files do.
Whenever an SQL query specifies an ORDER BY clause, the query result must be sorted.
Sorting is also a key component in sort-merge algorithms used for JOIN and other operations
(such as UNION and INTERSECTION), and in duplicate elimination algorithms for the
PROJECT operation (when an SQL query specifies the DISTINCT option in the SELECT
clause).
The typical external sorting algorithm uses a sort-merge strategy. This algorithm consists
of two phases: (i) Sorting Phase (ii) Merging Phase.
This sort-merge algorithm also like other database algorithms requires buffer space in the
main memory, where the actual sorting and merging of the runs (portions or pieces of the
file) is performed.
In the sorting phase, runs (portions or pieces of the file) which can fit in the available buffer
space are read into the main memory, which is sorted using an internal sorting algorithm,
and written back to disk as temporary sorted subfiles (or runs). The size of a run and the
number of initial runs are dictated by the number of file blocks (b) and the available
buffer space.
In the merging phase, the sorted runs are merged during one or more passes. The degree
of merging is the number of runs that can be merged together in each pass. In each pass, one
buffer block is needed to hold one block from each of the runs being merged, and one block
is needed for containing one block of the merge result.
This is another form of an algorithm for executing query operations. The execution of the
SELECT operation depends on the file having specific access paths and may apply only to
certain types of selection conditions.
There may be a number of search algorithms to select records from a file. These algorithms
scan a file and are also known as file scans because they scan the records of a file to search
for and retrieve records that satisfy a selection condition. A search algorithm may involve an
index for searching and this type of index search is called an index scan. The search methods
given below are examples of some of the search algorithms that can be used to implement a
select operation:
i) Linear search (brute force): This method is used to retrieve every record in the file, and
test whether its attribute values satisfy the selection condition.
ii) Binary search: This method can be used if the selection condition involves an equality
comparison on a key attribute on which the file is ordered.
iii) Using a primary index (or hash key): This method is used in the selection condition and
involves an equality comparison on a key attribute with a primary index (or hash key).
iv) Using a primary index to retrieve multiple records: This method is used in the
comparison condition is >, >=, <, or <= on a key field with a primary index.
v) Using a clustering index to retrieve multiple records: It is used in the selection condition
that involves an equality comparison on a non-key attribute with a clustering index.
vi) Using a secondary (BPlus-tree) index on an equality comparison: This search method
can be used to retrieve a single record if the indexing field is a key (has unique values)
or to retrieve multiple records if the indexing field is not a key. This can also be used
for comparisons involving >, >=, <, or <=.
Linear search (brute force) applies to any file, but all the other methods depend on having
the appropriate access path on the attribute used in the selection condition. Binary search is
more efficient than linear search if the selection condition involves an equality comparison
on a key attribute on which the file is ordered. The primary index, Clustering index, and
Secondary index can be used to retrieve records in a certain range. Queries involving such
conditions are called range queries.
The JOIN operation is one of the most time-consuming operations in query processing. There
are many possible ways to implement a two-way join, which is a join on two files. Joins
involving more than two files are called multiway joins. The number of possible ways to
execute multiway joins grows very rapidly. Let us consider the algorithms for join operations
of the form
P⨝A=B Q
For relations P and Q and where A and B are domain-compatible attributes of P and Q,
respectively.
i) Nested-loop join (brute force): In this method, for each record t in P (outer loop), every
records from Q (inner loop) has to be retrieved and tested whether the two records
satisfy the join condition t[A] = s[B].
ii) Single-loop join (using an access structure to retrieve the matching records): In this
method, suppose if an index (or hash key) exists for one of the two join attributes, B of
Q then each record t in P is retrieved one at a time (single loop), and then the access
structure is used to retrieve all matching records s directly from Q that satisfy s[B] =
t[A].
iii) Sort–merge join: In this method, the records of P and Q has to be physically sorted
(ordered) by the value of the join attributes A and B, respectively. Then both files are
scanned concurrently in order of the join attributes, matching the records that have the
same values for A and B. If the files are not sorted, they may be sorted first by using
external sorting. Pairs of file blocks are copied into memory buffers in order and the
records of each file are scanned only once each for matching with the other file–unless
both A and B are non-key attributes. In such a case, the method needs to be modified
slightly. We use P(i) to refer to the ith record in P. A variation of the sort-merge join can
be used when secondary indexes exist on both join attributes. The indexes provide the
ability to access (scan) the records in order of the join attributes, but the records
themselves are physically scattered all over the file blocks. This method may be quite
inefficient if every record access involves accessing a different disk block.
iv) Hash-join: In this method, the records of files P and Q are both hashed to the same hash
file, using the same hashing function on the join attributes A of P and B of Q as hash
keys. First, it passes through the first phase known as the partitioning phase, a single
pass through the file with fewer records (say, P) hashes its records to the hash file
buckets. It is called the partitioning phase because the records of P are partitioned
into the hash buckets. In the second phase, called the probing phase, a single pass
through the other file (Q) then hashes each of its records to probe the appropriate
bucket, and that record is combined with all matching records from P in that bucket.
Here we assume that the smaller of the two files fit entirely into memory buckets after
the first phase although this type of assumption is not always required.
Project and set operations are very useful and used in certain cases for query optimization.
Hashing can also be used to eliminate duplicates. Each record is hashed and inserted into a
bucket of the hash file in memory and it is checked against those already in the bucket. If it
is a duplicate, it is not inserted
Set operations: Set operations are UNION, INTERSECTION, SET DIFFERENCE, and
CARTESIAN PRODUCT. But the CARTESIAN
PRODUCT operation like R x S is quite expensive because its result includes a record for each
combination of records from R and S. Moreover the attributes of the result include all
attributes of R and S. Hence CARTESIAN PRODUCT operation is avoided and other equivalent
operations are used for query optimizations.
The other three set operations UNION, INTERSECTION, and SET DIFFERENCE apply only to
union-compatible relations, which have the same number of attributes and the same
attribute domains.
Hashing can also be used to implement UNION, INTERSECTION, and SET DIFFERENCE.
We use aggregate operations to find out the aggregate values for query optimizations. MIN,
MAX, COUNT, AVERAGE, and SUM are the aggregate operators used to compute the aggregate
values. But these can be computed by a table scan or by using an appropriate index. For
example:
FROM FACULTY
If an ascending index on salary exists for the FACULTY relation, it can be used (otherwise we
can scan the entire table).
The index can also be used for the COUNT, AVERAGE, and SUM aggregates but the index must
be dense, that is, there must be an index entry for every record in the main file.
When a GROUP BY clause is used in a query, the aggregate operator must be applied
separately to each group of tuples.
In order to do this, the table is first partitioned into subsets of tuples, where each partition
(group) has the same value for the grouping attributes. In this case, the computation is more
complex. Let us consider the following query:
FROM FACULTY
GROUP BY DNO;
Usually either sorting or hashing is used on the grouping attributes to partition the file into
the appropriate groups and then the algorithm computes the aggregate function for the
tuples in each group which have the same grouping attribute values.
Outer join also can be used for query optimizations. Outer Join can be computed by modifying
one of the join algorithms, such as nested-loop join or single-loop join. The sort-merge and
hash-join algorithms can also be extended to compute outer joins.
Outer join can also be computed by executing a combination of relational algebra operators.
In this case, the cost of the outer join would be the sum of the costs of the associated steps
i.e. inner join, projections, and union.
Self-Assessment Questions - 3
The heuristic rule is applied for Query Optimization by modifying the internal representation
of the query. This form of query is generally in the form of query tree or a query graph data
structure. Although some optimization techniques were based on query graphs, nowadays,
this technique is not applied because query graphs cannot show the order of operation which
is needed by the query optimizer for query execution. So this unit will deal mainly with the
Heuristic Optimization of the query tree.
A heuristic rule is applied to the initial query expression and produces the heuristically
transformed equivalent query expressions. This is performed by transforming an initial
expression (tree) into an equivalent expression (tree) which is made more efficient for
execution. This rule works well in most cases but is not always guaranteed.
Execution of the query tree consists of executing an internal node operation whenever its
operands are available and then replacing that internal node with the relation that results
from executing the operation.
The query tree is a tree data structure that represents the relational algebra expression in
the query optimization process. The leaf nodes in the query tree correspond to the input
relations of the query. The internal nodes represent the relational algebra operations. The
system will execute an internal node operation whenever its operands are available and then
the internal node is replaced by the relation which is obtained from executing the operation.
The execution is terminated when the root node is executed and produces the result relation
for the query.
For example – Consider the following two tables with the following schema:
Emp(EmpID, Name, Sal, DeptID)
Dept(DeptID, Dname, Location)
To find the name of employees working in the Marketing department, a query in relational
algebra can be written as shown in Q as well as a query tree as shown in fig. 11.2.
The query can also be represented by a query graph. In this case, the relations in the query
are represented by relational nodes and are represented by single circles. Constant nodes
are used to represent constant values and are displayed as double circles or ovals. Selection
and join conditions are represented by the graph edges. And finally, the attributes to be
retrieved from each relation are displayed in square brackets above each relation. The graph
query representation does not give an order of performing the operations because there is
only a single graph corresponding to each query.
Hence query trees are better than query graphs because the query optimizer needs to show
the order of operations for query execution which is not possible in query graphs.
Two relational algebra expressions are said to be equivalent if the two expressions generate
two relations of the same set of attributes and contain the same set of tuples although their
attributes may be ordered differently.
Hence while doing Heuristic Optimization on Query Trees, generally, many different
relational algebra expressions can be found which can be equivalent to correspond to the
same query. And for every relational algebra expression a query tree can be drawn. And
hence there can be many different query trees to correspond to the same query.
The query parser generates a standard initial query tree to correspond to the SQL query
without optimizing. When the simple standard form of the query tree is found out then the
heuristic query optimizer transforms this initial query tree into a final query tree that is
efficient to execute.
1) First of all SELECT operations are broken up with conjunctive operations into a cascade
of SELECT operations.
2) Then SELECT operations are moved down far to the query tree as is permitted by the
attributes involved in the select condition.
3) Leaf nodes of the tree are rearranged by :
o positioning the leaf node relation with the most restrictive SELECT
operations so they are executed first in the query representation,
o and making sure that the ordering of leaf nodes does not cause CARTESIAN PRODUCT
operation.
4) CARTESIAN PRODUCT operations are combined with a subsequent SELECT operation
in the tree into a JOIN operation if the condition represents a join condition.
5) Lists of projection attributes are broken down and moved down the tree as far as
possible by creating new PROJECT operations as needed.
6) And lastly subtrees are identified that represent groups of operations that can be
executed by a single algorithm.
Using the above rules the optimized query Q is shown in fig 11.3. It is optimized as now we
need to transfer less data from the Dept table for performing joining with data of Emp table.
These steps are applied using general transformation rules for relational algebra operations
into equivalent ones. While transforming, the meaning of the operations and the resulting
relations should not mismatch.
Self-Assessment Questions - 4
9. In query graph Constant nodes are used to represent constant values and are
displayed as double circles or__________ .
10. The query parser generates a standard initial query tree to correspond to SQL
query without __________ . (Choose correct option)
a) optimizing
b) parsing
c) evaluation
d) sorting
This is one of the alternative approaches to query optimization and uses constraints
specified on the database scheme.
Suppose there is a query which says to retrieve the names of all students in a University
whose age is more than their faculty. And if we had a constraint on the database schema that
stated no student is older than faculty. In such a case if the semantic query optimizer checks
for the existence of this constraint then it need not to execute the query at all. But searching
through many constraints to find constraints applicable to a given query can be quite time-
consuming.
We saw that the output of the Parsing and Translating step in the query processing is a
relational algebra expression. If the query is complex, then the expressions consist of several
multiple operations and interact with various relations. The evaluation of the expression
becomes very costly in terms of both time and memory space in such a case. So it is very
important to consider how to evaluate an expression containing multiple operations. The
two approaches used for evaluating expression are materialization and pipelining.
Depending upon the situation sometimes Materialization is applicable and sometimes
Pipelining is applicable.
Materialization
Pipelining
Pipelining of different operations can improve query evaluation efficiency by reducing the
number of temporary relations that are produced. To achieve this reduction, we can combine
several operations into a pipeline of operations. We can implement the pipeline by creating
a query execution code. Depending upon different situations, we can use pipelining to reduce
the number of temporary files, thus reducing the cost of query evaluation.
Self-Assessment Questions - 5
11. If the query is complex, then the expressions consist of several multiple operations
and interact with various _________ .
12. We can combine several operations into a pipeline of operations to improve ________
evaluation.
There are different components involved in the cost of query execution and different types
of information needed in cost functions. This information is kept in the DBMS catalog.
i) Access cost to secondary storage: This is the cost measured by the search performed
while reading, and writing data blocks that reside on secondary storage, mainly on disk.
The cost of searching for records in a file depends on the type of access structures on
that file, such as ordering, hashing, and primary or secondary indexes. In addition,
access cost is affected by the way file blocks are allocated contiguously on the same disk
cylinder or scattered on the disk. This cost is usually more important in the case of a
large database since disk accesses are slow compared to in-memory operations.
ii) Storage cost: This is the cost of storing any intermediate files that are generated by an
execution strategy for the query.
iii) Computation cost: This is the cost measured by the performance of in-memory
operations on the data buffers during query execution. Such operations include
searching for and sorting records, merging records for a join, and performing
computations on field values. This cost is important when the database is small where
almost all data reside in the memory.
iv) Memory usage cost: This is the cost pertaining to the number of memory buffers
needed during query execution.
v) Communication cost: This is the cost of shipping the query and its results from the
database site to the site or terminal where the query originated. In the distributed
system, this cost should be minimized.
There can be other factors also that may be the cost components in a cost function which
may not be so much of importance. Therefore, some cost functions consider only disk access
cost as the reasonable measure of the cost of a query-evaluation plan.
As already mentioned, Query optimizers use the statistic information stored in the DBMS
catalog to estimate the cost of a plan. The relevant catalog information about the relation
includes:
Moreover some information about indices is also used along with the relation information.
They are:
The above given statistical information are the simplified one. In a real database
management system, the optimizer may have more information to improve the accuracy of
their cost estimates.
This statistical information maintained in the DBMS catalog along with the measures of
query cost based on the number of disk accesses helps in estimating the cost for different
relational algebra operations
Self-Assessment Questions - 6
13. Memory usage cost is the cost pertaining to the number of ________ buffers needed
during query execution.
14. Query optimizers use the statistic information stored in ________________ catalog to
estimate the cost of a plan.
We can use multiple processors for parallel computation to make the processing faster.
There are many cases where multiple processors may be available for parallel computation
of the join. The architecture may be different, including database machines. We will consider
an architecture where all processors have access to all disks, and all processors share main
memory.
In Parallel-join, pairs to be tested are split over several processors. Each processor computes
part of the join, and then the results are assembled (merged).
Ideally, the overall work of computing join is partitioned evenly over all processors. If such
a split is achieved without any overhead, a parallel join using N processors will take 1/N
times as long as the same join would take on a single processor.
The speedups between several processors are more or less similar because the overhead is
incurred in partitioning the work among the processors and in collecting the results
computed by each processor. If the split is not even then the final result cannot be obtained
until the last processor has finished. Speedup also depends upon the processors competing
for shared system resources. In such a case for e.g., for relation A and B where A ⨝ B, if each
processor uses its own partition of A, and the main memory cannot hold the entire B, the
processors need to synchronize the access of B so as to reduce the number of times that each
block of B must be read in from the disk.
Multiway Join can be pipelined if several joins are computed in parallel over other
processors. For example:
1) r1⨝r2⨝r3⨝r4 can be computed by first computing “ t1← r1⨝r2 ” and “t2 ← r3⨝r4”,
and then “ t1⨝t2 ”
2) And, it can be computed in the pipelined way: For r1 ⨝ r2 ⨝ r3 ⨝ r4 One Processor (say
P1) can be assigned to process r1 ⨝ r2, another processor (say P2) to process r3 ⨝ r4,
and other processors (say P3) to process the join of the tuples being generated by P1
and P2.
We can also organize the physical resources to make the strategies for query evaluation more
efficient and better.
The database can be partitioned over several disks for accessing in parallel to avoid
contention between several processors. We must also distribute data among disks to exploit
parallel disk access.
For the parallel 2-way join, tuples of individual relations can be split among several disks.
This phenomenon is known as disk striping. For example in the hash-join algorithm, we can
assign tuples to disks based on the hash function value where all groups of tuples that share
a bucket are assigned to the same disk. Either each group can be assigned to the same disk,
if possible, or the groups can be distributed uniformly among the available disks in order to
exploit parallel disk access.
For the pipeline-join, it is better to keep each relation on one disk and the distinct relations
be assigned to separate disks to the degree possible.
Although physical organization can be optimized differently for different queries, one must
organize the physical resources for a better expectation of result. If the physical organization
is done in a proper way then only the query optimizer can choose the best technique by
estimating the cost of each technique on the given physical organization.
Self-Assessment Questions - 7
15. In Parallel-join pairs to be tested are split over several processors. (True/False)
16. __________ can be pipelined if several joins are computed in parallel.
10. SUMMARY
When the queries are expressed in the high-level declarative form then the queries have to
be processed and optimized so that the query of the internal form gets a suitable execution
strategy for processing. To obtain this, several query processing and optimization strategies
have to be performed.
Query processing and optimization is a set of activities to obtain the desired information
from a database system such that the result is obtained in lesser time with low cost. The
queries are parsed and translated in equivalent relational algebra expression which is then
easier for optimizing it using different rules and algorithms.
We use External Sorting, Select operation, Join Operation, Project and Set Operation,
Aggregate Operations, and Outer Join for executing query operations. We also do Heuristic
Optimization of Query Trees or semantic optimization where ever appropriate and then the
optimized results are obtained for the query evaluation plan. We convert query trees into a
query evaluation plan.
After the query evaluation plan, the cost is estimated to find out the lowest cost possible.
Costs are estimated using several components that are included while executing plans and
statistical information stored in the DBMS catalog. Finally, the plan is executed over a
physical data model, using operations on file structures, indices, etc.
12. ANSWERS
Self Assessment Questions
1. True
2. Query optimizer
3. False
4. WHERE clause
5. Main
6. Range
7. False
8. d)-CARTESIAN PRODUCT.
9. ovals
10. a)-optimizing.
11. Relations
12. Query
13. Memory
14. DBMS
15. True
16. Multiway join
Terminal Questions
1. There are three phases that a query passes through during the DBMS processing of that
query: (i) Parsing and translation (ii) Optimization
(iii) Evaluation. (Refer section 2 for detail)
2. A sort-merge strategy in external sorting consists of two phases: Sorting Phase (ii)
Merging Phase. In the sorting phase, runs (portions or pieces of the file) which can fit
in the available buffer space are read into main memory, which is sorted using an
internal sorting algorithm, and written back to disk as temporarily sorted subfiles (or
runs). In the merging phase, the sorted runs are merged during one or more passes.
(Refer section 4.1 for detail)
3. The heuristic rule is applied for Query Optimization by modifying the internal
representation of the query. A heuristic rule is applied to the initial query expression
and produces the heuristically transformed equivalent query expressions. This is
performed by transforming an initial expression (tree) into an equivalent expression
(tree) which is made more efficient for execution. This rule works well in most cases
but not always guaranteed. (Refer section 5 for detail)
4. The measure of query cost includes the components like access cost to secondary
storage, storage cost, computation cost, memory usage cost, and communication cost.
(Refer section 8.1 for detail)
5. Join strategies for parallel processing are Parallel join, pipelined multiway join along
with the way physical resources are organized. (Refer section 9 for detail)
DCA2101
COMPUTER ORIENTED NUMERICAL
METHODS
Unit 12
Distributed Databases
Table of Contents
1. INTRODUCTION
The processors in a distributed system may vary in size and function. They may include small
microcomputers, workstations, minicomputers, and large general-purpose computer
systems. These processors are referred to by a number of different names such as sites,
nodes, computers, and so on, depending on the context in which they are mentioned. We
mainly use the term site, in order to emphasize the physical distribution of these systems.
A distributed database system consists of a collection of sites, each of which may participate
in the execution of transactions which access data at one site, or several sites. The main
difference between centralized and distributed database systems is that, in the former, the
data resides in one single location, while in the latter, the data resides in several locations.
As we shall see, this distribution of data is the cause of many difficulties that will be
addressed in this chapter.
1.1 Objectives
After going through this unit, the learners should be able to:
A distributed database system consists of a collection of sites, each of which maintains a local
database system. Each site is able to process local transactions, those transactions that
access data only in that single site. In addition, a site may participate in the execution of
global transactions, those transactions that access data in several sites. The execution of
global transactions requires communication among the sites.
The sites in the system can be connected physically in a variety of ways. The various
topologies are represented as graphs, whose nodes correspond to sites. An edge from node
A to node B corresponds to a direct connection between the two sites. Some of the most
common configurations are depicted in Figure 12.1. The major differences among these
configurations involve:
• Installation cost: The cost of physically linking the sites in the system.
• Communication cost: The cost of time and money to send a message from site A to site
B.
• Reliability: The frequency with which a link or site fails.
• Availability: The degree to which data can be accessed despite the failure of some links
or sites.
As we shall see, these differences play an important role in choosing the appropriate
mechanism for handling the distribution of data. The sites of a distributed database system
may be distributed physically, either over a large geographical area (such as the all Indian
states) or over a small geographical area such as a single building or a number of adjacent
buildings). The former type of network is referred to as a long-haul network, while the latter
is referred to as a local-area network.
Since the sites in long-haul networks are distributed physically over a large geographical
area, the communication links are likely to be relatively slow and less reliable as compared
with local-area networks. Typical long-haul links are telephone lines, microwave links, and
satellite channels. In contrast, since all the sites in local-area networks are close to each
other, communication links are of higher speed and lower error rate than their counterparts
in long-haul networks. The most common links are twisted pair, baseband coaxial,
broadband coaxial, and fiber optics.
Let us illustrate these concepts by considering a banking system consisting of four branches
located in four different cities. Each branch has its own computer with a database consisting
of all the accounts maintained at that branch. Each such installation is thus a site. There also
exists one single site which maintains information about all the branches of the bank.
Suppose that the database systems at the various sites are based on the relational model.
Thus, each branch maintains (among others) the relation deposit (Deposit-scheme) where
the site containing information about the four branches maintains the relation branch
(Branch-scheme), where
There are other relations maintained at the various sites which are ignored for the purpose
of our example.
A local transaction is a transaction that accesses accounts in the one single site, at which the
transaction was initiated. A global transaction, on the other hand, is one which either
accesses accounts in a site different from the one at which the transaction was initiated, or
accesses accounts in several different sites. To illustrate the difference between these two
types of transactions, consider the transaction to add $ 50 to account number 177 located at
the Delhi branch. If the transaction was initiated at the Delhi branch, then it is considered
local; otherwise, it is considered global. A transaction to transfer $ 50 from account 177 to
account 305, which is located at the Bombay branch, is a global transaction since accounts in
two different sites are accessed as a result of its execution.
What makes the above configuration a distributed database system are the facts that:
Self-Assessment Questions - 1
There are several reasons for building distributed database systems, including sharing of
data, reliability, and availability, and speedup of query processing. However, along with
these advantages come several disadvantages, including software development cost, greater
potential for bugs, and increased processing overhead. In this section, we shall elaborate
briefly on each of these.
The primary advantage of distributed database systems is the ability to share and access data
in a reliable and efficient manner.
If a number of different sites are connected to each other, then a user at one site may be able
to access data that is available at another site. For example, in the distributed banking
system, it is possible for a user in one branch to access data in another branch. Without this
capability, a user wishing to transfer funds from one branch to another would have to resort
to some external mechanism for such a transfer. This external mechanism would, in effect,
be a single centralized database.
The primary advantage to accomplishing data sharing by means of data distribution is that
each site is able to retain a degree of control over data stored locally. In a centralized system,
the database administrator of the central site controls the database. In a distributed system,
there is a global database administrator responsible for the entire system. A part of these
responsibilities is delegated to the local database administrator for each site. Depending
upon the design of the distributed database system, each local administrator may have a
different degree of autonomy which is often a major advantage of distributed databases.
If one site fails in a distributed system, the remaining sited may be able to continue operating.
In particular, if data are replicated in several sites, transactions needing a particular data
item may find it in several sites. Thus, the failure of a site does not necessarily imply the
shutdown of the system.
The failure of one site must be detected by the system, and appropriate action may be needed
to recover from the failure. The system must no longer use the service of the failed site.
Finally, when the failed site recovers or is repaired, mechanisms must be available to
integrate it smoothly back into the system.
Although recovery from failure is more complex in distributed systems than in a centralized
system, the ability of most of the systems to continue to operate despite the failure of one
site results in increased availability. Availability is crucial for database systems used for real-
time applications. Loss of access to data, for example, in an airline may result in the loss of
potential ticket buyers to competitors.
If a query involves data at several sites, it may be possible to split the query into subqueries
that can be executed in parallel by several sites. Such parallel computation allows for faster
processing of a user's query. In those cases in which data is replicated, queries may be
directed by the system to the least heavily loaded sites.
The primary disadvantage of distributed database systems is the added complexity required
to ensure proper coordination among the sites. This increased complexity takes the form of:
Greater potential for bugs: Since the sites that comprise the distributed system operate in
parallel, it is harder to ensure the correctness of algorithms. The potential exists for
extremely subtle bugs. The art of constructing distributed algorithms remains an active and
important area for research.
In choosing the design for a database system, the designer must balance the advantages
against the disadvantages of distribution of data design ranging from fully distributed
designs to designs, which include a large degree of centralization.
Self-Assessment Questions - 2
The principles of database design that we discussed earlier apply to distributed databases as
well. In this section, we focus on those design issues that are specific to distributed
databases.
Consider a relation that is to be stored in the database. There are several issues involved in
storing this relation in the distributed database, including:
Replication: The system maintains several identical replicas (copies) of the relation. Each
replica is stored at a different site, resulting in data replication. The alternative to replication
is to store only one copy of the relation.
Fragmentation: The relation is partitioned into several fragments. Each fragment is stored
at a different site.
Replication and Fragmentation: This is a combination of the above two notions. The
relation is partitioned into several fragments. The system maintains several identical
replicas of each such fragment.
If relation r is replicated, a copy of relation r is stored in two or more sites. In the most
extreme case, we have full replication, in which a copy is stored on every site in the system.
Availability: If one of the sites containing relation r fails, then the relation r may be found in
another site. Thus, the system may continue to process queries involving despite the failure
of one site.
Increased parallelism: In the case where the majority of access to the relation r results in
only the reading of the relation, the several sites can process queries involving r in parallel.
The more replicas of r there are, the greater the chance that the needed data is found on the
site where the transaction is executing. Hence, data replication minimizes the movement of
data between sites.
Increase overhead on update: The system must ensure that all replicas of a relation r are
consistent since otherwise erroneous computations may result. This implies that whenever
r is updated, this update must be propagated to all sites containing replicas, resulting in
increased overhead. For example, in a banking system, where account information is
replicated on various sites, it is necessary that transactions assure that the balance in a
particular account agrees on all sites.
In general, replication enhances the performance of reading operations and increases the
availability of data to read transactions. However, update transactions incur greater
overhead. The problem of controlling concurrent updates by several transactions to
replicated data is more complex than the centralized approach to concurrency control. We
may simplify the management of replicas of relation r by choosing one of them as the primary
copy of r. For example, in a banking system, an account may be associated with the site in
which the account has been opened. Similarly, in an airline reservation system, a flight may
be associated with the site at which the flight originates.
If the relation r if fragmented, r is divided into a number of fragments r1, r2 ,rn. These
fragments contain sufficient information to reconstruct the
original relation r. As we shall see, this reconstruction can take place through the application
of either the union operation or a special type of join operation on the various fragments.
There are three different schemes for fragmenting a relation: horizontal fragmentation,
vertical fragmentation, and mixed fragmentation. Horizontal fragmentation splits the
relation by assigning each tuple of r to one or more fragments. Vertical fragmentation splits
the relation by decomposing the scheme R of relation r in a special way that we shall discuss
and mixed fragmentation is a combination of horizontal and vertical fragments. These three
schemes can be applied successively to the same relation, resulting in a number of different
fragments. Note that some information may appear in several fragments.
Below we discuss the various ways for fragmenting a relation. We shall illustrate these by
fragmenting the relation deposit, with the scheme:
Horizontal Fragmentation:
It refers to the division of a relation into subsets of tuples (rows). Each fragment has unique
rows and all fragments have the same attributes (columns). The relation r is partitioned into
a number of subsets r1, r2, rn.
Each subset consists of a number of tuples of relation r. Each tuple of relation r must belong
to one of the fragments so that the original relation can be reconstructed if needed.
A fragment may be defined as a selection on the global relation r. That is, a predicate P1 is
used to construct fragment ri as follows:
ri =Pi(r)
The reconstruction of the relation r can be obtained by taking the union of all fragments, that
is,
r = n i = 1 ri
To illustrate this, suppose that the relation r is the deposit relation in Figure 12.2. This
relation can be divided into n different fragments, each of which consists of tuples of
accounts belonging to a particular branch.
a)
branch-name account-number customer-name balance
Bombay 305 Lowman 500
Bombay 226 Camp 336
Bombay 155 Khan 62
b)
branch-name account-number customer-name balance
Delhi 177 Camp 205
Delhi 402 Khan 10000
Delhi 408 Khan 1123
Delhi 639 Khan 750
If the banking system has only two branches, Bombay and Delhi, then there are two different
fragments:
These two fragments are shown in figure 12.3. Fragment deposit1 is stored in the Bombay
site. Fragment deposit 2 is stored in the Delhi site.
In our example, the fragments are disjoint. By changing the selection predicates used to
construct the fragments; we may have a particular tuple of r appear in more than one of the
r1. This is a form of data replication about which we shall say more at the end of this section.
Vertical Fragmentation:
In its most simple form, vertical fragmentation is the same as decomposition. Vertical
fragmentation of r(R) involves the definition of several subsets R1, R2,... ,Rn of R such that ri =
r.
Each fragment ri of r is defined by:
ri = Ri(r)
relation r can be reconstructed from the natural join:
r = r1 ⨝ r2 ⨝ r 3 ⨝……. ⨝ r n
In Figure 12.4, we show the relation deposit, the deposit relation of Figure 12.2 with tuple-
ids added. Figure 12.5 shows a vertical decomposition of the scheme Deposit-scheme tuple-
id into:
deposit3=Deposit-scheme-3(Deposit')
deposit4 = Deposit-scheme-4(Deposit')
a)
branch-name customer-name tuple-id
Bombay Lowman 1
Bombay Camp 2
Delhi Camp 3
Delhi Khan 4
Bombay Khan 5
Delhi Khan 6
Delhi Green 7
b)
account-number balance tuple-id
305 500 1
226 336 2
117 205 3
402 10000 4
155 62 5
408 1123 6
639 750 7
is a special form of natural join. The join attribute is tuple-id. Since the tuple-id value
represents an address, it is possible to pair a tuple of deposit3 with the corresponding tuple
of deposit4 by using the address given by the tuple-id value. This address allows direct
retrieval of the tuple without the need for an index. Thus, this natural join may be computed
much more efficiently than typical natural joins.
Mixed Fragmentation:
horizontal and vertical fragments of global relations and there is a need to process
transactions or queries that would access these data fragments.
Example – Suppose we want to create a fragment that consists of the account number and
balance of all the customers who belongs to Delhi branch, then we can do the same using
mixed fragmentation as:
OR
Self-Assessment Questions - 3
11. Replication is the system that maintains several ________ of the relation.
12. Data replication ___________ movement of data between sites.
13. In general, replication enhances the performance of read operations and
increases the __________ of data to read transactions.
14. Horizontal fragmentation ____________ the relation by assigning each tuple of r to
one or more fragments.
15. ___________fragmentation splits the relation by decomposing the scheme R of
relation r in a special way.
16. In its most simple form, vertical fragmentation is the same as ____________ .
17. Mixed fragmentation is combination of _____________ and ___________ .
5. SUMMARY
A distributed database system consists of a collection of sites, each of which maintains a local
database system. Each site is able to process local transactions, those transactions that
access data only in that single site. In addition, a site may participate in the execution of
global transactions, those transactions that access data n several sites. The execution of
global transactions requires communication among the sites.
There are several reasons for building distributed database systems, including sharing of
data, reliability, and availability, and speed of query processing. However, along with these
advantages come several disadvantages, including software development cost, greater
potential for bugs, and increased processing overhead. The primary disadvantage of
distributed database systems is the added complexity required to ensure proper
coordination among the sites.
There are several issues involved in storing, a relation in the distributed database, including
replication and fragmentation. It is essential that the system minimize the degree to which a
user needs to be aware of how a relation is stored.
6. TERMINAL QUESTIONS
7. ANSWERS
Self-Assessment Questions
1. Local
2. Installation cost.
3. Links or sites.
4. Long-haul network
5. Local transaction
6. Share and access
7. Global database
8. Detected
9. Complexity
10. Costly
11. identical replicas
12. minimizes
13. availability
14. splits
15. Vertical
16. Decomposition
17. Horizontal fragmentation, Vertical fragmentation
Terminal Questions
1. The main difference between centralized and distributed database systems is that, in a
centralized database, the data resides in one single location, while in the distributed
database the data resides in several locations. (Refer section 3.1 for detail)
2. In local-area networks, all the sites are close to each other, so communication links are
of higher speed and lower error rate than Wide area networks. (Refer section 2 for
detail)
3. Replication of data is useful when the interconnection of sites fails often but data is
always required to be updated. Fragmentation of data is useful when each fragment has
to be stored in a different site.
4. a) Possible types of failure in a distributed system are Failures of sites, and failures of
links.
Failures of links can be in centralized systems. (Refer section 4 for detail)
5. a) Site A can distinguish if B goes down.
b) Site A can distinguish if the link between A and B goes down.
c) Site A can distinguish if B is extremely overloaded and the response time is 100 times
longer than normal. (Refer section 3 for detail)
6. If the data are replicated then the data can be recovered even if the link or site fails in
the distributed system. (Refer section 3 for detail)
7. Data replication is done by maintaining several identical replicas (copies) of the
relation in several sites in distributed databases. Maintenance of remote backup site is
done when one of the site fails due to some reason.
8. The advantages of a distributed database management system over a centralized DBMS
is that even if there is a link failure or site failure data can be updated. (Refer section 3
for detail)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 13
Object Oriented Database Management
System
Table of Contents
Fig No /
SL SAQ /
Topic Table / Page No
No Activity
Graph
1 Introduction 3–4
1.1 Objectives
2 Next Generation Data Base System 1 5
3 New Database Application 1 6–7
4 Object Oriented Database Management System 1 8 – 10
5 Features of Object Oriented System 1 11
1. INTRODUCTION
Since the 1960s, Data Base Management Systems (DBMS) have been widely used in the data
processing environment. The support of characteristics such as data sharing, independence,
consistency, and integrity is the main reason for its success which traditional file
management system does not inherently offer.
A database system is usually organized according to a data model. In Unit 3, we discussed
the three most popular models: hierarchical, network, and relational. The difference among
all these three models is in the way of organizing records, although they are record-based.
They were mainly designed to process a large amount of relatively simple and fixed format
data. DBMS, based on these models along with sophisticated indexing and query
optimization techniques, have served business-oriented database applications especially
well.
RDBMSs were originally designed for mainframe computer and business data processing
applications. Moreover, relational systems were optimized for environments with a large
number of users, who issue short queries. But today's application has moved from
centralized computer-aided design (CAD), multimedia systems, software engineering, and
knowledge database. These operations require complex operations and data structure
representation. For example, a multimedia database may contain variable length text,
graphics, images, audio, and video data. Finally, a knowledge base system requires data-rich
in semantics.
Existing commercial DBMS, both small and large have proven inadequate for these
applications. The traditional database notion of storing data in two- dimensional tables, or
in flat files, breaks down quickly in the face of complex data structures and data types used
in today's applications.
Research in modeling and processing complex data has gone in two directions:
a. Extending the functionality of RDBMS
b. Developing and implementing OODBMS that is based on object oriented programming
paradigm.
OODBMSs are designed for use in today's application areas such as multimedia, CAD, office
automation, etc. In this unit, we will touch upon some of the basic issues related to OODBMS.
1.1 Objectives
By the end of this unit, you should be able to:
❖ Define what is object oriented DBMS
❖ Differentiate between RDBMS and OODBMS
❖ List next-generation database systems
❖ List advantages of object oriented DBMS
Self-Assessment Questions - 1
1. OODBMS is an acronym for .
2. KDBMS is an acronym for .
3. The new DBMS can support new generations of traditional business _______.
In general, these applications require the representation of complex data elements as well
as complex relationships among them. Users in these environments have found relational
technology inadequate in terms of flexibility, modeling power, and efficiency.
Self-Assessment Questions - 2
CASE is an acronym for .
The data is typically stored as sequences of bytes with variable
lengths, and segments of data are linked together for easy reference.
Artificial intelligence and expert systems represent information as facts and rules
that can be collectively viewed as a .
Object
The term object means a combination of data and program that represents some real-world
entity. For example, consider an employee named Amit; Amit is 25 years old, and his salary
is $25000. Then Amit may be represented in a computer program as an object. The data part
of this object would be (name: Amit, age: 25, salary: $25000). The program part of the object
may be a collection of programs (hire, retrieve the data, change age, change salary, fire). The
data part consists of data of any type. For the Amit object, the string is used for the name,
integer for age, and monetary for salary; but in general, even any user-defined type, such as
Employee, may be used. In the Amit object, the name, age, and salary are called attributes of
the object.
Encapsulation
Often, an object is said to encapsulate data and program. This means that the users cannot
see the inside of the object, but can use the object by calling the program part of the object.
This is not much different from procedure calls in conventional programming; the users call
a procedure by supplying values for input parameters and receive results in output
parameters.
Amit, Ankit, and Anup are each an Employee object. The data part of each of these objects
consists of the attributes Name, Age, and salary. Each of these Employee objects has the same
program part (hire, retrieve the data, change age, change salary, fire). Each program in the
program part is called a method. The term class refers to the collection of all objects that
have the same attributes and methods. In our example, the Amit, Ankit, and Anup objects
belong to the class Employee, since they all have the same attributes and methods. This class
may be used as the type of an attribute of any object. At this time, there is only one class in
the system namely, the class Employee; and three objects that belong to the class namely
Amit, Ankit, and Anup objects.
programs need not be defined and written from scratch. New classes can be created by
adding attributes and methods of existing classes, thereby reducing the opportunity to
introduce new errors to existing classes.
Self-Assessment Questions - 3
The term means a combination of data and programthat
represents some real-world entity.
The means that the users cannot see the inside of theobject but can use the
object by calling the program part of the object.
The term inheritance is sometimes called .
Each program in the program part is called a .
In general, a class may inherit from one or more existing classes and,the
inheritance structure is called an inheritance hierarchy or ____________.
Real-world modeling: Object-oriented systems tend to model the real world in a more
complete fashion than do traditional methods. Objects are organized into classes of objects,
and, objects are associated with behaviors. The model is based on objects rather than on data
and processing.
High code reusability: When a new object is created, it will automatically inherit the data
attributes and characteristics of the class from which it was spawned. The new object will
also inherit the data and behaviors from all superclasses in which it participates.
processing.
Self-Assessment Questions - 4
The primary goal of object-oriented development is the assurance thatthe system
will enjoy a longer life while having far smaller .
The Real World Model is based on objects rather than on
and processing.
Object-oriented systems promise to be far more thantraditional
systems.
In systems designed with object-oriented languages, objects are created during the running
of a program and are destroyed when the program ends. Providing a database that can store
the objects between runs of a program offers both increased flexibility and increased
security. The ability to store the objects also allows the objects to be shared in a distributed
environment. An object-oriented database can allow only the actively used objects to be
loaded into memory, and thus minimizes or preempts the need for virtual memory paging.
This is especially useful in large-scale systems. Persistent objects also allow objects to be
stored for each version. This version control is useful not only for testing applications but
also for many object-oriented design applications where version control is a functional
requirement of the application itself. Access to other data sources can also be facilitated with
object-oriented databases, especially those built as hybrid relational systems, which can
access relational tables as well as other object types.
Object-oriented databases also offer many of the benefits that were formerly found only in
expert systems. With an object-oriented database, the relationships between objects and the
constraints in objects are maintained by the database management system, that is, the
objects themselves. The rules associated with the expert system are essentially replaced by
the object schema and the methods. As many expert systems currently do not have adequate
database support, object-oriented databases afford the possibility of offering expert system
functionality with much better performance.
Object-oriented databases offer benefits over current hierarchical and relational database
models. They enable support of complex applications not supported well by the other
models. They enable programmability and performance, improve navigational access, and
simplify concurrency control. They lower the risks associated with referential integrity, and
they provide a better user metaphor than the relational model.
Object-oriented databases by definition allow the inclusion of more of the code (i.e. the
object's methods) in the database itself. This incremental knowledge about the application
has a number of potential benefits for the database system itself, including the ability to
optimize query processing and to control the concurrent execution of transactions.
Object-oriented databases can store not only complex application components, but also
larger structures. Although relational systems can support a large number of tuples (i.e.,
rows in a table), individual types are limited in size. Object-oriented databases with large
objects do not suffer performance degradation, because the objects do not need to be broken
apart and reassembled by applications, regardless of the complexity of the properties of the
application objects.
Since objects contain direct references to other objects, complex data sets can be efficiently
assembled using these direct references. The ability to search by direct references
significantly improves navigational access. In contrast, complex data sets in relational
databases must be assembled by the application program using the slow process of joining
tables.
For the programmer, one of the challenges in building a database is the data manipulation
language (DML) of the database. DMLs for relational databases usually differ from the
programming language used to construct the rest of the application. This contrast is due to
differences in the programming paradigms and mismatches of type systems. The
programmer must learn two languages, two toolsets, and two paradigms because neither
alone has the functionality to build an entire application. Certain types of programming tools,
such as application generators and fourth-generation languages (4GLs) have emerged to
produce code for the entire application, thereby bridging the mismatch between the
programming language and the DML, but most of these tools compromise the application
programming process.
With object-oriented databases, much of this problem is eliminated. The DML can be
extended so that the application can be written in the DML. Or an object-oriented application
language, for example, C++ can be extended to be the DML. More of the application can be
built into the database itself. Class libraries can also assist the programmer in speeding the
creation of databases. Class libraries encourage the reuse of existing code and help minimize
the cost of later modifications. Programming is easier because the data structures model the
problem more closely. Having the data and procedures encapsulated in a single object makes
it less likely that a change to one object will affect the integrity of other objects in the
database. Concurrency control is also simplified with an object-oriented database. In a
relational database, the application needs to look into each record in each table explicitly
because related data is re-represented across a number of tables. Integrity, a key
requirement for databases, can be better supported with an object-oriented database
because the application can lock all the relevant data in one operation. Referential integrity
is better supported in an object-oriented database because the pointers are maintained and
updated by the database itself. Finally, object-oriented databases offer a better user
metaphor than relational databases. The tuple or table, although enabling a well-defined
implementation strategy, is not an intuitive modeling framework, especially outside the
domain of numbers. Objects offer a more natural and encompassing modeling metaphor.
Self-Assessment Questions - 5
Programming languages, including OOPLs, are designed with one userand a
relatively small in mind.
Providing a database that can store the objects between runs of aprogram
offers both increased flexibility and .
Object- oriented databases also offer many of the benefits that wereformerly
found only in .
Object-oriented databases offer benefits over current
and database models.
Object-oriented databases can store not only complex applicationcomponents
but also .
RDBs force the users to represent hierarchical data (or complex nested data or compound
data) such as bill of materials in terms of tuples in multiple relations. This is awkward to
start with. Further, to retrieve data thus spread out in multiple relations. RDBs must resort
to joins, a general expensive operation. The data type of an attribute of an object in OOPLs
may be a primitive type or an arbitrary user-defined type (class). The fact that an object may
have an attribute whose value may be another object naturally leads to nested object
representation, which in turn allows hierarchical data to be naturally (i.e., hierarchically)
represented.
RDBs offers a set of primitive built-in data types for use as domains of columns of relation,
but they do not offer any means of adding user-defined data types. The built-in data types
are basically all numbers and short symbols. RDBs are not designed to allow new data types
to be added and therefore often require major surgery to the system architecture and code
to add any new data type. Adding a new data type to a database system means allowing its
use as the data type of an attribute – that is, storage of data of that type, querying, and
updating of such data. Object encapsulation in OOPLs does not impose any restriction on the
types of data, that the data may be primitive types or user-defined types. Further, new data
types may be created as new classes, possibly even as subclasses of existing classes,
inheriting their attributes and methods.
Object encapsulation is the basis for the storage and management of programs as well as
data in the database. RDBs now supports stored procedures – that is, they allow programs
to be written in some procedural language and stored in the database for later loading and
execution. However, the stored procedures in RDBs are not encapsulated with the data – that
is, they are not associated with any linear relation or any tuple of a relation. Further, since
RDBs do not have the inheritance mechanism, the stored procedures cannot automatically
be reused.
Self-Assessment Questions - 6
The data type facilities are the keys to eliminating of the important
deficiencies of RDBs.
Adding a new data type to a system means allowing its use as the data
type of an attribute – that is, storage of data of that type, querying, and updating
of such data.
RDBMSs were never designed to allow for the nested structure. These types of applications
are extensively found in CAD/CAE, aerospace, etc. OODBM can easily support these
applications. Moreover, it is much easier and more natural to navigate through these
complex structures in the form of objects that model the real world in OODBMS, rather than
tables, tuples, and records in RDBMS.
It is hard to confuse a relational database with an object-oriented database. The normalized
relational model is based on a fairly elegant mathematical theory. Relational databases
derive a virtual structure at run time, based on values from sets of data stored in tables.
Database constructs views of the data by selecting data from multiple tables and loading it
into a single table (OODBs traverse the data from the object to object).
Relational databases have a limited number of simple, built-in data types, such as integer and
string, and a limited number of built-in operations that can handle these data types. You can
create complex data types in a relational database, but you must do it on a linear basis, such
as combining fields into records. And the operations on these new complex types are
restricted, again, to these defined for the basic types (as opposed to arbitrary data types or
sub-classing with inheritance as found in OODBs).
The object model supports browsing of object class libraries, which allows the reuse, rather
than the reinvention, of commonly used data elements. Objects in an OODB survive multiple
sessions; they are persistent. If you delete an object stored in a relational database, other
objects may be left with references to the deleted one and may now be incorrect. The
integrity of the data thus becomes suspect and creates inconsistent versions.
In the relational database, complex objects must be broken up and stored in separate tables.
This can only be done in a sequential procedure, with the next retrieval relying on the
outcome of the previous. The relational database does not understand a global request, and
thus cannot optimize multiple requests; OODBs can issue a single message (request) that
contains multiple transactions.
The relational model, however, suffers at least one major disadvantage. It is difficult to
express the semantics of complex objects with only a table model for data storage. Although
relational databases are adequate for accounting or other typical transaction–processing
applications, where the data types are simple and few in number, the relational model offers
limited help when data types become numerous and complex.
Object-oriented databases are favored for applications where the relationships among
elements in the database carry the key information. Relational databases are favored when
the values of the database elements carry the key information. That is, object-oriented
models, capture the structure of the data; relational models organize the data itself. If a
record can be understood in isolation, then the relational database is probably suitable. If a
record makes sense only in the context of other records, then an object-oriented database is
more appropriate.
Engineering and technical applications were the first applications to require databases that
handle complex data types and capture the structure of the data. Applications such as
mechanical and electrical computer-aided design (MCAD and ECAD) have always used
nontraditional forms of data, representing such phenomena as three-dimensional images
and VLSI circuit designs. Currently, these application programs store their data in
application-specific file structures. The data-intensiveness of these applications is not only
in a large amount of data that needs to be programmed into the database but in the
complexity of the data itself. In these design-based applications, relationships among
elements in the database carry key information for the user. Functional requirements for
complex cross-references, structural dependences, and version management all require a
richer representation than what is provided by hierarchical or relational databases. The
comparison of OODBMS and RDBMS is shown in table 13.1.
OODBMS RDBMS
The data can be used only through The data can be partitioned on basis
their classes’ methods. of users’ requirements and onthe
specific users’ applications.
Data and methods non- Data normalization aims at reducing
redundancy is achieved through or eliminating data redundancy. It is
encapsulation and inheritance. used in the stage of designing the
Inheritance helps to reduce the database and not in the stage of
redundancy of methods. developing the applications.
Object-oriented databases are Relational databases are favored
favored for applications where the when the values of the database
relationships among elementsin elements carry the key information.
the database carry the key
information.
It follows object oriented It does not have object oriented
properties. properties.
Self-Assessment Questions - 7
Relational databases derive a structure at run time based onvalues
from sets of data stored in tables.
Objects in an OODB survive multiple sessions; they are .
In the relational database, complex objects must be broken up and stored in
separate .
Object-oriented databases are favoured for applications where the relationships
among elements in the database carry the information.
Relational databases are favoured when the values of the _________________
elements carry the key information.
Self-Assessment Questions - 8
27. Most of the research projects in object-oriented databases have pursued
_______approach.
28. A number of programming languages have been extended with
_____constructs.
29. Database languages can be embedded in host languages.
10. SUMMARY
During the past decade, object-oriented technology has found its way into database user
interfaces, operating systems, programming languages, expert systems, and the like. In this
unit, we discussed the advantages of OODBMS and the deficiencies of RDBMS. We discussed
that OODB must provide standard database facilities found in today's relational database
system (RDBS), including nonprocedural query facility for retrieving objects, automatic
query optimization and processing, dynamic schema changes, automatic management of
access methods to improve query processing performance, automatic transaction
management, concurrency control, recovery from system crashes, and security and
authorization.
Object-Oriented database product has already been in the market for several years, and
several vendors of RDBMS are now declaring that they will extend their products with
object-oriented capabilities. Nowadays, OODBMS is evolving in the industry to overcome all
the drawbacks of RDBMS.
12. ANSWERS
Terminal Questions
1. Existing commercial DBMS, both small and large have proven inadequate for the
applications like computer-aided design (CAD), multimedia systems, software
engineering, and knowledge database where operations require complex operations
and data structure representation. (Refer section 1, 3 and 7 for detail)
2. Multi-media data are those data which includes text, numbers, images, graphics, and
digital audio and video. Such multimedia data is typically stored as sequences of bytes
with variable lengths, and segments of data are linked together for easy reference.
(Refer section 3 for detail)
3. Multi-media data requires storage for the sequences of bytes with variable lengths, and
segments of data that can be linked together for easy reference. The variable-length
data structure cannot fit well into the relational framework, which mainly deals with
fixed-format records. (Refer section 3)
4. In object oriented database systems, the data type of an attribute of an object in OOPLs
may be (i) a primitive type (ii) an arbitrary user-defined type (class). (Refer section 7
for detail)
5. Both RDBMS and OODBMS are used to store and retrieve information to and from
DBMS. But RDBMS does not support user-defined data types (class) and does not allow
nested structure whereas OODBMS supports all of these and is more natural to navigate
through the complex structures in the form of objects that model the real world in
OODBMS, rather than a table, tuples, and records in RDBMS. (Refer section 8 for detail)
BACHELOR OF COMPUTER
APPLICATIONS
SEMESTER 3
DCA2102
DATABASE MANAGEMENT SYSTEM
Unit 14
Object Relational Mapping
Table of Contents
SL Topic Fig No / SAQ / Page No
No Table / Activity
Graph
1 Introduction
3
1.1 Objectives
2 Significance of Mapping 1 4–5
3 Mapping Basics 1 6–7
4 Mapping a Class Inheritance Tree 1 8–9
5 Mapping Object Relationships
5.1 Types of relationships
5.2 Implementation of object relationships
5.3 Implementation of relational database relationships 10 – 17
1. INTRODUCTION
We have already discussed so far that the database management task is done so that all the
data working with the application system can be retained, resulting in information and
knowledge. This task may be done by structured query language database management
systems (SQL DBMS) or by object oriented programming techniques or other traditional
methods. SQL DBMS manipulates scalar values such as integers and strings which are
organized within normalized tables whereas the Object-oriented programming technique
implements data management by manipulating the object that represents non-scalar values.
In order to maintain and manage a database, a programmer can either convert the object
values into groups of simpler values for storage in the database (and convert them back upon
retrieval) or can use simple scalar values within the program.
In this unit, we will discuss Object-relational mapping (ORM or O/RM or O/R mapping)
which is a programming technique for converting data between two different paradigms (a
relational database and an object-oriented programming language). We will also discuss the
significance of mapping, modeling, and implementation aspects related to Database
management.
1.1 Objectives
2. SIGNIFICANCE OF MAPPING
We have already mentioned that Object-relational mapping (ORM or O/RM or O/R mapping)
is a programming technique for converting data between two different paradigms (a
relational database and an object-oriented programming language) where data
representations, data manipulation, modeling techniques and other entities related to it are
different. It helps to create a "virtual object database" that can be used from within the
programming language so that the objects can be translated into forms that can be stored in
the database for easy retrieval while preserving the properties of the objects and their
relationships.
Persistent Object
While dealing with ORM you have to know what a persistent object is. It is the object that can
automatically store and retrieve itself in the database while preserving the properties of
objects and their relationships.
Although the advantage of ORM is that it often reduces the amount of code needed to be
written, the disadvantage would be due to some O/R mapping tools that do not perform well
during bulk deletions of data. Hence ORM software has been pointed to as a major factor in
producing poorly designed databases.
Now let us discuss what the significance of mapping is. When we write an application, for
example in Java, which stores data in an RDBMS by relying on data-aware controls (e.g. data-
aware GUI components) to interface directly with the database then such applications are
not object oriented, and by using such two-layer (GUI/Database) applications, we lose the
benefits of object oriented design principle i.e. encapsulation. Applications that use data-
aware widgets or controls, the client and the database are very tightly coupled where GUI
code, business logic, and SQL statements are all interwoven throughout the application
source code. So, any changes in the database schema will surely cascade into unexpected
failures resulting in a maintenance nightmare. This type of problem where technical
difficulties are often encountered when a relational database management system (RDBMS)
is being used by a program written in an object-oriented programming language or style is
known as object-relational impedance mismatch. This type of problem occurs when objects
Self-Assessment Questions - 1
1. Object-relational impedance mismatch occurs when objects or class definitions are
mapped in a straightforward way to________________.
2. The main objective of Object-Relational mapping is to solve object-relational _________.
3. MAPPING BASICS
We can see that classes of OOP language can be mapped to RDBMS tables. For e.g. JAVA
classes can be mapped to RDBMS tables.
We can map a persistent class and a table in so many ways. The simplest mapping between
a persistent class and a table is one-to-one. In this case, all attributes in the persistent class
are represented by all columns of a table. Each instance of a business class is then stored in
a row of the table.
Although this type of mapping is straightforward, it might conflict with the existing object
and entity-relation (ER) models. This is because the goal of object modeling is to model a
business process using real-world objects, whereas the goal of ER modeling is normalization
and fast retrieval of data.
Let us take an example of the Visual Business Sight Framework ™ (VBSF) which is an object-
relational Java framework that allows Java objects to be easily saved and retrieved from
relational databases. Here a Java developer has to work with two different models: (i) an
object model and (ii) a relational model. As Java has to access relational databases and must
deal with tables, rows, and columns, Developers have to go work with tedious routines that
convert rows and columns into Java objects. This is known as the impedance mismatch
between Java objects and relational databases. VBSF contains an object-relational mapping
engine that automatically implements object persistence to relational databases, allowing a
programmer to forget about JDBC and stay focused on the object model.
As a result, VBSF supports two types of class-to-table mappings that help overcome the
differences between relational and object models: SUBSET and SUPERSET.
SUBSET Mapping: This type of mapping is used when all the attributes of a persistent class
are mapped to the same table. It is also used when a persistent class is not concerned with
some of the columns (not part of the business model) of its corresponding table in the
database.
The attributes of a persistent class with a subset mapping represent either a portion of the
columns in a table or all of the columns in the table. Subset mappings can also create
"projection classes" for tables with a large number of columns. A projection class allows a
user to select a row for full retrieval from the database. The full row can be mapped to
another persistent class. This type of design reduces the amount of information passed
across the network. Subset mappings are also used to map a class inheritance tree to a table
using filters.
SUPERSET Mapping: This is done when a persistent class with a superset mapping contains
attributes derived from columns of multiple tables. This type of mapping is also known as
table spanning. Superset mappings can be used to create "view classes" that hide the
underlying data model. It also can map a class inheritance tree to the database using a
Vertical mapping approach. VBSF fully supports performing insertions, updates, and
deletions of objects with this type of mapping, while transparently updating and maintaining
all foreign key columns that join the tables.
Self-Assessment Questions - 2
3. SUBSET Mapping is used when all the attributes of a persistent class are mapped to a
different table. (True/False)
4. SUPERSET Mapping is done when a persistent class with a superset mapping contains
attributes derived from columns of .
A class inheritance tree can be mapped in the RDBMS as vertical mapping, horizontal
mapping, and filtered mapping.
VBSF supports all three methods. A combination of these types of mappings can also be used
within an inheritance tree.
• Vertical Mapping: In vertical mapping, each class in the tree, whether abstract or
concrete, is mapped to a different table. All branch and leaf tables in the tree must be
linked to their parent tables. This is done by the use of a foreign key column that
references the primary key of the parent table.
• Horizontal Mapping: Under horizontal mapping, each concrete class in the tree is
mapped to a different table. Each of these mapped tables contains columns for all
attributes in its concrete class, plus all attributes inherited from all its abstract parent
classes. In other words, abstract classes are not mapped to their own table. This
approach results in fast performance and is simple to design. However, if an attribute
of an abstract parent class is changed, then potentially many tables must be modified.
Consequently, this method is most useful if the inheritance tree is more method-driven
than attribute-driven. In Short, if a substantial number of classes inherit a large number
of attributes from an abstract parent class, then the vertical or filtered methods might
be better choices.
• Filtered Mapping: In filtered mapping, all concrete classes in the tree are mapped to
the same table. The table must contain columns for all attributes of all the abstract and
concrete classes in the inheritance tree (or the part of the tree using this mapping). In
addition, a filter column is created in the table. The value of the filter column is used to
distinguish between subclasses. Abstract classes are not mapped to this table. This
approach provides adequate performance but violates table-normalization rules. More
specifically, it could lead to a substantial number of NULL columns in the table resulting
in wastage of space. Consequently, this method is most useful if most of the attributes
are inherited from the abstract parent classes.
Self-Assessment Questions - 3
5. In vertical mapping, each class in the tree, whether abstract or concrete, is mapped
to a table.
6. In filtered mapping, a filter column is created in the table and the value of the filter
column is used to distinguish between subclasses. (True/False)
It is also possible to define a one-to-one relationship in the relational model using a join
table. VBSF also supports this scenario.
In the object model, there are two types of one-to-many relationships: aggregation
(part-of), and association (acquaintance).
Under VBSF an aggregation relationship is defined by means of an Owned Collection
attribute, and an association relationship by means of a Referenced Collection attribute.
The difference between the two is that in an owned relationship when the owner is
updated in the database, all objects in all its owned collections are automatically
updated (this default behavior can be overridden at runtime if necessary).
To use foreign keys, an embedded foreign key column is defined in each of the tables
involved in the relationship. Each foreign key column holds the key to the other table.
object(s) does not know about the original object. Uni-directional relationships are
easier to implement than bi-directional relationships.
Example - The holds relationship between Employee and Position. Employee objects
know about the position that they hold, but Position objects do not know which
employee holds it (also there is no need to do so).
• Bi-directional relationships: A bi-directional relationship exists when the objects
on both end of the relationship know each other.
Example – The works-in relationship between the Employee and Department.
Employee objects know in which department they work and Department objects also
know what employees work in them.
VBSF automatically and transparently takes care of updating owned objects in the database
whenever an owner object is updated in the database. This behavior is recursive, so if any
owned objects are in turn owners (i.e. they hold other owned collections), all their owned
objects will also be updated in the database too, and so on. In this manner, whole object
graphs can be updated in the database in one operation. This type of behavior is known as
persistence by reachability.
Referenced relationships (Association): This type of relationship exists when (for e.g. in
a one-to-many association) the class on the one side of the relationship merely references
other classes and does not own it. In this case (one-to-many association) an object on the one
side (also referred to as the holder of the collection) of the relationship cannot create, update,
or delete objects on the other side of the relationship and can only retrieve them. So the
holder has to use a Referenced Collection attribute to reference its associated collection
instead of an Owned Collection attribute. A Referenced Collection attribute can also be used
to hold collections of objects of the same class as the holder of the collection.
A Referenced Collection attribute can handle the association in the database by using either
an embedded foreign key column or a join table. To handle the association using a join table,
the referenced collection attribute must be mapped to a join table.
When the multiplicity is one (e.g. 0...1 or 1) the relationship is implemented with a reference
to an object by the use of a getter operation, and a setter operation. The attribute(s) and
operations required to implement a relationship are often referred to as scaffolding.
When the multiplicity is many (e.g. N, 0...*, 1...*) the relationship is implemented via a
collection attribute, such as an Array or a HashSet in Java, and operations to manipulate that
array.
When a relationship is uni-directional, the code is implemented only by the object that knows
about the other object(s). When the relationship is Bi-directional then codes are
implemented by both classes.
In the case of a one-to-one relationship, the foreign key needs to be implemented by one of
the tables. To implement a one-to-many relationship a foreign key is implemented from the
“one table” to the “many tables”. It can be also chosen to overbuild the database schema and
implement a one-to-many relationship via an associative table, effectively making it a many-
to-many relationship.
A consistent key strategy within the database can simplify the relationship mapping efforts
to a greater extent. To make it simpler the first step is to prefer single-column keys and the
next step is to use a globally unique surrogate key.
Because foreign keys are used to join tables, all relationships in a relational database are
effectively bi-directional. This is why it doesn’t matter in which table you implement a one-
to-one relationship; the code to join the two tables is virtually the same.
We will mention the mappings from the point of view of mapping the object relationships
into the relational database. Note that in some cases, we may also choose to design mapping.
A general rule of thumb with relationship mapping is that the multiplicities should be kept
the same. Therefore a one-to-one object relationship maps to a one-to-one data relationship,
a one-to-many maps to a one-to-many, and a many-to-many maps to a many-to-many
relationship. But this doesn’t have to be the case always; a one-to-one object relationship can
be implemented into a one-to-many or even a many-to-many data relationship. This is
because a one-to-one data relationship is a subset of a one-to-many data relationship and a
one-to-many relationship is a subset of a many-to-many relationship.
manually traverse the relationship in the code, taking a lazy read approach (where the other
object is read at the time) when required by the application.
In relational databases the attributes contained in an associative table are traditionally the
combination of the keys in the tables involved in the relationship. The name of an associative
table is typically either the combination of the names of the tables that it associates with or
the name of the association that it implements.
The rule for mapping is that the multiplicities should "cross over" once the associative table
is introduced. A multiplicity of 1 is always introduced on the outside edges of the relationship
within the data schema to preserve the overall multiplicity of the original relationship.
Many-to-many relationships are interesting because of the addition of the associative table.
But two business classes are being mapped to three data tables to support this relationship.
And so this creates another extra work to do.
Although this mapping seems straightforward on the surface, there are several issues that
are needed to be taken into consideration. These issues become apparent when you consider
basic persistence functionality for the aggregate. These issues are (i) Reading the data in the
proper sequence
(ii) not to include the sequence number in the key (iii) updating the sequence numbers after
rearranging the order items (iv) updating sequence numbers after deleting an order item (v)
Considering sequence number gaps greater than one.
For example, an employee may manage several other employees. The aggregate relationship
that the Team class has with itself is recursive. For e.g. a team may be a part of one or more
other teams.
The many-to-many recursive aggregation is mapped to the Subteams associative table in the
same way a normal many-to-many relationship is mapped. The only difference is that both
columns are foreign keys into the same table. Similarly, the one-to-many association is
mapped in the same way a normal one-to-many relationship is mapped.
Self-Assessment Questions - 4
7. Based on directionality, objects relationship can be
relationships and relationships.
8. In Referenced Relationships (Association) a holder has to use a
to reference its associated collection instead of an
Owned Collection attribute.
9. While implementing relational database relationships, in a one-to-manyrelationship a
key is implemented from the “one table” tothe “many
table”.
10. A general rule of thumb with relationship mapping is that the multiplicities in
relationship should be kept . (Choose correct
option)
a) same
b) different
c) uni-directional
d) bi-directional
The join is said to be transactional if one or more additional fields such as a quantity or date
are defined in a join table. The join is said to be transparent If the join only contains foreign
key columns.
Transactional Joins
In a transactional join, the join table is modeled by two or more transaction classes. Each
transaction class is owned by one of the classes involved in the many-to-many relationship.
Transparent Joins
In Transparent joins, the join table is not modeled by a class. Here each class involved in the
relationship has to get methods that return its related objects to the other class. This is
implemented using a Referenced Collection attribute on each class involved in the
relationship.
When two classes are related via a join table, one of the classes must be designated as the
join manager. The join manager has no functionality in the object model. Only one
persistent class can be assigned as the join manager of a join table.
Self-Join Tables
A join does not always have to involve two different tables. You can join a table to itself,
creating a self-join. Joining a table to itself can be useful when you want to compare values
in a column to other values in the same column.
Self-Assessment Questions - 5
11. The join is said to be of a , if one or more additionalfields such as
a quantity or date is defined in a join table. (Choose correct option).
a) transactional
b) transparent
c) self-join
d) join manager
12. When two classes are related via join table, one of the classes must bedesignated as the
.
LiteSQL: It is an open-source mapper and is a C++ library that integrates C++ objects tightly
to a relational database and thus provides an object persistence layer. LiteSQL supports
SQLite3, PostgreSQL and MySQL as backends. This library is distributed under the terms of
the BSD License.
Java: The open-source mappers using JAVA are given below:
Similarly, the name of other Open-source object-relational mapping software using JAVA are
Java Data Objects (JDO), Java Persistence API (JPA), DataNucleus, JPOX, Object Relational
Bridge, OpenJPA, ORMLite, QuickDB ORM, Sobat, etc.
.NET: The names of open source mappers using .NET are given below:
Similarly, the name of other open source using .NET ( some may be with commercial
support) mappers are Base One Foundation Component Library, BLToolkit, ECO, Habanero,
iBATIS, Neo, nHydrate, [Link], Picasso, Quick Objects, SubSonic and AgileFx.
There are other open-source mappers also using languages like Python, Ruby, Perl, Scala,
Delphi, VB6, etc. For example, the open-source mappers using Python are Django, SQLObject,
Storm, XRecord, Autumn and Tryton ActiveRecord, Datamapper, and iBATIS. The sequel is
used in Ruby.
Self-Assessment Questions - 6
13. Hibernate is an open-source mapper and provides relational Persistence for Java and
.NET. (True/False)
14. Castle ActiveRecord is an implementation of the ActiveRecord pattern for _____.
8. SUMMARY
There is a hot debate among many whether the object-relational impedance mismatch is a
problem or not. Few programmers advocate the use of Object-Oriented databases as a
solution for impedance mismatch and not ORM. The objective of Object-Relational mapping
is to solve the problem of object-relational impedance mismatch.
Classes of Objects can be mapped to the RDBMS table. For e.g. VBSF supports two types of
class-to-table mappings that help overcome the differences between relational and object
models: SUBSET and SUPERSET. A class inheritance tree can be mapped in the RDBMS as:
vertical mapping, horizontal mapping, and filtered mapping.
Object relationships for mapping to the relational model can be categorized based on
multiplicity, directionality, ownership, and reference.
A general rule of thumb with relationship mapping is that the multiplicities should be kept
the same.
The join is said to be transactional if one or more additional fields such as a quantity or date
is defined in a join table. The join is said to be transparent. If the join only contains foreign
key columns.
There are many open-source object relations mapping software supported by C++, JAVA,
Python, Ruby, Perl, Scala, Delphi, VB6, etc.
9. TERMINAL QUESTIONS
10. ANSWERS
Self-Assessment Questions
1. database tables or relational schema
2. impedance mismatch
3. False
4. multiple tables
5. different
6. True
7. uni-directional, bi-directional
8. Referenced Collection attribute
9. Foreign
10. a) same
11. a) transactional
12. join manager
13. True
14. .NET
Terminal Questions
DCA2101
COMPUTER ORIENTED NUMERICAL
METHODS
Unit 15
Technological Trends in DBMS
Table of Contents
1. INTRODUCTION
In the previous unit, we studied object-relational mapping and several associated aspects
such as mapping a class inheritance tree, mapping object relationships, modeling with join
tables, and open-source object-relational mapping software. In this unit, we will cover
technological trends in the database management system.
The rapidly evolving and dynamic nature of systems-driven research imposes special
requirements on the technology, design, approach, and architecture of computational
infrastructure including database applications. Several technological advances exploded in
the field of databases during the last few years. In this unit, you will study advanced database
technological developments including cloud computing, temporal database, big data, and
NoSQL databases.
1.1 Objectives
2. CLOUD COMPUTING
Cloud computing has a background in the combination of both client/server computing and
peer-to-peer distributed computing. Cloud computing is a natural evolution of the extensive
adoption of virtualization, service-oriented architecture, and autonomic and utility
computing.
Before even getting deep into cloud computing technology, it is important to understand the
key element of “cloud” which represents computers that are organized in the form of a
network that fulfills the purpose of service-oriented architecture to give out information and
software. Basically, cloud computing technology is set apart from the traditional method
because the resources from computers are arranged in such a manner that the applications
can function irrespective of the server configuration which uses them.
This methodology makes less use of the resources degrading the necessity of using hardware
for the working of the applications. Cloud in cloud computing technology takes up the idea
of using the internet to run software on an individual’s computer. These days internet seems
to be a hub of everything therefore everyone prefers to use software that is entirely based
on the web and can also work on this software using a simple web browser.
To understand cloud computing technology, think of the cloud so that it will have layers
inside divided into two parts: back end and front end. The front end layer consists of
everything visible to a normal human who is using the technology and also gets an
opportunity to interact with it. The back end consists of both hardware and software
required to make the front end interface function properly.
The set of computers in the cloud computing technology are put together so that any
application can take any resource it wishes to take and also use up the complete power as it
usually does if it functions on one single machine. Cloud computing also provides scope for
flexibility that is the number of resources being consumed can vary depending on the task at
hand, which means that the resources can either decrease or increase according to the job.
Trends have been changing rapidly, and the number of people using these cloud computing
methods had only been seen to be increasing without any questions of halting. Though it is
a good thing to know, however it has its own set of risks of which the most primary one is
that if for any reason the internet is down, access to data over another system will be
tampered therefore stopping the work at least for some time. It might even disappear for
longer durations if the internet bill is not paid at the specified time.
The key to cloud computing is the “cloud” a massive network of servers or even individual
PCs interconnected in a grid. These computers run in parallel, combining the resources of
each to generate supercomputing like power. What, exactly, is the “cloud”? Put simply, the
cloud is a collection of computers and servers that are publicly accessible via the Internet.
This hardware is typically owned and operated by a third party on a consolidated basis in
one or more data center locations. The machines can run any combination of operating
systems; it’s the processing power of the machines that matter, not what their desktops look
like. As shown in Figure 15.1, individual users connect to the cloud from their own personal
computers or portable devices, over the Internet. To these individual users, the cloud is seen
as a single application, device, or document. The hardware in the cloud (and the operating
system that manages the hardware connections) is invisible.
This cloud architecture is deceptively simple, although it does require some intelligent
management to connect all those computers together and assign task processing to
multitudes of users. As you can see in Figure 15.2, it all starts with the front-end interface
seen by individual users. This is how users select a task or service (either starting an
application or opening a document). The user’s request then gets passed to the system
management, which finds the correct resources and then calls the system’s appropriate
provisioning services. These services carve out the necessary resources in the cloud, launch
the appropriate web application, and either create or open the requested document. After
the web application is launched, the system’s monitoring and metering functions track the
usage of the cloud so that resources are apportioned and attributed to the proper user(s).
As you can see, the key to the notion of cloud computing is the automation of many
management tasks. The system isn’t a cloud if it requires human management to allocate
processes to resources. What you have in this instance is merely a twenty-first-century
version of the old-fashioned data center–based client/server computing. For the system to
attain cloud status, manual management must be replaced by automated processes.
Self-Assessment Questions - 1
Cloud storage is a model of networked online storage where data is stored in virtualized
pools of storage which are generally hosted by third parties Any web-based application or
service offered via cloud computing is called cloud service.
Cloud storage
One of the primary uses of cloud computing is for data storage. With cloud storage, data is
stored on multiple third-party servers, rather than on the dedicated servers used in
traditional networked data storage. When storing data, the user sees a virtual server, that is,
it appears as if the data is stored in a particular place with a specific name. But that place
doesn’t exist in reality. It’s just an assumed name used to reference virtual space carved out
of the cloud. In reality, the user’s data could be stored on any one or more of the computers
used to create the cloud. The actual storage location may even differ from day to day or even
minute to minute, as the cloud dynamically manages available storage space. But even
though the location is virtual, the user sees a “static” location for his data and can actually
manage his storage space as if it were connected to his own PC. Cloud storage has both
financial and security-associated advantages. Financially, virtual resources in the cloud are
typically cheaper than dedicated physical resources connected to a personal computer or
network. As for security, data stored in the cloud is secure from accidental erasure or
hardware crashes, because it is duplicated across multiple physical machines; since multiple
copies of the data are kept continually, the cloud continues to function as normal even if one
or more machines go offline. If one machine crashes, the data is duplicated on other machines
in the cloud.
Cloud service
Cloud computing is a general term for anything that involves delivering hosted services over
the Internet. These services are broadly divided into three categories:
• Infrastructure-as-a-Service (IaaS),
• Platform-as-a-Service (PaaS)
• Software-as-a-Service (SaaS).
The name cloud computing was inspired by the cloud symbol that's often used to represent
the Internet in flowcharts and diagrams.
A cloud service has three distinct characteristics that differentiate it from traditional hosting.
It is sold on demand, “typically by the minute or the hour; it is elastic – a user can have as
much or as little of a service as they want at any given time;” and the service is fully managed
by the provider (the consumer needs nothing but a personal computer and Internet access).
Significant innovations in virtualization and distributed computing, as well as improved
access to high-speed Internet and a weak economy, have accelerated interest in cloud
computing.
A cloud can be private or public. A public cloud sells services to anyone on the Internet.
(Currently, Amazon Web Services is the largest public cloud provider.) A private cloud is a
proprietary network or a data center that supplies hosted services to a limited number of
people. When a service provider uses public cloud resources to create their private cloud,
the result is called a virtual private cloud. Private or public, the goal of cloud computing is to
provide easy, scalable access to computing resources and IT services.
Infrastructure-as-a-Service like Amazon Web Services provides virtual server instance API
(Application programming interface) to start, stop, access and configure their virtual servers
and storage. In the enterprise, cloud computing allows a company to pay for only as much
capacity as is needed, and bring more online as soon as required. Because this pay-for-what-
you-use model resembles the way electricity, fuel and water are consumed, it's sometimes
referred to as utility computing.
In the software-as-a-service cloud model, the vendor supplies the hardware infrastructure,
the software product and interacts with the user through a front-end portal. SaaS is a very
broad market. Services can be anything from Web-based email to inventory control and
database processing. Because the service provider hosts both the application and the data,
the end-user is free to use the service from anywhere.
Figure 15.3: Responsibilities of vendor and user for different types of services
salesforce: One of the most popular cloud computing SaaS applications is Salesforce CRM.
This was one of the first software with a multitenant platform that charged based on usage
instead of buying the software, deploying, and maintaining the same. You access the software
over the Internet.
Google Apps: Google Apps is a suite of cloud computing SaaS applications that includes e-
mail (Gmail), Organizer (Google Calendar), Word Processing documents (Google Docs), and
others. Figure 15.4 illustrates the various components of Google Apps. It has a free edition
with few applications and other editions with a lot more functionality. Google’s Web-based
messaging and collaboration apps require no server-side hardware or software and need
minimal administration, creating tremendous time and cost savings for businesses.
Office 365: Office 365 is the familiar Microsoft Office now available on the cloud as SaaS. It
is now available as a per-user per-month subscription. You do not need to install the
software on your PC. You just need a web browser to access the service. Figure 15.5
illustrates the various components of Office 365.
Zoho: One of the leading companies which were started in India that has cloud-based SaaS
is Zoho. It has applications similar to the ones offered by salesforce, Office 365, and Google
Apps. Figure 15.6 illustrates the various components which the Zoho supports.
[Link]: [Link] is PaaS offering from [Link]. It is a platform for creating and
deploying applications for social enterprises. Because there are no servers or software to
buy or manage, you can focus solely on building applications that include built-in social and
mobile functionality, business processes, reporting, and search. Your applications run on a
secure, proven service of [Link] that scales, tunes and backs up data automatically.
Windows Azure iPlatform: Windows Azure iPlatform is a cloud platform (PaaS) that
enables you to quickly build, deploy and manage Windows applications across a global
network of Microsoft managed datacenters.
Google App Engine: Google App Engine is a PaaS cloud computing platform used for
developing and hosting web applications in Google-managed datacenters.
Dropbox - Dropbox is an IaaS that provides a Web-based file hosting service. It uses cloud
storage to enable users to store and share files and folders with others across the Internet
using file synchronization.
Sify mystorage: Sify mystorage is a IaaS and provides a cloud-based online storage and
backup solution.
Amazon Web Services: Amazon Web Services (AWS) is a collection of remote computing
services (also called web services) that together make up a cloud computing IaaS platform,
offered over the Internet by [Link]. The most central and well-known of these services
are Amazon EC2 for resizable compute capacity and Amazon S3 Cloud Storage.
features ‘Elastic’ plans where customers can pay for their cloud infrastructure by the hour,
thereby availing of true ‘Pay-As-You-Use’ and ‘On-Demand’ infrastructure.
Traditional companies like Oracle and SAP sold software as licenses with Annual
Maintenance Contract. Cloud companies on the other hand have a subscription model where
customers pay based on usage. This allows companies to try new applications at a very low
cost and when they are comfortable, move all users to use the service. Cloud computing
technology implies the fundamental challenges of how IT operations are managed and
therefore, business as a whole. Traditional companies such as SAP, Oracle, Microsoft, and
Google are now trying to get a big piece of data in the cloud.
Self-Assessment Questions - 2
5. TEMPORAL DATABASE
Temporal databases are used to record time-referenced data. Basically, the majority of the
database technologies are temporal. For example:
Temporal databases are best suited for applications where information is organized on time
constraints. Therefore, the temporal database set a good example to demonstrate the
requirement for the development of a combined set of concepts for the use of application
developers. The framing (objective, design, coding, interface and implementation) of a
temporal database is designed by application developers and designers.
There are numerous applications where time is an important factor in storing information.
For example:
In the case of temporal applications, even the two instances utilized might be simply
expanded. For example, in the COMPANY database, it may be desirable to keep the PROJECT,
JOB, and SALARY histories of all the employees.
It can be applied to the UNIVERSITY database as well, to store the grade history of the
STUDENT. The details about the YEAR, SEMESTER, COURSE, and each SECTION are also
included in the database.
Actually, it can be easily concluded that some temporal information is stored by many of the
database applications. But it is also observed that many users try to ignore temporal features
as it adds complexity to the applications.
Time can be interpreted as valid time (when data occurred or is true in reality) or transaction
time (when data was entered into the database).
A bitemporal database stores data with respect to both valid and transaction time – they
store the history of data with respect to valid time and transaction time.
It is always possible to identify application domains of temporal data since any data can be
represented as temporal data. The question is: when is such transition (considering all
different states of data) required and important? The answer often is how vital it is to have
temporal data for an organization and the benefits that it will bring. Often, when
organizations are storing archives of data then a Temporal Database Management System
(TDBMS) will be useful for manipulating temporal data.
The first and second solution does not involve any changes to existing database technology
and may be simple to form as we just build new methods for temporal support on top of the
existing database system that will be used.
The third solution involves developing a whole new database system with temporal support.
This will be difficult as the underlying principles used by commercial DBMS to optimize
operations must be reformed and a lot of theoretical work needs to be carried out to show
that the new system is fully complete, all new and modified operations perform as required.
The amount of time and manpower required for this approach is similar to that needed by
commercial vendors to develop DBMS that we all are familiar with today.
Thus, this project cannot consider the third solution when developing a temporal database,
as this is out of reach in terms of time and manpower available.
The project adopts the second solution, by using a relational database system to model and
store temporal relations and hence, produces a temporal database.
Self-Assessment Questions - 3
Activity 1
Explain how temporal databases include all database applications to organise their
information.
6. BIG DATA
Big data is a term that describes the large volume of data – both structured and unstructured
– that inundates a business on a day-to-day basis. But it’s not the amount of data that’s
important. It’s what organizations do with the data that matters. Big data can be analyzed for
insights that lead to better decisions and strategic business moves.
While the term “big data” is relatively new, the act of gathering and storing large amounts of
information for eventual analysis is ages-old. The concept gained momentum in the early
2000s when industry analysts articulated the now-mainstream definition of big data as the
three Vs:
Velocity – Data streams in at an unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors, and smart metering are driving the need to deal with torrents of
data in near-real-time.
Variety – Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, email, video, audio, stock ticker data, and
financial transactions.
In real-world big data is used for several applications. Some of the big data applications
include the following:
Self-Assessment Questions - 4
7. NOSQL DATABASES
NoSQL is especially useful when an enterprise needs to access and analyze massive amounts
of unstructured data or data that are stored remotely on multiple virtual servers in the cloud.
Contrary to misconceptions caused by its name, NoSQL does not prohibit SQL. While it's true
that some NoSQL systems are entirely non-relational, others simply avoid selected relational
functionality such as fixed table schemas and join operations. For example, instead of using
tables, a NoSQL database might organize data into objects, key/value pairs, or tuples.
Arguably, the most popular NoSQL database is Apache Cassandra. Cassandra, which was
once Facebook’s proprietary database, was released as open-source in 2008. Other NoSQL
implementations include SimpleDB, Google BigTable, Apache Hadoop, MemcacheDB,
Voldemort, Hbase, and Hypertable. Companies that use NoSQL include Twitter, NetFlix, and
LinkedIn.
Key-Value Store – It has a Big Hash Table of keys & values. A key-value store is a NoSQL
database optimized for read-heavy application workloads such as - social networking,
gaming, media sharing, and Q&A portals, or compute-intensive workloads such as - a
recommendation engine. In-memory caching improves application performance by storing
critical pieces of data in memory for low-latency access. The cached information may include
the results of I/O-intensive database queries or the results of computationally-intensive
calculations. Some examples of such databases are- Amazon S3 (Dynamo), and Riak.
Column-based Store – Each storage block contains data from only one column. In a column-
oriented NoSQL database, data is stored in cells grouped in columns of data rather than as
rows of data. Columns are logically grouped into column families. Column families can
contain a virtually unlimited number of columns that can be created at runtime or the
definition of the schema. Read and write is done using columns rather than rows. Some
examples of such databases include - HBase and Cassandra.
While most relational DBMS store data in rows, the advantage of storing data in columns, is
fast search/ access and data aggregation. Relational databases store a single row as a
continuous disk entry. Different rows are stored in different places on the disk, whereas
Columnar databases store all the cells corresponding to a column as a continuous disk entry
thus making the search/access faster. For example: To query the titles from a bunch of a
million articles will be a painstaking task while using relational databases as it will go over
each location to get item titles. On the other hand, with just one disk access, the title of all the
items can be easily obtained.
Graph-based – A network database that uses edges and nodes to represent and store data.
In a Graph-Based NoSQL Database, you will not find the rigid format of SQL or the tables and
columns representation, a flexible graphical representation is instead used which is perfect
to address scalability concerns. Graph structures are used with edges, nodes and properties
which provides index-free adjacency. Data can be easily transformed from one model to the
other using a Graph Based NoSQL database. These databases use edges and nodes to
represent and store data. Nodes are organised by some relationships with one another,
which is represented by edges between the nodes. Both the nodes and the relationships have
some defined properties. Example of this database includes - Neo4J, Giraph.
In comparison to relational databases, NoSQL databases are more scalable and provide
superior performance. Some of the advantages of NoSQL are given below:
Replication
In contrast to SQL databases, most NoSQL databases allow automatic database replication to
increase availability in the event of outages or planned maintenance events. More
sophisticated NoSQL databases are fully self-healing, offering automated failover and
recovery, as well as the ability to distribute the database across multiple geographic regions
to withstand regional failures and enable data localization.
Integrated Caching
In SQL database systems caching tier can be facilitated by a number of products. These
systems can substantially improve read performance, but they do not improve write
performance, and they add operational complexity to system deployments. If your
application is dominated by reads then a distributed cache could be considered, but if your
application has just a modest write volume, then a distributed cache may not improve the
overall experience of your end-users and will add complexity in managing cache invalidation.
Most NoSQL database technologies have excellent inbuilt caching capabilities, keeping
frequently-used data in system memory as much as possible and removing the need for a
separate caching layer. Additionally, some NoSQL databases also offer a fully managed,
integrated in-memory database management layer for workloads demanding the highest
throughput and lowest latency.
Dynamic Schemas
You are aware of the fact that Relational databases require schemas to be defined before you
can add data. For example, if you want to store data about your customers such as first and
last name, phone numbers, address, city, and state – a SQL database needs to know what you
are storing in advance.
Whereas NoSQL databases provide the flexibility to allow the insertion of data without a
predefined schema. That makes it easy to make significant application changes in real-time,
without worrying about service interruptions – which means code integration is more
reliable, development is faster, and less database administrator time is required. Developers
typically had to add application-side code to enforce data quality controls, such as data types
or permissible values, mandating the presence of specific fields. More sophisticated NoSQL
databases allow validation rules to be applied within the database, allowing users to enforce
governance across data while maintaining the agility benefits of a dynamic schema.
Auto-sharding
Relational databases are structured, therefore they usually scale vertically – a single server
has to host the entire database to ensure acceptable performance for cross-table joins and
transactions. This gets expensive quickly, places limits on scale, and creates a relatively small
number of failure points for database infrastructure. The solution to allow rapidly growing
applications is to scale horizontally, by adding servers instead of concentrating more
capacity on a single server.
'Sharding' a database across many server instances cannot be achieved automatically with
SQL databases. On the other hand NoSQL databases, usually allow auto-sharding, meaning
that they natively and automatically spread data across an arbitrary number of servers,
without requiring the application to even be aware of the composition of the server pool.
Data and query load are automatically balanced across servers, and when a server goes
down, it can be quickly and transparently replaced with no application disruption.
Despite the above advantages, NoSQL has also a few disadvantages, which are given below:
Data Consistency
Many NoSQL databases don’t perform ACID transactions a tried and true technique for
ensuring data consistency across the entire database as it is moved around. Instead, NoSQL
follows the principle of "eventual consistency." This provides some performance benefits,
but it poses the risk that data on one database node may go out of sync with data on another
node.
Security
In comparison to SQL databases, NoSQL databases are generally subject to a fairly long list
of security issues. Most of them will likely be solved in time (a few already have been solved
on certain NoSQL platforms) as NoSQL continues to mature. But currently, security is a
limiting factor for NoSQL deployment.
Self-Assessment Questions - 5
8. SUMMARY
Let us recapitulate the important points discussed in this unit:
9. TERMINAL QUESTIONS
1. Describe the functioning of cloud computing.
2. Explain cloud architecture.
3. What are cloud services?
4. Explain the role of cloud computing in industrial applications.
5. Explain temporal database.
6. What is big data? List the applications where big data can be used.
7. Explain the 3 Vs of big data.
8. Describe the types of NoSQL.
9. Explain the advantages and disadvantages of NoSQL databases.
10. Compare SQL databases and NoSQL databases.
10. ANSWERS
1. Cloud
2. True
3. User interface
4. third-party servers
5. SaaS, PaaS, IaaS
6. Salesforce
7. Office 365
8. Amazon Web Services
9. Time-referenced
10. d. all of the above
11. Variety
12. Google
13. True
14. Graph
15. False
Terminal Questions
1. Fundamentally, the cloud computing technology is different as compared to the
traditional method because cloud computing is the delivery of computing as a service
rather than a product. (For more details refer section 2.1)
2. The key to cloud computing is the “cloud” a massive network of servers or even
individual PCs interconnected in a grid. (For more details refer section 2.2)
3. Different categories of cloud services are a software as a Service (SaaS), Platform as a
Service (PaaS), and Infrastructure as a Service (IaaS). (For more details refer section 3)
4. The cloud computing industry has seen a rapid rise in the number of vendors, with each
vendor trying to get the first-mover advantage. (For more details refer section 4)
5. The temporal database works on time-referenced data. (Refer Section 5 for more
details.)
6. Big data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis. (For more details refer
section 6).
7. Volume, Velocity, and Variety. (Refer Section 6 for more details)
8. There are 4 basic types of NoSQL databases: Key-Value Store, Document-based Store,
Column-based Store, and Graph-based database. (Refer Section 7.1 for more details)
9. Refer Section 7.2.
10. Refer Section 7.3.