DBMS Overview: Concepts and Architecture
DBMS Overview: Concepts and Architecture
IT-601
Contracts: 3L
Credits- 3
Module 1
Introduction :Concept & Overview of DBMS, Data Models, Database Languages, , Database Users, Three
Schema architecture of DBMS.
What is a database?
It is a collection of data, typically describing the activities of one or more related organizations, e.g., a
University, an airline reservation or a banking database.
1) Data redundancy and consistency:- Different files have different formats of programs written in
different programming languages by different users. So the same information may be duplicated in
several files. It may lead to data inconsistency. If a customer changes his address, then it may be
reflected in one copy of data but not in the other.
2) Difficulty in accessing data:- The file system environment does not allow needed data to be retrieved
in a convenient and efficient manner.
3) Data isolation:- Data is scattered in various files; so it gets isolated because file may be in different
formats.
4) Integrity problems:- Data values stored in the database must satisfy consistency constraints. Problem
occurs when constraints involve several data items from different files.
5) Atomicity problems:- If failure occurs, data must be stored to constant state that existed prior to
failure. For example, if in a bank account, a person abc is transferring Rs 5000 to the account of pqr, and
abc has withdrawn the money but before it
gets deposited to the pqr’s account, the system failure occurs, then Rs5000 should be deposited back to
abc’s bank account.
6) Concurrent access anomalies:- Many systems allow multiple users to update data simultaneously.
Concurrent updates should not result in inconsistent data.
7) Security problems:- Not every user of the database system should be able to access all data. Data
base should be protected from access by unauthorized users.
Major purpose of dbms is to provide users with abstract view of data i.e. the system hides certain details
of how the data are stored and maintained.
Since database system users are not computer trained, developers hide the complexity from users
through 3 levels of abstraction, to simplify user’s interaction with the system.
1) Physical level of data abstraction:
This is the lowest level of abstraction which describes how data are actually stored.
2) Logical level of data abstraction:
This level hides what data are actually stored in the database and what relationship exists among them.
3) View Level of data abstraction:
View provides security mechanism to prevent user from accessing certain parts of database.
Explanation:
Physical level: describes how a record (e.g., customer) is stored.
Logical level: describes data stored in database, and the relationships among the data.
type customer = record
name : string;
street : string;
city : integer;
end;
View level: application programs hide details of data types. Views can also hide information (e.g. salary)
for security purposes.
DATA INDEPENDENCE:
It is the capacity to change the conceptual schema without having to change external schemas or
application programs. We may change the conceptual schema to expand the database (by adding a
record type or data item), or to reduce the database (by
removing a record type or data item). In the latter case, external schemas that refer only to the
remaining data should not be affected. Only the view definition and the mappings need be changed in a
DBMS that supports logical data independence.
Application programs that reference the external schema constructs must work as before, after the
conceptual schema undergoes a logical reorganization. Changes to constraints can be applied also to the
conceptual schema without affecting the external schemas or application programs.
Independent from the database model it is important to differentiate between the description of the
database and the database itself. The description of the database is called database scheme or also
metadata. The database scheme is defined during the database design process and changes very rarely
afterwards.
The actual content of the database, the data, changes often over the years. A database state at a
specific time defined through the currently existing content and relationship and their attributes is
called a database instance. The following illustration shows that a database scheme could be looked at
like a template or building plan for one or several database instances.
When designing a database it is differentiated between two levels of abstraction and their respective
data schemes, the conceptual and the logical data scheme.
The conceptual data scheme orients itself exclusively by the database application and therefore by the
real world. It does not consider any data technical infrastructure like DBMS or computer systems, which
are eventually employed. Entity relationship diagrams and relations are tools for the development of a
conceptual scheme.
When designing a database the conceptual data scheme is derived from the logical data scheme
Relational Database Design. This derivation results in a logical data scheme for one specific application
and one specific DBMS. A DB-Development System converts then the logical scheme directly into
instructions for the DBMS.
DATA MODELS:
Many data models have been proposed, and we can categorize them according to the types of concepts
they use to describe the database structure.
High-level or conceptual data models provide concepts that are close to the way many users perceive
data, whereas lowlevel or physical data models provide concepts that describe the details of how data
is stored in the computer. Concepts provided by low-level data models are generally meant for
computer specialists, not for typical end users. Between these two extremes is a class of
representational (or implementation) data models, which provide concepts that may be understood by
end users but that are not too far removed from the way data is organized within the computer.
Representational data models hide some details of data storage but can be implemented on a computer
system in a direct way. Conceptual data models use concepts such as entities, attributes, and
relationships.
An entity represents a real-world object or concept, such as an employee or a project, that is described
in the database. An attribute represents some property of interest that further describes an entity, such
as the employee’s name or salary. A relationship among two or more entities represents an interaction
among the entities, which is explained by the Entity-Relationship model—a popular high-level
conceptual data model. Representational or implementation data models are the models used most
frequently in traditional commercial DBMSs, and they include the widely-used relational data model, as
well as the so-called legacy data models—the network and hierarchical
models—that have been widely used in the past. We can regard object data models as a new family of
higherlevel implementation data models that are closer to conceptual
data models. Object data models are also frequently utilized as highlevel conceptual models,
particularly in the software engineering domain.
Physical data models describe how data is stored in the computer by representing information such as
record formats, record orderings, and access paths. An access path is a structure that makes the search
for particular database records efficient.
The simple structure of a hierarchical database became a disadvantage when the data had a more
complex structure. In an order-processing database, for example, a single order might participate in
three different parent/child relationships, linking the order to the customer who placed it, the
salesperson who took it, and the product ordered. The structure of this type of data simply didn't fit the
strict hierarchy of IMS. To deal with applications such as order processing, a new network data model
was developed. The network data model extended the hierarchical model by allowing a record to
participate
in multiple parent/child relationships. For a programmer, accessing a network database was very similar
to accessing a hierarchical database. An application program could:
• find a specific parent record by key (such as a customer number),
• Move down to the first child in a particular set (the first order placed by this customer),
• Move sideways from one child to the next in the set (the next order placed by the same customer), or
• move up from a child to its parent in another set (the salesperson who took the order).
Once again the programmer had to navigate the database record-by-record, this time specifying which
relationship to navigate as well as the direction.
Network databases had several advantages:
• Flexibility: Multiple parent/child relationships allowed a network database to represent data that did
not have a simple hierarchical structure.
• Standardization: The CODASYL standard boosted the popularity of the network model, and
minicomputer vendors such as Digital Equipment Corporation and Data General implemented network
databases.
• Performance: Despite their greater complexity, network databases boasted performance approaching
that of hierarchical databases. Sets were represented by pointers to physical data records, and on some
systems, the database administrator could specify data clustering based on a set relationship. Network
databases had their disadvantages, too. Like hierarchical databases, they were very rigid. The set
relationships and the structure of the records had to be specified in advance.
Changing the database structure typically required rebuilding the entire database.
Database Languages:
Data Definition Language (DDL) statements are used to define the database structure or schema. Some
examples:
CREATE - to create objects in the database
ALTER - alters the structure of the database
DROP - delete objects from the database
TRUNCATE - remove all records from a table, including all spaces allocated for the records are removed
COMMENT - add comments to the data dictionary
RENAME - rename an object
Data Manipulation Language (DML) statements are used for managing data within schema objects.
Some examples:
SELECT - retrieve data from the a database
INSERT - insert data into a table
UPDATE - updates existing data within a table
DELETE - deletes all records from a table, the space for the records remain
MERGE - UPSERT operation (insert or update)
CALL - call a PL/SQL or Java subprogram
EXPLAIN PLAN - explain access path to data
LOCK TABLE - control concurrency
Transaction Control (TCL) statements are used to manage the changes made by DML statements. It
allows statements to be grouped together into logical transactions.
COMMIT - save work done
SAVEPOINT - identify a point in a transaction to which you can later roll back
ROLLBACK - restore database to original since the last COMMIT
SET TRANSACTION - Change transaction options like isolation level and what rollback segment to use.
If a table is dropped, all the relationships with other tables will no longer be valid, the integrity
constraints will be dropped, grant or access privileges on the table will also be dropped, if you want use
the table again it has to be recreated with the integrity constraints, access privileges and the
relationships with other tables should be established again. But, if a table is truncated, the table
structure remains the same, therefore any of the above problems will not exist.
DBMS developer: person or a group of people who design and implement DBMS software.
Database Administrator (DBA): person or a group of people who define, modify, and manage a
database using an existing DBMS.
Database User (Naive User, Sophisticated User, and DB Application Programmer): Person who uses a
DBS.
1) Naive users:
These are the unsophisticated users who interact with the system by invoking one of the application
programs that have been written previously.
E.g. consider a user who checks for account balance information over the World Wide Web. Such a user
access a form, enters the account number and password etc. And the application program on the
internet then retrieves the account balance using given account information which s passed to the user.
2) Application programmers:
These are computer professionals who write application programs, used to develop user interfaces. The
application programmer uses Rapid Application Development (RAD) toolkit or special type of
programming languages which include special features to facilitate generation of forms and display of
date on screen.
3) Sophisticated users:
These users interact with the database using database query language. They submit their query to the
query processor.
Then Data Manipulation Language (DML) functions are performed on the database to retrieve the data.
Tools used by these users are OLAP (Online Analytical Processing) and data mining tools.
4) Specialized users:
These users write specialized database applications to retrieve data. These applications can be used to
retrieve data with complex data types e.g. graphics data and audio data.
b) Database Administrator (DBA)
A person having who has central control over data and programs that access the data is called DBA.
Following are the functions of the DBA.
1) Schema definition: DBA creates database schema by executing Data Definition Language (DDL)
statements.
2) Storage structure and access method definition
3) Schema and physical organization modification: If any changes are to be made in the original
schema, to fit the need of your organization, then these changes are carried out by the DBA.
4) Granting of authorization foe data access: DBA can decide which parts of data can be accessed by
which users. Before any user access the data, dbms checks which rights are granted to the user by the
DBA.
5) Routine maintenance: DBA has to take periodic backups of the database, ensure that enough disk
space is available to store new data, ensure that performance of dbms ix not degraded by any operation
carried out by the users.
-----------Database System
Centralized control of the database is exerted by a person or group of persons under the supervision of
a high – level administrator. This person or group is referred to as the database administrator (DBA).
They are the users who are most familiar with the database and are responsible for creating, modifying,
and maintaining its three levels. Database Administrator is responsible to manage the DBMS’s use and
ensure that the database is functioning properly.
DBA administers the three levels of database and consultation with the overall user community, sets up
the definition of the global view of the various users and applications and is responsible the definition
and implementation of the internal level, including the storage structure and access methods to be used
for the optimum performance of the DBMS. DBA is responsible for granting permission to the users of
the database and stores the profile of each user in the database.
Responsibilities of DBA
Deciding the information content of the database – It is the DBA’s job to decide exactly what
information is to be held in the database – in other words, to identify the entities of interest to the
enterprise and to identify the information to be recorded about those entities. Having done this, the
DBA must then define the content of the database by writing the conceptual schema.
Deciding the storage structure and access strategy – The DBA must also decide how the data is to be
represented in the database, and must specify the representation by writing the storage structure
definition. In addition, the associated mapping between storage structure definition and the conceptual
schema must also be specified.
Liaising with the users – It is the business of the DBA to liaise with users, to ensure that the data they
require is available, and to write the necessary external schemas. In addition, the mapping between any
given external schema and the conceptual schema must also be specified. In practice the external DDL
will probably include the means for specifying the mapping, but the schema and the mapping should be
clearly distinguishable.
Defining authorization checks and validation procedures – Authorization checks and validation
procedures may be considered as logical extensions of the conceptual schema. The conceptual DDL will
include facilities for specifying such checks and procedures.
The goal of the three-schema architecture, illustrated in above Figure, is to separate the user
applications and the physical database. In this architecture, schemas can be defined at the following
three levels:
1. The internal level has an internal schema, which describes the physical storage structure of the
database. The internal schema uses a physical data model and describes the complete details of data
storage and access paths for the database.
2. The conceptual level has a conceptual schema, which describes the structure of the whole database
for a community of users. The conceptual schema hides the details of physical storage structures and
concentrates on describing entities, data types, relationships, user operations, and constraints. A high-
level data model or an implementation data model can be used at this level.
3. The external or view level includes a number of external schemas or user views. Each external
schema describes the part of the database that a particular user group is interested in and hides the rest
of the database from that user group. A high-level
data model or an implementation data model can be used at this level.
The three-schema architecture is a convenient tool for the user to visualize the schema levels in a
database system. In most DBMSs that support user views, external schemas are specified in the same
data model that describes the conceptual-level information. Some DBMSs allow different data models to
be used at the conceptual and external levels. Notice that the three schemas are only descriptions of
data; the only data that actually exists is at the physical level. In a DBMS based on the three schema
architecture, each user group refers only to its own external schema. Hence, the DBMS must transform
a request specified on an external schema into a request against the conceptual schema, and then into a
request on the internal schema for processing over the stored database. If the request is database
retrieval, the data extracted from the stored database must be reformatted to match the user’s external
view. The processes of transforming requests and results between levels are called mappings. These
mappings may be time-consuming, so some DBMSs—especially those that are meant to support small
databases—do not support external views. Even in such systems, however, a certain amount of mapping
is necessary to transform requests between the conceptual and internal levels.
Module 2
Entity-Relationship Model
Basic concepts, Design Issues, Mapping Constraints, Keys, Entity-Relationship Diagram, Weak
Entity Sets, Extended E-R features.
ENTITY RELATIONSHIP MODEL
ENTITY
The basic object that the ER model represents is an entity, which is a "thing" in the real world
with an independent existence.
An entity may be an object with a physical existence—a particular person, car, house, or
employee—or it may be an object with a conceptual existence—a company, a job, or a
university course.
ATTRIBUTES
• Composite attributes can be divided into smaller subparts, which represent more basic attributes with
independent meanings.
• For example, the Address attribute of the employee entity can be sub-divided into Street_Name, City,
State, and Zip.
• Attributes that are not divisible are called simple or atomic attributes.
• Composite attributes can form a hierarchy; for example, Name can be subdivided into three simple
attributes, First_Name, Middle Name, Last_Name.
• The value of a composite attribute is the concatenation of the values of its constituent simple
attributes.
Composite Attributes
• Attributes which have only one value for a entity are called single valued attributes.
• E.g. For a student entity, RollNo attribute has only one single value.
• But phone number attribute may have multiple values. Such values are called Multi-valued attributes.
• Two or more attribute values are related—for example, the Age and Birth Date attributes of a person.
• For a particular person entity, the value of Age can be determined from the current (today’s) date and
the value of that person’s Birth Date.
• The Age attribute is hence called a derived
• The attribute from which another attribute value is derived is called stored attribute.
• In the above example, date of birth is the stored attribute.
• Take another example, if we have to calculate the interest on some principal amount for a given time,
and for a particular rate of interest, we can simply use the interest formula.
o Interest=NPR/100;
• In this case, interest is the derived attribute whereas principal amount(P), time(N) and rate of
interest(R) are all stored attributes.
KEYS
An important constraint on the entities of an entity type is the key or uniqueness constraint on
attributes.
• A key is an attribute (also known as column or field) or a combination of attribute that is used to
identify records.
• Sometimes we might have to retrieve data from more than one table, in those cases we require to join
tables with the help of keys.
• The purpose of the key is to bind data together across tables without repeating all of the data in every
table
• Such an attribute is called a key attribute, and its values can be used to identify each entity uniquely.
• For example, the Name attribute is a key of the COMPANY entity type because no two companies are
allowed to have the same name.
• For the PERSON entity type, a typical key attribute is SocialSecurityNumber.
• Sometimes, several attributes together form a key, meaning that the combination of the attribute
values must be distinct for each entity.
• If a set of attributes possesses this property, we can define a composite attribute that becomes a key
attribute of the entity type.
The various types of key with e.g. in SQL are mentioned below, (For examples let suppose we have an
Employee Table with attributes ‘ID’ , ‘Name’ ,’Address’ , ‘Department_ID’ ,’Salary’)
(I) Super Key – An attribute or a combination of attribute that is used to identify the records uniquely is
known as Super Key. A table can have many Super Keys.
7 Name, Address, Department_ID ………… So on as any combination which can identify the records
uniquely will be a Super Key.
(II) Candidate Key – It can be defined as minimal Super Key or irreducible Super Key. In other words an
attribute or a combination of attribute that identifies the record uniquely but none of its proper subsets
can identify the records uniquely.
E.g. of Candidate Key
1 Code
2 Name, Address
For above table we have only two Candidate Keys (i.e. Irreducible Super Key) used to identify the
records from the table uniquely. Code Key can identify the record uniquely and similarly combination of
Name and Address can identify the record uniquely,
but neither Name nor Address can be used to identify the records uniquely as it might be possible that
we have two employees with similar name or two employees from the same house.
(III) Primary Key – A Candidate Key that is used by the database designer for unique identification of
each row in a table is known as Primary Key. A Primary Key can consist of one or more attributes of a
table.
E.g. of Primary Key - Database designer can use one of the Candidate Key as a Primary Key. In this case
we have “Code” and “Name, Address” as Candidate Key, we will consider “Code” Key as a Primary Key as
the other key is the combination of more than one attribute.
(IV) Foreign Key – A foreign key is an attribute or combination of attribute in one base table that points
to the candidate key (generally it is the primary key) of another table. The purpose of the foreign key is
to ensure referential integrity of the data i.e. only values that are supposed to appear in the database
are permitted.
E.g. of Foreign Key – Let consider we have another table i.e.
Department Table with Attributes “Department_ID”, “Department_Name”, “Manager_ID”,
”Location_ID” with Department_ID as an Primary Key. Now the Department_ID attribute of Employee
Table (dependent or child table) can be defined as the Foreign Key as it can reference to the
Department_ID attribute of the Departments table (the referenced
or parent table), a Foreign Key value must match an existing value in the parent table or be NULL.
(V) Composite Key – If we use multiple attributes to create a Primary Key then that Primary Key is called
Composite Key (also called a Compound Key or Concatenated Key).
E.g. of Composite Key, if we have used “Name, Address” as a Primary Key then it will be our Composite
Key.
(VI) Alternate Key – Alternate Key can be any of the Candidate Keys except for the Primary Key.
E.g. of Alternate Key is “Name, Address” as it is the only other Candidate Key which is not a Primary Key.
(VII) Secondary Key – The attributes that are not even the Super Key but can be still used for
identification of records (not unique) are known as Secondary Key.
E.g. of Secondary Key can be Name, Address, Salary, Department_ID etc. as they can identify the
records but they might not be unique.
RELATION
• There are several implicit relationships among the various entity types.
• In fact, whenever an attribute of one entity type refers to another entity type, some relationship
exists.
• For example, the attribute Manager of department refers to an employee who manages the
department.
• In the ER model, these references should not be represented as relationships or relation. There is a
relation “borrower” in the entities customer and account which can be shown as follows:
CARDINALITY
Mapping cardinalities, or cardinality ratios, express the number of entities to which another entity can
be associated via a relationship set.
For a relationship set R between entity sets A and B, the mapping cardinality must be one of the
following:
There are three types of relationships
1) One to one
2) One to many
3) Many to many
2.5.1 One to one:
An entity in A is associated with at most one entity in B, and an entity in B is associated with at most one
entity in A.
2.5.2 One to many:
An entity in A is associated with any number (zero or more) of entities in B. An entity in B, however, can
be associated with at most one entity in A.
2.5.3 Many to one:
An entity in A is associated with at most one entity in B. An entity in B, however, can be associated with
any number (zero or more) of entities in A.
2.5.4 Many to many:
An entity in A is associated with any number (zero or more) of entities in B, and an entity in B is
associated with any number (zero or more) of entities in A.
PARTICIPATION
• The participation of an entity set E in a relationship set R is said to be total if every entity in E
participates in at least one relationship in R.
• If only some entities in E participate in relationships in R, the participation of entity set E in
relationship R is said to be partial.
• For example, we expect every loan entity to be related to at least one customer through the borrower
relationship.
• Therefore the participation of loan in the relationship set borrower is total.
• In contrast, an individual can be a bank customer whether or not she has a loan with the bank.
• Hence, it is possible that only some of the customer entities are related to the loan entity set through
the borrower relationship, and the participation of customer in the borrower relationship set is
therefore partial.
WEAK ENTITIES
• An entity set may not have sufficient attributes to form a primary key.
• Such an entity set is termed a weak entity set.
• An entity set that has a primary key is termed a strong entity set.
• As an illustration, consider the entity set payment, which has the three attributes: payment-number,
payment-date, and paymentamount.
• Payment numbers are typically sequential numbers, starting from 1, generated separately for each
loan.
• Thus, al-though each payment entity is distinct, payments for different loans may share the same
payment number. Thus, this entity set does not have a primary key; it is a weak entity set.
• For a weak entity set to be meaningful, it must be associated with another entity set, called the
identifying or owner entity set.
• Every weak entity must be associated with an identifying entity; that is, the weak entity set is said to
be existence dependent on the identifying entity set.
• The identifying entity set is said to own the weak entity set that it identifies.
• The relationship associating the weak entity set with the identifying entity set is called the identifying
relationship.
• The identifying relationship is many to one from the weak entity set to the identifying entity set, and
the participation of the weak entity set in the relationship is total.
• In our example, the identifying entity set for payment is loan, and a relationship loan-payment that
associates payment entities with their corresponding loan entities is the identifying relationship.
Department of Information Technology
Nivedita Ray De Sarkar Page
1616
Database Management System
• Although a weak entity set does not have a primary key, we nevertheless need a means of
distinguishing among all those entities in the weak entity set that depend on one particular strong
entity.
• The discriminator of a weak entity set is a set of attributes that allows this distinction to be made.
• For example, the discriminator of the weak entity set payment is the attribute payment-number, since,
for each loan, a payment number uniquely identifies one single payment for that loan.
• The discriminator of a weak entity set is also called the partial key of the entity set.
• The primary key of a weak entity set is formed by the primary key of the identifying entity set, plus the
weak entity set’s discriminator.
• In the case of the entity set payment, its primary key is {loannumber, payment-number}, where loan-
number is the primary key of the identifying entity set, namely loan, and paymentnumber distinguishes
payment entities within the same loan.
• The identifying relationship set should have no descriptive attributes, since any required attributes can
be associated with the weak entity set
• A weak entity set can participate in relationships other than the identifying relationship.
• For instance, the payment entity could participate in a relationship with the account entity set,
identifying the account from which the payment was made.
• A weak entity set may participate as owner in an identifying relationship with another weak entity set.
• It is also possible to have a weak entity set with more than one identifying entity set.
• A particular weak entity would then be identified by a combination of entities, one from each
identifying entity set.
• The primary key of the weak entity set would consist of the union of the primary keys of the
identifying entity sets, plus the discriminator of the weak entity set.
• In E-R diagrams, a doubly outlined box indicates a weak entity set, and a doubly outlined diamond
indicates the corresponding identifying relationship.
• The weak entity set payment depends on the strong entity set loan via the relationship set loan-
payment.
• The figure also illustrates the use of double lines to indicate total participation—the participation of
the (weak) entity set payment in the relationship loan-payment is total, meaning that every payment
must be related via loan-payment to some loan.
Finally, the arrow from loan-payment to loan indicates that each payment is for a single loan. The
discriminator of a weak entity set also is underlined, but with a dashed, rather than a solid, line.
Specialization:
• An entity set may include sub groupings of entities that are distinct in some way from other entities in
the set.
• For instance, a subset of entities within an entity set may have attributes that are not shared by all the
entities in the entity set.
The E-R model provides a means for representing these distinctive entity groupings.
• Consider an entity set person, with attributes name, street, and city. A person may be further
classified as one of the following:
• Customer
• Employee
• Each of these person types is described by a set of attributes that includes all the attributes of entity
set person plus possibly additional attributes.
• For example, customer entities may be described further by the attribute customer-id, whereas
employee entities may be described further by the attributes employee-id and salary.
• The process of designating sub groupings within an entity set is called specialization.
• The specialization of person allows us to distinguish among persons according to whether they are
employees or customers.
• As another example, suppose the bank wishes to divide accounts into two categories, checking
account and savings account. Savings accounts need a minimum balance, but the bank may set interest
rates differently for different customers, offering better rates to favored customers.
• Checking accounts have a fixed interest rate, but offer an overdraft facility; the overdraft amount on a
checking account must be recorded.
• The bank could then create two specializations of account, namely savings-account and checking-
account.
• As we saw earlier, account entities are described by the attributes account-number and balance.
• The entity set savings-account would have all the attributes of account and an additional attribute
interest-rate.
• The entity set checking-account would have all the attributes of account, and an additional attribute
overdraft-amount.
• We can apply specialization repeatedly to refine a design scheme. For instance, bank employees may
be further classified as one of the following:
• Officer
• Teller
• Secretary
• Each of these employee types is described by a set of attributes that includes all the attributes of
entity set employee plus additional attributes. For example, officer entities may be described further by
the attribute office-number, teller entities by the attributes station-number and hours-per-week, and
secretary entities by the attribute hours-per-week. Further, secretary entities may participate in a
relationship secretary-for, which identifies which employees are assisted by a secretary.
• An entity set may be specialized by more than one distinguishing feature. In our example, the
distinguishing feature among employee entities is the job the employee performs.
Another, coexistent, specialization could be based on whether the person is a temporary (limited-term)
employee or a permanent employee, resulting in the entity sets temporaryemployee and permanent-
employee. When more than one specialization is formed on an entity set, a particular entity may belong
to multiple specializations. For instance, a given employee may be a temporary employee who is a
secretary.
• In terms of an E-R diagram, specialization is depicted by a triangle component labeled ISA. The label
ISA stands for “is a” and represents, for example, that a customer “is a” person. The ISA relationship may
also be referred to as a super class subclass relationship. Higher- and lower-level entity sets are depicted
as regular entity sets—that is, as rectangles containing the name of the entity set.
Generalization:
• The refinement from an initial entity set into successive levels of entity sub groupings represents a
top-down design process in which distinctions are made explicit. The design process may also proceed in
a bottom-up manner, in which multiple entity sets are synthesized into a higher-level entity set on the
basis of common features. The database designer may have first identified a customer entity set with
the attributes name, street, city, and customer-id, and an employee entity set with the attributes name,
street, city, employee-id, and salary. There are similarities between the customer entity set and the
employee entity set in the sense that they have several attributes in common. This commonality can be
expressed by generalization, which is a containment relationship that exists between a higher-level
entity set and one or more lower-level entity sets. In our example, person is the higher-level entity set
and customer and employee are lower-level entity sets.
• Higher- and lower-level entity sets also may be designated by the terms super class and subclass,
respectively. The person entity set is the super class of the customer and employee subclasses.
• For all practical purposes, generalization is a simple inversion of specialization. We will apply both
processes, in combination, in the course of designing the E-R schema for an enterprise. In terms of the E-
R diagram itself, we do not distinguish between specialization and generalization. New levels of entity
representation will be distinguished (specialization) or synthesized (generalization) as the design schema
comes to express fully the database application and the user requirements of the database.
Aggregation:
• One limitation of the E-R model is that it cannot express relationships among relationships.
• To illustrate the need for such a construct, consider the ternary relationship works-on, which we saw
earlier, between a employee, branch,and job.
• Now, suppose we want to record managers for tasks performed by an employee at a branch; that is,
we want to record managers for (employee, branch, job)combinations. Let us assume that there is an
entity set manager.
• One alternative for representing this relationship is to create a quaternary relationship manages
between employee, branch, job, and manager. (A quaternary relationship is required—a binary
relationship between manager and employee would not permit us to represent which (branch, job)
combinations of an employee are managed by which manager.)
• Using the basic E-R modeling constructs, we obtain the E-R diagram as follows:
• It appears that the relationship sets works-on and manages can be combined into one single
relationship set.
• Nevertheless, we should not combine them into a single relationship, since some employee, branch,
job combinations many not have a manager.
• There is redundant information in the resultant figure, however, since every employee, branch, job
combination in manages is also in works-on.
• If the manager were a value rather than an manager entity, we could instead make manager a multi
valued attribute of the relationship works-on.
• But doing so makes it more difficult (logically as well as in execution cost) to find, for example,
employee-branch-job triples for which a manager is responsible. Since the manager is a manager entity,
this alternative is ruled out in any case.
• The best way to model a situation such as the one just described is to use aggregation.
• Aggregation is an abstraction through which relationships are treated as higher-level entities.
• Following figure shows a notation for aggregation commonly used to represent the above situation.
Conversion of relationships:
a. For every one-to-one relationship, any one of the entity’s primary key is taken as foreign key
to the other entity to establish the relationship.
b. For every n-to-1 or 1-to-n relationship the primary key of the entity in the side 1 of the ER
schema is taken as foreign key to the other side to establish the relationship.
c. For a n-to-m relationship create a relation for the relationship set where the primary key will
consist of the primary keys of the entities present taken as foreign key.
d. For multivalued attributes create a relation taking the multivalued attributes and the
entity’s primary key as a foreign key.
Module 3
Relational Model : Structure of relational Databases, Relational Algebra, Relational Calculus, Extended
Relational Algebra Operations, Views, Modifications Of the Database.
The relational model represents the database as a collection of relations. Informally, each relation
resembles a table of values or, to some extent, a "flat" file of records. When a relation is thought of as a
table of values, each row in the table represents a collection of related data values. In the relational
model, each row in the table represents a fact that typically corresponds to a real world entity or
relationship. The table name and column names are used to help in interpreting the meaning of the
values in each row. In the formal relational model terminology, a row is called a tuple, a column header
is called an attribute, and the table is called a relation. The data type describing the types of values that
can appear in each column is called a domain. We now define these terms—domain, tuple, attribute,
and relation—more precisely.
To preserve the consistency and correctness of its stored data, a relational DBMS typically imposes one
or more data integrity constraints. These constraints restrict the data values that can be inserted into
the database or created by a database update.
Several different types of data integrity constraints are commonly found in relational databases,
including:
• Required data: Some columns in a database must contain a valid data value in every row; they are not
allowed to contain missing or NULL values. In the sample database, every order must have an associated
customer who placed the order. The DBMS can
be asked to prevent NULL values in this column.
• Validity checking: Every column in a database has a domain, a set of data values that are legal for that
column. The DBMS can be asked to prevent other data values in these columns.
• Entity integrity: The primary key of a table must contain a unique value in each row, which is different
from the values in all other rows. Duplicate values are illegal, because they wouldn't allow the database
to distinguish one entity from another. The DBMS can be asked to enforce this unique values constraint.
• Referential integrity: A foreign key in a relational database links each row in the child table containing
the foreign key to the row of the parent table containing the matching primary key value. The DBMS can
be asked to enforce this foreign key/primary key
constraint.
• Other data relationships: The real-world situation modeled by a database will often have additional
constraints that govern the legal data values that may appear in the database.
The DBMS can be asked to check modifications to the tables to make sure that their values are
constrained in this way.
Business rules: Updates to a database may be constrained by business rules governing the real-world
transactions that are represented by the updates.
• Consistency: Many real-world transactions cause multiple updates to a database. The DBMS can be
asked to enforce this type of consistency rule or to support applications that implement such rules.
RELATIONAL ALGEBRA:
The relational algebra is a procedural query language. It consists of a set of operations that take one or
two relations as input and produce a new relation as their result. The fundamental operations in the
relational algebra are select, project, union, set difference, Cartesian product, and rename. In addition
to the fundamental operations, there are several other operations—namely, set intersection, natural
join, division, and assignment. We will define these operations in terms of the fundamental operations.
Relational algebra is the basic set of operations for the relational model
These operations enable a user to specify basic retrieval requests (or queries)
The result of an operation is a new relation, which may have been formed from one or more
input relations
This property makes the algebra “closed” (all objects in relational algebra are relations)
The algebra operations thus produce new relations
----------These can be further manipulated using operations of the same algebra
A sequence of relational algebra operations forms a relational algebra expression
------------- The result of a relational algebra expression is also a relation that represents the result of
a database query (or retrieval request).
Fundamental Operations
The select, project, and rename operations are called unary operations,
because they operate on one relation. The other three operations operate on pairs of relations and are,
therefore, called binary operations.
The select operation selects tuples that satisfy a given predicate. We use the lowercase Greek letter
sigma (σ) to denote selection. The predicate appears as a subscript to σ.
The argument relation is in parentheses after the σ. Thus, to select those tuples of the loan relation
where the branch is “Perryridge,” we write
σbranch-name =“Perryridge” (loan)
We can find all tuples in which the amount lent is more than $1200 by writing σamount>1200 (loan)
In general, we allow comparisons using =, _=, <, ≤, >, ≥ in the selection predicate.
Furthermore, we can combine several predicates into a larger predicate by using the connectives and
( ), or ( ), and not (¬).
Thus, to find those tuples pertaining to loans of more than $1200 made by the Perryridge branch, we
write:
σbranch-name =“Perryridge” amount>1200 (loan)
The selection predicate may include comparisons between two attributes. To illustrate, consider the
relation loan-officer that consists of three attributes: customer-name, banker-name, and loan-number,
which specifies that a particular banker is the loan
officer for a loan that belongs to some customer. To find all customers who have the same name as their
loan officer, we can write σcustomer-name =banker-name (loan-officer).
Relational algebra, an offshoot of first-order logic (and of algebra of sets), deals with a set of finitary
relations which is closed under certain operators. These operators operate on one or more relations to
yield a relation.
As in any algebra, some operators are primitive and the others, being definable in terms of the primitive
ones, are derived. It is useful if the choice of primitive operators parallels the usual choice of primitive
Department of Information Technology
Nivedita Ray De Sarkar Page
2424
Database Management System
logical operators. Although it is well known that the usual choice in logic of AND, OR and NOT is
somewhat arbitrary, Codd made a similar arbitrary choice for his algebra. The six primitive operators of
Codd's algebra are the selection, the projection, the Cartesian product (also called the cross product or
cross join), the set union, the set difference, and the rename. (Actually, Codd omitted the rename, but
the compelling case for its inclusion was shown by the inventors of ISBL.) These six operators are
fundamental in the sense that none of them can be omitted without losing expressive power. Many
other operators have been defined in terms of these six. Among the most important are set intersection,
division, and the natural join. In fact ISBL made a compelling case for replacing the Cartesian product
with the natural join, of which the Cartesian product is a degenerate case. Altogether, the operators of
relational algebra have identical expressive power to that of domain relational calculus or tuple
relational calculus. However, for the reasons given in the Introduction above, relational algebra has
strictly less expressive power than that of first-order predicate calculus without function symbols.
Relational algebra actually corresponds to a subset of first-order logic that is Horn clauses without
recursion and negation.
Set operators
Although three of the six basic operators are taken from set theory, there are additional constraints that
are present in their relational algebra counterparts: For set union and set difference, the two relations
involved must be union-compatible—that is, the two relations must have the same set of attributes. As
set intersection can be defined in terms of set difference, the two relations involved in set intersection
must also be union-compatible.
The Cartesian product is defined differently from the one defined in set theory in the sense that tuples
are considered to be 'shallow' for the purposes of the operation. That is, unlike in set theory, where the
Cartesian product of a n-tuple by an m-tuple is a set of 2-tuples, the Cartesian product in relational
algebra has the 2-tuple "flattened" into an n+m-tuple. More formally, R × S is defined as follows:
R × S = {r s | r R, s S}
In addition, for the Cartesian product to be defined the two relations involved must have disjoint
headers — that is, they must not have a common attribute name.
Projection (π)
A projection is a unary operation written as where a1,...,an is a set of attribute names. The result of
such projection is defined as the set that is obtained when all tuples
in R are restricted to the set {a1,...,an}.
Selection (σ)
A generalized selection is a unary operation written as where is a propositional formula that consists of
atoms as allowed in the normal selection and the logical operators (and), (or) and (negation). This
selection selects all those tuples in R
for which holds.
Rename (ρ)
A rename is a unary operation written as ρa / b(R) where the result is identical to R except that the b
field in all tuples is renamed to an a field. This is simply used to rename the attribute of a relation or the
relation itself.
UNION Operation
Example:
To retrieve the social security numbers of all employees who either work in department 5
(RESULT1 below) or directly supervise an employee who works in department 5 (RESULT2
below)
We can use the UNION operation as follows:
DEP5_EMPS ← σDNO=5 (EMPLOYEE)
RESULT1 ← π SSN(DEP5_EMPS)
RESULT2(SSN) ← πSUPERSSN(DEP5_EMPS)
RESULT ← RESULT1 ∪ RESULT2
The union operation produces the tuples that are in either RESULT1 or RESULT2 or both.
Type Compatibility of operands is required for the binary set operation UNION ∪, (also for
INTERSECTION ∩, and SET DIFFERENCE –)
R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) are type compatible if:
--- they have the same number of attributes, and
--- the domains of corresponding attributes are type compatible (i.e. dom(Ai)=dom(Bi) for i=1, 2, ..., n).
The resulting relation for R1∪R2 (also for R1∩R2, or R1–R2, see next slides) has the same
attribute names as the first operand relation R1 (by convention).
INTERSECTION is denoted by ∩
The result of the operation R ∩ S, is a relation that includes all tuples that are in both R and S
The attribute names in the result will be the same as the attribute names in R
The two operand relations R and S must be “type compatible”
R ∪ (S ∪ T) = (R ∪ S) ∪ T
(R ∩ S) ∩ T = R ∩ (S ∩ T)
The minus operation is not commutative; that is, in general
R–S≠S–R
The sequence of CARTESIAN PRODECT followed by SELECT is used quite commonly to identify
and select related tuples from two relations
A special operation, called JOIN combines this sequence into a single operation
This operation is very important for any relational database with more than a single relation,
because it allows us combine related tuples from various relations
The general form of a join operation on two relations R(A1, A2, . . ., An) and S(B1, B2, . . ., Bm)
is: R <join condition>S
where R and S can be any relations that result from general relational algebra expressions.
Example: Suppose that we want to retrieve the name of the manager of each department.
To get the manager’s name, we need to combine each DEPARTMENT tuple with the EMPLOYEE
tuple whose SSN value matches the MGRSSN value in the department tuple.
We do this by using the join operation.
DEPT_MGR ← DEPARTMENT MGRSSN=SSN EMPLOYEE
MGRSSN=SSN is the join condition Combines each department record with the employee who
manages the department
The join condition can also be specified as [Link]= [Link]
EQUIJOIN Operation
The most common use of join involves join conditions with equality comparisons only
Such a join, where the only comparison operator used is =, is called an EQUIJOIN.
In the result of an EQUIJOIN we always have one or more pairs of attributes (whose names need not be
identical) that have identical values in every tuple.
Another variation of JOIN called NATURAL JOIN — denoted by * — was created to get rid of the second
(superfluous) attribute in an EQUIJOIN condition.
because one of each pair of attributes with identical values is superfluous
The standard definition of natural join requires that the two join attributes, or each pair of
corresponding join attributes, have the same name in both relations
If this is not the case, a renaming operation is applied first.
DIVISION Operation
In NATURAL JOIN and EQUIJOIN, tuples without a matching (or related) tuple are eliminated from the
join result
Tuples with null in the join attributes are also eliminated
This amounts to loss of information.
A set of operations, called OUTER joins, can be used when we want to keep all the tuples in R, or all
those in S, or all those in both relations in the result of the join, regardless of whether or not they have
matching tuples in the other relation.
The left outer join operation keeps every tuple in the first or left relation R in R S; if no matching tuple is
found in S, then the attributes of S in the join result are filled or “padded” with null values.
A similar operation, right outer join, keeps every tuple in the second or right relation S in the
result of R S.
A third operation, full outer join, denoted by keeps all tuples in both the left and the right
relations when no matching tuples are found, padding them with null values as needed.
A type of request that cannot be expressed in the basic relational algebra is to specify
mathematical aggregate functions on collections of values from the database.
Examples of such functions include retrieving the average or total salary of all employees or the
total number of employee tuples.
These functions are used in simple statistical queries that summarize information from the
database tuples.
Common functions applied to collections of numeric values include
SUM, AVERAGE, MAXIMUM, and MINIMUM.
TheAggregate
Use of the COUNT function is used
Functional for counting
operation ℱ tuples or values.
ℱMAX Salary (EMPLOYEE) retrieves the maximum salary value from the EMPLOYEE relation
ℱMIN Salary (EMPLOYEE) retrieves the minimum Salary value from the EMPLOYEE relation
ℱSUM Salary (EMPLOYEE) retrieves the sum of the Salary from the EMPLOYEE relation
ℱCOUNT SSN, AVERAGE Salary (EMPLOYEE) computes the count (number) of employees and
their average salary
Note: count just counts the number of rows, without removing duplicates
Relational Calculus
➲ A relational calculus expression creates a new relation, which is specified in terms of variables that
range over rows of the stored database relations (in tuple calculus) or over columns of the stored
relations (in domain calculus).
➲ In a calculus expression, there is no order of operations to specify how to retrieve the query
result—a calculus expression specifies only what information the result should contain.
This is the main distinguishing feature between relational algebra and relational calculus.
The result of such a query is the set of all tuples t that satisfy COND (t).
➲ Example: To find the first and last names of all employees whose salary is above $50,000, we can
write the following tuple calculus expression:
{[Link], [Link] | EMPLOYEE(t) AND [Link]>50000}
➲ The condition EMPLOYEE(t) specifies that the range relation of tuple variable t is EMPLOYEE.
➲ The first and last name (PROJECTION π FNAME, LNAME) of each EMPLOYEE tuple t that satisfies the
condition [Link]>50000 (SELECTION σ SALARY >50000) will be retrieved.
➲ In query above, using the expression not(PROJECT(x)) inside the universally quantified formula
evaluates to true all tuples x that are not in the PROJECT relation.
Then we exclude the tuples we are not interested in from R itself. The expression
not([Link]=5) evaluates to true all tuples x that are in the project relation but are not
controlled by department 5.
➲ Finally, we specify a condition that must hold on all the remaining tuples in R.
( (∃ w)(WORKS_ON(w) and [Link]=[Link] and [Link]=[Link])
SQL
SQL is a standard language for accessing and manipulating databases.
SELECT * Example
The following SQL stament selects all the columns from the "Customers" table:
Example
SELECT * FROM Customers;
Example
SELECT DISTINCT City FROM Customers;
The SQL WHERE Clause
The WHERE clause is used to extract only those records that fulfill a specified criterion.
SQL WHERE Syntax
SELECT column_name,column_name
FROM table_name
WHERE column_name operator value
WHERE Clause Example
The following SQL statement selects all the customers from the country "Mexico", in the "Customers"
table:
Example
SELECT * FROM Customers
WHERE City='Kolkata';
Example
SELECT * FROM Customers
WHERE Country='India'
AND City='Kolkata';
OR Operator Example
The following SQL statement selects all customers from the city "Berlin" OR "München", in the
"Customers" table:
Example
The following SQL statement selects all customers from the country "Germany" AND the city must be
equal to "Berlin" OR "München", in the "Customers" table:
Example
The ORDER BY keyword is used to sort the result-set by one or more columns.
The ORDER BY keyword sorts the records in ascending order by default. To sort the records in a
descending order, you can use the DESC keyword.
ORDER BY Example
The following SQL stament selects all customers from the "Customers" table, sorted by the "Country"
column:
Example
The following SQL stament selects all customers from the "Customers" table, sorted DESCENDING by the
"Country" column:
Example
SELECT * FROM Customers
ORDER BY Country DESC
The first form does not specify the column names where the data will be inserted, only their values:
INSERT INTO table_name
VALUES (value1,value2,value3,...)
The second form specifies both the column names and the values to be inserted:
INSERT INTO table_name (column1,column2,column3,...)
VALUES (value1,value2,value3,...)
Be careful when updating records. If we had omitted the WHERE clause in the example above, like this:
UPDATE Customers
SET ContactName=’BBB’, City='Delhi';
Note: Notice the WHERE clause in the DELETE syntax. The WHERE clause specifies which record or
records that should be deleted. If you omit the WHERE clause, all records will be deleted!
It is possible to delete all rows in a table without deleting the table. This means that the table structure,
attributes, and indexes will be intact:
or
Example
SELECT *
FROM Persons
WHERE ROWNUM <=5;
SELECT column_name(s)
FROM table_name
WHERE column_name LIKE pattern
The following SQL statement selects the customers in cities starting with an "s" from the "Customers"
table:
Example
SELECT * FROM Customers
WHERE City LIKE 'K%';
The "%" sign can be used to define wildcards (missing letters in the pattern) both before and after the
pattern.
The following SQL statement selects the customers in a city ending with an "s" from the "Customers"
table:
Example
SELECT * FROM Customers
WHERE City LIKE '%s';
The following SQL statement selects the customers in a country containing the pattern "land" from the
"Customers" table:
Example
SELECT * FROM Customers
WHERE Country LIKE '%land%';
Using the NOT keyword allows you to select records that does NOT match the pattern.
The following SQL statement selects the customers in a country NOT containing the pattern "land" from
the "Customers" table:
Example
SELECT * FROM Customers
WHERE Country NOT LIKE '%land%';
The IN Operator
SQL IN Syntax
SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1,value2,...)
IN Operator Example
The following SQL statement selects the companies in the cities "Paris" or "London" from the
"Customers" table:
Example
SELECT * FROM Customers
WHERE City IN ('Kolkata','Delhi');
The BETWEEN operator selects a range of data between two values. The values can be numbers, text, or
dates.
The following SQL statement selects the customers with "CustomerName" alphabetically between
"ABD" and "JKL" from the "Customers" table:
Example
SELECT * FROM Customers
WHERE CustomerName
BETWEEN ‘ABD' AND 'JKL';
To display the companies outside the range in the previous example, use NOT BETWEEN:
Example
SQL Alias
You can give a table or a column another name by using an alias. This can be a good thing to do if you
have very long or complex table names or column names.
Alias Example
Using the "Customers" and "Orders" tables, we give the table aliases of "c" and "o" respectively.
The following SQL statement selects all the orders from the customer "ABD":
Example
SELECT [Link], [Link]
FROM Customers AS c,
Orders AS o
WHERE [Link]=ABD';
SQL JOIN
The JOIN keyword is used in an SQL statement to query data from two or more tables, based on a
relationship between certain columns in these tables.
A primary key is a column (or a combination of columns) with a unique value for each row. Each primary
key value must be unique within the table. The purpose is to bind data together, across tables, without
repeating all of the data in every table.
Before we continue with examples, we will list the types of JOIN you can use, and the differences
between them.
JOIN: Return rows when there is at least one match in both tables
LEFT JOIN: Return all rows from the left table, even if there are no matches in the right table
RIGHT JOIN: Return all rows from the right table, even if there are no matches in the left table
FULL JOIN: Return rows when there is a match in one of the tables
The INNER JOIN keyword returns rows when there is at least one match in both tables.
The INNER JOIN keyword returns rows when there is at least one match in both tables. If there are rows
in "Persons" that do not have matches in "Orders", those rows will NOT be listed.
The LEFT JOIN keyword returns all rows from the left table (table_name1), even if there are no matches
in the right table (table_name2).
1 77895 3
2 44678 3
3 22456 1
4 24562 1
5 34764 15
Now we want to list all the persons and their orders - if any, from the tables above.
Svendson Tove
The LEFT JOIN keyword returns all the rows from the left table (Persons), even if there are no matches in
the right table (Orders).
The RIGHT JOIN keyword returns all the rows from the right table (table_name2), even if there are no
matches in the left table (table_name1).
1 77895 3
2 44678 3
3 22456 1
4 24562 1
5 34764 15
Now we want to list all the orders with containing persons - if any, from the tables above.
34764
The RIGHT JOIN keyword returns all the rows from the right table (Orders), even if there are no matches
in the left table (Persons).
The FULL JOIN keyword return rows when there is a match in one of the tables.
1 77895 3
2 44678 3
3 22456 1
4 24562 1
5 34764 15
Now we want to list all the persons and their orders, and all the orders with their persons.
Svendson Tove
34764
The FULL JOIN keyword returns all the rows from the left table (Persons), and all the rows from the right
table (Orders). If there are rows in "Persons" that do not have matches in "Orders", or if there are rows
in "Orders" that do not have matches in "Persons", those rows will be listed as well.
The UNION operator is used to combine the result-set of two or more SELECT statements.
Notice that each SELECT statement within the UNION must have the same number of columns. The
columns must also have similar data types. Also, the columns in each SELECT statement must be in the
same order.
Note: The UNION operator selects only distinct values by default. To allow duplicate values, use UNION
ALL.
PS: The column names in the result-set of a UNION are always equal to the column names in the first
SELECT statement in the UNION.
"Employees_Norway":
E_ID E_Name
01 Hansen, Ola
02 Svendson, Tove
03 Svendson, Stephen
04 Pettersen, Kari
"Employees_USA":
E_ID E_Name
01 Turner, Sally
02 Kent, Clark
03 Svendson, Stephen
04 Scott, Stephen
Now we want to list all the different employees in Norway and USA.
E_Name
Hansen, Ola
Svendson, Tove
Svendson, Stephen
Pettersen, Kari
Turner, Sally
Kent, Clark
Scott, Stephen
Note: This command cannot be used to list all employees in Norway and USA. In the example above we
have two employees with equal names, and only one of them will be listed. The UNION command
selects only distinct values.
Result
E_Name
Hansen, Ola
Svendson, Tove
Svendson, Stephen
Pettersen, Kari
Turner, Sally
Kent, Clark
Svendson, Stephen
Scott, Stephen
The SELECT INTO statement selects data from one table and inserts it into a different table.
The SELECT INTO statement is most often used to create backup copies of tables.
SELECT *
INTO new_table_name [IN externaldatabase]
FROM old_tablename
Or we can select only the columns we want into the new table:
SELECT column_name(s)
INTO new_table_name [IN externaldatabase]
FROM old_tablename
Make a Backup Copy - Now we want to make an exact copy of the data in our "Persons" table.
SELECT *
INTO Persons_Backup
FROM Persons
We can also use the IN clause to copy the table into another database:
SELECT *
INTO Persons_Backup IN '[Link]'
FROM Persons
We can also copy only a few fields into the new table:
SELECT LastName,FirstName
INTO Persons_Backup
FROM Persons
The following SQL statement creates a "Persons_Backup" table with only the persons who lives in the
city "Sandnes":
SELECT LastName,Firstname
INTO Persons_Backup
FROM Persons
WHERE City='Sandnes'
The following example creates a "Persons_Order_Backup" table contains data from the two tables
"Persons" and "Orders":
SELECT [Link],[Link]
INTO Persons_Order_Backup
FROM Persons
INNER JOIN Orders
ON Persons.P_Id=Orders.P_Id
column_name1 data_type,
column_name2 data_type,
column_name3 data_type,
....
)
The data type specifies what type of data the column can hold. For a complete reference of all the data
types available in MS Access, MySQL, and SQL Server, go to our complete Data Types reference.
Now we want to create a table called "Persons" that contains five columns: P_Id, LastName, FirstName,
Address, and City.
The P_Id column is of type int and will hold a number. The LastName, FirstName, Address, and City
columns are of type varchar with a maximum length of 255 characters.
The empty table can be filled with data with the INSERT INTO statement.
Integrity Constraints
Integrity constraints provide a way of ensuring that changes made to the database by authorized users
do not result in a loss of data consistency.
Data Integrity
Integrity ensures that the data in a database is both accurate and complete, in other words, that the
data makes sense. There are at least five different types of integrity that needs to be considered:
1. Domain constraints
2. Entity integrity
3. Column constraints
4. User-defined integrity constraints
5. Referential integrity
Domain Constraints
A domain is defined as the set of all unique values permitted for an attribute. For example, a domain of
date is the set of all possible valid dates, a domain of integer is all possible whole numbers, a domain of
day-of-week is Monday, Tuesday ... Sunday.
This in effect is defining rules for a particular attribute. If it is determined that an attribute is a date then
it should be implemented in the database to prevent invalid dates being entered.
A classic example of this is where the data from a legacy system is loaded into a newly designed
database. The new system is well designed. Columns that hold dates are defined as such whereas, in the
old system, they were held as character strings. Much data is rejected because of invalid dates, eg 30
February 2000.
If the system supports domain constraints then this invalid data would not have stored in the first place.
That is, the integrity of the database is being preserved.
Entity Integrity
A requirement of E F Codd in his seminal paper is that a primary key of an entity, or any part of it, can
never take a null value.
Oracle, and most other relational database management systems, will enforce this.
Column Constraints
During the data analysis phase, business rules will identify any column constraints. For example, a salary
cannot be negative; an employee number must be in the range 1000 - 2000, etc.
Oracle, and some other RDBMSs, will allow storage of these rules within the database itself. That is, in a
central data dictionary.
Referential Integrity
This is the most common type of integrity constraint. This is used to manage the relationships between
primary and foreign keys.
Let's assume the department and employee entities have been implemented as tables in a relational
database system.
When entering a new employee, the department in which they work needs to be specified. Department
number is the foreign key in the employee table and the primary key in the department table.
In order to preserve the integrity of the data in the database there are a set of rules that need to be
observed:
If inserting an employee in the employee table, the insertion should only be allowed if their
department number exists in the department table
If deleting a department in the department table, the deletion should only be allowed if there
are no employees working in that department
If changing the value of a department number in the department table, the update should only
be allowed if there are no employees working in the department whose number is being
changed
If changing the value of a department number in the employee table, the update should only be
allowed if the new value exists in the department table
If any of the above is allowed to happen then we have data in an inconsistent state. The integrity of the
data is compromised - the data does not make sense.
Oracle, and other relational database management systems, will allow enforcement of referential
integrity rules. If implemented, and a user tries to break any of the rules, an error message will be given
and the change will not take place.
SQL Constraints
Constraints are used to limit the type of data that can go into a table.
Constraints can be specified when a table is created (with the CREATE TABLE statement) or after the
table is created (with the ALTER TABLE statement).
NOT NULL
UNIQUE
PRIMARY KEY
FOREIGN KEY
CHECK
DEFAULT
The NOT NULL constraint enforces a column to NOT accept NULL values.
The NOT NULL constraint enforces a field to always contain a value. This means that you cannot insert a
new record, or update a record without adding a value to this field.
The following SQL enforces the "P_Id" column and the "LastName" column to not accept NULL values:
CREATE TABLE Persons
(
P_Id number NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255)
)
The following SQL creates a UNIQUE constraint on the "P_Id" column when the "Persons" table is
created:
Address varchar(255),
City varchar(255),
CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName)
)
To create a UNIQUE constraint on the "P_Id" column when the table is already created, use the
following SQL:
To allow naming of a UNIQUE constraint, and for defining a UNIQUE constraint on multiple columns, use
the following SQL syntax:
The PRIMARY KEY constraint uniquely identifies each record in a database table.
Primary keys must contain unique values.
A primary key column cannot contain NULL values.
Each table should have a primary key, and each table can have only ONE primary key.
The following SQL creates a PRIMARY KEY on the "P_Id" column when the "Persons" table is created:
City varchar(255)
)
To allow naming of a PRIMARY KEY constraint, and for defining a PRIMARY KEY constraint on multiple
columns, use the following SQL syntax:
Note: In the example above there is only ONE PRIMARY KEY (pk_PersonID). However, the value of the
pk_PersonID is made up of two columns (P_Id and LastName).
To create a PRIMARY KEY constraint on the "P_Id" column when the table is already created, use the
following SQL:
ALTER TABLE Persons
ADD PRIMARY KEY (P_Id)
To allow naming of a PRIMARY KEY constraint, and for defining a PRIMARY KEY constraint on multiple
columns, use the following SQL syntax:
ALTER TABLE Persons
ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)
Note: If you use the ALTER TABLE statement to add a primary key, the primary key column(s) must
already have been declared to not contain NULL values (when the table was first created).
Let's illustrate the foreign key with an example. Look at the following two tables:
Note that the "P_Id" column in the "Orders" table points to the "P_Id" column in the "Persons" table.
The "P_Id" column in the "Persons" table is the PRIMARY KEY in the "Persons" table.
The "P_Id" column in the "Orders" table is a FOREIGN KEY in the "Orders" table.
The FOREIGN KEY constraint is used to prevent actions that would destroy links between tables.
The FOREIGN KEY constraint also prevents invalid data from being inserted into the foreign key column,
because it has to be one of the values contained in the table it points to.
The following SQL creates a FOREIGN KEY on the "P_Id" column when the "Orders" table is created:
To allow naming of a FOREIGN KEY constraint, and for defining a FOREIGN KEY constraint on multiple
columns, use the following SQL syntax:
To create a FOREIGN KEY constraint on the "P_Id" column when the "Orders" table is already created,
use the following SQL:
To allow naming of a FOREIGN KEY constraint, and for defining a FOREIGN KEY constraint on multiple
columns, use the following SQL syntax:
The CHECK constraint is used to limit the value range that can be placed in a column.
If you define a CHECK constraint on a single column it allows only certain values for this column.
If you define a CHECK constraint on a table it can limit the values in certain columns based on values in
other columns in the row.
The following SQL creates a CHECK constraint on the "P_Id" column when the "Persons" table is created.
The CHECK constraint specifies that the column "P_Id" must only include integers greater than 0.
To create a CHECK constraint on the "P_Id" column when the table is already created, use the following
SQL:
To allow naming of a CHECK constraint, and for defining a CHECK constraint on multiple columns, use
the following SQL syntax:
The following SQL creates a DEFAULT constraint on the "City" column when the "Persons" table is
created:
The DEFAULT constraint can also be used to insert system values, by using functions like GETDATE():
To create a DEFAULT constraint on the "City" column when the table is already created, use the
following SQL:
Indexes allow the database application to find data fast; without reading the whole table.
Indexes
An index can be created in a table to find data more quickly and efficiently.
The users cannot see the indexes, they are just used to speed up searches/queries.
Note: Updating a table with indexes takes more time than updating a table without (because the
indexes also need an update). So you should only create indexes on columns (and tables) that will be
frequently searched against.
Note: The syntax for creating indexes varies amongst different databases. Therefore: Check the syntax
for creating indexes in your database.
The SQL statement below creates an index named "PIndex" on the "LastName" column in the "Persons"
table:
If you want to create an index on a combination of columns, you can list the column names within the
parentheses, separated by commas:
What if we only want to delete the data inside the table, and not the table itself?
The ALTER TABLE statement is used to add, delete, or modify columns in an existing table.
My SQL / Oracle:
Notice that the new column, "DateOfBirth", is of type date and is going to hold a date. The data type
specifies what type of data the column can hold.
Now we want to change the data type of the column named "DateOfBirth" in the "Persons" table.
We use the following SQL statement:
ALTER TABLE Persons
ALTER COLUMN DateOfBirth year
Notice that the "DateOfBirth" column is now of type year and is going to hold a year in a two-digit or
four-digit format.
Next, we want to delete the column named "DateOfBirth" in the "Persons" table.
A view contains rows and columns, just like a real table. The fields in a view are fields from one or more
real tables in the database.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data
were coming from one single table.
Note: A view always shows up-to-date data! The database engine recreates the data, using the view's
SQL statement, every time a user queries a view.
If you have the Northwind database you can see that it has several views installed by default.
The view "Current Product List" lists all active products (products that are not discontinued) from the
"Products" table. The view is created with the following SQL:
CREATE VIEW [Current Product List] AS
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
We can query the view above as follows:
SELECT * FROM [Current Product List]
Another view in the Northwind sample database selects every product in the "Products" table with a
unit price higher than the average unit price:
We can also add a condition to the query. Now we want to see the total sale only for the category
"Beverages":
Now we want to add the "Category" column to the "Current Product List" view. We will update the view
with the following SQL:
SQL aggregate functions return a single value, calculated from values in a column.
SQL scalar functions return a single value, based on the input value.
Tip: The aggregate functions and the scalar functions will be explained in details in the next chapters.
950
Now we want to find the customers that have an OrderPrice value higher than the average OrderPrice
value.
Hansen
Nilsen
Jensen
The COUNT() function returns the number of rows that matches a specified criteria.
The COUNT(column_name) function returns the number of values (NULL values will not be counted) of
the specified column:
The COUNT(DISTINCT column_name) function returns the number of distinct values of the specified
column:
Note: COUNT(DISTINCT) works with ORACLE and Microsoft SQL Server, but not with Microsoft Access.
NumberOfOrders
Now we want to count the number of unique customers in the "Orders" table.
We use the following SQL statement:
SELECT COUNT(DISTINCT Customer) AS NumberOfCustomers FROM Orders
The result-set will look like this:
NumberOfCustomers
3
which is the number of unique customers (Hansen, Nilsen, and Jensen) in the "Orders" table.
The MAX() function returns the largest value of the selected column.
LargestOrderPrice
2000
The MIN() function returns the smallest value of the selected column.
The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by
one or more columns.
Now we want to find the total sum (total order) of each customer.
Customer SUM(OrderPrice)
Hansen 2000
Nilsen 1700
Jensen 2000
Customer SUM(OrderPrice)
Hansen 5700
Nilsen 5700
Hansen 5700
Hansen 5700
Jensen 5700
Nilsen 5700
Explanation of why the above SELECT statement cannot be used: The SELECT statement above has two
columns specified (Customer and SUM(OrderPrice). The "SUM(OrderPrice)" returns a single value (that
is the total sum of the "OrderPrice" column), while "Customer" returns 6 values (one value for each row
in the "Orders" table). This will therefore not give us the correct result. However, you have seen that the
GROUP BY statement solves this problem.
We can also use the GROUP BY statement on more than one column, like this:
The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate
functions.
Now we want to find if any of the customers have a total order of less than 2000.
Customer SUM(OrderPrice)
Nilsen 1700
Now we want to find if the customers "Hansen" or "Jensen" have a total order of more than 1500.
Customer SUM(OrderPrice)
Hansen 2000
Jensen 2000
Cursor
A cursor is a control structure that enables traversal over the records in a database. Cursors facilitate
subsequent processing in conjunction with the traversal, such as retrieval, addition and removal of
database records.
A cursor is a temporary work area created in the system memory when a SQL statement is executed. A
cursor contains information on a select statement and the rows of data accessed by it. This temporary
work area is used to store the data retrieved from the database, and manipulate this data. A cursor can
hold more than one row, but can process only one row at a time. The set of rows the cursor holds is
called the active set.
Cursors are used by database programmers to process individual rows returned by database system
queries. Cursors enable manipulation of whole result-sets at once -- a capability that most procedural
programming languages lack. In this scenario, a cursor enables the rows in a result-set to be processed
sequentially.
Implicit cursors:
These are created by default when DML statements like, INSERT, UPDATE, and DELETE statements are
executed. They are also created when a SELECT statement that returns just one row is executed.
Oracle provides few attributes called as implicit cursor attributes to check the status of DML operations.
The cursor attributes available are %FOUND, %NOTFOUND, %ROWCOUNT, and %ISOPEN.
For example, When you execute INSERT, UPDATE, or DELETE statements the cursor attributes tell us
whether any rows are affected and how many have been affected.
When a SELECT... INTO statement is executed in a PL/SQL Block, implicit cursor attributes can be used to
find out whether any row has been returned by the SELECT statement. PL/SQL returns an error when no
data is selected.
The status of the cursor for each of these attributes are defined in the below table.
For Example: Consider the PL/SQL Block that uses implicit cursor attributes as shown below:
Explicit cursors:
They must be created when you are executing a SELECT statement that returns more than one row.
Even though the cursor stores multiple records, only one record can be processed at a time, which is
called as current row. When you fetch a row the current row position moves to next row.
An explicit cursor is defined in the declaration section of the PL/SQL Block. It is created on a SELECT
Statement which returns more than one row. We can provide a suitable name for the cursor.
execution
section of the PL/SQL program.
How to access an Explicit Cursor?
These are the three steps in accessing the cursor.
1) Open the cursor.
2) Fetch the records in the cursor one at a time.
3) Close the cursor.
General Syntax to open a cursor is:
OPEN cursor_name;
General Syntax to fetch records from a cursor is:
FETCH cursor_name INTO record_name;
OR
FETCH cursor_name INTO variable_list;
General Syntax to close a cursor is:
CLOSE cursor_name;
When a cursor is opened, the first row becomes the current row. When the data is fetched it is
copied to the record or variables and the logical pointer moves to the next row and it becomes
the current row. On every fetch statement, the pointer moves to the next row. If you want to
fetch after the last row, the program will throw an error. When there is more than one row in a
cursor we can use loops along with explicit cursor attributes to fetch all the records.
Points to remember while fetching a row:
We can fetch the rows in a cursor to a PL/SQL Record or a list of variables created in the PL/SQL
Block.
If you are fetching a cursor to a PL/SQL Record, the record should have the same structure as
the cursor.
If you are fetching a cursor to a list of variables, the variables should be listed in the same order
in the fetch statement as the columns are present in the cursor.
General Form of using an explicit cursor is:
DECLARE
variables;
records;
create a cursor;
BEGIN
OPEN cursor;
FETCH cursor;
process the records;
CLOSE cursor;
END;
Example 1:
1> DECLARE
2> emp_rec emp_tbl%rowtype;
3> CURSOR emp_cur IS
4> SELECT *
5> FROM
6> WHERE salary > 10;
7> BEGIN
8> OPEN emp_cur;
9> FETCH emp_cur INTO emp_rec;
10> dbms_output.put_line (emp_rec.first_name || ' ' || emp_rec.last_name);
11> CLOSE emp_cur;
12> END;
In the above example, first we are creating a record ‘emp_rec’ of the same structure as of table
‘emp_tbl’ in line no 2. We can also create a record with a cursor by replacing the table name
with the cursor name. Second, we are declaring a cursor ‘emp_cur’ from a select query in line no
3 - 6. Third, we are opening the cursor in the execution section in line no 8. Fourth, we are
fetching the cursor to the record in line no 9. Fifth, we are displaying the first_name and
last_name of the employee in the record emp_rec in line no 10. Sixth, we are closing the cursor
in line no 11.
What are Explicit Cursor Attributes?
Oracle provides some attributes known as Explicit Cursor Attributes to control the data
processing while using cursors. We use these attributes to avoid errors while accessing cursors
through OPEN, FETCH and CLOSE Statements.
When does an error occur while accessing an explicit cursor?
a) When we try to open a cursor which is not closed in the previous operation.
b) When we try to fetch a cursor after the last operation.
These are the attributes available to check the status of an explicit cursor.
Both implicit and explicit cursors have the same functionality, but they differ in the way they are
accessed.
Disadvantages of cursors
Fetching a row from the cursor may result in a network round trip each time. This uses much more
network bandwidth than would ordinarily be needed for the execution of a single SQL statement like
DELETE. Repeated network round trips can severely impact the speed of the operation using the cursor.
Some DBMSs try to reduce this impact by using block fetch. Block fetch implies that multiple rows are
sent together from the server to the client. The client stores a whole block of rows in a local buffer and
retrieves the rows from there until that buffer is exhausted.
Cursors allocate resources on the server, for instance locks, packages, processes, temporary storage, etc.
Stored Procedure
A stored procedure or in simple a proc is a named PL/SQL block which performs one or more specific
task. This is similar to a procedure in other programming languages. A procedure has a header and a
body. The header consists of the name of the procedure and the parameters or variables passed to the
procedure. The body consists or declaration section, execution section and exception section similar to a
general PL/SQL Block. A procedure is similar to an anonymous PL/SQL Block but it is named for repeated
usage.
Stored procedures are typically used for data validation (integrated into the database) or access control
mechanisms. Extensive or complex processing that requires the execution of several SQL statements is
moved into stored procedures, and all applications call the procedures. One can use nested stored
procedures, by executing one stored procedure from within another.
Stored procedures are similar to user-defined functions (UDFs). The major difference is that UDFs can be
used like any other expression within SQL statements, whereas stored procedures must be invoked
using the CALL statement.
Stored procedures may return result sets, i.e. the results of a SELECT statement. Stored procedures may
also contain declared variables for processing data and cursors that allow it to loop through multiple
rows in a table. Stored procedure languages typically include IF, WHILE, LOOP, REPEAT, and CASE
statements, and more. Stored procedures can receive variables, return results or modify variables and
return them, depending on how and where the variable is declared.
procedure_name;
Disadvantages
Stored procedure languages are quite often vendor-specific. If you want to switch to using another
vendor's database, then you have to rewrite your stored procedures. Stored procedure languages from
different vendors have different levels of sophistication; for example, Oracle's PL/SQL has more
languages features and built-in features than Microsoft's T-SQL.
Normalization
Data normalization is a set of rules and techniques concerned with:
It follows a set of rules worked out by E F Codd in 1970. A normalized relational database provides
several benefits:
This table is not in First Normal Form because field Telephone_No containing more than one value. Also
the first record is not having any values. To make the table in the first normal form we need to split the
A103 rows into two and supply the blank value with null. So the resultant table is as follows:-
Here repetitive data cannot be avoided since they represent unique items.
The table is not in 3NF as the Dept_Name is not directly dependent on the EMP_ID but on the Dept_No
and the Emp_Id is associated to the Dept_No because each employee should belong to some
department. So the table is split into two following relations to make it in 3NF.
FDs :
Emp_ID EmpName
Emp_ID Emp_DOB
Emp_ID Emp_Add
Emp_ID Dept_No Transitive dependency
Dept_no Dept_Name
R2 (Dept_No,Dept_Name)
Dept_No Dept_Name
11234 Accounts
12344 IT
Anomalies can also occur where a relation contains several candidate keys where:
The keys contain more than one attribute (they are composite keys).
No two buildings on any of the university campuses have the same name, thus
ROOM/BLDG ----->CAMPUS. As the determinant is not a candidate key this table is NOT in Boyce-Codd
normal form.
R2 (room/bldg, campus)
campus room/bldg
A A212
B O305
A A102
1 Programming Golf
1 Programming Bowling
1 Analysis Golf
1 Analysis Bowling
2 Analysis Golf
2 Analysis Gardening
2 Management Golf
2 Management
Gardening
This table is difficult to maintain since adding a new hobby requires multiple new rows corresponding to
each skill. This problem is created by the pair of multi-valued dependencies EMPLOYEE#--->SKILLS and
EMPLOYEE#--->HOBBIES.
A much better alternative would be to decompose INFO into two relations:
Skills (employee#, skill)
E_no skills
1 Programming
1 Analysis
2 Analysis
2
Management
Hobbies (employee#, hobby)
E_no hobbies
1 Golf
1 Bowling
2 Golf
2
Gardening
Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin
in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). Whereas the
second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is
concerned with a more general type of dependency known as a multivalued dependency. A table is in
4NF if and only if, for every one of its non-trivial multivalued dependencies X →→ Y, X is a superkey—
that is, X is either a candidate key or a superset thereof.
Multivalued dependencies
If the column headings in a relational database table are divided into three disjoint groupings X, Y, and Z,
then, in the context of a particular row, we can refer to the data beneath each group of headings as x, y,
and z respectively. A multivalued dependency X →→ Y signifies that if we choose any x actually
occurring in the table (call this choice xc), and compile a list of all the xcyz combinations that occur in the
table, we will find that xc is associated with the same y entries regardless of z.
A trivial multivalued dependency X →→ Y is one in which Y consists of all columns belonging to X. That
is, a subset of attributes in a table has a trivial multivalued dependency on the remaining subset of
attributes.
Example
Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.
The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}.
Therefore it meets all normal forms up to BCNF. If we assume, however, that pizza varieties offered by a
restaurant are not affected by delivery area, then it does not meet 4NF. The problem is that the table
features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a
superkey). The dependencies are:
These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza
a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs
leads to redundancy in the table: for example, we are told three times that A1 Pizza offers Stuffed Crust,
and if A1 Pizza starts producing Cheese Crust pizzas then we will need to add multiple rows, one for each
of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we
might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the
multivalued dependency {Restaurant} →→ {Pizza Variety}.
To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a
different table from the facts about delivery areas, yielding two tables that are both in 4NF:
Varieties By Restaurant
Restaurant Pizza Variety
A1 Pizza Thick Crust
A1 Pizza Stuffed Crust
Elite Pizza Thin Crust
Elite Pizza Stuffed Crust
Domino's Pizza Thick Crust
Domino's Pizza Thin Crust
In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one
delivery area to another, the original three-column table would satisfy 4NF.
For any one you must know the other two (cyclical).
Problem:- The problem with the above table structure is that if L starts to sell Jeans then how many
records must you create to record this fact? The problem is there are pair wise cyclical dependencies in
the primary key. That is, in order to determine the item you must know the buyer and vendor, and to
determine the vendor you must know the buyer and the item, and finally to know the buyer you must
know the vendor and the item.
Solution: - The solution is to break this one table into three tables; Buyer-Vendor, Buyer-Item, and
Vendor-Item. So following tables are in the 5NF.
Buyer-Vendor
buyer vendor
Sally L
Mary L
Sally J
Mary J
Buyer-Item
buyer item
Sally Tshirts
Mary Tshirts
Sally Jeans
Mary Jeans
Sally Sneakers
Vendor-Item
vendor item
L Tshirt
J Jeans
J Sneakers
Note:- There is also one more normal form i.e. 6 NF. A table is in sixth normal form (6NF) or Domain-Key
normal form (DKNF) if it is in 5NF and if all constraints and dependencies that should hold on the relation
can be enforced simply by enforcing the domain constraints and the key constraints specified on the
relation.
A table is said to be in the 5NF if and only if every join dependency in it is implied by the candidate keys.
A join dependency *{A, B, … Z} on R is implied by the candidate key(s) of R if and only if each of A, B, …, Z
is a superkey for R.
Example
The table's predicate is: Products of the type designated by Product Type, made by the brand designated
by Brand, are available from the travelling salesman designated by Travelling Salesman.
In the absence of any rules restricting the valid possible combinations of Travelling Salesman, Brand, and
Product Type, the three-attribute table above is necessary in order to model the situation correctly.
Suppose, however, that the following rule applies: A Travelling Salesman has certain Brands and certain
Product Types in his repertoire. If Brand B is in his repertoire, and Product Type P is in his repertoire, then
(assuming Brand B makes Product Type P), the Travelling Salesman must offer products of Product Type
P made by Brand B.
Note how this setup helps to remove redundancy. Suppose that Jack Schneider starts selling Robusto's
products. In the previous setup we would have to add two new entries since Jack Schneider is able to
sell two Product Types covered by Robusto: Breadboxes and Vacuum Cleaners. With the new setup we
need only add a single entry (in Brands By Travelling Salesman).
Query Optimization
Query Processing
Query Processing: Activities involved in retrieving data from the database.
Aims of QP:
transform query written in high-level language (e.g. SQL), into correct and efficient execution
strategy
expressed in low-level language (implementing RA);
execute the strategy to retrieve required data.
Query Optimization
Query Optimization: Activity of choosing an efficient execution strategy for processing query.
As there are many equivalent transformations of same high-level query, aim of QO is to choose
one that minimizes resource usage.
Generally, reduce total execution time of query.
Problem computationally intractable with large number of relations, so strategy adopted is
reduced to finding near optimum solution.
Transactions
Concurrent execution of user programs is essential for good DBMS performance. Because disk accesses
are frequent, and relatively slow, it is important to keep the cpu humming by working on several user
programs concurrently. A user’s program may carry out many operations on the data retrieved from the
database, but the DBMS is only concerned about what data is read/written from/to the database.
A transaction is the DBMS’s abstract view of a user program: a sequence of reads and writes.
A transaction is a unit of a program execution that accesses and possibly modifies various data objects
(tuples, relations).
DBMS has to maintain the following properties of transactions:
Atomicity: A transaction is an atomic unit of processing, and it either has to be performed in its entirety
or not at all.
Consistency: A successful execution of a transaction must take a consistent database state to a (new)
consistent database state. (; integrity constraints)
Isolation: A transaction must not make its medications visible to other transactions until it is committed,
i.e., each transaction is unaware of other transactions executing concurrently in the system. (;
concurrency control)
Durability: Once a transaction has committed its changes, these changes must never get lost due to
subsequent (system) failures. (; recovery)
Model used for representing database modifications of a transaction:
Problems that can occur for certain transaction schedules without appropriate concurrency control
mechanisms:
Checks for serializability are based on precedence graph that describes dependencies among concurrent
transactions; if the graph has no cycle, then the transactions are serializable.
They can be executed concurrently without affecting each others transaction result.
Dirty-reads and non-repeatable reads are not possible. Furthermore, this mode guarantees serializability
(but does not provide much parallelism).
Checkpoint :
[Checkpoint] is appended to the log when writes of all previously committed transactions are effected
on the database.
Purpose:
Reduce the work required for recovery.
This is a note for myself about how to check whether a schedule is view serializable, conflict serializable,
or not.
There is various resources in the internet about how to do this, but the examples are a bit scattered, so
in this post I just want to make a neat note on how to do it properly with several examples that can
cover many possibilities as well.
T1 | R1(X) R1(X)
T2 | R2(Y) R2(Y) W2(X)
T3 | W3(Y)
——————————————————
Two or more actions are said to be in conflict if:
For our example, we have a conflict on X (T1 reads it and T2 writes it).
We also have a conflict on Y (T2 reads it and T3 writes it).
The easiest way to check for a cyclic transaction, is to paste the schedule to this web app.
But if you are like me, you won’t be satisfied getting answer from someone without understanding how
exactly they come up with the the answer (though sometimes I don’t care either..). So let’s find out how
to do exactly just that.
We have T1 that reads X after T2 writes it, so draw arrow from T2 -> T1
We have T2 that writes X after T1 reads it, so draw arrow from T1 -> T2
We also have T3 that writes Y after T2 reads it, so draw arrow from T2 -> T3
We can see that there is a cycle between T1 and T2, so the graph is cyclic, and therefore it is not conflict
serializable.
To check for view serializability of a schedule, you must create all possible combinations of the
Transactions. So let’s say you have three transactions, then you need to check for these combinations:
< T1, T2, T3 >
A Tx reads an initial data in a Schedule, the same Tx also should read the initial data in one of the
transaction combination.
For our example, at least T1 should occur before T2, because T1 reads initial value X. If T2 occurs
before T1, then T1 reads X value after T2 writes. So remove these combinations:
o < T2, T1, T3 >
o < T2, T3, T1 >
o < T3, T2, T1 >
A Tx reads a data after another Tx has written in a Schedule, the same Tx also should read the data after
another Tx has written it in one of the transaction combination.
In our example, T1 reads X after T2 writes, so it means that T2 should occur before T1. But wait
a minute, isn’t it we’ve just said that T1 should occur before T1 on the previous condition?
That’s what a cycle in the graph also caused. We need T1 before T2 and at the same time we
need T2 before T1. Because of this, none of the combinations can satisfy these two conditions,
so it is not view serializable.
A Tx writes the final value for a data in a Schedule, the same Tx should also write the final data in one of
the transaction combination.
Although we’ve said that our schedule is not view serializable, let’s just take a look which
combination satisfy this condition. We know that T2 writes Y and X, but in our schedule the T3
writes the last value of Y. So the combination that satisfy these condition should have T2 occur
before T3. If we are to choose the combinations, keep everything else but remove these:
o < T1, T3, T2 >
o < T3, T1, T2 >
o < T3, T2, T1 >
A Tx reads an initial data in a Schedule, the same Tx also should read the initial data in one of the
transaction combination.
Here, at least T2 must occur first, though it actually doesn’t matter also because no one else
writes X, so we still keep all our Tx combinations.
Department of Information Technology
Nivedita Ray De Sarkar Page
1081081
Database Management System
A Tx reads a data after another Tx has written in a Schedule, the same Tx also should read the data after
another Tx has written it in one of the transaction combination.
Now, this means that T1 must occur after T3 because T1 reads Z after T3 writes it. So we remove
o < T1, T2, T3 >
o < T1, T3, T2 >
o < T2, T1, T3 >
A Tx writes the final value for a data in a Schedule, the same Tx should also write the final data in one of
the transaction combination.
Here, T1 must occur last after T3 or T2 because if not T3 or T2 will overwrites Y that T1 writes in
our schedule. So we remove < T3, T1, T2 >
So the two combinations left satisfy the view serializability this time, they are:
Locking
The most common way in which access to items is controlled is by “locks.” Lock manager is the part of a
DBMS that records, for each item A, whether one or more transactions are reading or writing any part of
A. If so, the manager will forbid another transaction from gaining access to A, provided the type of
access (read or write) could cause a conflict, such as the duplicate selling of an airline seat.
A timestamp is a centrally dispensed number assigned to each transaction in strictly increasing order.
The DBMS guarantees that the final result from competing transactions will appear as if the transactions
had executed serially in timestamp order.
A log file maintains a record of all changes to the database, including the ID of the perpetrating
transaction, a before-image of each modified object.
The log file enables recovery from a failure that loses the memory buffer’s contents but doesn’t corrupt
the database. You scan the log backward and reverse transactions by rewriting their before-images. You
then scan it forward and reverse transactions by rewriting their after-images.
• once you release your first lock, you can’t acquire any more locks
1. Expanding phase(growing phase): locks are acquired and no locks are released.
2. Shrinking phase: locks are released and no locks are acquired.
The strict two-phase locking (S2PL) class of schedules is the intersection of the 2PL class with the class
of schedules possessing the Strictness property.
To comply with the S2PL protocol a transaction needs to comply with 2PL, and release its write
(exclusive) locks only after it has ended, i.e., being either committed or aborted. On the other hand, read
(shared) locks are released regularly during phase 2.
S2PL is a special case of 2PL, i.e., the S2PL class is a proper subclass of 2PL
• Modify 2 Φ locking protocol so that transactions hold all their locks until after they commit.
In terms of the logical operations to be performed on the data, relational tables provide a beautiful
mechanism for all of the three above tasks. The storage of a Database in a computer memory is mainly
concerned with:
1. The need to store a set of tables [each table can be stored as an independent file];
2. The attributes in a table are often accessed together [Why?].
Therefore, it makes sense to store the different attribute values in each record contiguously. For each
record of a table, the attributes MUST be stored in the same sequence [Why?]
3. Since there is no prescribed order in which records must be stored in a table, we may choose the
sequence in which we store the different records of a table. We shall see that this observation is quite
useful.
It is useful to look at how tables are stored on the HD in a little more detail. Each row (called record) of a
table can contain different amount of data [Why?].
Therefore, the record is stored with each subsequent attribute separated by the next by a special ASCII
character called a field separator. Of course, in each block, we may place many records. Each record is
separated from the next, again by another special ASCII character called the record separator. For our
discussion, we shall assume that a record is stored in a block only if there is sufficient space to store it
completely (namely, we do not store part of the record in one block, and the remaining in the next
block). There are some cases where each record of a table can be larger in size than one sector on the
HD, for example, a DB storing street maps, where each map segment is a JPEG image of over 3MBytes.
However, we will ignore such special cases for our discussion.
HEAP files:
The simplest method of storing a DB table is to store all the records of the table in the order in which
they are created, on contiguous blocks, in a large file. Such files are called HEAP files, or a
PILE. We shall examine the storage methods in terms of the operations we need to perform on the
Database.
Operation:
Insert a new record
Method:
The heap file data records the address of the first block, and the file size (in blocks). It is therefore easy
to calculate the last block of the file, which is directly copied into the buffer. The new record is inserted
at the end of the last existing record, and the block is written back to the disk.
Performance:
Very fast -- in the worst case, the DBMS needs to read two blocks of data and write back one block (i.e.
read the last block, if it does not have sufficient empty space, read the next available block and finally,
write the record to this block). If the time for transfer of 1 block = t, then the worst case time required is
2t. In addition, there is a fixed overhead cost associated with updating the size of the file, which sectors
it occupies, etc. -- but this is negligible.
Operation:
Search for a record
Method:
Linear search: (i) The first block to the RAM; (ii) CPU searches each record in the block to match the
search criterion; (iii) If no match is found, copy the next block into RAM and go to (ii). Performance:
Poor. Assume that a table occupies B blocks in size. In the worst case, we will find the data in the last
Block (or not find it at all); worst case time =Bt. On the average, if the searched data is
uniformly distributed, the average search time= Bt/2, which is still very poor.
Operation:
Update a cell in a record
Method:
Linear search: (i) The first block to the RAM; (ii) CPU searches each record in the block to match the
search criterion; (iii) If no match is found, copy the next block into RAM and go to (ii); (iv) else, if a record
in the block matches the criterion, update the record, and write it back.
Performance:
It is easy to see that the worst case time =Bt+t, and the average time= Bt/2+t(assuming that the time to
write a block of data = time to read a block = t).
[Note: it is possible that the update increases the size of the block greater than 1 sector; convince
yourself that this will introduce a constant, and relatively negligible, overhead]
Operation:
Delete a record
Method:
Search the record that is to be deleted (requires linear search).
Performance:
Poor (analysis is same as for “update a record” case).
Problem:
After the record is deleted, the block has some extra (unused) space. What to do about the
unused space?
Different DBMS may use a different policy about this. Typical policies may be: (a) Delete the space and
rewrite the block. At periodic intervals (few days), read the entire file into a large
RAM buffer and write it back into a new file. (b) For each deleted record, instead of actually deleting the
record, just use an extra bit per record, which is the 'RECORD_DELETED' marker.
IF: RECORD_DELETED == 1, the record is ignored in all searches.
Approach (b) may have some advantage in terms of recovery of data deleted by mistake, or in cases
where a deleted record must be recovered. In any case, after fixed intervals, the file is updated just as in
case (a), to recover wasted space. Heaps are quite inefficient when we need to search for data in large
database tables.
Sorted Files:
The simplest method to make efficient searching is to organize the table in a Sorted File. The entire file is
sorted by increasing value of one of the attributes, called the ordering attribute
(or ordering field). If the ordering field) is also a key attribute, it is called the ordering key. Figure shows
an example of a table sorted by the "SSN" attribute.
Operation:
Search for a record with a given value of the ordering attribute.
Method:
Binary search
For a file of b blocks, look in the block number |b/2|. If the searched value is in this block, we are done;
If the searched value is larger than the last ordering attribute value in this block, repeat the binary
search in blocks between ( ⎡b/2⎤+ 1) and b; otherwise repeat the search in blocks between 1 and ( ⎡b/2⎤-
1).
Performance:
Let us consider the worst case time spent for a sorted file of size b blocks. In each iteration, we need to
check one block (= t units of time); Worst case we do not find the record in until the last remaining
block. In the 2nd iteration, at most b/2 blocks will remain to be searched; in the 3rd iteration, (b/2)/2 =
b/2
2blocks remain. After i-iterations, b/2i-1 blocks remain to be searched. Of course, if only one block
remains to be searched, no further sub-division of the list of blocks is required, namely, we stop when
b/2i-1= 1, which gives: b = 2i-1, or i = (1 + lg2b). Total number of blocks searched after i-iterations is i,
and the total time is t(1 + lg2b).
Let’s compare the relative worst-case search time between heap storage and sorted file storage. Let’s
assume a file occupies 8192 blocks. Stored as a heap file, the worst case time is 8192t; stored as a sorted
file, the worst case time is t( 1 + lg28192) = t(1+ 13) = 14t. The search is 8192/14 ≈585 times faster!
Operation:
Delete a record/update a value in a record (other than the ordering attribute value)
Method:
First, search the record to be deleted (using binary search).
After the record is deleted, we still need to perform the usual file compacting once every few days.
Performance:
Quite fast. The worst case time is the same as that for search, plus ‘t’ units of time to write
the modified block.
Operation:
Insert a new record/update a record where the ordering attribute is changed
Method:
If we insert the new record in its correct ordered position, we would need to shift every subsequent
record in the table! This is of course very time consuming. Consider the case that a new record is
inserted for ‘SSN=1003’ in our example of figure 7. Since this record goes into the first block, the record
for ‘SSN=1008’ will need to be removed to make space for the new record. The ‘SSN=1008’ record must
be shifted to Block-2, but this may require ‘SSN=1086’ to be shifted out of Block 2, and so on. Total
worst-case time spent is approximately ≈Search for the insertion point ≈t(1+ lg2b) + (Read and Write
each block = 2bt)
Performance: Very inefficient.
Solution:
Overflow file. The table is stored as a sorted file. However, when a new record is inserted, it is stored in
a different file, called an overflow file (thus each table has its own main file and its own overflow file).
The main file is sorted, while the overflow file is a heap file.
Periodically, the overflow file is merged with the ordered file. Thus every insert operation is very fast
(constant time) -- just add the record at the end of the overflow file.
For other operations, we first use binary searching in the regular file. If we cannot find a matching
record, we then perform a linear search in the overflow file. Note that since the overflow file is not
ordered, it requires a linear search.
ed for the hashed table. For simplicity, let’s assume that the table is stored in blocks [1, ..., MxS]. The
first bucket occupies blocks (1, ... S), the 2nd bucket is stored from blocks (S+1, ..., 2S), etc.
Operation:
Insert a record.
Procedure:
1. Using the Hashing attribute value, calculate hashing key, K;
2. Calculate hashing address, [in our example, (K mod M)];
3. Place the record in first sector with enough space, starting from address Sx(K mod M)
Performance:
In general, the maximum time that it will take to do an insertion will be tS +t( tS to read the S blocks in
this bucket, and t units to write the block where the record is inserted).
Operation:
Search for a record
Procedure:
Compute the hashing address, and look in the blocks starting from the address. If all blocks of this
bucket are full, look into the blocks for the next bucket until an empty block is encountered
Performance:
Quite fast (Almost constant time)
Operation:
Deletion of a record.
Procedure, Performance:
Similar to that for Searching for a record.
All the above are techniques used by the DBMS's to physically store each of the tables in a Database. In
all of the above, we assume that each table is arranged according to the value of a sorting key. However,
in practical use, a record may be searched by the value of some other key. This is a very common
requirement, and often it is essential to get a fast data operation on a table based on more than one
searching attribute. DBMS's provide another mechanism to quickly search for records, at the cost of
using some extra disk space: INDEXES.
Basics of indexes
The main idea of an index is similar to the physical index. Relational DB’s can have two different types of
indexes; we shall study each type below Primary Indexes.
A primary index file is an index that is constructed using the sorting attribute of the main file.
In the primary index file, we store the value of the sorting attribute of the main file in the first record in
each block of the main file. Corresponding to each entry, we store the address of the block. Notice that
the index file is much smaller (i.e. has very few blocks) compared to the main file.
Operation:
Search for a record in the main file
Procedure:
First perform a binary search on the primary index file, to find the address of the corresponding data.
Suppose we are searching for ‘SSN=1208’. From the primary index file we can tell that if a record for
‘SSN=1208’ exists, it must be in Block 3. We then fetch the contents of Block3 from the HD, and search in
this block for the record for ‘SSN=1208’.
Performance:
Very fast! The worst case search time to search for the address in the primary index file of size P blocks
is ≈t(1 + lg2P); an additional t units is needed to fetch the correct block from the main file.
Thus the total time ≈t(2 + lg2b)
Similarly you can analyze the performance of other operations.
Problem:
The Primary Index will work only if the main file is a sorted file. What if we want to insert a new record
into the main file? Recall that inserting a record in a sorted file is time-consuming, therefore records are
initially stored in an overflow file. Earlier we studied how simple heap files can be used for overflow
files; in reality, most DBMS’s will use a special structure in storing such records, called a balanced tree
(or B-tree). However, in the following solution to our problem, we shall just assume that overflow files
are heaps.
Solution:
The new records are inserted into an unordered (heap) in the overflow file for the table.
Periodically, the ordered and overflow tables are merged together; at this time, the main file is sorted
again, and the Primary Index file is accordingly updated.
Thus any search for a record first looks for the INDEX file, and searches for the record in the indicated
Block.
If the record is not found, then a linear search is conducted in the overflow file for a possible match.
Secondary Indexes
It is possible to construct indexes for any attribute of a table. Of course, since the main file can only be
sorted by one field (usually the primary key field), therefore the contents of secondary indexes is
different from that of primary indexes.
Note that the data file was sorted using SSN. Obviously, we cannot order the records of the main file
using our secondary index key attribute, ‘Lname’. The Secondary Index is a two column file storing the
block address of every secondary index attribute value of the table.
You can also create Secondary Indexes for attributes that are non-unique. The idea is similar, though the
storage details are slightly different (for example, if there are three persons with
‘Lname=Wong’, then the three records may be located in different blocks -- thus the secondary
index file must record multiple entries for the same ordering attribute (Lname), one for each record --
and the search algorithm must account for each occurrence. Of course, you can create as many
secondary indexes as you wish.