0% found this document useful (0 votes)
23 views103 pages

Data vs. Information in Databases

Uploaded by

hanishsagarm
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views103 pages

Data vs. Information in Databases

Uploaded by

hanishsagarm
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IBDP CS Database Chapter – A1

Database : Basic
Concepts
A 1.1Data and information
Data is meaningless. To be useful, data must be interpreted to produce
information.

Data Processing Information


A 1.2Key difference:
@ Data usually refers to raw data, or unprocessed data.
@ It is the basic form of data, data that hasn’t been analyzed or processed in any
manner.
@ Once the data is analyzed, it is considered as information.
@ Information is "knowledge communicated or received concerning a particular
fact or circumstance."
@ Information is a sequence of symbols that can be interpreted as a message. It
provides knowledge or insight about a certain matter.
Some differences between data and information:
@ Data is used as input for the computer system. Information is the output of data.

@ Data is unprocessed facts figures. Information is processed data.

@ Data doesn’t depend on Information. Information depends on data.

@ Data is not specific. Information is specific.

@ Data is a single unit. A group of data which carries news is called Information.

@ Data doesn’t carry a meaning. Information must carry a logical meaning.

@ Data is the raw material. Information is the product.


A1.3 Information System
An Information System (IS) is any combination of
information technology and people's activities
using that technology to support operations,
management, and decision-making.

In a very broad sense, the term information system


is frequently used to refer to the interaction
between people, algorithmic processes, data and
technology.
Database
A database system is a term that is typically used to encapsulate the
constructs of a data model, database Management system (DBMS)
and database.
Databases are a component within an
information system.
A database is an organised pool of
logically-related data. Data is stored
within the data structures of the database.
A DBMS is a suite of computer software
providing the interface between users and
a database(s).
Components of DBMS

Data Definition. Defining new data structures for a database, removing data
structures from the database, modifying the structure of existing data.

Data Maintenance. Inserting new data into existing data structures, updating
data in existing data structures, deleting data from existing data structures.

Data Retrieval. Querying existing data by end-users and extracting data for
use by application programs.

Data Control. Creating and monitoring users of the database, restricting


access to data in the database and monitoring the performance of databases.
A 1.4 What is Transaction in DBMS?
A transaction is a set of changes that
must all be made together.

Process :
For example, to ensure data
consistency when moving money
between two accounts it is necessary to
complete two operations (debiting one
account and crediting the other).
Unless both operations are carried out
successfully, the transaction will be
rolled back.
A 1.5 Transaction state

Consistent state, by ensuring that interdependent operations on the


system are either all completed successfully or all canceled successfully.
Rollback
The Rollback transaction is a transaction which rolls back the transaction to the
beginning of the transaction. The transaction can be rolled back completely by
specifying the transaction name in the Rollback statement or to cancel any changes to
a database during current transaction. It is permissible to use before Commit
transaction.

Roll forward
Recovering a database by applying different transactions that recorded in the
database log files. It is nothing but re-doing the changes made by a transaction i.e.
after the committed transaction and to over write the changed value again to ensure
consistency.
A 1.7 Properties of Database Transactions
A transaction is a sequence of operations performed as a single logical
unit of work. A logical unit of work must exhibit four properties, called
the atomicity, consistency, isolation, and durability (ACID) properties, to
qualify as a transaction.

Atomicity Consistency Isolation Durability


Atomicity

• A transaction must be an atomic unit of work; either all of its data


modifications are performed, or none of them is performed.

• A transaction must be fully complete, saved (committed) or completely


undone (rolled back). A sale in a retail store database illustrates a
scenario which explains atomicity, e.g., the sale consists of an inventory
reduction and a record of incoming cash. Both either happen together
or do not happen - it's all or nothing.
Consistency

• The transaction must be fully compliant with the state of the database
as it was prior to the transaction. In other words, the transaction cannot
break the database’s constraints. For example, if a database table’s
Phone Number column can only contain numerals, then consistency
dictates that any transaction attempting to enter an alphabetical letter
may not commit.
Isolation

• Transaction data must not be available to other transactions until the


original transaction is committed or rolled back.

• Modifications made by concurrent transactions must be isolated from


the modifications made by any other concurrent transactions. A
transaction either recognizes data in the state it was in before another
concurrent transaction modified it, or it recognizes the data after the
second transaction has completed, but it does not recognize an
intermediate state.
Durability

• After a transaction has completed, its effects are permanently in place


in the system. The modifications persist even in the event of a system
failure.
• Transaction data changes must be available, even in the event of
database failure.
Concurrency Control
In a single-user database, the user can modify data in the database without concern for
other users modifying the same data at the same time. However, in a multiuser database, the
statements within multiple simultaneous transactions can update the same data. Transactions
executing at the same time need to produce meaningful and consistent results.
Therefore, control of data concurrency and data consistency is vital in a multiuser database.

Data concurrency means that Data consistency means that each user
many users can access data at sees a consistent view of the data,
the same time. including visible changes made by the
user's own transactions and transactions of
other users.
To describe consistent transaction behavior
when transactions run at the same time,
database researchers have defined a
transaction isolation model
called serializability. The serializable
mode of transaction behavior tries to
ensure that transactions run in such a way
that they appear to be executed one at a
time, or serially, rather than concurrently.
When multiple users attempt to make modifications to a data at the same, some level
of control should be established to that having one user’s modification affect
adversely can be prevented. The process of controlling this is called concurrency
control.
There are three common ways that databases manage data concurrency and they are
as follows:

1. Pessimistic concurrency control

2. Optimistic concurrency control

3. Last in wins
Pessimistic Optimistic concurrency
concurrency control
control With this method, a row
in this method, a cannot be available to
row is available to other users while the data
the users when the is currently being updated.
record is being During updating, the
fetched and stays database examined the
with the user until it
row in the database to
is updated within
the database . determine whether or not
any change has been
made. An attempt to
update a record that has
already been changed can
be flagged as concurrency
violation.
Last in wins –
with this method, any row can never be available to users while the data is currently
being updated but there is no effort made to compare updates with the original
record.
The record would simply be written out. The potential effect would be overwriting
any changes that are being made by other concurrent users since the last refresh of
the record.
Some DBMS have multi-version concurrency control. This works by automatically providing read
consistency to a query which results in a situation where all data seen by query can only come from a
single point in time, or a term known as statement level read consistency. Another read consistency is
the transaction level read consistency.
The RDMS uses the information stored in the rollback segments which contain old values of data that
have been altered by recently committed or uncommitted transactions to get a consistent view.
Both consistency and concurrency are closely related to each other in database systems.
Two users (Julie and Scott) updating different fields
of the same employee record at the same time
Pessimistic Concurrency Control
Scenario: Julie opens the record and locks it.
While she edits the last name, Scott is blocked and cannot make any changes
until Julie saves and releases the lock.
Result: Only Julie can update the record at that moment. Scott must wait, and
his changes will be made after Julie finishes.
How it works: It locks the data when one user begins to edit, preventing others
from making changes until the lock is released.
➢ Data integrity is preserved, but can cause delays or deadlocks in high-use
systems.
Optimistic Concurrency Control
Scenario:
● Julie and Scott both load the same employee record at the same time.

● Julie edits and saves the last name.

● Scott edits and tries to save the phone extension after Julie.

● Before committing Scott's changes, the system checks:

○ If the record was modified since Scott last read it.

○ It detects Julie’s update.

○ One of the following happens:

■ Reject Scott’s changes and prompt him to reload the latest record.

■ Merge the changes if they are on different fields (e.g., Julie changed last name, Scott changed phone
extension) — this depends on the system's design.
How it works: Assumes that conflicts are rare. Each user edits a local copy, and the system checks for changes before
saving.

➢ Safer and non-blocking, but needs conflict resolution logic.


Last-In-Wins (Overwriting)

Scenario:

○ Julie and Scott both load the record.

○ Julie saves her change to the last name.

○ Scott saves his change to the phone extension afterward.

○ Scott’s save overwrites the entire record, including the last name field Julie changed—even if he didn't
touch it.

How it works: Whichever user saves their changes last, their version overwrites the previous one.

➢ Risky because Julie’s changes are silently lost.


➢ This is a naïve approach, often used in poorly designed systems or when concurrency isn’t considered
important.
RDBMS
Relational database management System - languages use specific programming languages such as SQL
to store, retrieve and manipulate stored data. Scripting languages used in web development such as
PHP use "query" functions to send these commands to the target database.
g

g
Select Insert Update

Queries that encapsulate a "select" Queries that encapsulate an "insert" Queries that encapsulate an
command are used to retrieve data command are used to add new rows "update" command modify rows of
from databases. The command of data to a database. The data that are already stored in a
specifies conditions to retrieve rows command specifies which table within given database table. The command
of data and return them to the the database will receive the data. can update one or more fields in
program. every row in which another given
field holds a specified value.
CustomerID CustomerName ContactName Address City PostalCode Country

1
Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany

2 Ana Trujillo Avda. de la


Ana Trujillo México D.F. 05021 Mexico
Emparedados Constitución 2222
3 Antonio Moreno
Antonio Moreno Mataderos 2312 México D.F. 05023 Mexico
Taquería
4
Around the Horn Thomas Hardy 120 Hanover Sq. London WA1 1DP UK

5
Berglunds snabbköp Christina Berglund Berguvsvägen 8 Luleå S-958 22 Sweden

SELECT column_name, column_name FROM table_name WHERE column_name operator value;

SELECT * FROM Customers WHERE Country='Mexico';

SELECT * FROM Customers WHERE CustomerID=1;


UPDATE table_name
SET column1=value1, column2=value2,...
WHERE some_column=some_value;

UPDATE Customers
SET ContactName='Alfred Schmidt', City='Frankfurt'
WHERE CustomerID=1;

UPDATE Customers
SET ContactName='Juan'
WHERE Country='Mexico';
A 1.9 Data validation

Data validation guarantees to your


application that every data value is
correct and accurate. You can design
data validation into your application
with several differing approaches: user
interface code, application code, or
database constraints.
Types of validation
There are a number of validation types that can be used to check the data that is being entered.

Validation type How it works Example usage

The last one or two digits in a code are used to check the
Check digit Bar code readers in supermarkets use check digits
other digits are correct
A National Insurance number is in the form LL 99 99 99 L
Format check Checks the data is in the right format
where L is any letter and 9 is any number

Length check Checks the data isn't too short or too long A password which needs to be six letters long

Lookup table Looks up acceptable values in a table There are only seven possible days of the week

Presence check Checks that data has been entered into a field In most databases a key field cannot be left blank

Number of hours worked must be less than 50 and more


Range check Checks that a value falls within the specified range
than 0

Spell check Looks up words in a dictionary When word processing


Data Verification
Data verification refers to the process wherein the data is checked for accuracy and
inconsistencies after data migration are done. This helps to determine whether data
was accurately translated when data is transported from one source to another, is
complete, and supports processes in the new system.
There are two main methods of verification:

Double entry - entering the data twice and Proofreading data / Visual Varification - this
comparing the two copies. This effectively doubles method involves someone checking the data
the workload, and as most people are paid by entered against the original document. This is also
the hour, it costs more too. time-consuming and costly.
Exercise Questions :
1. A tourist has booked a double room, non-smoking, at the Overlook Hotel. The tourist used an online
booking service called Best Fare that compares offers from different hotels.
Best Fare confirms the transaction by communicating only the booking code 3EP9RE to the tourist. The
tourist later receives an email from the hotel, with further details.

(a) Outline the difference between data and information, in relation to the booking code. [2]

Data (the booking code) are unorganized/unstructured/unprocessed terms;


Whereas information (hotel’s email) is organized/structured/processed data;

OR

Data lacks meaning e.g. on their own data elements, Code, name, room;
Whereas information is interpreted data (the code interpreted by the system) and has meaning (eg
email provides meaningful information to the tourist about the use of the Code);
(b) Define the term database transaction. [2]

o The minimal step of operation/update to be performed on a database;


o That guarantees consistency/integrity of the database;
o And recovery upon failure;

For example:
Minimal set of actions that complete/make the booking;
So that database is accurate/consistent/has integrity;
So that database is updated to prevent others booking room;
So that recovery can be run on failure;
(c) Describe the use of transactions to maintain data consistency. [3]

o Modifications on data are made persistent in the database only if the transaction terminates;
o A roll-back operation is performed if a failure occurs prior to termination of the transaction;
o And this keeps the database in the original consistent status;

For example:
Transactions conform to rules before update to database;
Transactions must complete fully to make the changes permanent/persistent;
Allows roll-back operation in case of failure;
To ensure database is consistent while performing transactions;
(d) Explain what is meant by isolation in the transaction described above. [2]

An example must be seen for the second mark to be awarded. (It means concurrent control;)
It says/defines/specifies how and when the changes of a process are visible to concurrent operations;

For example:
using locks on data to prevent concurrent writing that would lead to inconsistency as in the case of
two people booking the same room;
(e) Explain the benefits offered by data sharing in databases, for two of the stakeholders in this
scenario. [6]
General explanation (“What it is”): Tourist
Can quickly/efficiently access;
Data sharing/replication: all/part of A wide choice of hotels/offers/prices;
one/several source databases are Best Fare
shared/replicated according to the Can efficiently display offers on a wide selection of hotels;
needs of different user groups; Can develop/extend existing software when new hotels are created;
Can develop/re-use the same application without major
The information needed by the modifications;
group is made ‘closer’ to the user; Hotel
The hotel needs to rely on new applications and increasing speed is
Hence, less focus on an issue (cost/maintenance);
transmission/traffic (expensive or Allowing the sharing of information in the source database gives
slow or not available or not more visibility to the hotel, making it more competitive;
convenient), but requires more The hotel can continue to locally operate only on part of the data in
storage space (but storage is the source database, using less advanced software;
cheap); [2] Allows hotel to change prices and offers in real-time; [4]
IBDP CS Database Chapter – A2

The relational database


model
DBMS
DBMS stands for "Database Management System." In short, a DBMS is a
database program.
A database management system (DBMS) is software for creating and
managing databases. The DBMS provides users and programmers with a
systematic way to create, retrieve, update and manage data.
RDBMS
Short for relational database management system and pronounced as separate
letters, a type of database management system (RDBMS) that stores data in
the form of related tables. Relational databases are powerful because they
require few assumptions about how data is related or how it will be extracted
from the database. As a result, the same database can be viewed in many
different ways.

An important feature of relational systems is that a single database can be


spread across several tables. This differs from flat-file databases, in which a
database is self-contained in a single table.
The main differences between DBMS and RDBMS are given below:
DBMS RDBMS
DBMS applications store data as file. RDBMS applications store data in a tabular form.
In DBMS, data is generally stored in either a hierarchical form In RDBMS, the tables have an identifier called primary key and the
or a navigational form. data values are stored in the form of tables.
Normalization is not present in DBMS. Normalization is present in RDBMS.
DBMS does not apply any security with regards to data RDBMS defines the integrity constraint for the purpose of ACID
manipulation. (Atomocity, Consistency, Isolation and Durability) property.
DBMS uses file system to store data, so there will be no relation in RDBMS, data values are stored in the form of tables, so
between the tables. a relationship between these data values will be stored in the form of
a table as well.
DBMS has to provide some uniform methods to access the RDBMS system supports a tabular structure of the data and a
stored information. relationship between them to access the stored information.
DBMS does not support distributed database. RDBMS supports distributed database.
DBMS is meant to be for small organization and deal with RDBMS is designed to handle large amount of data. it
small data. it supports single user. supports multiple users.
Examples of DBMS are file systems, xml etc. Example of RDBMS are mysql, postgre, sql server, oracle etc.
Functions and Tools of a DBMS
1. Data Dictionary Management
▪ Data Dictionary is where the DBMS stores definitions of the data elements and their
relationships (metadata).
▪ The DBMS uses this function to look up the required data component structures and
relationships. When programs access data in a database they are basically going
through the DBMS.
▪ This function removes structural and data dependency and
provides the user with data abstraction. This makes things a
lot easier on the end user. The Data Dictionary is often
hidden from the user and is used by Database Administrators
and Programmers.
Functions and Tools of a DBMS
2. Data Storage Management
▪ This particular function is used for the storage of
data and any related data entry forms or screen
definitions, report definitions, data validation
rules, procedural code, and structures that can
handle video and picture formats.

▪ Users do not need to know how data is stored or


manipulated.

▪ Also involved with this structure is a term called


performance tuning that relates to a database’s
efficiency in relation to storage and access
speed.
Functions and Tools of a DBMS
3. Data Transformation and Presentation
▪ This function exists to transform any data entered into required data
structures.
▪ By using the data transformation and presentation function the DBMS
can determine the difference between logical and physical data
formats.
Functions and Tools of a DBMS
4. Security Management
▪ This is one of the most important functions in the DBMS. Security
management sets rules that determine specific users that are allowed to
access the database.
▪ Users are given a username and password or sometimes through
biometric authentication (such as a fingerprint or retina scan) but these
types of authentication tend to be more costly.
▪ This function also sets restraints on what
specific data any user can see or manage.
Functions and Tools of a DBMS
5. Multiuser Access Control

▪ Data integrity and data consistency are the


basis of this function.
▪ Multiuser access control is a very useful tool in
a DBMS, it enables multiple users to access the
database simultaneously without affecting the
integrity of the database
Functions and Tools of a DBMS
6. Backup and Recovery Management
▪ Backup and recovery is brought to mind whenever there is potential
outside threats to a database.
▪ For example if there is a power outage,
recovery management is how long it takes
to recover the database after the outage.
▪ Backup management refers to the data
safety and integrity; for example backing
up all your mp3 files on a disk.
Functions and Tools of a DBMS
7. Data Integrity Management
▪ The DBMS enforces these rules to reduce data redundancy
▪ When data is stored in more than one place unnecessarily, and
maximizing data consistency, making sure database is returning
correct/ same answer each time for same question asked.
Functions and Tools of a DBMS
8. Database Access Languages and Application Programming Interfaces
▪ A query language is a nonprocedural language.
▪ An example of this is SQL (structured query language).
▪ SQL is the most common query language supported by the majority of
DBMS vendors.
▪ The use of this language makes it easy for user to specify what they want
done without the headache of explaining how to specifically do it.
Functions and Tools of a DBMS
9. Database Communication Interfaces
▪ This refers to how a DBMS can accept different end user requests
through different network environments.
▪ An example of this can be easily related to the internet.
▪ A DBMS can provide access to the database using the Internet through
Web Browsers (Mozilla Firefox, Internet Explorer, Netscape).
Functions and Tools of a DBMS
10. Transaction Management
▪ This refers to how a DBMS must supply a method that will guarantee
that all the updates in a given transaction are made or not.
▪ All transactions must follow what is called the ACID properties.
How a DBMS can be used to promote data security?
Features involve:
▪ Data Validation
✔ data validation is the process of ensuring that a program operates on clean, correct and
useful data. It uses routines, often called "validation rules" that check for correctness,
meaningfulness, and security of data that are input to the system.
▪ Access Rights
✔ The permissions that are granted to a user, or to an application, to read, write and erase files in
the computer. Access rights can be tied to a particular client or server, to folders within that
machine or to specific programs and data files. See access control list and privilege.
• Data Locking.
✔ A database lock is used to “lock” some data in a database so that only one database user/session
may update that particular data. So, database locks exist to prevent two or more database users
from updating the same exact piece of data at the same exact time.
Database Schema
• A database schema is the skeleton structure
that represents the logical view of the entire
database. It defines how the data is organized
and how the relations among them are
associated. It formulates all the constraints that
are to be applied on the data.
• A database schema defines its entities and the
relationship among them. It contains a
descriptive detail of the database, which can
be depicted by means of schema diagrams. It’s
the database designers who design the schema
to help programmers understand the database
and make it useful.
The characteristics of the three levels of the schema
▪ The logical architecture describes how data in the database is perceived by users.
▪ It is not concerned with how the data is handled and processed by the DBMS, but only with
how it looks.
▪ The method of data storage on the underlying file system is not revealed, and the users can
manipulate the data without worrying about where it is located or how it is actually stored.
This results in the database having different levels of abstraction.
It divides the system into three levels of abstraction : the internal or physical level, the
conceptual level, and the external or view level.
1. The External Level or View Level
2. The Logical Level or Conceptual Level
3. The Internal or Physical Level
The External or View Level:
The external or view level is the highest level of abstraction of database. It
provides a window on the conceptual view, which allows the user to see only
the data of interest to them.
The user can be either an application program or an end user. There can
be many external views as any number of external schema can be defined
and they can overlap each other.
It consist of the definition of logical records and relationships in the external
view. It also contains the method of deriving the objects such as entities,
attributes and relationships in the external view from the conceptual view.
The Conceptual Level or Global Level:
The conceptual level presents a logical view of the entire database as a
unified whole.
It allows the user to bring all the data in the database together and see
it in a consistent manner. Hence , there is only one conceptual schema per
database.
The first stage in the design of a database is to define the conceptual
view, and a DBMS provides a data definition language for this purpose.
it describes all the records and relationships included in the database.
The data definition language used to create the conceptual level must
not specify any physical storage considerations that should be handled
by the physical level. It does not provide any storage or access details,
but defines the information content only.
The Internal or Physical Level:
The collection of files permanently stored on secondary storage devices
is known as the physical database.
The physical or internal level is the one closest to the physical storage,
and it provide a low level description of the physical database, and an
interface between the operating system’s file system and the record
structures used in higher level of abstraction.
It is at this level that record types and methods of storage are defined,
as well as how stored fields are represented, what physical sequence the
stored records are in, and what other physical structures exist.
IBDP CS Database Chapter – A3

Further aspects of
database management
What is a database administrator?
Database administration is more of an operational or technical level
function responsible for physical database design, security enforcement,
and database performance. Tasks include maintaining the data
dictionary, monitoring performance, and enforcing organizational
standards and security.
Explain the role or Functions of a database administrator.
1. Schema Definition:

• The DBA definition the logical Schema of the database. A Schema refers to the overall logical structure of the database.

• According to this schema, database will be developed to store required data for an organization.

2. Storage Structure and Access Method Definition:

• The DBA decides how the data is to be represented in the stored database.

3. Assisting Application Programmers:

• The DBA provides assistance to application programmers to develop application programs.

4. Physical Organization Modification:

• The DBA modifies the physical organization of the database to reflect the changing needs of the organization or to improve performance.
Explain the role or Functions of a database administrator. (Continue…..)
5. Approving Data Access:

• The DBA determines which user needs access to which part of the database.

• According to this, various types of authorizations are granted to different users.

6. Monitoring Performance:

• The DBA monitors performance of the [Link] DBA ensures that better performance is maintained by making changes in
physical or logical schema if required.

7. Backup and Recovery:

• Database should not be lost or damaged.

• The DBA ensures this periodically backing up the database on magnetic tapes or remote servers.

• In case of failure, such as virus attack database is recovered from this backup.
Explain how end-users can interact
with a database.
Describe different methods of
database recovery.
Database recovery is the process of restoring the database to the most
recent consistent state that existed just before the failure.

▪ Three states of database recovery:


• Pre-condition: At any given point in time the database is in a consistent state.
• Condition: Occurs some kind of system failure.
• Post-condition: Restore the database to the consistent state that existed before
the failure
Types of failures
Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point
from where it can’t go any further. This is called transaction failure where
only a few transactions or processes are hurt.
Reasons for a transaction failure could be −
Logical errors − Where a transaction cannot complete because it has some
code error or any internal error condition.
System errors − Where the database system itself terminates an active
transaction because the DBMS is not able to execute it, or it has to stop
because of some system condition. For example, in case of deadlock or
resource unavailability, the system aborts an active transaction.
Types of failures (Continue…)
System Crash
There are problems − external to the system − that may cause the system to stop
abruptly and cause the system to crash. For example, interruptions in power supply
may cause the failure of underlying hardware or software failure.
Examples may include operating system errors.

Disk Failure
In early days of technology evolution, it was a common problem where hard-disk
drives or storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head
crash or any other failure, which destroys all or a part of disk storage.
Recovery and Atomicity
When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations, which
are atomic in nature. But according to ACID properties of DBMS, atomicity of transactions as a
whole must be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
• It should check the states of all the transactions, which were being executed.
• A transaction may be in the middle of some operation; the DBMS must ensure the atomicity of the
transaction in this case.
• It should check whether the transaction can be completed now or it needs to be rolled back.
• No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as
maintaining the atomicity of a transaction −
• Maintaining the logs of each transaction, and writing them onto some stable storage before actually
modifying the database.
• Maintaining shadow paging, where the changes are done on a volatile memory, and later, the actual
database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction. It is important that the
logs are written prior to the actual modification and stored on a stable storage media, which is failsafe.
Log-based recovery works as follows −
•The log file is kept on a stable storage media.
•When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
•When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads Tn has changed the value of X, from V1 to V2.
•When the transaction finishes, it logs −
<Tn, commit>
The database can be modified using two approaches −
•Deferred database modification − All logs are written on to the stable storage and the database is updated when a
transaction commits.
•Immediate database modification − Each log follows an actual database modification. That is, the database is modified
immediately after every operation.
Recovery with Concurrent Transactions
When more than one transaction are being executed in parallel, the logs
are interleaved. At the time of recovery, it would become hard for the
recovery system to backtrack all logs, and then start recovering. To ease
this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may
fill out all the memory space available in the system. As time passes, the
log file may grow too big to be handled at all. Checkpoint is a
mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point
before which the DBMS was in consistent state, and all the transactions
were committed.
When a system with concurrent transactions crashes and recovers, it
behaves in the following manner −

• The recovery system reads the logs backwards from the end to the last checkpoint.
• It maintains two lists, an undo-list and a redo-list.
• If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>, it puts the transaction in
the redo-list.
• If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts the transaction in
undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the transactions in the redo-list and
their previous logs are removed and then redone before saving their logs.
Outline how integrated database systems function.
The term integrated database is used to describe two different database structures: connection of multiple databases or a
database built into another application or tool. Although the initial description would appear to mean that these two structures
are quite different, they are not. In fact, they possess many of the same structures.

• A database is a collection of data from a range of different sources organized under one structure.

• The database itself if made up of a series of semi-independent data tables. Each table holds specific information, typically
with one focus per table.

• The database can create links between the separate tables, based on common elements, information, or programming
logic. Queries can be written to create reports accessing all the different tables in the database.

• Connecting multiple databases into an integrated database is a very complex task.

• Connections between the different databases must be made and tested from a variety of perspectives to ensure that the
logic is sound.

• The different databases can also exist in different formats or programs. This adds to the complexity but may be achieved
using advanced programming logic and powerful resources.
Outline the use of databases in areas such as stock
control, police records, health records, employee data.
Suggest methods to ensure the privacy of the
personal data and the responsibility of those
holding personal data not to sell or divulge it in
any way.
• Access control :
• Auditing : Database auditing involves observing a database so as to be aware of the
actions of database users. Database administrators and consultants often set up auditing for
security purposes, for example, to ensure that those without the permission to access
information do not access it.
• Authentication
• Encryption
• Integrity controls : Data integrity is the maintenance of, and the assurance of the accuracy
and consistency of, data over its entire life-cycle
• Backups
• Application security
Discuss the need for some databases to be open
to interrogation by other parties (police,
government, etc).
Explain the difference between data
matching and data mining.
• Data matching (also known as record or data linkage, entity resolution,
object identification, or field matching) is the task of identifying,
matching and merging records that correspond to the same entities
from several databases or even within one database.

• Data Mining is defined as the procedure of extracting information from


huge sets of data. In other words, we can say that data mining is mining
knowledge from data. The tutorial starts off with a basic overview and
the terminologies involved in data mining and then gradually moves on
to cover topics such as knowledge discovery, query language,
classification and prediction, decision tree induction, cluster analysis,
and how to mine the Web.
DATA MATCHING
Data Mining
IBDP CS Database Chapter – A4

HL Extension
Further database models
and database analysis
Data Models
The structure of the database is called the data models. A Collection of
conceptual tools for describing data, data relationship, data semantic
and consistency constraint.
❑ Relational model
❑ Network model
❑ Hierarchical model
❑ Object-based logical models
❑ Unifying model
Relational model
The most common model, the relational model sorts data
into tables, also known as relations, each of which consists
of columns and rows. Each column lists an attribute of the
entity in question, such as price, zip code, or birth date.
Together, the attributes in a relation are called a domain. A
particular attribute or combination of attributes is chosen as
a primary key that can be referred to in other tables, when
it’s called a foreign key.
Each row, also called a tuple, includes data about a specific
instance of the entity in question, such as a particular
employee.
The model also accounts for the types of relationships
between those tables, including one-to-one, one-to-many,
and many-to-many relationships. Here’s an example:
Hierarchical model
The hierarchical model organizes data into a tree-like structure, where each record has a
single parent or root. Sibling records are sorted in a particular order. That order is used as
the physical order for storing the database. This model is good for describing many
real-world relationships.
Network model
The network model builds on the hierarchical
model by allowing many-to-many relationships
between linked records, implying multiple
parent records. Based on mathematical set
theory, the model is constructed with sets of
related records. Each set consists of one owner
or parent record and one or more member or
child records. A record can be a member or
child in multiple sets, allowing this model to
convey complex relationships.
It was most popular in the 70s after it was
formally defined by the Conference on Data
Systems Languages (CODASYL).
Object-oriented database model
This model defines a database as a collection of objects, or
reusable software elements, with associated features and
methods. There are several kinds of object-oriented
databases:
A multimedia database incorporates media, such as
images, that could not be stored in a relational database.
A hypertext database allows any object to link to any other
object. It’s useful for organizing lots of disparate data, but
it’s not ideal for numerical analysis.
The object-oriented database model is the best known
post-relational database model, since it incorporates tables,
but isn’t limited to tables. Such models are also known as
hybrid database models.
Entity-relationship model
This model captures the relationships between real-world entities much like the network model, but it
isn’t as directly tied to the physical structure of the database. Instead, it’s often used for designing a
database conceptually.

Here, the people, places, and


things about which data points
are stored are referred to as
entities, each of which has
certain attributes that
together make up their
domain. The cardinality, or
relationships between entities,
are mapped as well.
Spatial Data Model
A spatial database, or geodatabase is
a database that is optimized to store
and query data that represents
objects defined in a geometric space.
Most spatial databases allow
representing simple geometric objects
such as points, lines and polygons.
Some spatial databases handle more
complex structures such as 3D objects,
topological coverages
Multidimensional Data Model
multidimensional database
model or management
system (MDDBMS) - implies
the ability to rapidly process
the data in the database so
that answers can be
generated quickly. A number
of vendors provide products
that use multidimensional
databases. Approaches to
how data is stored and the
user interface vary.
Evaluate the use of object-oriented databases
as opposed to relational databases.
OODBMS RDBMS
Main objectives: data encapsulation and independence. Main objective: ensuring data independence from application programs.
Independence of classes: classes can be reorganized without affecting Data independence: Data can be reorganized and modified without
the mode of using them. affecting the mode of using them.
OODBMS store data and methods. RDBMS store only data.
Encapsulation: the data can be used only through their classes’ methods. Data partitioning: data can be partitioned depending on the requirements
of the users and on the specific users applications.
Active objects: the objects active. Requests cause objects to execute their Passive data: the data are passive. Certain operations, which are limited,
methods. can be automatically brought into use when the data are used.
Complexity: the structure of data may be complex, involving different Simplicity: users perceive data as columns, rows/tuples and tables.
types of data.
Chained data: data can be chained so that the methods of classes may Separate Tables: each relation/table is separate. The
bring about increased performance. Structured data such as BLOBS Join Operator refers data from separate tables.
(binary large objects) are used for sound, image, video etc.

Non-redundancy of methods: data and methods non-redundancy is Data non-redundancy: data normalization aims at eliminating or reducing
achieved through encapsulation and inheritance. Inheritance helps to data redundancy. It is used in the stage of designing the database and not
reduce the redundancy of methods. in the stage of developing the applications.
Optimizing classes: the data for an object can be interrelated and RDBMS performance is related to the level of complexity of the data
stored together, so that they may all be accessed by the access structure.
mechanism.
Consistent conceptual model: the models used for analysis, designing, Different conceptual model: the model of data structure and data access
programming and accessing represented by tables and JOINS is different from the model of analysis,
the database are similar. designing and programming.
The classes of objects directly represent the concepts of applications. The project must be converted in relational and access tables in accordance
with SQL.
Define the term data warehouse.
A data warehouse is constructed by integrating data from multiple
heterogeneous sources. It supports analytical reporting, structured
and/or ad hoc queries and decision making.
Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data
consolidations.
Describe a range of situations suitable
for data warehousing.
Explain why data warehousing is time
dependent.
• Data in a warehouse is only valid for a period of time.
Describe how data in a warehouse is updated
in real time.
• Data is refreshed from data in operational systems.
Describe the advantages of using data
warehousing.
• A single manageable structure to support decision-making. Allows
complex queries to be run across a number of business areas.
Explain the need for ETL processes in data
warehousing.
ETL Technology is an important component of the Data Warehousing Architecture. It
is used to copy data from Operational Applications to the Data Warehouse Staging
Area, from the DW Staging Area into the Data Warehouse and finally from the Data
Warehouse into a set of conformed Data Marts that are accessible by decision
makers.
The ETL software extracts data,
transforms values of inconsistent data,
cleanses "bad" data, filters data and
loads data into a target database. The
scheduling of ETL jobs is critical. Should
there be a failure in one ETL job, the
remaining ETL jobs must respond
appropriately.
ETL (Extract, Transform and Load) is a process in data warehousing responsible for
pulling data out of the source systems and placing it into a data warehouse. ETL
involves the following tasks:
• Extracting the data from source systems (SAP, ERP, other oprational systems),
data from different source systems is converted into one consolidated data
warehouse format which is ready for transformation processing.
• Transforming the data may involve the following tasks: applying business rules
(so-called derivations, e.g., calculating new measures and dimensions), cleaning
(e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.), filtering
(e.g., selecting only certain columns to load), splitting a column into multiple
columns and vice versa, joining together data from multiple sources (e.g.,
lookup, merge), transposing rows and columns, applying any kind of simple or
complex data validation (e.g., if the first 3 columns in a row are empty then reject
the row from processing)
• Loading the data into a data warehouse or data repository other reporting
applications
• Extract
• The Extract step covers the data extraction from the source system and makes it accessible for further processing.
The main objective of the extract step is to retrieve all the required data from the source system with as little
resources as possible. The extract step should be designed in a way that it does not negatively affect the source
system in terms or performance, response time or any kind of locking.
• There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a record has been changed and
describe the change, this is the easiest way to get the data.
• Incremental extract - some systems may not be able to provide notification that an update has occurred, but they
are able to identify which records have been modified and provide an extract of such records. During further ETL
steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not
be able to handle deleted records properly.
• Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the
only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the
same format in order to be able to identify changes. Full extract handles deletions as well.
• When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for full extracts;
the data volumes can be in tens of gigabytes.
• Transform
• The transform step applies a set of rules to transform the data from the
source to the target. This includes converting any measured data to the
same dimension (i.e. conformed dimension) using the same units so that
they can later be joined. The transformation step also requires joining
data from several sources, generating aggregates, generating
surrogate keys, sorting, deriving new calculated values, and applying
advanced validation rules.
• Load
• During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible. The target
of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes
before the load and enable them back only after the load completes.
The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Describe how ETL processes can be used to
clean up data for a data warehouse.
The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP
code, City/Street).
Compare the different forms of discovering
patterns using data mining.
describe the conceptual approach used by:
Cluster analysis : Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
Associations : Association rules are created by analyzing data for frequent if/then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an indication of how frequently
the items appear in the database. Confidence indicates the number of times the if/then statements have been found to
be true. In data mining, association rules are useful for analyzing and predicting customer behavior. They play an
important part in shopping basket data analysis, product clustering, catalog design and store layout.
Classifications : Classification is a data mining function that assigns items in a collection to target categories or classes.
The goal of classification is to accurately predict the target class for each case in the data. For example, a classification
model could be used to identify loan applicants as low, medium, or high credit risks.
Sequential patterns : Sequential pattern mining is a topic of data mining concerned with finding statistically relevant
patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values
are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern
mining is a special case of structured data mining.
Forecasting:
Describe situations that benefit from data
mining.
Examples can be cited such as the use of mining techniques by banks to
identify fraudulent credit card use; retailers can use mining techniques to
identify subsets of the population likely to respond to a particular
promotion.
Describe how predictive modelling is used.
The use of classification techniques such as “decision tree induction”
or “back propagation in neural networks”. The determination of
values for rows of a database useful for predictions.
Explain the nature of database segmentation.
The partitioning of a database according to some feature in common in
the rows.
Explain the nature and purpose of link
analysis.
The use of rules to establish associations between individual records in a
data set.
Describe the process of deviation detection.
The detection of outlying data can be subjected to statistical techniques
in order to identify unusual events or data subsets.

You might also like