Data vs. Information in Databases
Data vs. Information in Databases
Database : Basic
Concepts
A 1.1Data and information
Data is meaningless. To be useful, data must be interpreted to produce
information.
@ Data is a single unit. A group of data which carries news is called Information.
Data Definition. Defining new data structures for a database, removing data
structures from the database, modifying the structure of existing data.
Data Maintenance. Inserting new data into existing data structures, updating
data in existing data structures, deleting data from existing data structures.
Data Retrieval. Querying existing data by end-users and extracting data for
use by application programs.
Process :
For example, to ensure data
consistency when moving money
between two accounts it is necessary to
complete two operations (debiting one
account and crediting the other).
Unless both operations are carried out
successfully, the transaction will be
rolled back.
A 1.5 Transaction state
Roll forward
Recovering a database by applying different transactions that recorded in the
database log files. It is nothing but re-doing the changes made by a transaction i.e.
after the committed transaction and to over write the changed value again to ensure
consistency.
A 1.7 Properties of Database Transactions
A transaction is a sequence of operations performed as a single logical
unit of work. A logical unit of work must exhibit four properties, called
the atomicity, consistency, isolation, and durability (ACID) properties, to
qualify as a transaction.
• The transaction must be fully compliant with the state of the database
as it was prior to the transaction. In other words, the transaction cannot
break the database’s constraints. For example, if a database table’s
Phone Number column can only contain numerals, then consistency
dictates that any transaction attempting to enter an alphabetical letter
may not commit.
Isolation
Data concurrency means that Data consistency means that each user
many users can access data at sees a consistent view of the data,
the same time. including visible changes made by the
user's own transactions and transactions of
other users.
To describe consistent transaction behavior
when transactions run at the same time,
database researchers have defined a
transaction isolation model
called serializability. The serializable
mode of transaction behavior tries to
ensure that transactions run in such a way
that they appear to be executed one at a
time, or serially, rather than concurrently.
When multiple users attempt to make modifications to a data at the same, some level
of control should be established to that having one user’s modification affect
adversely can be prevented. The process of controlling this is called concurrency
control.
There are three common ways that databases manage data concurrency and they are
as follows:
3. Last in wins
Pessimistic Optimistic concurrency
concurrency control
control With this method, a row
in this method, a cannot be available to
row is available to other users while the data
the users when the is currently being updated.
record is being During updating, the
fetched and stays database examined the
with the user until it
row in the database to
is updated within
the database . determine whether or not
any change has been
made. An attempt to
update a record that has
already been changed can
be flagged as concurrency
violation.
Last in wins –
with this method, any row can never be available to users while the data is currently
being updated but there is no effort made to compare updates with the original
record.
The record would simply be written out. The potential effect would be overwriting
any changes that are being made by other concurrent users since the last refresh of
the record.
Some DBMS have multi-version concurrency control. This works by automatically providing read
consistency to a query which results in a situation where all data seen by query can only come from a
single point in time, or a term known as statement level read consistency. Another read consistency is
the transaction level read consistency.
The RDMS uses the information stored in the rollback segments which contain old values of data that
have been altered by recently committed or uncommitted transactions to get a consistent view.
Both consistency and concurrency are closely related to each other in database systems.
Two users (Julie and Scott) updating different fields
of the same employee record at the same time
Pessimistic Concurrency Control
Scenario: Julie opens the record and locks it.
While she edits the last name, Scott is blocked and cannot make any changes
until Julie saves and releases the lock.
Result: Only Julie can update the record at that moment. Scott must wait, and
his changes will be made after Julie finishes.
How it works: It locks the data when one user begins to edit, preventing others
from making changes until the lock is released.
➢ Data integrity is preserved, but can cause delays or deadlocks in high-use
systems.
Optimistic Concurrency Control
Scenario:
● Julie and Scott both load the same employee record at the same time.
● Scott edits and tries to save the phone extension after Julie.
■ Reject Scott’s changes and prompt him to reload the latest record.
■ Merge the changes if they are on different fields (e.g., Julie changed last name, Scott changed phone
extension) — this depends on the system's design.
How it works: Assumes that conflicts are rare. Each user edits a local copy, and the system checks for changes before
saving.
Scenario:
○ Scott’s save overwrites the entire record, including the last name field Julie changed—even if he didn't
touch it.
How it works: Whichever user saves their changes last, their version overwrites the previous one.
g
Select Insert Update
Queries that encapsulate a "select" Queries that encapsulate an "insert" Queries that encapsulate an
command are used to retrieve data command are used to add new rows "update" command modify rows of
from databases. The command of data to a database. The data that are already stored in a
specifies conditions to retrieve rows command specifies which table within given database table. The command
of data and return them to the the database will receive the data. can update one or more fields in
program. every row in which another given
field holds a specified value.
CustomerID CustomerName ContactName Address City PostalCode Country
1
Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany
5
Berglunds snabbköp Christina Berglund Berguvsvägen 8 Luleå S-958 22 Sweden
UPDATE Customers
SET ContactName='Alfred Schmidt', City='Frankfurt'
WHERE CustomerID=1;
UPDATE Customers
SET ContactName='Juan'
WHERE Country='Mexico';
A 1.9 Data validation
The last one or two digits in a code are used to check the
Check digit Bar code readers in supermarkets use check digits
other digits are correct
A National Insurance number is in the form LL 99 99 99 L
Format check Checks the data is in the right format
where L is any letter and 9 is any number
Length check Checks the data isn't too short or too long A password which needs to be six letters long
Lookup table Looks up acceptable values in a table There are only seven possible days of the week
Presence check Checks that data has been entered into a field In most databases a key field cannot be left blank
Double entry - entering the data twice and Proofreading data / Visual Varification - this
comparing the two copies. This effectively doubles method involves someone checking the data
the workload, and as most people are paid by entered against the original document. This is also
the hour, it costs more too. time-consuming and costly.
Exercise Questions :
1. A tourist has booked a double room, non-smoking, at the Overlook Hotel. The tourist used an online
booking service called Best Fare that compares offers from different hotels.
Best Fare confirms the transaction by communicating only the booking code 3EP9RE to the tourist. The
tourist later receives an email from the hotel, with further details.
(a) Outline the difference between data and information, in relation to the booking code. [2]
OR
Data lacks meaning e.g. on their own data elements, Code, name, room;
Whereas information is interpreted data (the code interpreted by the system) and has meaning (eg
email provides meaningful information to the tourist about the use of the Code);
(b) Define the term database transaction. [2]
For example:
Minimal set of actions that complete/make the booking;
So that database is accurate/consistent/has integrity;
So that database is updated to prevent others booking room;
So that recovery can be run on failure;
(c) Describe the use of transactions to maintain data consistency. [3]
o Modifications on data are made persistent in the database only if the transaction terminates;
o A roll-back operation is performed if a failure occurs prior to termination of the transaction;
o And this keeps the database in the original consistent status;
For example:
Transactions conform to rules before update to database;
Transactions must complete fully to make the changes permanent/persistent;
Allows roll-back operation in case of failure;
To ensure database is consistent while performing transactions;
(d) Explain what is meant by isolation in the transaction described above. [2]
An example must be seen for the second mark to be awarded. (It means concurrent control;)
It says/defines/specifies how and when the changes of a process are visible to concurrent operations;
For example:
using locks on data to prevent concurrent writing that would lead to inconsistency as in the case of
two people booking the same room;
(e) Explain the benefits offered by data sharing in databases, for two of the stakeholders in this
scenario. [6]
General explanation (“What it is”): Tourist
Can quickly/efficiently access;
Data sharing/replication: all/part of A wide choice of hotels/offers/prices;
one/several source databases are Best Fare
shared/replicated according to the Can efficiently display offers on a wide selection of hotels;
needs of different user groups; Can develop/extend existing software when new hotels are created;
Can develop/re-use the same application without major
The information needed by the modifications;
group is made ‘closer’ to the user; Hotel
The hotel needs to rely on new applications and increasing speed is
Hence, less focus on an issue (cost/maintenance);
transmission/traffic (expensive or Allowing the sharing of information in the source database gives
slow or not available or not more visibility to the hotel, making it more competitive;
convenient), but requires more The hotel can continue to locally operate only on part of the data in
storage space (but storage is the source database, using less advanced software;
cheap); [2] Allows hotel to change prices and offers in real-time; [4]
IBDP CS Database Chapter – A2
Further aspects of
database management
What is a database administrator?
Database administration is more of an operational or technical level
function responsible for physical database design, security enforcement,
and database performance. Tasks include maintaining the data
dictionary, monitoring performance, and enforcing organizational
standards and security.
Explain the role or Functions of a database administrator.
1. Schema Definition:
• The DBA definition the logical Schema of the database. A Schema refers to the overall logical structure of the database.
• According to this schema, database will be developed to store required data for an organization.
• The DBA decides how the data is to be represented in the stored database.
• The DBA modifies the physical organization of the database to reflect the changing needs of the organization or to improve performance.
Explain the role or Functions of a database administrator. (Continue…..)
5. Approving Data Access:
• The DBA determines which user needs access to which part of the database.
6. Monitoring Performance:
• The DBA monitors performance of the [Link] DBA ensures that better performance is maintained by making changes in
physical or logical schema if required.
• The DBA ensures this periodically backing up the database on magnetic tapes or remote servers.
• In case of failure, such as virus attack database is recovered from this backup.
Explain how end-users can interact
with a database.
Describe different methods of
database recovery.
Database recovery is the process of restoring the database to the most
recent consistent state that existed just before the failure.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk
drives or storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head
crash or any other failure, which destroys all or a part of disk storage.
Recovery and Atomicity
When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations, which
are atomic in nature. But according to ACID properties of DBMS, atomicity of transactions as a
whole must be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
• It should check the states of all the transactions, which were being executed.
• A transaction may be in the middle of some operation; the DBMS must ensure the atomicity of the
transaction in this case.
• It should check whether the transaction can be completed now or it needs to be rolled back.
• No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as
maintaining the atomicity of a transaction −
• Maintaining the logs of each transaction, and writing them onto some stable storage before actually
modifying the database.
• Maintaining shadow paging, where the changes are done on a volatile memory, and later, the actual
database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction. It is important that the
logs are written prior to the actual modification and stored on a stable storage media, which is failsafe.
Log-based recovery works as follows −
•The log file is kept on a stable storage media.
•When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
•When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads Tn has changed the value of X, from V1 to V2.
•When the transaction finishes, it logs −
<Tn, commit>
The database can be modified using two approaches −
•Deferred database modification − All logs are written on to the stable storage and the database is updated when a
transaction commits.
•Immediate database modification − Each log follows an actual database modification. That is, the database is modified
immediately after every operation.
Recovery with Concurrent Transactions
When more than one transaction are being executed in parallel, the logs
are interleaved. At the time of recovery, it would become hard for the
recovery system to backtrack all logs, and then start recovering. To ease
this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may
fill out all the memory space available in the system. As time passes, the
log file may grow too big to be handled at all. Checkpoint is a
mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point
before which the DBMS was in consistent state, and all the transactions
were committed.
When a system with concurrent transactions crashes and recovers, it
behaves in the following manner −
• The recovery system reads the logs backwards from the end to the last checkpoint.
• It maintains two lists, an undo-list and a redo-list.
• If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>, it puts the transaction in
the redo-list.
• If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts the transaction in
undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the transactions in the redo-list and
their previous logs are removed and then redone before saving their logs.
Outline how integrated database systems function.
The term integrated database is used to describe two different database structures: connection of multiple databases or a
database built into another application or tool. Although the initial description would appear to mean that these two structures
are quite different, they are not. In fact, they possess many of the same structures.
• A database is a collection of data from a range of different sources organized under one structure.
• The database itself if made up of a series of semi-independent data tables. Each table holds specific information, typically
with one focus per table.
• The database can create links between the separate tables, based on common elements, information, or programming
logic. Queries can be written to create reports accessing all the different tables in the database.
• Connections between the different databases must be made and tested from a variety of perspectives to ensure that the
logic is sound.
• The different databases can also exist in different formats or programs. This adds to the complexity but may be achieved
using advanced programming logic and powerful resources.
Outline the use of databases in areas such as stock
control, police records, health records, employee data.
Suggest methods to ensure the privacy of the
personal data and the responsibility of those
holding personal data not to sell or divulge it in
any way.
• Access control :
• Auditing : Database auditing involves observing a database so as to be aware of the
actions of database users. Database administrators and consultants often set up auditing for
security purposes, for example, to ensure that those without the permission to access
information do not access it.
• Authentication
• Encryption
• Integrity controls : Data integrity is the maintenance of, and the assurance of the accuracy
and consistency of, data over its entire life-cycle
• Backups
• Application security
Discuss the need for some databases to be open
to interrogation by other parties (police,
government, etc).
Explain the difference between data
matching and data mining.
• Data matching (also known as record or data linkage, entity resolution,
object identification, or field matching) is the task of identifying,
matching and merging records that correspond to the same entities
from several databases or even within one database.
HL Extension
Further database models
and database analysis
Data Models
The structure of the database is called the data models. A Collection of
conceptual tools for describing data, data relationship, data semantic
and consistency constraint.
❑ Relational model
❑ Network model
❑ Hierarchical model
❑ Object-based logical models
❑ Unifying model
Relational model
The most common model, the relational model sorts data
into tables, also known as relations, each of which consists
of columns and rows. Each column lists an attribute of the
entity in question, such as price, zip code, or birth date.
Together, the attributes in a relation are called a domain. A
particular attribute or combination of attributes is chosen as
a primary key that can be referred to in other tables, when
it’s called a foreign key.
Each row, also called a tuple, includes data about a specific
instance of the entity in question, such as a particular
employee.
The model also accounts for the types of relationships
between those tables, including one-to-one, one-to-many,
and many-to-many relationships. Here’s an example:
Hierarchical model
The hierarchical model organizes data into a tree-like structure, where each record has a
single parent or root. Sibling records are sorted in a particular order. That order is used as
the physical order for storing the database. This model is good for describing many
real-world relationships.
Network model
The network model builds on the hierarchical
model by allowing many-to-many relationships
between linked records, implying multiple
parent records. Based on mathematical set
theory, the model is constructed with sets of
related records. Each set consists of one owner
or parent record and one or more member or
child records. A record can be a member or
child in multiple sets, allowing this model to
convey complex relationships.
It was most popular in the 70s after it was
formally defined by the Conference on Data
Systems Languages (CODASYL).
Object-oriented database model
This model defines a database as a collection of objects, or
reusable software elements, with associated features and
methods. There are several kinds of object-oriented
databases:
A multimedia database incorporates media, such as
images, that could not be stored in a relational database.
A hypertext database allows any object to link to any other
object. It’s useful for organizing lots of disparate data, but
it’s not ideal for numerical analysis.
The object-oriented database model is the best known
post-relational database model, since it incorporates tables,
but isn’t limited to tables. Such models are also known as
hybrid database models.
Entity-relationship model
This model captures the relationships between real-world entities much like the network model, but it
isn’t as directly tied to the physical structure of the database. Instead, it’s often used for designing a
database conceptually.
Non-redundancy of methods: data and methods non-redundancy is Data non-redundancy: data normalization aims at eliminating or reducing
achieved through encapsulation and inheritance. Inheritance helps to data redundancy. It is used in the stage of designing the database and not
reduce the redundancy of methods. in the stage of developing the applications.
Optimizing classes: the data for an object can be interrelated and RDBMS performance is related to the level of complexity of the data
stored together, so that they may all be accessed by the access structure.
mechanism.
Consistent conceptual model: the models used for analysis, designing, Different conceptual model: the model of data structure and data access
programming and accessing represented by tables and JOINS is different from the model of analysis,
the database are similar. designing and programming.
The classes of objects directly represent the concepts of applications. The project must be converted in relational and access tables in accordance
with SQL.
Define the term data warehouse.
A data warehouse is constructed by integrating data from multiple
heterogeneous sources. It supports analytical reporting, structured
and/or ad hoc queries and decision making.
Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data
consolidations.
Describe a range of situations suitable
for data warehousing.
Explain why data warehousing is time
dependent.
• Data in a warehouse is only valid for a period of time.
Describe how data in a warehouse is updated
in real time.
• Data is refreshed from data in operational systems.
Describe the advantages of using data
warehousing.
• A single manageable structure to support decision-making. Allows
complex queries to be run across a number of business areas.
Explain the need for ETL processes in data
warehousing.
ETL Technology is an important component of the Data Warehousing Architecture. It
is used to copy data from Operational Applications to the Data Warehouse Staging
Area, from the DW Staging Area into the Data Warehouse and finally from the Data
Warehouse into a set of conformed Data Marts that are accessible by decision
makers.
The ETL software extracts data,
transforms values of inconsistent data,
cleanses "bad" data, filters data and
loads data into a target database. The
scheduling of ETL jobs is critical. Should
there be a failure in one ETL job, the
remaining ETL jobs must respond
appropriately.
ETL (Extract, Transform and Load) is a process in data warehousing responsible for
pulling data out of the source systems and placing it into a data warehouse. ETL
involves the following tasks:
• Extracting the data from source systems (SAP, ERP, other oprational systems),
data from different source systems is converted into one consolidated data
warehouse format which is ready for transformation processing.
• Transforming the data may involve the following tasks: applying business rules
(so-called derivations, e.g., calculating new measures and dimensions), cleaning
(e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.), filtering
(e.g., selecting only certain columns to load), splitting a column into multiple
columns and vice versa, joining together data from multiple sources (e.g.,
lookup, merge), transposing rows and columns, applying any kind of simple or
complex data validation (e.g., if the first 3 columns in a row are empty then reject
the row from processing)
• Loading the data into a data warehouse or data repository other reporting
applications
• Extract
• The Extract step covers the data extraction from the source system and makes it accessible for further processing.
The main objective of the extract step is to retrieve all the required data from the source system with as little
resources as possible. The extract step should be designed in a way that it does not negatively affect the source
system in terms or performance, response time or any kind of locking.
• There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a record has been changed and
describe the change, this is the easiest way to get the data.
• Incremental extract - some systems may not be able to provide notification that an update has occurred, but they
are able to identify which records have been modified and provide an extract of such records. During further ETL
steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not
be able to handle deleted records properly.
• Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the
only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the
same format in order to be able to identify changes. Full extract handles deletions as well.
• When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for full extracts;
the data volumes can be in tens of gigabytes.
• Transform
• The transform step applies a set of rules to transform the data from the
source to the target. This includes converting any measured data to the
same dimension (i.e. conformed dimension) using the same units so that
they can later be joined. The transformation step also requires joining
data from several sources, generating aggregates, generating
surrogate keys, sorting, deriving new calculated values, and applying
advanced validation rules.
• Load
• During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible. The target
of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes
before the load and enable them back only after the load completes.
The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Describe how ETL processes can be used to
clean up data for a data warehouse.
The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP
code, City/Street).
Compare the different forms of discovering
patterns using data mining.
describe the conceptual approach used by:
Cluster analysis : Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
Associations : Association rules are created by analyzing data for frequent if/then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an indication of how frequently
the items appear in the database. Confidence indicates the number of times the if/then statements have been found to
be true. In data mining, association rules are useful for analyzing and predicting customer behavior. They play an
important part in shopping basket data analysis, product clustering, catalog design and store layout.
Classifications : Classification is a data mining function that assigns items in a collection to target categories or classes.
The goal of classification is to accurately predict the target class for each case in the data. For example, a classification
model could be used to identify loan applicants as low, medium, or high credit risks.
Sequential patterns : Sequential pattern mining is a topic of data mining concerned with finding statistically relevant
patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values
are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern
mining is a special case of structured data mining.
Forecasting:
Describe situations that benefit from data
mining.
Examples can be cited such as the use of mining techniques by banks to
identify fraudulent credit card use; retailers can use mining techniques to
identify subsets of the population likely to respond to a particular
promotion.
Describe how predictive modelling is used.
The use of classification techniques such as “decision tree induction”
or “back propagation in neural networks”. The determination of
values for rows of a database useful for predictions.
Explain the nature of database segmentation.
The partitioning of a database according to some feature in common in
the rows.
Explain the nature and purpose of link
analysis.
The use of rules to establish associations between individual records in a
data set.
Describe the process of deviation detection.
The detection of outlying data can be subjected to statistical techniques
in order to identify unusual events or data subsets.