Multibase: Integrating Distributed Databases
Multibase: Integrating Distributed Databases
database systems*
by JOHN MILES SMITH, PHILIP A. BERNSTEIN, UMESHWAR DAYAL, NATHAN
GOODMAN, TERRY LANDERS, KEN W. T. LIN, and EUGENE WONG
Computer Corporation of America
Cambridge, Massachusetts
ABSTRACT dent databases, each with its own schema. Such databases are
nonintegrated. Furthermore, these databases may be man-
Multibase is a software system for integrating access to pre- aged by different database management systems (DBMS),
existing, heterogeneous, distributed databases. The system perhaps on different hardware. In this case, in addition to
suppresses differences of DBMS, language, and data models being nonintegrated the databases are distributed and hetero-
among the databases and provides users with a unified global geneous. Thus, the real world of nonintegrated, hetero-
schema and a single high-level query language. Autonomy for geneous, distributed databases differs greatly from the more
updating is retained with the local databases. The architecture ideal world of an integrated database.
of Multibase does not require any changes to local databases Nonintegrated, heterogeneous, distributed databases arise
or DBMSs. There are three principal research goals of the for several reasons. First, many of these databases were cre-
project. The first goal is to develop appropriate language ated before the benefits of integrated databases were well
constructs for accessing and integrating heterogeneous data- understood. In those days, total integration was not a prin-
bases. The second goal is to discover effective global and local cipal database design goal. Second, the lack of a central data-
optimization techniques. The final goal is to design methods base administrator for some enterprises has made it difficult
for handling incompatible data representations and inconsis- for independent organizations within an enterprise to produce
tent data. Currently the project is in the first year of a planned an integrated database suitable for all of them. Third, the
three year effort. This paper describes the basic architecture large size of many data processing applications has made dis-
of Multibase and identifies some of the avenues to be taken in tribution a necessity, simply to handle the volume of work.
subsequent research. Since integrated distributed DBMSs have not been available,
it has been necessary to implement applications on different
machines. Since different applications often have different
1. INTRODUCTION performance and functionality requirements, different
DBMSs were often selected to run on these machines to meet
these different requirements. Many data processing organiza-
What is Multibase? tions have experienced these problems, so there are many
nonintegrated, heterogeneous, distributed databases in the
The database approach to data processing requires that all world.
of the data relevant to an enterprise be stored in an integrated A principal problem in using databases of this type is that
database. By "integrated," we mean that a single schema of integrated retrieval. In such databases, each independent
(i.e., database description) describes the entire database, that database has its own schema, expressed in its own data model,
all accesses to the database are expressed relative to that and can be accessed only by its own retrieval language. Since
schema, and that such accesses are processed against a single different databases in general have different schemata, differ-
(logical) copy of the database. Unfortunately, in the real ent data models, and different retrieval languages, many diffi-
world many databases are not integrated. Often, the data culties arise in formulating and implementing retrieval re-
relevant to an enterprise is implemented by many indepen- quests (called queries) that require data from more than one
database. These difficulties include the following: resolving
* This research was jointly supported by the Defense Advanced Research incompatibilities between the databases, such as differences
Projects Agency of the Department of Defense and the Naval Electronic Sys- of data types and conflicting schema names; resolving incon-
tems Command and was monitored by the Naval Electronic Systems Command sistencies between copies of the same information stored in
under Contract No. N00039-80-C-0402. The views and conclusions contained in different databases; and transforming a query expressed in the
this document are those of the authors and should not be interpreted as neces-
sarily representing the official policies, either expressed or implied, of the
user's language4nto a set of queries expressed in the many
Defense Advanced Research Projects Agency or the Naval Electronic Systems different languages supported by the different sites. Imple-
Command or the U.S. Government. menting such a query usually consumes months of program-
487
488 National Computer Conference, 1981
Schema Architecture
use the same language DAPLEX as both the query and map-
query over global schema ping language. The process of constructing the global schema
from the local schemata is discussed in Section 3.
query translator
Query Processor
query over disjoint union of LSs & IS
The query processor translates a query over the disjoint
union of LSs and IS into a query processing strategy. This
query processor strategy includes the following: a set of queries, each of which
is posed against exactly one LS or the IS; a set of "move"
query over IS
operations to ship the results of these queries between the
query over
LSI local DBMSs and the query processor; and a set of queries
that is executed locally by the query processor to integrate the
results of the LS and IS queries. The main goal of this trans-
lation is to minimize the total cost of evaluating the query,
LDI1 LDIn LDI
where cost is measured by local processing time and commu-
query over LHS
nication volume.
query over
LHS1 A query processing strategy is produced in two steps. First,
the query is translated into an internal representation called a
query graph. Using this representation, the query processor
DBMS1 DBMSn
isolates those subqueries of the given query (which are essen-
Figure 3—Run-time query processing subsystem tially subgraphs of the query graph) that can be entirely eval-
uated at one local DBMS. Thus, the result of the first step is
the set of single-site subqueries of the given query.
The second step is to combine the single-site queries with
move operations and local queries issued by the query pro-
Query Translator cessor. Move operations serve two purposes. First, they are
used to gather the results of the single-site queries back to the
The query translator receives global queries expressed in query processor. These results can be integrated by the query
DAPLEX over the GS and translates them into queries ex- processor by executing a query local to itself. The integrated
pressed in an internal language over the disjoint union of LSs results may be the answer to the query, in which case they are
and IS. returned to the user. Second, they may be used as input to
To perform the translation, the query translator must use other single-site queries. In this case, a move operation is
the mapping that defines how entity types and functions of the issued to ship the data to the local DBMS that needs it. The
GS are constituted from the entity types and functions of the method by which single-site queries, move operations, and
LS and the IS. The query translator uses these mapping defi- queries local to the query processor are sequenced to produce
nitions to substitute global entity types and global functions in a correct and efficient strategy is discussed in Section 4.
the global query by their mapping definitions. The substi-
tution results in a query containing only entity types and func-
tions of the LSs and the IS. Therefore references by the global
query to entities in the GS are now expressed as references to Local Database Interface (LDI)
the actual entities at particular sites that implement the global
GS. Any extra data needed from the integration database to Local queries posed against the LSs are sent by the query
resolve incompatibilities among LSs is now explicitly refer- processor to the LDIs in an internal format. The LDI trans-
enced in the translated query. lates these local queries into programs in the local DML and
The query produced by the query translator only references programming language over the local host schema (LHS).
data in the LS and the IS. Thus, we can imagine that this query This translation is optimized to minimize the processing time
is posed against a database state that is the disjoint union of of the translated query. When the local DBMS uses a high
the LSs together with the IS. This disjoint union is a homoge- level (i.e., set-at-a-time) language, such as DAPLEX, this
neous and centralized view of the distributed heterogeneous translation is fairly direct. However, when the local DBMS
database. uses a low level (i.e., record-at-a-time) language, such as
The language used for defining the mapping between sche- CODASYL DML embedded in COBOL, this translation may
mata must be compatible with the global DML. Otherwise, it be quite complex and may require nontrivial optimization.
would be awkward to translate the query from the GS to LSs Translation methods for a file system and CODASYL lan-
and IS using conventional query modification techniques. guage are described in Section 4.
(Query modification composes the given query, which is a To do the translation, the LDI must have information about
function from GS states to answer states, with the mapping how entity types and functions in the LS are mapped to ob-
from LS and IS states to GS states, to produce a query from jects in the LHS. These mappings are defined using the rules
LS and IS states to answer states.2) Therefore, we propose to discussed below.
Integrating Heterogeneous Distributed Database Systems 491
3. SCHEMA INTEGRATION ARCHITECTURE record type corresponds to an entity type, and the attributes
of the record type correspond to functions defined on the
"Schema Integration" is the process of defining a global sche- entity type.
ma and its mapping from the existing local schemata. The If an attribute of a record type is a key (in CODASYL '
general architecture of this design process is discussed in this terminology, a key is the data item(s) declared "NO DUPLI-
section. CATE ALLOWED") then the corresponding function must
There is one local host schema (LHS) for each local data- be a totally defined one-to-one mapping. If the attribute is a
base. Each LHS can be expressed in a relational, CODASYL, repeating group (declared to have multiple occurrences in a
or a file language. To merge these LHSs we must convert them CODASYL model), then the corresponding function is a set-
into a common data model first. Otherwise, we would be valued function.
mixing relations from a relational model with record types and A set type in the CODASYL model is a mapping between
set types from a CODASYL model. Thus the first step of an owner record type and one or several member record
schema integration is to translate LHSs into Local Schemata types. A set type maps an owner record to a set of member
(LS) defined in the Functional Data Model of DAPLEX. records, or, conversely, a set type maps a member record to
The second step is to merge LSs into a GS. To do this, an a unique owner record. Therefore, a set type resembles a
integration schema which defines an integration database is function that maps an owner entity to a set of member enti-
often needed. An integration database contains: information ties, or, conversely, maps a member entity to a unique owner
about mapping between different scales used by different LSs entity.
for the same entity type; statistical information about im- In a CODASYL model, a set type implies not only certain
precise data; and other information needed for reconciling semantic information but also the existence of access paths.
inconsistency between copies of the same data stored in differ- For example a set type "work-in" between "department" and
ent databases. The integration schema and LSs are then used "employee" record types implies that the employees owned
to define a global schema. by a department work in that department. But it also implies
The overall architecture of schema integration consists of that there is an access path from a department record to the
employee records owned by that department and another ac-
a) a global schema, cess path from each employee record to its own department
b) a mapping language, record. Since the LSs will be used for query optimization, we
c) local schemata (LS) and an integration schema (IS),
d) a mechanized local-to-host schema translator, and
e) local host schemata (LHS) and local DBMSs.
type shipclass i s entity and the integration of incompatible data, are discussed in
classname string(1..24)
length stringd, subsequent sections.
draft stringd,
beam stringd,
displacement stringd,
endurance stringd, Merging Entity Types and Functions
consists-of set of ship;
end entity;
To merge two entity types, say El and E2 in Figure 9, into
type ship i s entity an entity type, say E in Figure 12, the database designer must
UIC s t r i n g d . .6) ;
VCN s t r i n g d . .5) ; first determine whether the set of entities of type El is disjoint
name stringd 26) ; from the set of entities of type E2. If El and E2 are disjoint,
type stringCl 4 ) ;
flag stringd 2); then E is simply the union of El and E2. If El and E2 are not
owner stringd 2); disjoint, then the condition under which two entities from El
hull : s t r i n g d . .4) ;
positions : se_£ Q£ trackhist; and E2 respectively are identical must be specified. To specify
consists-of_inv : shipclass; the condition under which entities are identical, entities of El
end entity:
and E2 must be able to be identified by their attributes.
Therefore, for each entity type to be merged, a function or
type system i s entity combination of functions of the entity type must be a primary
all-class : set of shipclass
all-ship : set of ship; key. Two entities from two entity types being merged can then
end entity;
Relation Platform
VesselName char(26)
class char(25)
type char(6) type EI is entity type E2 is entity
shipidl : integer; shipid2 : integer;
hull char(6) classl : codel; class2 : code2;
flag char(2) end entity: sM entity;
category char(4)
{ PIF
NOSICID
IRCS
char(4)
char(8)
char(8)
El
ell
shipidl
1212
classl
cl
E2
e21
shipid2
3440
class2
d2
el2 1240 c3 e22 3651 d3
Relation Position
{ PIF
NOSICID
DTG
latitude
char(4)
char(8)
char(10)
char(5)
el3 2341 c5 e23
Figure 9—Local schemata
4411 d4
longitude char(6)
bearing char(3) type code is. entity
course char(3) end entity;
speed char(3)
Define <a n£u function
* primary key f : (codel union code2) -> code.
Figure 7—A relational model Figure 10—Integration database
494 National Computer Conference, 1981
Creation of a New Entity Type and its Functions type supply2 is entity
sno : integer;
pno : integer;
end entity;
Merging two entity types into a single entity type is a special
Figure 14—Two local schemata
case of creating a new entity type. Essentially, a new entity
type may be created which is a combination of the existing
entity types. However, this combination does not create new
objects in the database. Rather, it simply presents many exist- Suppose a global schema with two entity types, "supplier"
ing objects of different types as objects of a single type to the and "parts," is to be designed from two local schemata shown
global schema users. Properties of the new global entities are in Figure 14. The global schema must capture all the informa-
simply those that previously existed in the local schemata. tion contained in both schemata. Notice that in the second
However, in some cases, a database designer may want to schema, "supplier" and "parts" entities do not exist, but their
design a more sophisticated global schema in which new (vir- existence is implied by the presence of supplier numbers and
tual) objects derive their properties (attributes) from many part numbers: "sno" and "pno." To capture this information,
dissimilar existing objects. An example is used to illustrate virtual "supplier" and "parts" entities corresponding to those
this process, and general principles can be drawn from the "sno" and "pno" must be created in the global schema. A
example. definition of the global schema is shown in Figure 15. Notice
that in the definition primary keys "[Link]" and
"[Link]" are used to map the new entities to existing entities
in the first schema and the implied entities in the second
type E jjs entity
shipid : integer; schema.
class : code;
end entity;
end loop; 1. Scale difference. For example, in one database four val-
Figure 13—The mapping definition of entity type E ues (cold, cool, warm, hot) are used to classify climates
Integrating Heterogeneous Distributed Database Systems 495
local schemata, integration schema, and the mapping defini- data to a central site which has large memory and computing
tions among them. It uses this information to parse, translate, power and do most of the processing there. In doing this
and decompose queries over the global schema into local planning, the "ACCESS PLANNER" tries to produce steps
queries over local schemata, and coordinates execution of the which minimize the cost of processing the query. The meaning
local queries. The structure of a GDM and its interface with of "cost" depends on the individual systems being integrated.
local DBMSs is shown in Figure 17. It may mean the amount of data moved between sites, or the
A query expressed in DAPLEX over the global schema is amount of processing time.
first parsed by the parser and a parse tree is generated. Com- The execution of the access plan is coordinated by the
ponents of the parse tree, which are entities and functions of "EXECUTION STRATEGIST." It sequences the steps of
the global schema, are then replaced by their corresponding the access plan and it makes sure that the data needed by a
definitions, which are expressed in terms of the local schemata step are there before the step is initiated.
LSs. The result is a parse tree consisting of entities and func- The "EXECUTION STRATEGIST" communicates with
tions of the local schemata. The parser is part of the query local DBMSs through the Local Database Interface (LDI).
translator. The LDIs receive "data move" and "local processing" steps
The parse tree is then simplified to eliminate the inefficient from the "EXECUTION STRATEGIST," translate these
boolean components. For example, the boolean expression steps into programs in the local query language or Data Ma-
"(a > 5)or(a < 20)" is reduced to "true," and "(a>5)and nipulation Language (DML), or call local routines to process
(a < 2)" is reduced to "false."The query simplifier is also part these steps, and translate the results of these steps into the
of the query translator. format expected by the "EXECUTION STRATEGIST."
The parse tree is then decomposed by the decomposer into The LDI may reside in a GDM if the local site does not have
subtrees. Each subtree represents a local query referencing enough memory or cpu power; otherwise it resides with the
only entities and functions of a single local schema. individual local DBMS at the local site.
The "ACCESS PLANNER" transforms the local queries The query processor to be described in this section is orient-
into "data movement" and "local processing" steps. De- ed towards the initial breadboard system. It is designed to
pending on the memory size and processing power of each handle restricted versions of the user interface language and
individual site, and the capacity of the communication chan- view mapping language with reasonable efficiency. Subse-
nels, the "ACCESS PLANNER" may move data and distrib- quent research is needed to extend the query processor to
ute the computing load among several sites, or it may move efficiently handle the unrestricted languages.
Within the "Query Processor," the database is modelled as
a collection of entity types and links. A link L from entity type
R to entity type S is a function from entities of S to entities of
R; S is called the owner entity type and R is called the member
entity type relative to L. We assume that if L links R to S, then
L, R, and S are all stored at the same site. We also assume that
Query Translator there is a database schema describing the entity types and
links of the database.
rParser, View Global Schema We will sketch the Multibase query processing strategy in
Mapper, Query and Views three steps. First, we define the set of queries that can be
posed. Second, we define the set of basic operations that
Local Schemata
Multibase is capable of executing. Third, we describe how to
LSi translate a query into a sequence of basic operations that solve
the query. Finally, we describe how to translate a local query
posed over a CODASYL local host schema into a program in
Query Processor
Integration a low level Data Manipulation Language.
Schema
:Decomposer,
Access Planner
Query Optimizer
Queries
Workspace
A query consists of a target list and a qualification. A target
list consists of a set of function terms of the form A(R) where
EXECUTION STRATEGIST R is an entity type and A is a non-link function of R. A
qualification is a conjunction of selection clauses, join clauses,
and link clauses. A selection clause is a formula of the form
(A(R) op k) where A(R) is a function term, op is one of
LDI1 LDI2 LDI3 { = , ^ , < , > , ^ , ^ = } and k is a constant. A join clause is a
formula of the form (A(R) = B(S)) where A(R) and B(S) are
DBMS1 DBMS2 DBMS3 Integration function terms. A link clause is a formula of the form
Database
(L(R) = S) where L is a link from R to S.
Figure 17—Run time query processing subsystem Let r and s be entities in R and S respectively. We say that
Integrating Heterogeneous Distributed Database Systems 497
This can be computed as follows. Suppose A l , . . . , A n be further decomposed into two or more tree queries. (In the
are fields of R l , . . . ,Rn respectively where R l , . . . ,Rn breadboard version of Multibase, we will only handle queries
are record types of Q. ( R l , . . . ,Rn need not be distinct.) whose CODASYL subqueries are tree queries; if some CO-
Augment the qualification of Q' by adding the clauses DASYL subquery is cyclic, the query cannot be processed.)
([Link] = kl)...([Link] = kn). And execute the fol- Having extracted the File and CODASYL subqueries, we
lowing program. must now choose an order for these subqueries to be exe-
cuted. As a first-cut solution, we propose to solve all File and
Result: = 0; CODASYL subqueries before processing the results of any of
for each s in S loop these subqueries at the GDM. This strategy will be an es-
kl: = [Link];....; kn: = [Link]; pecially poor performer if a File or CODASYL subquery has
Result: = Result U Q'; no selection clauses. For such cases, we recommend use of
end loop; File and CODASYL semijoin operations, so that the results
5. GDM Queries. The GDM can process any natural query of some subqueries can be used to reduce the cost of other
Q provided (1) all entity types referenced in Q are stored subqueries. However, this tactic brings us into the realm of
at the GDM, and (2) Q contains no link clauses. Suppose new query optimization algorithms and will require further
Q references entity types R l , . . . ,Rn. Q is processed by research.
constructing a request to the local DBMS (the Datacom-
puter for the initial breadboard system) of the form:
Processing CODASYL Tree Queries
for each rl in Rl where (selection clauses on Rl)
for each r2 in R2 where (selection clauses on R2) Let Q be a CODASYL tree query and QG its tree. The
and (join clauses on Rl and R2) following algorithm compiles Q into a program that solves Q.
The program contains statements of the form:
1. for r in set(s) loop... end loop ; where S owns R via
for each rn in Rn where (selection clauses onRn) set ;
and (join clauses on Rl and Rn) 2. r: = set inv(s); where R owns S via set. Note that set-inv
and (join clauses on R2 and Rn) is the inverse function of set and is always a function.
Algorithm
and (join clauses on Rn-1 and Rn).
print (target list). 1. Do a pre-order traversal of QG. The result is a list of the
nodes of QG. Call this list P.
It is important that the "for" statements be in a "reason- 2. Let R and S be nodes of QG; with R the parent of S.
able" order for performance reasons. Optimization Cases
techniques developed by Wong for the SDD-1 DM3 are R is the root of QG; replace "R" by "for r in R
directly applicable. loop" in P.
R owns S: replace "S" by "/or s in set(r)" in P.
S owns R: replace "S" by "s: = setinv(r)" in P.
Query Decomposition
3. Push loop independent assignments up as high as possi-
To solve a query Q, we must decompose it into a sequence ble.
of basic operations. Our basic strategy is to find subqueries of 4. Add an "output (target list)" statement, add selections,
Q that can be entirely solved at File and CODASYL sites, and joins as high as possible, tack on enough ends to
move the results of these subqueries to the GDM, and solve balance the fors.
the remainder of the query at the GDM.
To follow this strategy, we must isolate File and CODASYL As an example let QG be the query graph of Figure 18.
subqueries of Q. File subqueries are easy to find. We simply
find entity types in Q that are stored at File sites. For each 1. Preorder traversal: R,S,T,U,V.
such entity type R, we produce a subquery consisting of the 2. for r in R loop
selection clauses on R. for s in Ll(r) loop
Let QG be the query graph of Q. To find CODASYL t: = L2 inv(r)
subqueries, we begin by deleting from QG all entity types not for u in L3(t) loop
stored at a CODASYL site and all join clauses. Each con- v: = L4 inv(t)
nected component of the resulting graph includes entity types 3. Push up T and V; add an output statement; add ends to
and links that are stored at the same site, because no link can balance the fors,
connect two entity types stored at different sites (c.f., the for r in R loop
section on "Overall Architecture"). If a connected com- t: = L2_inv(r);
ponent is a tree, then it corresponds to a tree query and can v: = L4_inv(r)
be solved by the CODASYL site. If it has a cycle, then it must for s in Ll(r) loop
Heterogeneous Distributed Database Systems 499
The Integration Schema (IS) in the Multibase architecture plays a crucial role in addressing data integration issues, such as data inconsistencies and converting different scales used by Local Schemata (LS) for the same entity type. The IS enables merging LSs into a Global Schema (GS) by defining necessary mappings and integration information, thus facilitating a unified and coherent data view .
The Multibase system addresses data modeling differences by translating Local Host Schemata (LHS) into Local Schemata (LS) using the Functional Data Model. This step creates a common model for different local models like relational, file, or CODASYL, ensuring the higher system levels don't need to manage different data model formats .
The Multibase system utilizes the functional data model to express all Local Schemata (LS) uniformly, enabling easy mapping into the Integration Schema (IS) and Global Schema (GS). This unification simplifies handling heterogeneous databases and supports efficient global query formulation .
The Multibase system's design emphasizes extendability by allowing the addition of new features without major modifications. This is achieved through a flexible architecture and thoughtful design of schemas and mappings that can be adapted or expanded with minimal disruption as technological needs evolve .
DAPLEX functions within the Multibase system by operating on data in the functional data model, enabling comprehensive schema interaction. It simplifies user operations by utilizing high-level constructs that align well with relational and network data structures, making it user-friendly and reducing the complexity of writing queries over an integrated system .
The query graph in the Multibase system is important as it visually represents query relationships and aids in query optimization by depicting join and selection clauses. This aids in formulating efficient query execution plans, ensuring optimized resource use and faster query processing .
Schema design aid is critical as it provides the necessary tools for database designers to define Local Schemata (LS), the Global Schema (GS), and the mappings among them. It ensures that the system can integrate multiple databases efficiently, maintaining consistency and accuracy in global query results .
The Multibase system handles query decomposition by isolating subqueries solvable at File and CODASYL sites, then transferring these results to the Global Data Manager (GDM) for further processing. This strategy is significant because it allows for more localized query execution, minimizes data movement, and supports efficient query processing through distributed databases .
The Multibase query processing architecture consists of the query translator, query processor, local database interfaces (LDI), and local DBMSs. The query translator converts global queries into queries referencing local schemata. The query processor decomposes these into local queries and optimizes their execution. The LDI translates these local queries into the local Data Manipulation Language (DML) and formats the results for the query processor .
The Multibase system design leaves existing interfaces to local DBMSs intact to ensure compatibility. This approach is significant because it protects the investment in existing software, allowing the integration of various database systems without invalidating them .