Chapter 7 - Consistency and
Replication
CoSc4191 : Distributed Systems
Department of Computer Science
Semera University
Lecture by: Wolday G (MSc)
An important issue in distributed systems is the
replication of data.
Data are generally replicated to enhance reliability or
improve performance.
One of the major problems is keeping replicas consistent.
Informally, this means that when one copy is updated we
need to ensure that the other copies are updated as well;
otherwise the replicas will no longer be the same.
Reasons for Replication
There are two primary reasons for replicating data including
reliability and performance:
Reliability:-
if a file is replicated, we can switch to other replicas if there is a
crash on our replica
we can provide better protection against corrupted data; similar
to mirroring
l
performance:-
if the system has to scale in size and geographical area
place a copy of data in the proximity of the process using them,
reducing the time of access and increasing its performance; for
example a Web server is accessed by thousands of clients from
all over the world
caching is strongly related to replication; normally by clients
in replication, more bandwidth is consumed to keep replicas consistent
(cost) and a possibility of inconsistency
l
Replication as Scaling Technique
replication and caching are widely applied as scaling techniques
Scaling with respect to size and geographical area may also require
replication.
when an increasing number of processes needs to access data that
are managed by a single server.
placing a copy of data in the proximity of the process using them, the
time to access the data decreases.
l
If replication helps to improve reliability and performance, who
could be Against it ?
problem with replication is that having multiple copies may
lead to consistency problems.
Whenever a copy is modified, that copy becomes different
from the rest.
Consequently, modifications have to be carried out on all
copies to ensure consistency.
Exactly when and how those modifications need to be carried
out determines the price of replication.
Consistency Models
Consistency Models: A consistency model is essentially a
contract between processes and the data store. It says that if
processes agree to obey certain rules, the store promises to
work correctly.
Normally, a process that performs a read operation on a data
item, expects the operation to return a value that shows the
results of the last write operation on that data.
[Link] Centric Consistency Model
Replicating data poses consistency problems that cannot be
solved efficiently in a general way. Only if we loosen consistency
can there be hope for attaining efficient solutions.
consistency has always been discussed
in terms of read and write operations on shared
data available by means of (distributed) shared
memory, a (distributed) shared database, or a
(distributed) file system
we use the broader term data store, which may
be physically distributed across multiple machines
assume also that each process has a local copy of
the data store and write operations are
propagated to the other copies
A data operation is classified as a write operation
when it changes the data, and is otherwise
classitied as a read operation.
the general organization of a logical data store, physically distributed and
replicated across multiple processes
Conti ..
in a distributed system and in the absence of a global clock
and with several copies, it is difficult to know which is the
last write operation
to simplify the implementation, each consistency model
restricts what read operations return
data-centric consistency models to be discussed
1)Sequential Consistency
2) Causal Consistency
the following notations and assumptions will be used
Wi(x)a means write by process Pi to data item x with the value “a” has
been done
Ri(x)b means a read by process Pi to data item x returning the value b
has been done
1 Sequential Consistency: a data store is said to be
sequentially consistent when it satisfies the following condition:
The result of any execution is the same as if the (read and write)
operations by all processes on the data store were executed in
some sequential order and the operations of-each individual
process appear in this sequence in the order specified by its
program.
all processes see the same interleaving of operations
time does not play a role; no reference to the “most recent” write
operation
example: four processes operating on the same data item x
(a) sequentially consistent data store
the write operation of P2 appears to have taken place before that of
P1; but for all processes
(b) data store that is not sequentially consistent
to P3, it appears as if the data item has first been changed to b, and later to
a; but P4 , will conclude that the final value is b
not all processes see the same interleaving of write operations
[Link] Consistency
it is a weakening of sequential consistency
it distinguishes between events that are
potentially causally related and those that are not
e.g., y = x+5; a write on y that follows a read on x; the
writing of y may have depended on the value of x
otherwise the two events are concurrent
two processes write two different variables
if event B is caused or influenced by an earlier event, A,
causality requires that everyone else must first see A,
then B
a data store is said to be causally consistent, if it
obeys the following condition:
Writes that are potentially causally related must be
seen by all processes in the same order. Concurrent
writes may be seen in a different order on different
machines.
W2(x)b and W1(x)c are concurrent, not a requirement for processes to see
them in the same order
(a) this sequence is allowed with a casually-consistent store, but not
with sequentially consistent store
a) a violation of a causally-consistent store
b) a correct sequence of events in a causally-consistent store (R(x)a is
removed) but not with sequentially consistent store
Client centric consistency models
The consistency models described in the previous
section aim at providing a system wide consistent view
on a data store.
An important assumption is that concurrent
processes may be simultaneously updating the data
store, and that it is necessary to provide consistency in
the face of such concurrency.
Being able to handle-concurrent operations on
shared data while maintaining sequential consistency
is fundamental to distributed systems.
For performance reasons, sequential consistency
may possibly be guaranteed only when processes use
synchronization mechanisms such as transactions or
locks.
there are four client-centric consistency models
Assumptions
consider a data store that is physically distributed across multiple
machines
a process reads and writes to a locally (or nearest) available copy
and updates are propagated
assume that data items have an associated owner, the only
process permitted to modify that item, hence write- write conflicts
are avoided
the following notations are used
xi[t] denotes the version of the data item x at local copy Li at
time t
version xi[t] is the result of a series of write operations at Li
that took place since initialization; denote this set by
WS(xi[t])
if operations in WS(xi [t1 ]) have also been performed at
local copy Lj at a later time t 2 , we write WS(xi [t1 ];xj [t2]); it
means that WS(x i [t 1 ]) is part of WS(x j [t 2 ])
Conti …
The data stores we consider are characterized by the lack of
simultaneous updates, or when such updates happen, they
can easily be resolved. Most operations involve reading data.
Client centric Models
1. Monotonic Read: If a process reads the value of a data item
x, any successive read operation on x by that process will
always return that same value or a more recent value. In other
words, monotonic-read consistency guarantees that if a
process has seen a value of x at time t, it will never see an
older version of x at a later time.
the read operations performed by a single process P at two different
local copies of the same data store
a) a monotonic-read consistent data store
b) a data store that does not provide monotonic reads; there is no
guaranty that when R(x 2 ) is executed WS (x 2 ) also contains WS (x 1 )
2. Monotonic write
it may be required that write operations propagate in
the correct order to all copies of the data store
in a monotonic-write consistent data store the
following condition holds:
A write operation by a process on a data item x is
completed before any successive write operation on x
by the same process
completing a write operation means that the copy on
which a successive operation is performed reflects the
effect of a previous write operation by the same
process, no matter where that operation was initiated
the write operations performed by a single process P at two different
local copies of the same data store
a) a monotonic-write consistent data store
b) a data store that does not provide monotonic-write consistency
3. Read Your
writes:
a data store is said to provide read-your-writes consistency,
if the following condition holds:
The effect of a write operation by a process on data item
x will always be seen by a successive read operation on x
by the same process
In other words, a write operation is always completed
before a successive read operation by the same process,
no matter where that read operation takes place.
a) a data store that provides read-your-writes consistency
b) a data store that does not
4. Write follow Reads:
updates are propagated as the result of previous
read operations
a data store is said to provide writes-follow-reads
consistency, if the following condition holds:
A write operation by a process on a data item x
following a previous read operation on x by the same
process is guaranteed to take place on the same or a
more recent value of x that was read
In other words, any successive write operation by a
process on a data item x will be performed on a copy
of x that is up to date with the value most recently
read by that process.
a) a writes-follow-reads consistent data store
b) a data store that does not provide writes-follow-reads consistency
Replica management
Replica Placement :
A key issue for any distributed system that supports replication
is to decide where, when, and by whom replicas should be
placed, and subsequently which mechanisms to use for keeping
the replicas consistent.
The placement problem itself should be split into two sub-
problems: that of placing replica servers, and that of
placing content.
Replica-server placement is concerned with finding the
best locations to place a server that can host (part of) a
data store.
Content placement deals with finding the best servers
for placing content. Note that this often means that we are
looking for the optimal placement of only a single data item.
Obviously, before content placement can take place, replica
servers will have to be placed first.
Distribution protocols
Where, when, by whom copies of data are to be placed?
Figure shows: The logical organization of different kinds of copies of a
data store into three concentric rings.
Permanent Replica:
Permanent replicas can be considered as the initial set of
replicas that constitute a distributed data store.
In many cases, the number of permanent replicas is small.
Consider, for example, a Web site. Distribution of a Web site
generally comes in one of two forms.
The first kind of distribution is one in which the files that
constitute a site are replicated across a limited number of
servers at a single location.
Whenever a request comes in, it is forwarded to one of the
servers, for instance, using a round-robin strategy.
The second form of distributed Web sites is what is called
mirroring. In this case, a Web site is copied to a limited number
of servers, called mirror sites which are geographically spread
across the Internet. In most cases, clients simply choose one
of the various mirror sites from a list offered to them. Mirrored
Web sites have in common with cluster-based Web sites that
Server-Initiated Replicas
In contrast to permanent replicas, server-initiated replicas are
copies of a data store that exist to enhance performance and
which are created at the initiative of the (owner of the) data
store.
Consider, for example, a Web server placed in New York.
Normally, this server can handle incoming requests quite easily,
but it may happen that over a couple of days a sudden burst of
requests come in from an unexpected location far from the
server.
In that case, it may be worthwhile to install a number of
temporary replicas in regions where requests are coming from.
Client-initiated Replicas
An important kind of replica is the one initiated by a client.
Client-initiated replicas are more commonly known as (client)
caches.
In essence, a cache is a local storage facility that is used by a
client to temporarily store a copy of the data it has just
requested. In principle, managing the cache is left entirely to the
client.
The data store from where the data had been fetched has
nothing to do with keeping cached data consistent. However, as
we shall see, there are many occasions in which the client can
rely on participation from the data store to inform it when cached
data has become stale. Client caches are used only to improve
access times to data.