Lecture 3
Communication in Distributed Systems
The single most important
difference between a distributed
system and a uniprocessor
system is the
“interprocess
communication.”
• In a uniprocessor system, interprocess
communication assumes the existence of shared
memory.
• A typical example is the producer-consumer
problem.
• One process writes to - buffer -reads from
another process
• The most basic form of synchronization, the
semaphone requires one word (the semaphore
variable) to be shared.
• In a distributed system, there’s no shared memory,
so the entire nature of interprocess communication
must be completely rethought from scratch.
• All communication in distributed system is based on
message passing.
• E.g. Proc. A wants to communicate with Proc. B
• [Link] first builds a message in its own address space
• [Link] executes a system call
• [Link] OS fetches the message and sends it through
network to B.
• A and B have to agree on the meaning of the
bits being sent. For example,
• How many volts should be used to signal a 0-bit? 1-bit?
• How does the receiver know which is the last bit of the
message?
• How can it detect if a message has been damaged or lost?
• What should it do if it finds out?
• How long are numbers, strings, and other data items? And
how are they represented?
OSI (Open System Interconnection
Reference
Machine 1
model) Machine 2
Process A Process B
Application protocol
Application Application
Presentation protocol
Presentation Presentation
Interface Interface
Session protocol
Session Sessionn
Transport protocol
Transport Transport
Network protocol
Network Network
Data link protocol
Data link Data link
Physical protocol
Physical Physical
Network
Client-Server Model
Request
Client Server
Reply
Kernel Kernel
Network
Client-Server Model Layer
5 Request/Reply
4
3
2 Data link
1 Physical
Advantages
• Simplicity: The client sends a request and gets an
answer. No connection has to be established.
• Efficiency: just 3 layers. Getting packets from client
to server and back is handled by 1 and 2 by
hardware: an Ethernet or Token ring. No routing is
needed and no connections are established, so
layers 3 and 4 are not needed. Layer 5 defines the
set of legal requests and replies to these requests.
• two system calls: send (dest, &mptr), receive (addr,
&mptr)
#include <header.h>
int copy (char *src, char *dst) /* procedure to copy file using the server */
{ struct message m1; /* message buffer */
long position; /* current file position */
long client = 110; /* client’s address */
initialize(); /* prepare for execution */
position = 0;
Addressing
1. The server’s address is simply hardwired as a
constant
[Link] # + Process #: 243.4 199.0
[Link] # + local-id
Disadvantage: it is not transparent to the user. If the server is changed
from 243 to 170, the program has to be changed.
4. Assign each process a unique address that does
not contain an embedded machine number.
One way to achieve this is to have a centralized process address allocator
that simply maintains a counter. Upon receiving a request for an address,
it simply returns the current value of the counter and increment it by one.
Disadvantage: centralize does not scale to large systems.
5. Let each process pick its own id from a large,
sparse address space, such as the space of 64-bit
binary integers.
Problem: how does the sending kernel know what machine to send the
message to?
Solution:
a) The sender can broadcast a special “locate packet” containing the
address of the destination process.
b) All the kernel check to see if the address is theirs.
c) If so, send back “here I am” message giving their network address
(machine number).
Disadvantage: broadcasting puts extra load on the system.
6. Provide an extra machine to map high-level (ASCII)
service names to machine addresses. Servers can be
referred to by ASCII strings in the program.
Disadvantage: centralized component: the name server
7. Use special hardware. Let process pick random
address. Instead of locating them by broadcasting,
locate them by hardware.
Blocking versus Nonblocking
Primitives
Client blocked
Client running Client running
Trap to Return from kernel,
kernel, process released
Process blocked
Message being sent
Blocking send primitive
Nonblocking send primitive
Client
blocked
Client running
Client running
Trap Return
Message Message being sent
copied to
kernel
buffer
Nonblocking primitives
• Advantage: can continue execution without waiting.
• Disadvantage: the sender cannot modify the
message buffer until the message has been sent
and it does not know when the transfer can
complete. It can hardly avoid touching the buffer
forever.
Solutions to the drawbacks of
nonblocking primitives
1. To have the kernel copy the message to an
internal kernel buffer and then allow process to
continue.
Problem: extra copies reduce the system performance.
2. Interrupt the sender when the message has been
sent
Problem: user-level interrupts make programming tricky,
difficult, and subject to race conditions.
Buffered versus Unbuffered
Primitives
No buffer allocated. Fine if receive() is called before
send().
Buffers allocated, freed, and managed to store the
incoming message. Usually a mailbox created.
Reliable versus Unreliable
Primitives
• The system has no guarantee about message being
delivered.
• The receiving machine sent an acknowledgement
back. Only when this ack is received, will the
sending kernel free the user (client) process.
• Use reply as ack.
Implementing the client-server
model
Item Option 1 Option 2 Option 3
Addressing Machine number Sparse process ASCII names
address looked up via
server
Blocking Blocking Nonblocking with Nonblocking with
primitives copy to kernel interrupt
Buffering Unbuffered, Unbuffered, Mailboxes
discarding temporarily keeping
unexpected unexpected messages
messages
Reliability Unreliable Request-Ack-Reply Request-Reply-Ack
Ack
Acknowledgement
• Long messages can be split into multiple packets. For
example, one message: 1-1, 1-2, 1-3; another message:
2-1, 2-2, 2-3, 2-4.
• Ack each individual packet
Advantage: if a packet is lost, only that packet has to be retransmitted.
Disadvantage: require more packets on the network.
• Ack entire message
Advantage: fewer packets
Disadvantage: more complicated recovery when a packet is lost. (Because
retransmit the entire message).
Code Packet type From To Description
REQ Request Client Server The client wants service
REP Reply Server Client Reply from the server to the
client
ACK Ack Either Other The previous packet arrived
AYA Are you Client Server Probe to see if the server has
alive? crashed
IAA I am alive Server Client The server has not crashed
TA Try again Server Client The server has no room
AU Address Server Client No process is using this
unknown address
Some examples of packet
exchanges for client-server
communication
REQ
Client Server
REP
REQ
Client ACK Server
REP
ACK
REQ
ACK
Client AYA Server
IAA
REP
ACK
Remote Procedure Call
• The idea behind RPC is to make a remote procedure
call look as much as possible like a local one.
• A remote procedure call occurs in the following
steps:
Remote procedure call steps:
• The client procedure calls the client stub in the normal way.
• The client stub builds a message and traps to the kernel.
• The kernel sends the message to the remote kernel.
• The remote kernel gives the message to the server stub.
• The server stub unpacks the parameters and calls the server.
• The server does the work and returns the result to the stub.
• The server stub packs it in a message and traps to the kernel.
• The remote kernel sends the message to the client’s kernel.
• The client’s kernel gives the message to the client stub.
• The stub unpacks the result and returns to the client.
Remote Procedure Call
Client machine Client stub Server stub Server machine
Call Pack parameters Unpack Call
parameters
Client Server
Unpack result
Return Pack result
Return
Kernel Kernel
Message transport
over the network
Parameter Passing
• little endian: bytes are numbered from right to left
• big endian: bytes are numbered from left to right
2 1 0
0 3 0 0 5
7 6 5 4
L L I J
1 2 3
5 0 0 0 0
4 5 6 7
J I L L
How to let two kinds of machines
talk to each other?
• a standard should be agreed upon for representing
each of the basic data types, given a parameter list
(n parameters) and a message.
• devise a network standard or canonical form for
integers, characters, Booleans, floating-point
numbers, and so on.
• Convert to either little endian/big endian. But
inefficient.
• use native format and indicate in the first byte of
the message which format this is.
How are pointers passed?
• not to use pointers. Highly undesirable.
• copy the array into the message and send it to the
server. When the server finishes, the array can be
copied back to the client.
• distinguish input array or output array. If input, no
need to be copied back. If output, no need to be
sent over to the server.
• still cannot handle the most general case of a
pointer to an arbitrary data structure such as a
complex graph.
How can a client locate the server?
hardwire the server network address into the client.
Disadvantage: inflexible.
use dynamic binding to match up clients and servers.
Dynamic Binding
• Server: exports the server interface.
• The server registers with a binder (a program), that
is, gives the binder its name, its version number, a
unique identifier, and a handle.
• The server can also deregister when it is no longer
prepared to offer service.
How the client locates the server?
• When the client calls one of the remote procedure “read”
for the first time, the client stub sees that is not yet bound
to a server.
• The client stub sends message to the binder asking to
import version 3.1 of the file-server interface.
• The binder checks to see if one or more servers have already
exported an interface with this name and version number.
• If no server is willing to support this interface, the “read” call
fails; else if a suitable server exists, the binder gives its
handle and unique identifier to the client stub.
• The client stub uses the handle as the address to send the
request message to.
Advantages
• It can handle multiple servers that support the
same interface
• The binder can spread the clients randomly over the
servers to even the load
• It can also poll the servers periodically,
automatically deregistering any server that fails to
respond, to achieve a degree of fault tolerance
• It can also assist in authentication. Because a server
could specify it only wished to be used by a specific
list of users
Disadvantage
• the extra overhead of exporting and importing interfaces cost time.
Server Crashes
• The server can crash before the execution or after
the execution
• The client cannot distinguish these two.
• The client can:
• Wait until the server reboots and try the operation
again (at least once semantics).
• Gives up immediately and reports back failure (at
most once semantics).
• Guarantee nothing.
Client Crashes
• If a client sends a request to a server and crashes
before the server replies, then a computation is
active and no parent is waiting for the result. Such
an unwanted computation is called an orphan.
Problems with orphans
• They waste CPU cycles
• They can lock files or tie up valuable resources
• If the client reboots and does the RPC again, but the
reply from the orphan comes back immediately
afterward, confusion can result
What to do with orphans?
• Extermination: Before a client stub sends an RPC
message, it makes a log entry telling what it is
about to do. After a reboot, the log is checked and
the orphan is explicitly killed off.
• Disadvantage: the expense of writing a disk record
for every RPC; it may not even work, since orphans
themselves may do RPCs, thus creating
grandorphans or further descendants that are
impossible to locate.
• Reincarnation: Divide time up into sequentially
numbered epochs. When a client reboots, it
broadcasts a message to all machines declaring the
start of a new epoch. When such a broadcast comes
in, all remote computations are killed.
• Gentle reincarnation: when an epoch broadcast
comes in, each machine checks to see if it has any
remote computations, and if so, tries to locate their
owner. Only if the owner cannot be found is the
computation killed.
• Expiration: Each RPC is given a standard amount of
time, T, to do the job. If it cannot finish, it must
explicitly ask for another quantum. On the other
hand, if after a crash the server waits a time T
before rebooting, all orphans are sure to be gone.
• None of the above methods are desirable.
Implementation Issues
• the choice of the RPC protocol: connection-oriented or
connectionless protocol?
• general-purpose protocol or specifically designed protocol
for RPC?
• packet and message length
• Acknowledgements
• Flow control
overrun error: with some designs, a chip cannot accept two
back-to-back packets because after receiving the first one,
the chip is temporarily disabled during the packet-arrived
interrupt, so it misses the start of the second one.
How to deal with overrun error?
• If the problem is caused by the chip being disabled
temporarily while it is processing an interrupt, a
smart sender can insert a delay between packets to
give the receiver just enough time.
• If the problem is caused by the finite buffer capacity
of the network chip, say n packets, the sender can
send n packets, followed by a substantial gap.
Timer Management
Current time Current time
14200 14200
Process table
14205
Process 3 0
14216
1
14212 0
Process 2
2
14212
14216
Process 0 3
14205
Group Communication
• RPC can have one-to-one communication (unicast) one-to-many
communication (multicast) and one-to-all communication
(broadcast).
• Multicasting can be implemented using broadcast. Each machine
receives a message. If the message is not for this machine, then
discard.
• Closed groups: only the member of the group can
send messages to the group. Outsiders cannot.
• Open groups: any process in the system can send
messages to the group.
• Peer group: all the group members are equal.
Advantage: symmetric and has no single point of failure.
Disadvantage: decision making is difficult. A vote has to be taken.
• Hierarchical group: coordinator
Advantage and disadvantage: opposite to the above
Group Membership Management
• Centralized way: group server maintains a complete data
base of all the groups and their exact membership.
Advantage: straightforward, efficient, and easy to implement.
Disadvantage: single point of failure.
• Distributed way: an outsider sends to message to all group
members to join and sends a goodbye message to everyone
to leave.
Group Addressing
• A process just sends a message to a group address
and it is delivered to all the members. The sender is
not aware of the size of the group or whether
communication is implemented by multicasting,
broadcasting, or unicasting.
• Require the sender to provide an explicit list of all
destinations (e.g., IP addresses).
• Each message contains a predicate (Boolean
expression) to be evaluated. If it is true, accept; If
false, discard.
Send and Receive Primitives
• If we wish to merge RPC and group communication, to send
a message, one of the parameters of send indicates the
destination. If it is a process address, a single message is
sent to that one process. If it is a group address, a message
is sent to all members of the group.
Atomicity
• How to guarantee atomic broadcast and fault
tolerance?
• The sender starts out by sending a message to all
members of the group. Timers are set and
retransmissions sent where necessary. When a process
receives a message, if it has not yet seen this particular
message, it, too, sends the message to all members of
the group (again with times and retransmissions if
necessary). If it has already seen the message, this step
is not necessary and the message is discarded. No
matter how many machines crash or how many packets
are lost, eventually all the surviving processes will get
the message.
Message Ordering
• Use global time ordering, consistent time ordering.