Multiprocessors and multi
computers
-prajwala T R
Dept. of CSE
PESIT
Multiprocessor system interconnects
Network characteristics
• Timing
– Synchronous
– asynchronous
• Switching
– Circuit switching
– Packet switching
• Control
– Centralized
– distributed
Hierarchical bus systems
• Local bus-
buses implemented within processor chip or
PCB
provides communication path among
components mounted on board
Memory bus
Data bus
• Backplane bus
• Is printed circuit on which many connectors
are used to plug in functional boards
• VME bus
• multibusII
• Futurebus+
Backplane bus
• I/O bus
– SCSCI-small computer system interface bus
– Made of coaxial cables with taps connecting to
disks,printer.
– Interface logic
– Ex: encore bus consists of 32 bit address,64 bit
data path and 14 bit vector bus
– Clock speed 12.5MHz
SCSI bus cable
Encore ultramax multiprocessor
architecture
Cross bar switch
• Single stage
• Multistage network
– Blocking ex:omega and baseline network
– non blocking(all possible connections between
i/o)
• Cross bar networks
Single stage
Cross bar networks
• Single stage, permutation and non blocking
network
• Unary switch set to open or close and
establishes point to point connections
• N X M or N=M
• All processors send request asynchronously
and independently
design
• Multiplexer
• Arbitration logic
• Acknowledgement signal
• Memory read or writ
• 16 processors then 4 bit control lines
• Advantages
– High bandwidth
– Interface is cheaper
– Single processor send many requests to multiple
modules
• Disadvantages
– Cost effective only for small number of processors
– Not expandable once built
Multiport memory
• Solution intermediate to bus and switch
• Only one of n processor requests is honored at
a time.
• Drawback
– Not scalable
– Large number of interconnection cables
Multiport memory
Multistage and combining networks
• Omega network
• Base line networks
• Hotspot problem
– ex: memory module
– Semaphore
– Degrade performance
Fetch and add primitive
• Increments content of memory loation.
• Atomic operation
• X,e-value, increments
• When using multiprocessor, when one process
is allowed to make change no other process
can access intermediate result
• Switch performs addition of increments.
• Disadvantage-
• Requires additional switch cycles to make
entire operation atomic.
• Rela time systems ex:IBMRP3
– 512 processors
– Omega network of 128 ports
– Bandwidth 13Gbps
– 50Mhz clock
• 2 methods to solve cache coherence problem
– Snoopy protocol- to monitor the values
– Directory based protocols-no broadcasting of
values. A central directory is maintained for
modifications made in the cache
Snoopy protocols
• Snoopy protocols are used to ensure
coherence of cache.
• The mechanism are
– write invalidate
– Write update
– Write through caches
– Write back caches
– Write once protocol
Snoopy protocols contd…
• Write invalidate protocol
– Will invalidate all remote copies when local cache
block is updates
• Write update policy
– Broadcast new data to all caches containing the
copy of block
Snoopy protocols contd…
• Write through caches
– I and j processors
– VALID o INVALID
• Possible operations:
– Read by same processR(i)
– Read by different processorR( j )
– Write by same processor W(i) Write by different
processor W( j )
– Replace by same processor Z(i) Replace by different
processor Z( j )
Write back caches
• Data item states: o
– RO : Read Only (Valid state)
– RW : Read Write (Valid state)
– INV : Invalid state
• Possible operations:
– Read by same processor R(i)
– Read by different processor R( j )
– Write by same processor W(i)
– Write by different processor W( j )
– Replace by same processor Z(i) Replace by different
processor Z( j )
Write back cache
Snoopy protocols contd..
• Write-once Protocol
• First write using write-through policy
• Subsequent writes using write-back policy
• In both cases, data item copy in remote caches is invalidated
• Data item states:
– Valid :cache block consistent with main memory copy
– Reserved : data has been written exactly once and is consistent
with main memory copy
– Dirty : data is written more than once but is not consistent with
main memory copy
– Invalid :block not found in cache or is inconsistent with main
memory copy
Read hit, read miss, write hit, write miss
• Read hit: The information is supplied by the c
• Read miss: The data is read from main
memory. Check for dirty or reserved states
• Write hit-if in dirty or reserved state update to
dirty state
• Write miss-invalid state
Multilevel cache coherence
• An write invalidate is sent vertically up inorder
to invalidate the shared caches at higher level.
• Higher level caches keep track of dirty blocks.
Protocol Performance issues
Directory based protocols
• Snoopy protocols broadcast the information.
• In large network this is expensive
• Write invalidate protocol leads heavy bus
traffic
• Write update protocol –the updated data may
not be used by remote processors a lot
• Hence use directory based protocol.
• Cache directories-
– List of cached locations
– Number of pointers to specify the copies of block
– Dirty bit
• Cache directories store information on where
copies of cache block resides, list of cached
locations
• Central directory scheme
– Duplicates all cache directories.
– Consistency must be maintained.
– Drawbacks-
• Contention
• Long term searches
• Distributed directory scheme
– Each memory module holds its own directories.
– State information is local to the memory module.
– If read miss in cache 2-request sent to memory
module and memory module controller
retransmits data in cache 1.
– If write hit of c1-controller sends invalidation to all
caches.
Types of directories
• Full map directories
– Each directory has n entries where n is number of
processors.
– 2 bit-entry for processor(valid),dirty bit(whether
block overwritten)
• Steps
– Cache c3 finds block containing x is valid.
– C3 issues write request to memory module
containing x.
– Memory module invalidates requests of c1 and c2
– C1 and c2 set the b it indicating x is no longer
valid.
– Memory module sends write request to c3
– Cache c3 updates value of x
Limited directories
• Directory size problem is solved- entries only if
cache block has the value X else no entry.
• Dirix
– i –number of pointers
– X-no broadcast scheme
• Full map scheme without broadcast
• i<n pointers
• Dir2NB-pointer replacement-eviction
• Directory-set associative mapping
• Scalable protocols
• Dir I B-
– Allow more than I copies of each block of data to
exist
Chained directories
• Singly linked chain
– Initially no shared copies of x
– P1 reads x from shared memory along with chain
termination pointer.
– If p2 requires cache data it is read from p1 along
with CT.
– Memory then keeps a pointer to c2.
– Gossip protocol –info passed from individual to
individual .
• Doubly linked chain
– 2 pointer-backward and forward chain pointers.
– More memory because of more storage of 2
pointers
• Cache design alternatives
• Shared caches-no private cache,will reduce
main memory access time,second level cache
• Shared data can be non cacheable.
• Cache flushing-at synchronization ,I/O and
process migration.
Atomic operation
• Synchronization primitives
– Test and set lock
– Lock-1 set
– Lock 0-reset
– Spin lock
• Wired barrier synchronization
– Wired NOR logic
– Control vector –X
– Common monitor-Y
– Xi connected to input
– Yi output
• Xi -1-process is initiated
• Barrier set to 1-synchronization
• Only one barrier line is needed to initiate and
complete single synchronization operation
3 generations of multicomputers
Message passing schemes
• Message formats
– Fixed length of packets.
– Destination addr, sequence number
– Further dived to flits-flow control digits
– Store and forward routing-packets
– Wormhole-flits
– Size of packets-64-512 bits
Store and forward routing
• Basic units of transfer are packets
• Transmitted through series of intermediate
nodes.
• Buffers are used to store packets, then
transferred to output channels
• Latency is α number of hops
Wormhole routing
• Flits are used.
• Transmission from source to destination is
done through routers
• All flits are transmitted in order ,as
inseparable companions.
• Header flit,dataflits
• Latency is independent of distance or number
of hops
Asynchronous pipelining
• Handshaking protocol
• 1 –bit ready/request line is used between
adjacent routers.
• No global clock
virtual channels
• Virtual channel is a logical link between 2
nodes.
• Flit buffer in source node and a physical
channel between them and flit buffer at
receiver node.
• One source buffer is paired with one receiver..
buffer to form virtual channel
• Physical channel is time shared by all virtual
channels
deadlocks
• Why?
– 4 flits from 4 messages occupy 4 channels
– Circular waits
• How to detect deadlock?
– Channel dependence graph
• Deadlock avoidance
– Use virtual channels
– Can be bidirectional or unidirectional
Flow control strategies
• Packet collision
• Elements –scr buffer and dest buffer holding
slit,channel
• Packet collision resolution-
– Which packet will be allocated to channel?
– What will be done to packet which is denied
Solution 1
• Virtual cut through routing scheme
– Packet 2 temporarily stored in buffer.
– Adv- not wasting resource
– Disadv-requires use of large buffer, storage delay
– Packet buffer should not have cycles
Solution 2
• Blocking flow control
– Second packet is blocked but not abandoned
Solution 3
• Discard and retransmit
– Drops packet
– Disadv-wastage of resources,unstable delivery
rate
– Requires packet retransmission and
acknowledgement.
Solution 4
• Detour after blocked
– Results in idling the resources
– Offers flexibility
– Rerouted packet enters live stock which wastes
resources
Dimension order routing
• Deterministic
– Communication path is completely predetermined
by source and destination.
– Dimension order routing- X Y routing, E cube
routing
• Adaptive routing depends on network
conditions
E cube routing
• N=2n
• Source (s),dest(d),intermediate node (v)
(0,1…n-1),
[Link] bit ri=si-1 XOR di-1
2.V XOR 2i-1 if ri=[Link] ri=0 skip
[Link] to i+1 dimension until dest reached.
example
Adaptive routing
Multicast routing algorithms
• Communication patters
– Unicast
– Broadcast
– multicast
Routing efficiency
• Channel bandwidth
• Communication delay
• Implemented by replicating packet at
intermediate node and multiple copies of
packet reach destination.
Virtual networks
Network portioning