PIPELINING, INTRODUCTION TO PARALLEL PROCESSING
AND OPERATING SYSTEM
1
PIPELINING
greater performance can be achieved by taking advantage of
improvements in technology, such as faster circuitry.
In addition, organizational enhancements to the processor can
improve performance.
We have already seen some examples of this, such as the use of
multiple registers rather than a single accumulator, and the use
of a cache memory.
Another approach is Instruction Pipelining
2
PIPELINING
new inputs are accepted at one end before previously
accepted inputs appear as outputs at the other end
an instruction has a number of stages.
Two stage Instruction Pipeline has two stages:
Fetch
Execute
3
TWO-STAGE PIPELINING
4
PROBLEMS WITH TWO-STAGE PIPELINE
The execution time will generally be longer than the
fetch time. Thus, the fetch stage may have to wait for
some time before it can empty its buffer.
A conditional branch instruction makes the address of
the next instruction to be fetched unknown. Thus, the
fetch stage must wait until it receives the next instruction
address from the execute stage. The execute stage may
then have to wait while the next instruction is fetched.
5
SIX-STAGE PIPELINE
To gain further speedup, the pipeline must have more stages.
Fetch instruction (FI): Read the next expected instruction into a
buffer.
Decode instruction (DI): Determine the opcode and the operand
specifiers.
Calculate operands (CO): Calculate the effective address of each
source operand. This may involve displacement, register indirect,
indirect, or other forms of address calculation.
Fetch operands (FO): Fetch each operand from memory.
Operands in registers need not be fetched.
Execute instruction (EI): Perform the indicated operation and
store the result, if any, in the specified destination operand
location.
Write operand (WO): Store the result in memory. 6
SIX-STAGE PIPELINE
With this decomposition, the various stages will be of
more nearly equal duration.
For the sake of illustration, let us assume equal duration.
Using this assumption, Figure below shows that a six-
stage pipeline can reduce the execution time for 9
instructions from 54 time units to 14 time units.
7
SIX-STAGE PIPELINE
8
SIX-STAGE PIPELINE
Factors that limit performance
Ifthe six stages are not of equal duration, there will be some
waiting involved at various pipeline stages
Conditional Brach instruction-invalidate several instruction
fetches
Interrupt
9
BRANCH PENALTY
10
THE LOGIC NEEDED FOR
SIX STAGE
INSTRUCTION PIPELINE
11
DEALING WITH BRANCHES
A variety of approaches have been taken for dealing with
conditional branches:
Multiple streams
Prefetch branch target
Loop buffer
Branch prediction
Delayed branch
12
MULTIPLE STREAMS
A simple pipeline suffers a penalty for a branch
instruction because it must choose one of two
instructions to fetch next and may make the wrong
choice. A brute-force approach is to replicate the initial
portions of the pipeline and allow the pipeline to fetch
both instructions, making use of two streams.
There are two problems with this approach:
With multiple pipelines there are contention delays for access
to the registers and to memory.
Additional branch instructions may enter the pipeline (either
stream) before the original branch decision is resolved. Each
such instruction needs an additional stream. 13
PREFETCH BRANCH TARGET
When a conditional branch is recognized, the target of
the branch is prefetched, in addition to the instruction
following the branch.
This target is then saved until the branch instruction is
executed. If the branch is taken, the target has already
been prefetched.
14
LOOP BUFFER
A loop buffer is a small, very-high-speed memory
maintained by the instruction fetch stage of the pipeline
and containing the n most recently fetched instructions, in
sequence.
If a branch is to be taken, the hardware first checks
whether the branch target is within the buffer. If so, the
next instruction is fetched from the buffer
Benefits:
Reduces memory access
useful for the rather common occurrence of IF–THEN and IF–
THEN–ELSE sequences
well suited to dealing with loops, or iterations; hence the name 15
loop buffer
BRANCH PREDICTION (1)
Predict never taken
Assume that jump will not happen
Always fetch next instruction
68020 & VAX 11/780
VAX will not prefetch after branch if a page fault would
result (O/S v CPU design)
Predict always taken
Assume that jump will happen
Always fetch target instruction
16
BRANCH PREDICTION (2)
Predict by Opcode
Some instructions are more likely to result in a jump than
others
Can get up to 75% success
Taken/Not taken switch
Based on previous history
Good for loops
17
BRANCH PREDICTION (3)
Delayed Branch
Do not take jump until you have to
Rearrange instructions
18
BRANCH PREDICTION FLOWCHART
19
PARALLEL PROCESSING
A traditional way to increase system performance is
to use multiple processors that can execute in parallel to
support a given workload.
The two most common multiple-processor
organizations are symmetric multiprocessors (SMPs) and
clusters.
More recently, nonuniform memory access (NUMA)
systems have been introduced commercially.
20
Symmetric multiprocessors
An SMP consists of multiple similar processors within
the same computer, interconnected by a bus or some sort
of switching arrangement.
The most critical problem to address in an SMP is
that of cache coherence.
Each processor has its own cache and so it is possible
for a given line of data to be present in more than one
cache.
21
If such a line is altered in one cache, then both
main memory and the other cache have an invalid
version of that line. Cache coherence protocols are
designed to cope with this problem.
chip multiprocessing :
When more than one processor are implemented on a
single chip, the configuration is referred to as chip
multiprocessing.
22
Multithreaded processor:
A related design scheme is to replicate some of the
components of a single processor so that the processor can
execute multiple threads concurrently; this is known as a
multithreaded processor.
Cluster:
A cluster is a group of interconnected, whole computers
working together as a unified computing resource that can
create the illusion of being one machine.
The term whole computer means a system that can run on
its own, apart from the cluster. 23
Non-uniform memory access
A NUMA system is a shared-memory multiprocessor in
which the access time from a given processor to a word
in memory varies with the location of the memory word.
24
MULTIPLE PROCESSOR ORGANIZATION
I. Single instruction, single data stream - SISD
II. Single instruction, multiple data stream - SIMD
III. Multiple instruction, single data stream - MISD
IV. Multiple instruction, multiple data stream- MIMD
25
SINGLE INSTRUCTION, SINGLE DATA
STREAM - SISD
Single processor
Single instruction stream
Data stored in single memory
Uni-processor
26
SINGLE INSTRUCTION, MULTIPLE DATA
STREAM - SIMD
Single machine instruction Controls simultaneous
execution of a Number of processing elements on
Lockstep basis
Each processing element has associated data memory
Each instruction executed on different set of data by
different processors
Vector and array processors
27
MULTIPLE INSTRUCTION, SINGLE
DATA STREAM - MISD
Sequence of data Transmitted to set of processors
Each processor executes different instruction sequence
Never been implemented
28
MULTIPLE INSTRUCTION, MULTIPLE DATA STREAM- MIMD
Set of processors Simultaneously execute different
instruction sequences on Different sets of data
SMPs, clusters and NUMA systems
29
TAXONOMY OF PARALLEL PROCESSOR
ARCHITECTURES
30
OPERATING SYSTEM
Operating Systems Definition: The set of software
products that jointly controls the system resources and
processes using these resources on a computer.
It can be thought of as having two objectives:
Convenience: An OS makes a computer more
convenient to use.
Efficiency: An OS allows the computer system resources
to be used in an efficient manner.
31
An operating system (OS) is a software program that
manages the hardware and software resources of a
computer.
The OS performs basic tasks, such as:
Controlling and allocating memory
Prioritizing the processing of instructions
Controlling input and output devices
Facilitating networking
Managing files.
32
Layers and Views of a Computer System:
33
Operating System Services:
Program creation
Program execution
Access to I/O devices
Controlled access to files
System access
Error detection and response
Accounting
34
Kernel:
An operating system generally consists of two parts:
• Kernel space (kernel mode) and
• User space (user mode).
Kernel is the smallest and central component of an
operating system.
Its services include managing memory and devices and
also to provide an interface for software applications to
use the resources.
Additional services such as managing protection of
35
programs and multitasking may be included depending
on architecture of operating system.
There are three broad categories of kernel models
available, namely:
I. Monolithic kernel
II. Microkernel
III. Exokernel
36
Why Operating Systems?
Operating systems provide a layer of abstraction
between the programmer and the machine
It screens the complexities of the computer from the
programmer
It allows the computer to be treated as a virtual machine
37
Operating system types:
Real-time:
A real-time operating system is a multitasking operating
system that aims at executing real-time applications.
Real-time operating systems often use specialized
scheduling algorithms so that they can achieve a
deterministic nature of behavior.
The main objective of Real-Time operating systems is
their quick and predictable response to events. 38
Multi-user:
A multi-user operating system allows multiple users to
access a computer system concurrently.
Time-sharing system can be classified as multi-user
systems as they enable a multiple user access to a
computer through the sharing of time.
Single-user operating systems, as opposed to a multi-
user operating system, are usable by a single user at a
time.
39
Being able to use multiple accounts on a Windows
operating system does not make it a multi-user system.
For a UNIX-like operating system, it is possible for two
users to login at a time and this capability of the OS
makes it a multi-user operating system.
Below are some examples of multi-user operating
systems.
• Linux
• Unix
• Windows 2000
40
Distributed:
A distributed operating system manages a group of
independent computers and makes them appear to be a
single computer.
The development of networked computers that could
be linked and communicate with each other gave rise to
distributed computing.
Distributed computations are carried out on more than
one machine. When computers in a group work in
cooperation, they make a distributed system. 41
Embedded:
Embedded operating systems are designed to be used
in embedded computer systems.
They are designed to operate on small machines like
PDAs with less autonomy.
They are able to operate with a limited number of
resources.
They are very compact and extremely efficient by
design.
Windows CE and Minix 3 are some examples of 42
embedded operating systems.
CORE I3, CORE I5, CORE I7 — THE
DIFFERENCE IN A NUTSHELL
Core i7s are better than Core i5s, which are in turn better
than Core i3s.
Core i7 does not have seven cores nor does Core i3 have
three cores.
The numbers are simply indicative of their relative
processing powers.
43
RELATIVE PROCESSING POWER
Determined:
number of cores,
clock speed (in GHz), size of cache,
Intel technologies like Turbo Boost and Hyper-Threading
44
NUMBER OF CORES
The more cores there are, the more tasks (known as
threads) can be served at the same time.
The lowest number of cores can be found in Core i3
CPUs, i.e., which have only two cores. Currently, all
Core i3s are dual-core processors
45
INTEL TURBO BOOST
The Intel Turbo Boost Technology allows a processor to
dynamically increase its clock speed whenever the need
arises.
The maximum amount that Turbo Boost can raise clock
speed at any given time is dependent on the number of
active cores, the estimated current consumption, the
estimated power consumption, and the processor
temperature.
Because none of the Core i3 CPUs have Turbo Boost, the
i5-4570T can outrun them whenever it needs to.
46
CACHE SIZE
Whenever the CPU finds that it keeps on using the same
data over and over, it stores that data in its cache
with a larger cache, more data can be accessed quickly.
The Haswell (fourth generation) Core i3 processors have
either 3MB or 4MB of cache.
The Haswell Core i5s have either 4MB or 6MB of cache.
Finally, all Core i7 CPUs have 8MB of cache, except for
i7-4770R, which has 6MB.
This is clearly one reason why an i7 outperforms an i5
— and why an i5 outperforms an i3.
47
HYPER-THREADING
Strictly speaking, only one thread can be served by one core at a
time.
So if a CPU is a dual core, then supposedly only two threads can
be served simultaneously.
However, Intel has a technology called Hyper-Threading. This
enables a single core to serve multiple threads.
For instance, a Core i3, which is only a dual core, can actually
serve two threads per core. In other words, a total of four threads
can run simultaneously.
Thus, even if Core i5 processors are quad cores, since they don’t
support Hyper-Threading (again, except the i5-4570T) the
number of threads they can serve at the same time is just about
48
equal to those of their Core i3 counterparts
HYPER-THREADING
Core i7s Not only are they quad cores, they also support
Hyper-Threading.
Thus, a total of eight threads can run on them at the same
time.
Combine that with 8MB of cache and Intel Turbo Boost
Technology, which all of them have, and you’ll see what
sets the Core i7 apart from its siblings.
49
The End!
Thank You So Much!
50