Fall 2025 (Abasyn University, Islamabad)
Sunday, 12 noon to 3 pm
Lecture # 3 (14 Sep 2025)
Dr. Adnan Masood
Mobile & WhatsApp: 0334-5344375
Email: am34029@[Link]
There are two kinds of parallelism in applications:
◦ Data-Level Parallelism (DLP) arises because there are many data items that can
be operated on at the same time.
◦ Task-Level Parallelism (TLP) arises because tasks of work are created that can
operate independently and largely in parallel.
Computer hardware in turn can exploit these two kinds of
application parallelism in four major ways:
◦ Instruction-Level Parallelism exploits data-level parallelism at modest levels
using pipelining etc. and at medium levels using speculative execution etc.
◦ Vector Architectures and Graphic Processor Units (GPUs) exploit data-level
parallelism by applying a single instruction to a collection of data in parallel.
◦ Thread-Level Parallelism exploits either data-level parallelism or task-level
parallelism in a tightly coupled hardware model.
◦ Request-Level Parallelism exploits parallelism among largely decoupled tasks
specified by the programmer or the operating system.
Micheal Flynn in 1966, he studied the paralell computing efforts, taking place in 1960s. He found a simple classification
whose applications we use still today. He placed all in computers in four categories. His Taxonomy is course model,
Single instruction stream, single data stream (SISD)—This category
is the uniprocessor, but it can exploit instruction-level parallelism.
Single instruction stream, multiple data streams (SIMD)—The same
instruction is executed by multiple processors using different data
streams. SIMD computers exploit data-level parallelism by applying
the same operations to multiple items of data in parallel.
Multiple instruction streams, single data stream (MISD)—No
commercial multiprocessor of this type has been built to date.
Multiple instruction streams, multiple data streams (MIMD)—Each
processor fetches its own instructions and operates on its own data,
and it targets task-level parallelism.
Old View: Several years ago, the term computer architecture often
referred only to instruction set design. Other aspects of computer
design were called implementation, often implying that
implementation is less challenging.
Current View: The architect’s or designer’s job is much more than
instruction set design, and the technical hurdles in the other
aspects of the project are likely more challenging than those
encountered in instruction set design.
Instruction set architecture (ISA) refers to the portion of the
computer (instruction set) visible to the programmer or compiler
writer. The ISA serves as the boundary between the software and
hardware.
This review of ISA uses examples from 80x86, Advanced RISC
Machine (ARM), and RISC-V to illustrate the seven dimensions of an
ISA.
The most popular RISC processors come from ARM, which were in
14.8 billion chips shipped in 2015, or roughly 50 times as many
chips that shipped with 80x86 processors.
Class of ISA
Memory addressing all uses byte addressing to access memory operands
Addressing modes Specifies the address of a memory object
Types and sizes of operands Supports all operand sizes (8, 16, 32, 64 bits)
Operations The general categoriries of operations are data transfer, arithmatic logic, control and floating point
All three support conditional calls, unconditonal jumps, procedure calls and
Control flow instructions returns.
Encoding an ISATwo types of Encoding i.e Fixed Length and Variable Length
ARM and RISC-V are 32 bits long instructions
80x86 are variable lenght ranging 1-18 Byte (144 Bits) instructions
Class of ISA
Nearly all ISAs today are classified as general-purpose register
architectures, where the operands are either registers or memory
locations.
The 80x86 has 16 general-purpose registers and 16 that can hold
floating point data, while RISC-V has 32 general-purpose and 32
floating-point registers.
The two popular versions of this class are:
◦ Register-memory ISAs (80x86), which can access memory as part
of many instructions.
◦ Load-store ISAs (ARM and RISC-V), which can access memory
only with load or store instructions.
All ISAs announced since 1985 are load-store.
Memory Addressing
Virtually all desktop and server computers, including the 80x86,
ARM, and RISC-V, use byte addressing to access memory operands.
Addressing Modes
In addition to specifying registers and constant operands,
addressing modes specify the address of a memory object.
Types and Sizes of Operands
Like most ISAs, 80x86, ARM, and RISC-V support operand sizes of 8-
bit (ASCII character), 16-bit (Unicode character or half word), 32-bit
(integer or word), 64-bit (double word or long integer), and IEEE
754 floating point in 32-bit (single precision) and 64-bit (double
precision). The 80x86 also supports 80-bit floating point (extended
double precision).
Operations
The general categories of operations are data transfer, arithmetic
logical, control, and floating point.
RISC-V is a simple and easy-to-pipeline ISA, and it is representative
of the RISC architectures being used in 2017. The 80x86 has a much
richer and larger set of operations.
Control Flow Instructions
Virtually all ISAs, including these three, support conditional
branches, unconditional jumps, procedure calls, and returns.
All three use program counter (PC)-relative addressing, where the
branch address is specified by an address field that is added to the
PC.
Encoding an ISA
There are two basic choices on encoding: fixed length and variable
length. All ARM and RISC-V instructions are 32 bits long, which
simplifies instruction decoding. 144 Bits
The 80x86 encoding is variable length, ranging from 1 to 18 bytes.
Variable length instructions can take less space than fixed-length
instructions, so a program compiled for the 80x86 is usually smaller
than the same program compiled for RISC-V.
The implementation of a computer has two components:
organization and hardware. Organization includes the high-level
aspects of a computer’s design, such as the memory system, the
memory interconnect, and the design of the processor (CPU).
The term microarchitecture is also used instead of organization. For
example, two processors with the same ISAs but different
organizations are the AMD Opteron and the Intel Core i7. Both
processors implement the x86 instruction set, but they have very
different pipeline and cache organizations.
The switch to multiple processors per microprocessor led to the
term core to also be used for processor. Instead of saying
multiprocessor microprocessor, the term multicore has caught on.
Hardware refers to the specifics of a computer, including the
detailed logic design and the packaging technology of the
computer.
Often a line of computers contains computers with identical ISAs
and nearly identical organizations, but they differ in the detailed
hardware implementation. For example, the Intel Core i7 and the
Intel Xeon E7 are nearly identical but offer different clock rates and
different memory systems, making the Xeon E7 more effective for
server computers.
In our course textbook, the word architecture covers all three
aspects of computer design—ISA, organization or
microarchitecture, and hardware.
Computer architects must design a computer to meet functional
requirements as well as price, power, performance, and availability
goals. Figure 1.7 summarizes requirements to consider in designing
a new computer.
Often, architects also must determine what the functional
requirements are, which can be a major task. The requirements
may be specific features inspired by the market.
Application software often drives the choice of certain functional
requirements by determining how the computer will be used. If a
large body of software exists for a certain ISA, the architect may
decide that a new computer should implement an existing
instruction set.
If an ISA is to be successful, it must be designed to survive rapid
changes in computer technology as a successful new ISA may last
decades—for example, the core of the IBM mainframe has been in
use for nearly 50 years.
To plan for the evolution of a computer, the designer must be aware
of rapid changes in implementation technology. Five
implementation technologies, which change at a dramatic pace, are
critical to modern implementations:
◦ IC logic technology
◦ Semiconductor DRAM (dynamic random-access memory)
◦ Semiconductor Flash memory
◦ Magnetic disk technology
◦ Network technology
IC Logic Technology
Transistor density increases by about 35% per year, quadrupling
somewhat over four years. Increases in die size are less predictable
and slower, ranging from 10% to 20% per year.
The combined effect is a growth rate in transistor count on a chip of
about 40% to 55% per year, or doubling every 18 to 24 months
(Moore’s law).
Moore’s Law is no more. The number of devices per chip is still
increasing, but at a decelerating rate.
Unlike in the Moore’s Law era, the doubling time is expected to be
stretched with each new technology generation.
Typical NPN transistor. The F8680 is a x86
Size of die is roughly compatible system-on-
1 mm × 1 mm. chip (SoC).
Semiconductor DRAM (Dynamic Random-Access Memory)
This technology is the foundation of main memory. The growth of
DRAM has slowed dramatically, from quadrupling every three years
to doubling every two to three years.
The 8-gigabit DRAM was shipped in 2014, but the 16-gigabit DRAM
reached that state in 2019.
The rate of improvement has continued to slow over the editions of
course text book, as Figure 1.8 shows.
Figure 1.8 Change in rate of improvement in DRAM capacity over time.
Semiconductor Flash Memory
Flash memory was invented by Fujio Masuoka at Toshiba in 1980
and is based on Electrically Erasable Programmable Read-only
Memory (EEPROM) technology.
This nonvolatile semiconductor memory is the standard storage
device in PMDs, and its rapidly increasing popularity has fueled its
rapid growth rate in capacity.
Recently, the capacity per Flash chip increased by about 50%–60%
per year, doubling roughly every 2 years. In 2019, Flash memory
was 8–10 times cheaper per bit than DRAM.
Magnetic Disk Technology
This technology is central to server and warehouse scale storage.
Prior to 1990, density increased by about 30% per year, doubling in
three years. It rose to 60% per year thereafter, and increased to
100% per year in 1996.
Between 2004 and 2011, it dropped back to about 40% per year, or
doubled every two years. Recently, disk improvement has slowed to
less than 5% per year.
Hard disk drives are now 8–10 times cheaper per bit than Flash and
200–300 times cheaper per bit than DRAM.
Network Technology
Network performance depends both on the performance of
switches and on the performance of the transmission system.
These rapidly changing technologies shape the design of a
computer that, with speed and technology enhancements, may
have a lifetime of three to five years.
Key technologies such as Flash change sufficiently that the designer
must plan for these changes. Indeed, designers often design for the
next technology, knowing that when a product begins shipping in
volume that the next technology may be the most cost-effective or
may have performance advantages.
Traditionally, cost has decreased at about the rate at which density
increases.
Although technology improves continuously, the impact of these
improvements can be in discrete leaps, as a threshold that allows a
new capability is reached.
Bandwidth or throughput is the total amount of work done in a given
time, such as megabytes per second for a disk transfer. Latency or
response time is the time between the start and the completion of an
event, such as milliseconds for a disk access.
Figure 1.9 plots the relative improvement in bandwidth and latency for
microprocessors, memory, networks, and disks.
Performance is the primary differentiator for microprocessors and
networks, so they have seen the greatest gains: 32,000–40,000 x in
bandwidth and 50–90 x in latency.
Capacity is generally more important than performance for memory and
disks, so capacity has improved more, yet bandwidth advances of 400–
2400 x are still much greater than gains in latency of 8–9 x.
Conclusion: Bandwidth has outpaced latency. A simple rule of thumb is
that bandwidth grows by at least the square of improvement in latency.
Figure 1.9 Log–log plot of bandwidth and latency milestones.
IC processes are characterized by the feature size, which is the
minimum size of a transistor or a wire in either the x or y
dimension.
Feature sizes decreased from 10 µm in 1971 to 0.016 μm in 2017; in
fact, units have been switched, so production in 2017 was referred
to as “16 nm,” and 7 nm chips were under way. proportionally to square of a
variable
As the transistor count per square millimeter of silicon is
determined by the transistor surface area, the density of transistors
increases quadratically with a linear decrease in feature size.
The increase in transistor performance is more complex;
approximately, transistor performance improves linearly with
decreasing feature size.
Performance of IC means
switching speed
Although transistors generally improve in performance with
decreased feature size, wires in an IC do not. The signal delay for a
wire increases in proportion to the product of its resistance and
capacitance.
As feature size shrinks, wires get shorter, but the resistance and
capacitance per unit length get worse. This relationship is complex,
since both resistance and capacitance depend on detailed aspects of
the process, the geometry of a wire, the loading on a wire, and
even the adjacency to other structures.
In general, wire delay scales poorly compared to transistor
performance. It has become a major design limitation for large ICs
and is often more critical than transistor switching delay.
Power = Energy / Time Work is energy
Rate of doing work
Energy is the biggest challenge facing the computer designer in two
ways:
◦ Power must be brought in and distributed around the chip, and
modern microprocessors use hundreds of pins and multiple
interconnect layers just for power and ground.
◦ Power is dissipated as heat and must be removed. P = I square R (I2 R)
From the viewpoint of a system designer, there are three primary
concerns:
◦ The maximum power a processor ever requires
◦ Thermal design power (TDP) i.e. the sustained power
consumption
◦ Energy and energy efficiency
Meeting the maximum power demand of processor can be
important to ensuring correct operation. For example, if a processor
attempts to draw more power than a power supply system can
provide (by drawing more current than the system can supply), the
result is typically a voltage drop, which can cause the device to
malfunction. Power = Voltage x Current
Modern processors can vary widely in power consumption with high
peak currents. Hence, they provide voltage indexing methods that
allow the processor to slow down and regulate voltage within a
wider margin, which obviously decreases performance.
Continued Power
over period of time
The sustained power consumption is widely called the thermal design
power (TDP), since it determines the cooling requirement.
A typical power supply for a system is usually sized to exceed the TDP, and
a cooling system is usually designed to match or exceed TDP. Failure to
provide adequate cooling will allow the junction temperature in the
processor to exceed its maximum value, resulting in device failure and
possibly permanent damage.
Modern processors provide two features to assist in managing heat, since
the maximum power (and hence heat and temperature rise) can exceed
the long-term average specified by the TDP.
First, as the thermal temperature approaches the junction temperature
limit, circuitry reduces the clock rate, thereby reducing power. Should this
technique not be successful, a second thermal overload trip is activated
to power down the chip.
Since power is energy per unit time (1 watt = 1 joule per second),
energy is always a better metric than power for comparing
processors because it is tied to a specific task and the time required
for that task.
In particular, the energy to complete a workload is equal to the
average power times the execution time for the workload.
Thus, if we want to know which of two processors is more efficient
for a given task, we should compare energy consumption (not
power) for executing the task.
Example: Processor A may have a 20% higher average power
consumption than processor B, but if A executes the task in only
70% of the time needed by B, its energy consumption will be 1.2 ×
0.7 = 0.84, which is clearly better.