SUPE R S C A L A R ,
T I O N-L E V E L &
INSTRUC
P A R A L L E L I S M
MACHINE GAN IZATION (CT
CTU R E AN D OR
COMPUTER ARCHITE 211)
OMA
UNIVERSITY OF DOD
GROUP INFORMATION
• COURSE: COMPUTER ARCHITECTURE AND ORGANIZATION
(CT 211)
• GROUP NUMBER: 09
• PROGRAMME CE2
• FACILITATOR: MR. BAKII JUMA
• ACADEMIC YEAR: 2025 / 2026
HARDWARE SUPPORT AND
SUPER SCALAR
SUPER SCALAR
• In computer architecture superscalar refers to a type of cpu design that can
execute more than one instruction per clock cycle by using multiple execution
units (alu, fpu, load/store units) in parallel.
• A superscalar processor fetches, decodes and executes multiple instructions
simultaneously during a single clock cycle
A SUPERSCALAR CPU HAS
• -Multiple execution units, advanced instruction scheduling , -instruction-level
parallelism (ILP) detection
KEY FEATURES OF SUPERSCALAR
CPU
• Multiple instructions per cycle: A superscalar CPU can fetch, decode, and execute more than one
instruction in a single clock cycle. It issues independent instructions simultaneously to different
execution units
• Dynamic instruction scheduling; the CPU decides at runtime the order in which instructions are
executed.
• Out-of-order execution: (in many designs); in out-of-order execution instructions are executed as soon
as their operands are available, rather than strictly following the program order.
• Register renaming: to avoid data hazards; superscalar cpus use register renaming to eliminate false
dependencies (war and waw hazards).
ADVANTAGES OF SUPERSCALAR
ARCHITECTURE
• Higher performance; A superscalar CPU can execute multiple instructions in a single clock cycle
instead of just one. By issuing instructions in parallel to different execution units (such as ALU,
FPU, and load/store units), the CPU completes more work per cycle, significantly increasing
overall performance.
• Better use of hardware resources; superscalar processors have multiple execution units. Instead
of leaving these units idle, the CPU intelligently schedules independent instructions to run
simultaneously. This maximizes hardware utilization and reduces wasted processing power.
• Faster program execution; because instructions are executed in parallel, programs finish in
fewer clock cycles. This leads to faster execution of applications, improved responsiveness, and
better performance for compute-intensive tasks such as multimedia processing, scientific
computing, and gaming.
LIMITATIONS OF SUPERSCALAR
ARCHITECTURE
• Complex hardware design; superscalar cpu’s must analyze multiple instructions at the same
time to decide which can run in parallel
• Higher power consumption;
to execute multiple instructions per clock cycle, superscalar processors include:
• Multiple execution units (alus, fpus, load/store units)
• Large instruction windows and buffers
• Sophisticated scheduling and prediction hardware
All this extra hardware consumes more power, generates more heat, and reduces battery life in
mobile devices.
• Diminishing returns if instructions are not independent
Superscalar performance depends heavily on instruction-level parallelism (ILP).
Diminishing returns if instructions are not independent
Superscalar performance depends heavily on instruction-level parallelism (ILP).
When instructions are not independent, the CPU cannot issue multiple instructions, so
performance gains become limited.
INSTRUCTION-LEVEL
PARALLELISM (ILP)
Ability to execute multiple independent instructions
simultaneously
Example, Program
A=B+C and D=E+F instructions does not depend on each other
• Unlike dependent instruction like A=B+C and D=A+F Means second
instruction depends on first one in such that can not Run in parallel.
• Depends on the program structure
Independent instructions → high ilp
Dependent instructions → low ilp
HOW ILP WORKS
Modern cpu's divide instruction execution into several stages, such as:
• Instruction fetch (IF) read the instruction from memory
• Instruction decode (ID): the CPU analyzes the instruction and generates the
necessary control signals to execute it
• Execute (EX) perform the operation (arithmetic, logic, etc.).
• Memory access (MEM) read or write data (if required)
MACHINE (HARDWARE)
PARALLELISM
ABILITY OF CPU HARDWARE TO EXECUTE MULTIPLE
INSTRUCTIONS AT ONCE
• DEPENDS ON
CPU DESIGN (CORES): whether is superscalar
ISSUE WIDTH: how many instructions can issue per cycle
NUMBER OF EXECUTION UNITS
• INDEPENDENT of the program logic, only depends on the cpu
design
HOW MLP WORKS
• Multi-core processors: each core can execute its own instruction stream
independently.
• Simultaneous multi-threading (smt) / hyper-threading:
• A single cpu core runs multiple threads concurrently, utilizing idle execution
units efficiently.
• Multiple processors (smp – symmetric multiprocessing):
• Two or more physical cpu's work together to execute multiple tasks in parallel.
HARDWARE TECHNIQUES FOR
PERFORMANCE ENHANCEMENT
• PIPELINING: breaks instruction execution into stages (fetch, decode, execute, etc.). Allows
multiple instructions to be processed at once.
• Two alus allow executing two arithmetic instructions at once.
• Pipeline split into 10 stages instead of 5 → cpu cycles faster
• SUPERSCALAR EXECUTION: cpu issues multiple instructions
per clock cycle. uses several parallel execution units (alu, fpu,
load/store unit). allows true parallel instruction execution.
• OUT-OF-ORDER EXECUTION : cpu does not wait for
stalled instructions. executes other independent
instructions first. not necessarily in order.
CONT….
BRANCH PREDICTION: predicts the outcome of conditional branches to avoid delays.
• SPECULATIVE EXECUTION: executes instructions ahead of time based on branch
prediction.
if the prediction is correct → results are kept. if wrong → results are discarded
improves performance by not wasting pipeline cycles.
• CACHING & MEMORY HIERARCHY IMPROVEMENTS
• CACHING: uses small, fast memory between cpu and main memory. reduces memory
access time. MEMORY HIERARCHY: faster ram types (DDR3 → DDR4 → DDR5)
• WIDER MEMORY BUSES
CONCLUSION
• Superscalar Processors Improve Performance By Executing
Multiple Instructions Per Cycle Using Parallel Hardware Units
And Advanced Techniques Like Pipelining, Branch Prediction,
And Out-of-order Execution. However, Data Dependencies,
Hardware Complexity, And Prediction Failures Create Limitations
To The Achievable Speedup.