0% found this document useful (0 votes)
5 views12 pages

Mod1 Parallel Processing Answers

The document provides an overview of parallel processing and computer architecture, covering topics such as the theory of parallelism, Flynn's classification, and the evolution of parallelism in uniprocessor computers. It discusses the differences between implicit and explicit parallelism, levels of parallelism, and the roles of hardware and software in achieving parallelism. Additionally, it compares RISC and CISC architectures and highlights the importance of pipelining and multiple functional units in enhancing performance.

Uploaded by

ashiksingh151
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Mod1 Parallel Processing Answers

The document provides an overview of parallel processing and computer architecture, covering topics such as the theory of parallelism, Flynn's classification, and the evolution of parallelism in uniprocessor computers. It discusses the differences between implicit and explicit parallelism, levels of parallelism, and the roles of hardware and software in achieving parallelism. Additionally, it compares RISC and CISC architectures and highlights the importance of pipelining and multiple functional units in enhancing performance.

Uploaded by

ashiksingh151
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MOD - 1

Parallel Processing & Computer Architecture

Semester Examination Answer Key

Computer Organization & Architecture

Subject Computer Organization & Architecture

Module Module 1 — Parallel Processing

Total Questions 10

Marks per Question 4 Marks

Total Marks 40 Marks


Q1. Parallel Processing & Theory of Parallelism vs Sequential Processing
Parallel Processing is a computing model where multiple processors or processing units execute multiple
tasks or parts of a task simultaneously, reducing total execution time.

Theory of Parallelism
The theory states that any large computational problem can be broken into smaller sub-problems that are
solved concurrently. It is governed by two key laws:

• Amdahl's Law: Speedup is limited by the sequential (non-parallelizable) fraction of a program. If


fraction f is sequential: Speedup ≤ 1/f
• Gustafson's Law: As problem size grows, the parallel fraction dominates, making speedup scale
better with more processors.
Amdahl's Law:
Speedup = 1 / [f + (1-f)/p]

Where: f = sequential fraction, p = number of processors

Example: f=0.2, p=10


Speedup = 1 / [0.2 + 0.8/10] = 1 / 0.28 = 3.57x (not 10x!)

Parallel vs Sequential Processing


Feature Sequential Processing Parallel Processing

Execution One task at a time Multiple tasks simultaneously

Hardware Single processor Multiple processors/cores

Speed Limited by clock rate Scales with processor count

Complexity Simple programming model Complex (sync, communication)

Use Case Simple desktop applications HPC, AI, simulations

Example Single-core CPU Multi-core CPU, GPU, cluster

✦ Conclusion: Parallel processing overcomes the physical limits of single-processor speed by distributing
work — essential for modern AI, scientific computing, and big data.

Q2. Flynn's Classification — SISD, SIMD, MISD, MIMD with Block Diagrams
Flynn's Classification (1966) categorizes computer architectures based on the number of concurrent
Instruction Streams (I) and Data Streams (D).

1. SISD — Single Instruction, Single Data


Traditional uniprocessor. One instruction processes one data item at a time.
[Control Unit] --> [Processor] --> [Memory]
1 Instruction 1 Data
Example: Classic single-core CPU (Intel 8086)

2. SIMD — Single Instruction, Multiple Data


One instruction is broadcast to multiple processors, each operating on different data.
[Control Unit]
/ | \
[PE1] [PE2] [PE3] <- Processing Elements
| | |
[D1] [D2] [D3] <- Different Data
Example: GPU, MMX/SSE extensions, vector processors

3. MISD — Multiple Instruction, Single Data


Multiple instructions operate on the same data stream. Mostly theoretical; used in fault-tolerant systems.
[CU1] --> [PE1] --|
[CU2] --> [PE2] --|--> Same Data Stream
[CU3] --> [PE3] --|
Example: Space Shuttle flight computer (fault tolerance)

4. MIMD — Multiple Instruction, Multiple Data


Each processor executes its own instruction on its own data. Most powerful and widely used model.
[CU1]+[PE1] <-> [D1]
[CU2]+[PE2] <-> [D2] <- Independent processors
[CU3]+[PE3] <-> [D3]
Example: Multi-core CPUs, distributed clusters, supercomputers

Class Instruction Streams Data Streams Example

SISD 1 1 Single-core CPU

SIMD 1 Multiple GPU, Vector CPU

MISD Multiple 1 Fault-tolerant systems

MIMD Multiple Multiple Multi-core, Clusters

✦ Conclusion: MIMD is the dominant architecture today — all modern multi-core processors and distributed
systems fall into this category.
Q3. Evolution of Parallelism in Uni-processor Computers — Look-ahead &
Functional Redundancy
Even a single processor can exhibit parallel behavior through architectural techniques that overlap
operations internally. This is called Intra-processor Parallelism.

Evolution Timeline
1950s: Single-cycle execution (fully sequential)
1960s: Instruction Look-ahead introduced
1970s: Pipelining (overlapping instruction stages)
1980s: Multiple Functional Units (FPU, ALU separation)
1990s: Superscalar (multiple issue per cycle)
2000s: Out-of-order execution, branch prediction
2010s: Hyper-threading (SMT) — 2 threads per core

A) Instruction Look-ahead
The processor fetches and decodes future instructions while the current instruction is still executing.
This creates an overlap between fetch and execution stages.
• A fetch buffer (instruction queue) holds upcoming instructions.
• While EX stage works on I1, fetch stage already has I2, I3 ready.
• Eliminates fetch latency from critical path.
• Predecessor concept to full pipelining.
Without Look-ahead: Fetch I1 → Exec I1 → Fetch I2 → Exec I2
With Look-ahead: Fetch I1 → Exec I1
Fetch I2 → Exec I2 (overlap!)

B) Functional Redundancy (Multiple Functional Units)


The processor has multiple independent execution units — separate hardware for different operation
types — allowing multiple operations to proceed in parallel.

• Integer ALU: Handles add, subtract, logical operations.


• Floating-Point Unit (FPU): Handles float multiply, divide independently.
• Load/Store Unit: Handles memory operations independently.
• Branch Unit: Evaluates branch conditions independently.
Example — Parallel execution with functional units:
Cycle 1: ALU: ADD R1,R2 | FPU: MUL F1,F2 | LSU: LOAD R3
Cycle 2: ALU: SUB R4,R5 | FPU: ADD F3,F4 | LSU: STORE R6
(3 operations per cycle on a single-core processor!)

✦ Conclusion: These techniques allowed single processors to achieve parallelism without requiring multiple
cores — laying the foundation for modern superscalar design.

Q4. Implicit Parallelism vs Explicit Parallelism — Roles of Compiler and


Programmer
Implicit Parallelism
Parallelism is automatically detected and exploited by the compiler or hardware. The programmer
writes sequential code — no parallel constructs needed.

• Compiler analyzes data dependencies and schedules parallel operations.


• Hardware (superscalar, out-of-order) reorders instructions at runtime.
• Programmer is unaware of parallelism — writes normal code.
• Example: Compiler auto-vectorization of loops, superscalar CPUs.
// Programmer writes:
for(i=0; i<n; i++) C[i] = A[i] + B[i];

// Compiler auto-vectorizes to SIMD:


VADD XMM0, [A], [B] // adds 4 floats at once
// No programmer involvement needed!

Explicit Parallelism
The programmer explicitly specifies parallel regions, thread creation, synchronization, and data
distribution using parallel programming models.

• Programmer uses directives, libraries, or language constructs.


• Examples: OpenMP (shared memory), MPI (message passing), CUDA (GPU).
• More control over performance — but more complex to program.
• Programmer manages thread creation, synchronization, and load balancing.
// Programmer explicitly parallelizes:
#pragma omp parallel for // OpenMP directive
for(i=0; i<n; i++) C[i] = A[i] + B[i];
// Compiler creates threads — programmer directed it!

Aspect Implicit Parallelism Explicit Parallelism

Who detects ILP Compiler / Hardware Programmer

Programming effort Low High

Control Limited Full control

Correctness risk Low (automated) High (race conditions)

Scalability Limited (hardware bound) High

Tools Superscalar, OOO CPUs OpenMP, MPI, CUDA

Best for General-purpose code HPC, scientific computing

✦ Conclusion: Implicit parallelism suits everyday programming; explicit parallelism is necessary for
maximizing performance in high-performance computing applications.
Q5. Levels of Parallelism — Instruction Level to Job/Program Level
Parallelism in computer systems exists at multiple levels, each with different granularity and scope. Finer
granularity = more frequent but smaller parallel units.

Parallelism Hierarchy (Fine to Coarse)


FINE GRAIN COARSE GRAIN
<--------------------------------------------------------->
Bit Instruction Thread Process Job/Program
Level Level Level Level Level

1. Bit-Level Parallelism
• Processing multiple bits simultaneously (e.g., 32-bit vs 64-bit ALU).
• A 64-bit processor adds two 64-bit numbers in one operation vs 2 ops for 32-bit.
2. Instruction-Level Parallelism (ILP)
• Multiple instructions executed simultaneously within a single processor.
• Achieved via pipelining, superscalar, out-of-order execution.
• Example: Executing ADD and MUL in the same clock cycle.
3. Thread-Level Parallelism (TLP)
• Multiple threads of the same program run concurrently.
• Each thread shares memory space but has its own PC and registers.
• Example: Multi-threaded web server handling 100 requests simultaneously.
4. Process-Level Parallelism
• Multiple independent processes run on different processors.
• Each process has its own memory space — communicate via IPC/message passing.
• Example: Parallel compilation of different source files.
5. Job/Program-Level Parallelism
• Entirely independent programs run simultaneously on separate systems.
• Coarsest grain — minimal communication between programs.
• Example: Batch processing jobs on a computing cluster.

Level Granularity Managed By Example

Bit Finest Hardware design 64-bit ALU

Instruction Fine Hardware/Compiler Pipelining, Superscalar

Thread Medium OS + Hardware Multi-threading

Process Coarse Operating System Parallel compilation

Job/Program Coarsest User / Scheduler Cluster batch jobs

✦ Conclusion: Modern systems exploit all levels simultaneously — ILP inside cores, TLP across cores, and
job-level parallelism across clusters.

Q6. Hardware Parallelism vs Software Parallelism — Role of Multiple


Functional Units
Hardware Parallelism
Parallelism built into the physical structure of the processor. The hardware itself can execute multiple
operations simultaneously without programmer intervention.

• Multiple Functional Units: Separate ALU, FPU, Load/Store, Branch units.


• Pipelining: Overlapping stages of different instructions.
• Multiple Cores: Independent processors on one chip.
• SIMD Units: Single instruction operates on multiple data (SSE, AVX).

Software Parallelism
Parallelism expressed through program structure, algorithms, and parallel programming constructs.

• Parallel algorithms: Divide-and-conquer, map-reduce.


• Threads: POSIX threads (pthreads), Java threads.
• Message Passing: MPI for distributed memory systems.
• GPU Programming: CUDA, OpenCL for massively parallel execution.

How Multiple Functional Units Contribute to Hardware Parallelism


Traditional Single FU:
Cycle 1: ADD
Cycle 2: MUL (waits for ADD to finish)
Cycle 3: LOAD
→ 3 cycles for 3 operations

With Multiple Functional Units:


Cycle 1: ADD (ALU) | MUL (FPU) | LOAD (LSU)
→ 1 cycle for 3 operations! (3x speedup)

Feature Hardware Parallelism Software Parallelism

Source Physical hardware design Program code & algorithms

Control Automatic (transparent) Explicit by programmer

Flexibility Fixed at manufacture Highly flexible

Examples Multi-core, FPU, SIMD OpenMP, MPI, CUDA

Limits Die area, power, cost Algorithm complexity

✦ Conclusion: Hardware and software parallelism are complementary — hardware provides the capability,
software exploits it effectively.
Q7. Instruction Level Parallelism (ILP) — How Compilers Exploit ILP
Instruction Level Parallelism (ILP) is the ability to execute multiple instructions from a sequential
program simultaneously by identifying instructions that are independent of each other (no data or control
dependencies).

Sources of ILP
• Independent instructions: No shared registers or memory between them.
• Loop unrolling: Expanding loop body to expose more independent instructions.
• Out-of-order execution: Hardware reorders instructions to avoid stalls.
Example — Independent instructions (can execute in parallel):
ADD R1, R2, R3 (writes R1)
MUL R4, R5, R6 (writes R4, reads R5,R6 — independent of above!)
LOAD R7, [R8] (independent of both!)
→ All 3 can execute in parallel!

How Compilers Help Exploit ILP


1. Instruction Scheduling
Compiler reorders instructions to avoid pipeline stalls while maintaining correct results.
Original: Reordered by compiler:
LOAD R1, [addr] LOAD R1, [addr] <- start load early
ADD R2, R1, R3 LOAD R4, [addr2] <- fill load latency slot
LOAD R4, [addr2] ADD R2, R1, R3 <- R1 ready now
ADD R5, R4, R6 ADD R5, R4, R6

2. Loop Unrolling
• Compiler replicates loop body multiple times to reduce loop overhead and expose parallelism.
• More independent instructions become visible to the hardware.
Original loop (1 iteration): Unrolled (4 iterations):
for(i=0;i<n;i++) A[0]+=B[0]; A[1]+=B[1];
A[i] += B[i]; A[2]+=B[2]; A[3]+=B[3];
// All 4 can run in parallel!

3. Register Renaming
Compiler assigns different physical registers to eliminate false WAR and WAW dependencies, exposing
more ILP.

4. Software Pipelining
Compiler interleaves instructions from different loop iterations to keep all functional units busy.

Compiler Technique Hazard Removed Benefit

Instruction Scheduling RAW stalls Fills pipeline gaps

Loop Unrolling Loop overhead Exposes parallel ops

Register Renaming WAR, WAW dependencies Removes false hazards

Software Pipelining Structural hazards Maximizes FU utilization

Dead Code Elimination Wasted cycles Cleaner instruction stream


✦ Conclusion: Compilers are the first line of ILP exploitation — they transform sequential code into
instruction sequences that maximize hardware parallelism utilization.

Q8. RISC vs CISC Architecture — Register Usage, Instruction Length,


Addressing, Pipelining
RISC (Reduced Instruction Set Computer) uses a small set of simple, fixed-length instructions. CISC
(Complex Instruction Set Computer) uses a large set of complex, variable-length instructions.

Detailed Comparison
Parameter RISC CISC

Instruction Set Small (~50-100 instructions) Large (~200-300+ instructions)

Instruction Length Fixed (32 bits) Variable (1-15 bytes)

Execution Time 1 instruction = 1 clock cycle 1 instruction = many cycles

Register Usage Many registers (32+) Few registers (8-16)

Addressing Modes Few (load/store only) Many (complex modes)

Pipelining Easy (fixed length stages) Hard (variable length)

Memory Access Only LOAD/STORE instructions Any instruction can access memory

Compiler Complex compiler needed Simpler compiler

Hardware Simple, fast Complex microcode

Examples ARM, MIPS, RISC-V Intel x86, AMD x64

Pipelining Difference (Key Point)


RISC Pipelining (easy — all stages equal duration):
IF --> ID --> EX --> MEM --> WB (all 1 cycle each)

CISC Pipelining (hard — variable instruction length):


[FETCH varies] --> [DECODE complex] --> [EXECUTE 1-20 cycles]
Solution: x86 processors internally translate CISC to RISC micro-ops!

✦ Conclusion: Modern CPUs blur the line — x86 (CISC) processors internally use RISC-like
micro-operations, combining the compatibility of CISC with the speed of RISC.
Q9. Parallelism in Uni-processor Computers — Pipelining & Multiple
Functional Units
A single processor can exhibit parallel behavior through two primary architectural techniques: Pipelining
and Multiple Functional Units. These allow a uniprocessor to work on several operations simultaneously.

A) Pipelining
The processor is divided into stages; different stages work on different instructions simultaneously — like
an assembly line.
Time → 1 2 3 4 5 6 7
I1: IF ID EX MEM WB
I2: IF ID EX MEM WB
I3: IF ID EX MEM WB
I4: IF ID EX MEM

At cycle 4: All 4 instructions are in different stages simultaneously!


→ Single processor behaves as if 4 things happen at once.

• Throughput: One instruction completes per clock cycle (steady state).


• Speedup: Up to k times (k = number of stages) for large instruction sequences.
• Hazards (data, control, structural) can disrupt the pipeline.

B) Multiple Functional Units


Separate dedicated hardware units handle different types of operations simultaneously within the same
clock cycle.
Functional Units in a Modern Processor:
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
■ ALU 1 ■ ALU 2 ■ FPU ■ LSU ■
■ (integer)■ (integer)■ (float) ■(mem ops) ■
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

Cycle 1: ADD R1,R2 ■ MUL F1,F2 ■ LOAD R3,[R4]


(ALU 1) (FPU) (LSU)
→ 3 different operations complete in 1 cycle!

• Pipelining exploits parallelism across time (temporal parallelism).


• Multiple FUs exploit parallelism across space (spatial parallelism).
• Combined: Modern superscalar CPUs use both — up to 6 instructions/cycle.

Technique Type of Parallelism How it Works Speedup Factor

Pipelining Temporal Overlap instruction stages Up to k (stages)

Multiple FUs Spatial Simultaneous different ops Up to n (FU count)

Combined (SS) Both Multiple pipelines + FUs Up to k × n

✦ Conclusion: Pipelining and multiple functional units transform a single processor into a parallel engine —
this is why a modern CPU at 3GHz can execute billions of instructions per second.

Q10. SISD vs SIMD — Control Units, Processors, Data Streams & Diagrams
SISD — Single Instruction, Single Data
The classical von Neumann architecture. One control unit issues one instruction at a time, and one
processor operates on one data item.

• Control Units: 1 — issues instructions sequentially.


• Processors: 1 — one ALU/processor.
• Instruction Stream: 1 — one sequence of instructions.
• Data Stream: 1 — one data item processed per instruction.
SISD Block Diagram:

[Instruction Memory]
|
| (1 Instruction)
v
[Control Unit]
|
| (1 Operation)
v
[Processor / ALU]
|
| (1 Result)
v
[Data Memory]

Example: ADD R1, R2 → adds two numbers, stores one result.

SIMD — Single Instruction, Multiple Data


One control unit broadcasts a single instruction to multiple processing elements (PEs), each operating
on its own data simultaneously.

• Control Units: 1 — issues one instruction to all PEs.


• Processors: Multiple (N Processing Elements) — all do the same op.
• Instruction Stream: 1 — same instruction broadcast to all PEs.
• Data Streams: Multiple (N) — each PE has its own data.
SIMD Block Diagram:

[Control Unit]
/ | \
/ | \
[PE 1] [PE 2] [PE 3] ... [PE N]
| | | |
[D1] [D2] [D3] [DN]
(Data Memory partitioned across PEs)

Example: VADD → PE1: A[0]+B[0], PE2: A[1]+B[1], PE3: A[2]+B[2]


All additions happen in ONE clock cycle!

Feature SISD SIMD

Control Units 1 1 (shared broadcast)

Processors 1 Multiple (N PEs)


Feature SISD SIMD

Instruction Stream 1 (sequential) 1 (broadcast to all)

Data Streams 1 N (one per PE)

Parallelism None (sequential) Data parallelism

Speed for arrays N cycles for N elements 1 cycle for N elements

Hardware cost Low High (N processors)

Programming Simple Requires vectorization

Examples x86 single-core GPU, SSE/AVX, Cray-1

Key Insight
SIMD delivers N× speedup for data-parallel workloads (image processing, ML, signal processing) with
only 1 control unit — making it extremely efficient for regular, repetitive computations.
✦ Conclusion: SISD is the foundation of general computing; SIMD is the engine of modern AI, graphics, and
scientific computing where the same operation is applied to millions of data elements.

All answers prepared for 4-mark university semester examination level — concise, structured, with diagrams and tables.

You might also like