0% found this document useful (0 votes)

12 views12 pages

OsChapter 6

This document is an uncorrected proof intended for internal use only, containing solutions to various problems related to parallelism and performance in computing. It discusses concepts such as data-level and task-level parallelism, speedup factors in algorithms, and issues like cache misses and false sharing. The content emphasizes the importance of restructuring code for parallel execution and provides examples and calculations to illustrate these principles.

Uploaded by

liotim911202

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views12 pages

OsChapter 6

Uploaded by

liotim911202

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for

internal business use only by the author(s), editor(s),

reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

6
Solutions
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

Chapter 6 Solutions S-3

6.1 There is no single right answer for this question. The purpose is to get
students to think about parallelism present in their daily lives. The answer
should have at least 10 activities identified.
6.1.1 Any reasonable answer is correct here.
6.1.2 Any reasonable answer is correct here.
6.1.3 Any reasonable answer is correct here.
6.1.4 The student is asked to quantify the savings due to parallelism. The answer
should consider the amount of overlap provided through parallelism and
should be less than or equal to (if no parallelism was possible) the original
time computed if each activity was carried out serially.

6.2
6.2.1 For this set of resources, we can pipeline the preparation. We assume that
we do not have to reheat the oven for each cake.
Preheat Oven
Mix ingredients in bowl for Cake 1
Fill cake pan with contents of bowl and bake Cake 1. Mix ingredients for
Cake 2 in bowl.
Finish baking Cake 1. Empty cake pan. Fill cake pan with bowl contents for
Cake 2 and bake Cake 2. Mix ingredients in bowl for Cake 3.
Finish baking Cake 2. Empty cake pan. Fill cake pan with bowl contents for
Cake 3 and bake Cake 3.
Finish baking Cake 3. Empty cake pan.
6.2.2 Now we have 3 bowls, 3 cake pans and 3 mixers. We will name them A, B,
and C.
Preheat Oven
Mix incredients in bowl A for Cake 1
Fill cake pan A with contents of bowl A and bake for Cake 1. Mix ingredients
for Cake 2 in bowl A.
Finish baking Cake 1. Empty cake pan A. Fill cake pan A with contents of
bowl A for Cake 2. Mix ingredients in bowl A for Cake 3.
Finishing baking Cake 2. Empty cake pan A. Fill cake pan A with contents of
bowl A for Cake 3.
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

S-4 Chapter 6 Solutions

Finish baking Cake 3. Empty cake pan A.

The point here is that we cannot carry out any of these items in parallel
because we either have one person doing the work, or we have limited
capacity in our oven.
6.2.3 Each step can be done in parallel for each cake. The time to bake 1 cake, 2
cakes or 3 cakes is exactly the same.
6.2.4 The loop computation is equivalent to the steps involved to make one
cake. Given that we have multiple processors (or ovens and cooks), we can
execute instructions (or cook multiple cakes) in parallel. The instructions
in the loop (or cooking steps) may have some dependencies on prior
instructions (or cooking steps) in the loop body (cooking a single cake).
Data-level parallelism occurs when loop iterations are independent (i.e., no
loop carried dependencies).
Task-level parallelism includes any instructions that can be computed on
parallel execution units, are similar to the independent operations involved
in making multiple cakes.

6.3
6.3.1 While binary search has very good serial performance, it is difficult to
parallelize without modifying the code. So part A asks to compute the
speedup factor, but increasing X beyond 2 or 3 should have no benefits.
While we can perform the comparison of low and high on one core, the
computation for mid on a second core, and the comparison for A[mid] on
a third core, without some restructuring or speculative execution, we will
not obtain any speedup. The answer should include a graph, showing that
no speedup is obtained after the values of 1, 2, or 3 (this value depends
somewhat on the assumption made) for Y.
6.3.2 In this question, we suggest that we can increase the number of cores (to
each the number of array elements). Again, given the current code, we
really cannot obtain any benefit from these extra cores. But if we create
threads to compare the N elements to the value X and perform these
in parallel, then we can get ideal speedup (Y times speedup), and the
comparison can be completed in the amount of time to perform a single
comparison.
6.4 This problem illustrates that some computations can be done in parallel if
serial code is restructured. But more importantly, we may want to provide
for SIMD operations in our ISA, and allow for data-level parallelism when
performing the same operation on multiple data items.
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

Chapter 6 Solutions S-5

6.4.1 As shown below, each iteration of the loop requires 16 cycles. The loop runs
999 times. Thus, the total number of cycles is 16 × 999 + 3 = 15984.
li $s0, 8000
add $s1, $a0, $s0
addi $s2, $a0, 16
loop: l.d $f0, -16($s2)
l.d $f2, -8($s2)
stall
stall
stall
stall
stall
stall
add.d $f4, $f0, $f2
stall
stall
stall
stall
s.d $f4, 0($s2)
addi $s2, $s2, 8
bne $s2, $s1, loop

6.4.2 The following code removes one stall per iteration:

a. li $s0, 8000
add $s1, $a0, $s0
addi $s2, $a0, 16
loop: l.d $f0, -16($s2)
l.d $f2, -8($s2)
stall
stall
stall
stall
stall
stall
add.d $f4, $f0, $f2
addi $s2, $s2, 8
stall
stall
stall
s.d $f4, 0($s2)
bne $s2, $s1, loop
b. Thus, the new loop takes 15×999=14958 cycles.
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

S-6 Chapter 6 Solutions

6.4.3 Array elements D[j] and D[j−1] will have loop carried dependencies.
The value loaded into D0 during iteration i was produced during iteration
i−1.
6.4.4 li $s0, 8000
add $s1, $a0, $s0
addi $s2, $a0, 16
l.d $f0, 0($s2)
l.d $f2, 8($s2)
addi $s2, $s2, 8
stall
stall
stall
stall
stall
loop: add.d $f4, $f0, $f2
addi $s2, $s2, 8
mov.d $f0, $f2
mov.d $f2, $f4
stall
s.d $f2, 0($s2)
ble $s2, $s1,loop
This loop takes seven cycles and runs 999 times. Thus, the total number of
cycles is 7 × 999 + 10 = 7003.
6.4.5 l.d $f0, 0($s2)
l.d $f2, 8($s2)
li $s0, 8000
add $s1, $a0, $s0
addi $s2, $s2,16
stall
stall
stall
loop: add.d $f4, $f0, $f2
stall
stall
stall
stall
add.d $f0, $f4, $f2
s.d $f2, 0($s2)
stall
stall
stall
add.d $f2, $f4, $f0
s.d $f0, 0($s2)
addi $s2, $s2,24
stall
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

Chapter 6 Solutions S-7

stall
s.d $f2,-8($s2)
bne $s2,$s1,loop
The unrolled loop takes 17 cycles, but runs only 333 times. Thus, the total
number of cycles is 17 × 333 + 10 = 5671.
6.4.6 Include two copies of the loop: The unrolled loop and the original loop.
Suppose you unrolled the loop U times. Run the unrolled loop until the
number of iterations left is less than U. (In some sense your unrolled loop
will be doing this: for (i = 0; i + U < MAX; i+= U).) At this
point, switch to the unrolled loop. (In some sense, your original loop will be
doing this: for (; i < MAX; i++) .)
6.4.7 It is not possible to use message passing to improve performance—even
if the message passing system has no latency. There is simply not enough
work that can be done in parallel to benefit from using multiple CPUs. All
the work that can be done in parallel can be scheduled between dependent
floating point instructions.

6.5
6.5.1 This problem is again a divide and conquer problem, but utilizes recursion
to produce a very compact piece of code. In part A the student is asked to
compute the speedup when the number of cores is small. When forming the
lists, we spawn a thread for the computation of left in the MergeSort code,
and spawn a thread for the computation of the right. If we consider this
recursively, form initial elements in the array, we can utilize 1 + 2 + 4 + 8 +
16 + …. log2 (m) processors to obtain speedup.
6.5.2 In this question, log2 (m) is the largest value of Y for which we can obtain
any speedup without restructuring. But if we had m cores, we could
perform sorting using a very different algorithm. For instance, if we have
greater than m/2 cores, we can compare all pairs of data elements, swap
the elements if the left element is greater than the right element, and then
repeat this step m times. So this is one possible answer for the question. It
is known as parallel comparison sort. Various comparison sort algorithms
include odd-even sort and cocktail sort.
6.6
6.6.1 This problem presents an “embarrassingly parallel” computation and
asks the student to find the speedup obtained on a four-core system. The
computations involved are: (m × p × n) multiplications and (m × p × (n− 1))
additions. The multiplications and additions associated with a single
element in C are dependent (we cannot start summing up the results of the
multiplications for an element until two products are available). So in this
question, the speedup should be very close to 4.
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

S-8 Chapter 6 Solutions

6.6.2 This question asks about how speedup is affected due to cache misses caused
by the four cores all working on different matrix elements that map to the
same cache line. Each update would incur the cost of a cache miss, and so
will reduce the speedup obtained by a factor of 3 times the cost of servicing
a cache miss.
6.6.3 In this question, we are asked how to fix this problem. The easiest way to
solve the false sharing problem is to compute the elements in C by traversing
the matrix across columns instead of rows (i.e., using index-j instead of
index-i). These elements will be mapped to different cache lines. Then we
just need to make sure we process the matrix index that is computed (i, j)
and (i + 1, j) on the same core. This will eliminate false sharing.

6.7
6.7.1 x = 2, y = 2, w = 1, z = 0
x = 2, y = 2, w = 3, z = 0
x = 2, y = 2, w = 5, z = 0
x = 2, y = 2, w = 1, z = 2
x = 2, y = 2, w = 3, z = 2
x = 2, y = 2, w = 5, z = 2
x = 2, y = 2, w = 1, z = 4
x = 2, y = 2, w = 3, z = 4
x = 3, y = 2, w = 5, z = 4
6.7.2 We could set synchronization instructions after each operation so that all
cores see the same value on all nodes.

6.8
6.8.1 If every philosopher simultaneously picks up the left fork, then there will be
no right fork to pick up. This will lead to starvation.
6.8.2 The basic solution is that whenever a philosopher wants to eat, she checks
both forks. If they are free, then she eats. Otherwise, she waits until a
neighbor contacts her. Whenever a philosopher finishes eating, she checks
to see if her neighbors want to eat and are waiting. If so, then she releases
the fork to one of them and lets them eat. The difficulty is to first be able to
obtain both forks without another philosopher interrupting the transition
between checking and acquisition. We can implement this a number of
ways, but a simple way is to accept requests for forks in a centralized queue,
and give out forks based on the priority defined by being closest to the head
of the queue. This provides both deadlock prevention and fairness.
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

Chapter 6 Solutions S-9

6.8.3 There are a number or right answers here, but basically showing a case
where the request of the head of the queue does not have the closest forks
available, though there are forks available for other philosophers.
6.8.4 By periodically repeating the request, the request will move to the head of
the queue. This only partially solves the problem unless you can guarantee
that all philosophers eat for exactly the same amount of time, and can use
this time to schedule the issuance of the repeated request.
6.9
6.9.1
Core 1 Core 2
A3 B1, B4
A1, A2 B1, B4
A1, A4 B2
A1 B3

6.9.2 Answer is same as 6.9.1.

6.9.3 FU1 FU2

A1 A2
A1
A1
B1 B2
B1
A3
A4
B2
B4

6.9.4 FU1 FU2

A1 B1
A1 B1
A1 B2
A2 B3
A3 B4
A4

6.10 This is an open-ended question.

6.10.1 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).

6.10.2 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

S-10 Chapter 6 Solutions

6.11

6.11.1 The answer should include an MIPS program that includes four different
processes that will compute ¼ of the sums. Assuming that memory latency
is not an issue, the program should get linear speed when run on the four
processors (there is no communication necessary between threads). If
memory is being considered in the answer, then the array blocking should
consider preserving spatial locality so that false sharing is not created.

6.11.2 Since this program is highly data parallel and there are no data dependencies,
an 8 × speedup should be observed. In terms of instructions, the SIMD
machine should have fewer instructions (though this will depend upon
the SIMD extensions).

6.12 This is an open-ended question that could have many possible answers.
The key is that the student learns about MISD and compares it to an SIMD
machine.

6.12.1 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).

6.12.2 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).

6.13 This is an open-ended question that could have many answers. The key is
that the students learn about warps.

6.13.1 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).

6.14 This is an open-ended programming assignment. The code should be

tested for correctness.

6.14.1 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).

6.14.2 There is no solution to this problem (it is an open-ended question-no

need to change the solutions document).

6.15 This question will require the students to research on the Internet both the
AMD Fusion architecture and the Intel QuickPath technology. The key is
that students become aware of these technologies. The actual bandwidth
and latency values should be available right off the company websites, and
will change as the technology evolves.

6.15.1 There is no solution to this problem (it is a lab-based question-no need to

change the solutions document).
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

Chapter 6 Solutions S-11

6.16

6.16.1 For an n-cube of order N (2N nodes), the interconnection network can
sustain N − 1 broken links and still guarantee that there is a path to all nodes in the
network.

6.16.2 The plot below shows the number of network links that can fail and still
guarantee that the network is not disconnected.
100000

10000
Number of faulty links

1000
n-cube
fully connected
100

1
2 4 8 16
Network order

6.17
6.17.1 Major differences between these suites include:
Whetstone—designed for floating point performance specifically
PARSEC—these workloads are focused on multithreaded programs.
6.17.2 Only the PARSEC benchmarks should be impacted by sharing and
synchronization. This should not be a factor in Whetstone.

6.18
6.18.1 Any reasonable C program that performs the transformation should be
accepted.
6.18.2 The storage space should be equal to (R + R) times the size of a single
precision floating point number + (m + 1) times the size of the index,
where R is the number of non-zero elements and m is the number of rows.
We will assume each floating-point number is 4 bytes, and each index is a
short unsigned integer that is 2 bytes. For Matrix X this equals 111 bytes.
6.18.3 The answer should include results for both a brute-force and a computation
using the Yale Sparse Matrix Format.
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),
reviewer(s), Elsevier and typesetter MPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and
is confidential until formal publication.

S-12 Chapter 6 Solutions

6.18.4 There are a number of more efficient formats, but their impact should be
marginal for the small matrices used in this problem.

6.19
6.19.1 This question presents three different CPU models to consider when
executing the following code:
if (X[i][j] > Y[i][j])
count++;

6.19.2 There are a number of acceptable answers here, but they should consider
the capabilities of each CPU and also its frequency. What follows is one
possible answer:
Since X and Y are FP numbers, we should utilize the vector processor (CPU
C) to issue two loads, eight matrix elements in parallel from A and eight
matrix elements from B, into a single vector register and then perform a
vector subtract. We would then issue two vector stores to put the result in
memory.
Since the vector processor does not have comparison instructions, we would
have CPU A perform two parallel conditional jumps based on floating
point registers. We would increment two counts based on the conditional
compare. Finally, we could just add the two counts for the entire matrix. We
would not need to use core B.
6.20 This question looks at the amount of queuing that is occurring in the system
given a maximum transaction processing rate, and the latency observed
on average by a transaction. The latency includes both the service time
(which is computed by the maximum rate) and the queue time.
6.20.1 So for a max transaction processing rate of 5000/sec, and we have four
cores contributing, we would see an average latency of 0.8 ms if there was
no queuing taking place. Thus, each core must have 1.25 transactions
either executing or in some amount of completion on average.
So the answers are:
Latency Max TP rate Avg. # requests per core
1 ms 5000/sec 1.25
2 ms 5000/sec 2.5
1 ms 10,000/sec 2.5
2 ms 10,000/sec 5

6.20.2 We should be able to double the maximum transaction rate by doubling

the number of cores.
6.20.3 The reason this does not happen is due to memory contention on the
shared memory system.

Chapter 6 Solutions: Parallelism Insights
No ratings yet
Chapter 6 Solutions: Parallelism Insights
11 pages
Parallel Computing in Cake Baking
No ratings yet
Parallel Computing in Cake Baking
1 page
RISC-V Assembly Control Flow Lab
No ratings yet
RISC-V Assembly Control Flow Lab
5 pages
Homework #3 Solution: Department of Electrical and Computer Engineering University of Wisconsin - Madison
No ratings yet
Homework #3 Solution: Department of Electrical and Computer Engineering University of Wisconsin - Madison
7 pages
Arch 151 Quiz 1 Solutions
No ratings yet
Arch 151 Quiz 1 Solutions
5 pages
Software Pipelining vs. Loop Unrolling
No ratings yet
Software Pipelining vs. Loop Unrolling
13 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
Digital Systems Design Using Verilog 1st Edition Roth Solutions Manual
No ratings yet
Digital Systems Design Using Verilog 1st Edition Roth Solutions Manual
105 pages
EE352 Spring 2010 Homework 3 Solutions
No ratings yet
EE352 Spring 2010 Homework 3 Solutions
8 pages
Lsli 11
No ratings yet
Lsli 11
37 pages
Parallelism in Computer Algorithms
No ratings yet
Parallelism in Computer Algorithms
7 pages
Two-Bit Counter Design with J-K Flip-Flops
No ratings yet
Two-Bit Counter Design with J-K Flip-Flops
7 pages
VLIW Machine Homework Solutions
No ratings yet
VLIW Machine Homework Solutions
16 pages
Question Bank For Computer Organization and Design 4th Edition by David A PattersonJohn L Hennessy
No ratings yet
Question Bank For Computer Organization and Design 4th Edition by David A PattersonJohn L Hennessy
8 pages
Chapter 04
No ratings yet
Chapter 04
25 pages
Loop Cycle Count with Data Dependencies
No ratings yet
Loop Cycle Count with Data Dependencies
6 pages
CS146 Computer Architecture Homework 2
No ratings yet
CS146 Computer Architecture Homework 2
5 pages
Parallel Execution of DO Loops Techniques
No ratings yet
Parallel Execution of DO Loops Techniques
11 pages
Concurrency and Parallelism in Computing
No ratings yet
Concurrency and Parallelism in Computing
3 pages
VLIW Machine Scheduling Quiz 4
No ratings yet
VLIW Machine Scheduling Quiz 4
17 pages
VLIW and ILP in Superscalar Architectures
No ratings yet
VLIW and ILP in Superscalar Architectures
34 pages
Advanced Computer Architectures Exam Solutions
No ratings yet
Advanced Computer Architectures Exam Solutions
13 pages
Advanced Computer Architectures Exam
No ratings yet
Advanced Computer Architectures Exam
10 pages
Computer Architecture Cycle Test Questions
No ratings yet
Computer Architecture Cycle Test Questions
10 pages
CS 433 Midterm Exam Instructions
No ratings yet
CS 433 Midterm Exam Instructions
9 pages
CS146 Fall 2017 Midterm Exam Guide
No ratings yet
CS146 Fall 2017 Midterm Exam Guide
12 pages
Chapter 4 Solutions Overview
No ratings yet
Chapter 4 Solutions Overview
52 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
CENG 400 Exam: Computer Organization
No ratings yet
CENG 400 Exam: Computer Organization
10 pages
Reg2Loc Control in CPU Instructions
No ratings yet
Reg2Loc Control in CPU Instructions
24 pages
Challenges in Parallel Processing
No ratings yet
Challenges in Parallel Processing
82 pages
FAT CAO Pyp
No ratings yet
FAT CAO Pyp
4 pages
VLSI DSP Systems Overview and Design
No ratings yet
VLSI DSP Systems Overview and Design
25 pages
Understanding Iteration Bound in DSP
No ratings yet
Understanding Iteration Bound in DSP
25 pages
VLSI DSP Systems: Design & Implementation
No ratings yet
VLSI DSP Systems: Design & Implementation
455 pages
VLSI DSP Systems: Design & Optimization
No ratings yet
VLSI DSP Systems: Design & Optimization
25 pages
VLSI DSP Systems: Design & Optimization
No ratings yet
VLSI DSP Systems: Design & Optimization
25 pages
VLSI DSP Systems: Design & Iteration Bound
50% (4)
VLSI DSP Systems: Design & Iteration Bound
25 pages
VLIW and Vector Machine Quiz 2016
No ratings yet
VLIW and Vector Machine Quiz 2016
40 pages
ELE 375 Midterm Sample Questions
No ratings yet
ELE 375 Midterm Sample Questions
4 pages
Assignment 1 Answer Key PDF
No ratings yet
Assignment 1 Answer Key PDF
12 pages
CSE 6230 Homework 3: Vectorization Guide
No ratings yet
CSE 6230 Homework 3: Vectorization Guide
16 pages
Software Pipelining in Loop Optimization
No ratings yet
Software Pipelining in Loop Optimization
25 pages
CS2100 Tutorial 10: MIPS Pipelining Answers
100% (1)
CS2100 Tutorial 10: MIPS Pipelining Answers
7 pages
Generating Execution Paths in C Code
No ratings yet
Generating Execution Paths in C Code
11 pages
HW 5
No ratings yet
HW 5
4 pages
ECE 313 Exam 1: MIPS Performance Analysis
No ratings yet
ECE 313 Exam 1: MIPS Performance Analysis
5 pages
Enhancing Computer Architecture Performance
No ratings yet
Enhancing Computer Architecture Performance
71 pages
Synchronization Tools and Race Conditions
No ratings yet
Synchronization Tools and Race Conditions
14 pages
MIPS Code and ALU Design Exercises
No ratings yet
MIPS Code and ALU Design Exercises
6 pages
ECE 452 Midterm Exam: Computer Design
No ratings yet
ECE 452 Midterm Exam: Computer Design
9 pages
Code Scheduling and Loop Unrolling Analysis
No ratings yet
Code Scheduling and Loop Unrolling Analysis
2 pages
Inherent Parallelism in Software Design
No ratings yet
Inherent Parallelism in Software Design
12 pages
Instruction-Level Parallelism Overview
No ratings yet
Instruction-Level Parallelism Overview
5 pages
Speculation in Computer Architecture
No ratings yet
Speculation in Computer Architecture
22 pages
Computer Organization and Design RISC-V 1st Edition Patterson Solutions Manual Direct Link
100% (5)
Computer Organization and Design RISC-V 1st Edition Patterson Solutions Manual Direct Link
176 pages
CS120 Fall 2023 Final Exam Guide
No ratings yet
CS120 Fall 2023 Final Exam Guide
4 pages
Efficient Algorithm for Arithmetic Units
No ratings yet
Efficient Algorithm for Arithmetic Units
9 pages
Digital Systems Design Using VHDL 3rd Edition Roth Solutions Manual
No ratings yet
Digital Systems Design Using VHDL 3rd Edition Roth Solutions Manual
100 pages
Java Programming Concepts and Course Outline
No ratings yet
Java Programming Concepts and Course Outline
3 pages
SIS Performance Optimization Guide
No ratings yet
SIS Performance Optimization Guide
14 pages
Flynn's Taxonomy
No ratings yet
Flynn's Taxonomy
2 pages
Parallel Apriori Algorithm Implementation
No ratings yet
Parallel Apriori Algorithm Implementation
4 pages
Ignou Unit 4
No ratings yet
Ignou Unit 4
22 pages
Simultaneous Multithreading Explained
No ratings yet
Simultaneous Multithreading Explained
21 pages
Dynamic Partitioning in Informatica
No ratings yet
Dynamic Partitioning in Informatica
3 pages
Alpa: Automating Model Parallelism in DL
No ratings yet
Alpa: Automating Model Parallelism in DL
20 pages
Understanding Parallel Computing
No ratings yet
Understanding Parallel Computing
6 pages
B.E. Computer Science Syllabus 2022
No ratings yet
B.E. Computer Science Syllabus 2022
84 pages
Multithreading for Ultra-Low-Latency HFT
No ratings yet
Multithreading for Ultra-Low-Latency HFT
6 pages
PRAM Algorithms for Parallel Sorting
No ratings yet
PRAM Algorithms for Parallel Sorting
16 pages
4+1 View Model of Software Architecture
100% (1)
4+1 View Model of Software Architecture
9 pages
Efficient Object Detection in Remote Sensing
No ratings yet
Efficient Object Detection in Remote Sensing
29 pages
Computer Organization Course Syllabus
No ratings yet
Computer Organization Course Syllabus
2 pages
The 18 J
No ratings yet
The 18 J
46 pages
Future Directions in Computer Architecture
No ratings yet
Future Directions in Computer Architecture
46 pages
Cloud Computing Overview and Models
100% (1)
Cloud Computing Overview and Models
122 pages
Linux Scalability Analysis on 48 Cores
No ratings yet
Linux Scalability Analysis on 48 Cores
49 pages
Computer Organization Exam Question Bank
No ratings yet
Computer Organization Exam Question Bank
2 pages
Data Warehousing Infrastructure Essentials
No ratings yet
Data Warehousing Infrastructure Essentials
32 pages
Hitachi Accelerated Flash 2.0
No ratings yet
Hitachi Accelerated Flash 2.0
26 pages
AI-Enabled MCU/SOC Systems for Edge Computing
No ratings yet
AI-Enabled MCU/SOC Systems for Edge Computing
12 pages
x86 Virtualization and Cloud Computing Overview
No ratings yet
x86 Virtualization and Cloud Computing Overview
34 pages
Data Dependency in Parallel Computing
No ratings yet
Data Dependency in Parallel Computing
39 pages
Advanced Computer Architecture Course
No ratings yet
Advanced Computer Architecture Course
125 pages
Embedded Processor Systems Test 2 Guide
No ratings yet
Embedded Processor Systems Test 2 Guide
6 pages
Parallel Programming in HPC Systems
No ratings yet
Parallel Programming in HPC Systems
330 pages
NLP Techniques and Applications
No ratings yet
NLP Techniques and Applications
25 pages
CUDA Programming for GPGPU Training
0% (1)
CUDA Programming for GPGPU Training
4 pages

OsChapter 6

Uploaded by

OsChapter 6

Uploaded by

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for

internal business use only by the author(s), editor(s),

Chapter 6 Solutions S-3

S-4 Chapter 6 Solutions

Finish baking Cake 3. Empty cake pan A.

Chapter 6 Solutions S-5

6.4.2 The following code removes one stall per iteration:

S-6 Chapter 6 Solutions

Chapter 6 Solutions S-7

S-8 Chapter 6 Solutions

Chapter 6 Solutions S-9

6.9.2 Answer is same as 6.9.1.

6.9.3 FU1 FU2

6.9.4 FU1 FU2

6.10 This is an open-ended question.

6.10.1 There is no solution to this problem (it is an open-ended question-no

6.10.2 There is no solution to this problem (it is an open-ended question-no

S-10 Chapter 6 Solutions

6.12.1 There is no solution to this problem (it is an open-ended question-no

6.12.2 There is no solution to this problem (it is an open-ended question-no

6.13.1 There is no solution to this problem (it is an open-ended question-no

6.14 This is an open-ended programming assignment. The code should be

6.14.1 There is no solution to this problem (it is an open-ended question-no

6.14.2 There is no solution to this problem (it is an open-ended question-no

6.15.1 There is no solution to this problem (it is a lab-based question-no need to

Chapter 6 Solutions S-11

S-12 Chapter 6 Solutions

6.20.2 We should be able to double the maximum transaction rate by doubling

You might also like