Storage Systems
CSE 598d, Spring 2007
Lecture 11: Disk scheduling
Feb 27, 2007
(ACK: Several slides borrowed from
Shiva Chaitanya)
Disk Access Time:
Components
• CPU time to issue and process I/O
• contention for controller
• contention for bus
• contention for memory
• verifying block correctness with
checksums (retransmissions)
• waiting in scheduling queues
• ...
Disk Scheduling
Seek time is a dominant factor of total disk I/O time
Let operating system or disk controller choose which
request
to serve next depending on the head’s current position
and requested block’s position on disk
Disk scheduling is much more difficult than CPU
scheduling
– a mechanical device – hard to determine (accurate)
access times
– disk accesses cannot be preempted – runs until it
finishes
– disk I/O often the main performance bottleneck
Scheduling at Multiple
Locations!
S/W, H/W Components between an
application and the disk:
- File system
- Device driver
- SCSI bus Scheduling locations
- RAID controller (if employing RAID)
- Some bus
- Disk controller
Why?
- Why not do it only at FS/DD level?
- Why not do it only within the disk?
Scheduling at Multiple
Locations!
Why?
Key ideas that disk scheduling employs:
Request re-ordering for seek/positioning
minimization
Exploit temporal locality
Anticipation for sequential streams
Introduce non-work conserving behavior!
Exploit spatial locality
Coalesce consecutively placed requests
Free-block scheduling
Different optimizations are best done at
different locations
Furthermore, the best location to do an
optimization depends on the workload!
Goals
– Short response time
– High overall throughput
– Fairness (equal probability for all
blocks to be accessed in the same
time)
Tradeoff: Throughput vs. Fairness
Socialism vs. Capitalism?
Disk Scheduling
Several traditional algorithms
– First-Come-First-Serve (FCFS)
– Shortest Seek Time First (SSTF)
• Shortest Positioning Time First
(SPTF)
– SCAN
– C-SCAN
– LOOK
– C-LOOK
–…
First–Come–First–
Serve (FCFS)
FCFS serves the first arriving request first:
Long seeks
Short average response time
cylinder number
ming requests (in order of arrival): 1 5 10 15 20 25
12 14 2 7 21 8 24
time
Shortest Seek Time
First (SSTF)
SSTF serves closest request first:
short seek times
longer maximum seek times – may lead to starvation
cylinder number
1 5 10 15 20 25
ming requests (in order of arrival):
12 14 2 7 21 8 24
time
SCAN
SCAN moves head edge to edge and serves requests on the way
bi-directional
compromise between response time and seek time optimizati
cylinder number
1 5 10 15 20 25
ming requests (in order of arrival):
12 14 2 7 21 8 24
time
C–SCAN
Circular-SCAN moves head from edge to edge
serves requests on one way – uni-directional
improves response time (fairness)
cylinder number
1 5 10 15 20 25
ming requests (in order of arrival):
12 14 2 7 21 8 24
time
LOOK and C–LOOK
LOOK (C-LOOK) is a variation of SCAN (C-SCAN):
same schedule as SCAN
does not run to the edges
stops and returns at outer- and innermost request
increased efficiency
ming requests (in order of arrival):
cylinder number
1214 2 7 21 8 24 1 5 10 15 20 25
time
V–SCAN(R)
V-SCAN(R) combines SCAN (or LOOK) and SSTF
– define a R-sized unidirectional SCAN window,
i.e., C-SCAN, and use SSTF outside the window
– Example: V-SCAN(0.6)
• makes a C-SCAN (C-LOOK) window over 60 % of the
cylinders
uses SSTF for requests outside the window
V-SCAN(0.0) equivalent with SSTF
– V-SCAN(1.0) equivalent with SCAN
– V-SCAN(0.2) is supposed to be an appropriate
configuration
Shortest Positioning
Time First (SPTF)
Given the complete knowledge of the
actual mapping of data blocks onto the
media, the scheduler can choose the
request with the minimum positioning
delay (combined seek and rotational
latency)
SPTF, like SSTF suffers from poor
starvation resistance. To reduce
response time variance, priority can be
given to requests that have been in
pending queue for excessive periods of
time
Aged Shortest
Positioning Time
First (ASPTF)
ASPTF(W) adjusts each positioning
delay (Tpos) by subtracting a weighted
value corresponding to the amount of
time the request has been waiting for
service (Twait)
Teff = Tpos – (W*Twait)
For large values of W, ASPTF behaves
like FCFS
Scheduling in Modern
Disk Drives
Features of current disk drives
that affect traditional
scheduling algorithms
Host interface
Data layout
On-Board Cache
Ref: B.L. Worthington, Greg Ganger, N. Patt : Scheduling Algorithms for Modern
Disk Drives ACM Sigmetrics 1994
Host interface
Controller presents a request to the disk
drive in terms of the starting logical block
number and request size
Subsequent media access hidden from the
host
Scheduling entities outside of the drive
have little knowledge of overhead delays
Data Layout
Many systems assume sequentiality of LBN-to-
PBN mappings in seek reducing algorithms
Aggressive algorithms require highly
accurate knowledge of the data layout which is
typically hidden
Complexity of mappings increased by zoned
recording, track/cylinder skew and defect
management
On-Board Cache
Memory within disk drives has progressed from small
speed-matching buffers to megabytes of cache memory
Disk logic typically prefetches data into cache to
satisfy sequential read requests. This affects
scheduling in two ways:
Position of the head cannot be determined
easily
Requests that can be satisfied by cache could
be given higher priority
Scheduling by Logical
Block Number
• As expected,
FCFS quickly
saturates as
workload
increases
• SSTF provides
lower mean
response time
Scheduling by Logical
Block Number
FCFS has the
lowest
coefficient
for lighter
workloads
As FCFS begins
to saturate
and its
response time
variance
increases, C-
LOOK emerges
as a better
algorithm for
response time
Scheduling with Full
knowledge
As W
increases, the
average
response time
slowly grows,
though
variance drops
Scheduling with Full
Knowledge
Modern Disk
Scheduling
In modern drives, C-LOOK best exploits the
prefetching cache for workloads with significant
read sequentiality
SSTF and LOOK perform better for random
workloads
Powerful disk controllers use variants of
Shortest Positioning Time First (SPTF).
Freeblock
Scheduling
An approach to utilizing more of a
disk’s potential media bandwidth
Fill rotational latency periods with
useful media transfers for background
applications
It has been observed that 20-50% of a
never-idle disk’s bandwidth can often be
provided to background applications without
affecting foreground response times
Ref: Christopher R. Lumb, Jiri Schindler, Greg Ganger : “Towards Higher
Disk Head Utilization: Extracting Free Bandwidth From Busy Disk
Drives”, OSDI , 2000
Disk-intensive
background tasks
Disk Reorganization
File system cleaning
Backup
Prefetching
Write-back
Integrity Checking
RAID scrubbing
Virus detection
Index Reorganization
…
Free Bandwidth
Time required for a disk media access
Taccess = Tseek + Trotate + Ttransfer
Freeblock scheduling uses the Trotate
component
of disk access to transfer additional data
Instead of just waiting for desired sector
to arrive, this technique transfers the
intermediate sectors
Steps in Freeblock
Scheduling
Predict how much rotational latency
will occur before the next foreground media
transfer
Requires detailed knowledge of disk
attributes, including layout algorithms and
time dependent mechanical positioning
overheads
Squeeze additional media transfers into
that time
Get to the destination track in time
Anticipatory Disk
Scheduling
Reorder available disk requests for
performance by seek optimization,
proportional resource allocation, etc.
Any policy needs multiple outstanding
requests to make good decisions!
Ref: Sitaram Iyer, Peter Druschel : “Anticipatory scheduling : A disk
scheduling framework to overcome deceptive idleness in synchronous I/O”,
SOSP 2001
With enough
requests…
issued by process A issued by process B
tim
e
s ee
k
location on
disk
Throughput = 21 MB/s (IBM Deskstar d
With synchronous
I/O…
issued by process
too
A issued by process B
lat
e!
forced
!
schedu forced
le !
E.g., Throughput = 5 MB/s
Next
Deceptive idleness
Process A is about to issue next request.
but
Scheduler hastily assumes that process A has
no further requests!
Proportional
scheduler
Allocate disk serviceDeceptive idleness
in say 1:2 ratio: causes 1:1 allocation:
A B
A B
Next
Anticipatory
scheduling
Key idea: Sometimes wait for process
whose request was last serviced.
Keeps disk idle for short intervals.
But with informed decisions, this:
Improves throughput
Achieves desired proportions
Cost-benefit
analysis
Balance expected benefits of waiting
against cost of keeping disk idle.
Tradeoffs sensitive to scheduling
policy
e.g., 1. seek optimizing scheduler
2. proportional scheduler
Statistics
For each process, measure:
1. Expected median and 95percentile
thinktime
Number of
requests
Thinktim
Median 95percentilee
2. Expected positioning time
last nex
t
Cost-benefit analysis
for seek optimizing
scheduler
best := best available request chosen by
scheduler
next := expected forthcoming request from
process whose request was last serviced
Benefit =
best.positioning_time — next.positioning_time
Cost = next.median_thinktime
Waiting_duration =
(Benefit > Cost) ? next.95percentile_thinktime : 0
Proportional
scheduler
Costs and benefits are different.
e.g., proportional scheduler:
Wait for process whose request was last serviced,
1. if it has received less than its allocation, and
2. if it has think time below a threshold (e.g., 3ms)
Waiting_duration = next.95percentile_thinktime
Prefetch
Overlaps computation with I/O.
Side-effect:
avoids deceptive idleness!
Application-driven
Kernel-driven
Conclusion
Anticipatory scheduling:
overcomes deceptive idleness
achieves significant performance
improvement on real applications
achieves desired proportions
and is easy to implement!
Fairness : Evaluating
disk scheduling
algorithms
Storage system designers prefer to keep the
queue length at disks small regardless of the
load
When queuing threshold is reached at the disk,
the controller or the device driver queues the
requests until disk queue is processed
Low queuing threshold minimizes request
starvation at the disk level when unfair
scheduling algorithms are deployed
Ref: Alma Riska, Erik Riedel : “It’s not fair – evaluating efficient disk
scheduling” , MASCOTS 2003
Results
Queuing more requests at the disk provides
the scheduling algorithms more information
used for better disk resource utilization
Percentage of requests starved remains
small even if longer queues build up at
the disk
Overall request starvation is independent
from the queuing threshold at the disk
Storage subsytem
architecture
Queues at various levels
Outstanding requests
queued at disk and at
device driver in a single
disk system
And, at the disks and the
controller(s) in a
multiple disk system
Impact of queuing
thresholds
Average load of 16 outstanding requests in Average load of 64 outstanding requests in
system system
Response time
distribution
Higher the load the larger the gap between
the performances of different scheduling
algorithms
Fair and simple FCFS yields longest average
request response time
Best performance obtained when increasing
the queue threshold under SPTF
How about request starvation and variability
in the request response time ?
Response time
distribution
Tail of response time distribution with Tail of response time distribution with
average load of 16 outstanding requests and average load of 16 outstanding requests and
threshold of 8 threshold of 16
Observations ..
Majority of requests under FCFS exhibit long
response times, while seek-reducing algorithms
result in majority of short response times
More than 90% of requests under SPTF have shorter
response times than FCFS and only 1% exhibit upto
double the response times in FCFS
Amount of starvation in position-based scheduling
algorithms for both queuing thresholds is the same
relative to FCFS
Hence, queuing more requests improves disk
performance without introducing more request
starvation
Scheduling at
Device driver level
Depends on workload and filesystem layout
Eg, with SCAN, seek times to sectors in the
middle of the disks are shorter
OS could choose between algorithms based on
current queue
Likely to be expensive in CPU cycles
Queue changes as new requests arrive
SSTF or SCAN are reasonable defaults
Allow algorithm selection as part of OS tuning
FreeBSD: C-SCAN
Linux 2.2 :SCAN
Linux 2.6 : four different versions of elevator
algorithm
Multiple Locations
Positioning-based optimizations best done within the disk
Seek-based optimizations best done at device driver
Why do scheduling within FS?
Device and DD independent
Aware of buffer cache
Application isolation
Disk queue length crucial
Short queue results in degraded throughput
Locally good but globally bad schedules
Long queue results in unfairness
Non-work conservation can improve fairness and throughout!
Anticipatory scheduling
Achieving proportional fairness non-trivial
Solutions based on hierarchy of queues, anticipatory scheduling can
help
Request coalescing can result in great improvement in
throughput
FS and device driver are good places
Improve the sequentiality of the request stream seen by the disk
Free-block scheduling can improve throughput
Can view this as a “corrector” for the non work conserving nature of
disk
Additional slides
on free-block
scheduling
Illustration of two
freeblock scheduling
possibilities
Characteristics of
tasks
Low priority
Freeblock requests will only be served
opportunistically
Not appropriate for a set of equally
important requests
Large sets of desired blocks
Larger the set of disk locations desired,
higher the probability to find a free bandwidth
opportunity
Desired
Characteristics …
No particular order of access
Ordering requirements restrict the set of
requests that can be considered by
scheduler
Effectiveness of freeblock scheduling
directly related to number of outstanding
requests
Small working memory footprints
Need to buffer multiple blocks before
processing creates artificial ordering
requirements due to memory limitations
A Simple Interface
No call into the freeblock scheduler subsystem
waits for a disk access. Calls return immediately
Freeblock read requests do not specify memory
locations for read data. Completion callbacks
provide pointers to buffers owned by freeblock
scheduling subsystem
Applications
Scanning applications : tasks that scan large
portion of disk contents like report
generation, RAID scrubbing, virus detection,
tamper detection and backup
Internal storage optimization: reorganizing
stored data to improve performance, e.g placing
related data contigously, placing hot data near
the center of the disk, replicating data for
subsequent reads, ..etc
Prefetching and Prewriting: prewriting is early
writing out of dirty blocks under the
assumption they will not be overwritten or
deleted before writeback is necessary
Availability of
free bandwidth
Availability of potential free
bandwidth = Total bandwidth *
Fraction of time spent on rotational
latency
Results in the next few slides
obtained using Disksim
Default disk drive used is Quantum
Atlas 10k
Default workload consists of 10,000
foreground requests issued one at a
time with uniform distribution of
starting locations
Impact of disk
characteristics
Overall, about one third of each
disk’s head usage is on rotational
latency
Characteristics of
simulated disk drives
Impact of workload
characteristics
Impact of scheduling
algorithm
• C-LOOK and SSTF
reduce seek times
without affecting
transfer times and
rotational latencies
• SPTF tends to
decrease both
overhead components.
Figure shows
rotational latency
decreases to 22%
Feasibility
Freeblock scheduling relies heavily on
ability to accurately predict positioning
delays
Firmware of most disk drives now supports
SPTF which requires similar predictions
Freeblock scheduling resembles advanced
disk schedulers for environments with a
mixed workload of real-time and non-real-
time activities
Additional slides
on anticipatory
scheduling -
experimental
evaluation
Experiments
• FreeBSD-4.3 patch + kernel module
(1500 lines of C code)
• 7200 rpm IDE disk (IBM Deskstar)
• Also in the paper:
15000 rpm SCSI disk (Seagate Cheetah)
Microbenchmark
25
no prefetch
prefetch
Original
20
Anticipatory
Throughput (MB/s)
15
no prefetch
prefetch
10
no prefetch
prefetch
0
Sequential Alternate Random within file
Real workloads
What’s the impact on real applications
and benchmarks?
Andrew benchmark • Disk-intensive
Apache web server • Prefetching enabled
(large working set)
Database benchmark
Andrew filesystem
benchmark
2 (or more) concurrent clients
30
6
5
25
20 Original
15 Anticipatory
10
5
Execution
0 time (minutes)
mkdir cp stat scan gcc
-16% -5% -5% -54% +1.7%
Overall 8% performance improvement
Apache web server
4
3
• [Link] trace
2
• Large working set
• 48 web clients 1
Throughput (MB/s)
0
read mmap
+29% +71%
no
prefetch