0% found this document useful (0 votes)
11 views69 pages

Operating Systems Study Notes

The document contains notes on operating systems, detailing their role as intermediaries between users and hardware, and outlining the evolution of operating systems from the first generation to the present. It discusses system components, resource management, process management, and the introduction of threads, highlighting the complexities and trade-offs involved in OS design. The notes serve as a study resource for computer science students, though the author notes potential copyright issues and lack of warranty for accuracy.

Uploaded by

Ayush Meshram
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views69 pages

Operating Systems Study Notes

The document contains notes on operating systems, detailing their role as intermediaries between users and hardware, and outlining the evolution of operating systems from the first generation to the present. It discusses system components, resource management, process management, and the introduction of threads, highlighting the complexities and trade-offs involved in OS design. The notes serve as a study resource for computer science students, though the author notes potential copyright issues and lack of warranty for accuracy.

Uploaded by

Ayush Meshram
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computer Science notes ⇒ Operating

Systems
This minisite contains notes taken by Chris Northwood whilst studying Computer Science at
the University of York between 2005-09 and the University of Sheffield 2009-10.

They are published here in case others find them useful, but I provide no warranty for their
accuracy, completeness or whether or not they are up-to-date.

The contents of this page have dubious copyright status, as great portions of some of my
revision notes are verbatim from the lecture slides, what the lecturer wrote on the board, or
what they said. Additionally, lots of the images have been captured from the lecture slides.

An operating system is a program that acts as an intermediary between a user of the PC


and the hardware.

System Components
A computer system can be abstracted into various components. Users (people, or
other machines and systems) interact with system and application programs (defines
the way in which system resources are used to solve the computing problems of the
users), which interact with the OS (controls and co-ordinates the use of hardware
among the various application programs for the user), which interacts with the
hardware (the most basic computing resources - CPU, memory, I/O, etc). When
developing an OS, all levels must be considered, not just the OS level.
For users, they interact with the hardware (monitor, keyboard, mouse, etc) to control
the system and application programs. The program needs an efficient interface to
interact with the OS to utilise the resources it needs to accomplish the tasks the user
wants to do and to also communicate with the user. The OS interface describes the
calls that an application program can make into the OS.

The OS is designed for ease of use, which is in terms of not having to deal with low
level hardware issues. This is more important than CPU usage - a user doesn't care if
the CPU is idle 99% of the time. The goal of OS design is to provide a useful,
simple and efficient system:

ease of use
minimise resource usage
maximise resource utilisation

However, abstraction away from the physical machine to provide ease of use can
lead to inefficiencies. There's a trade off required between the extra functionality of
the virtual machine and the overheads required to implement this.

History of Operating Systems


1st Generation (1945-1955)
See CAR for
These are systems which didn't have operating systems. Computer History
Users had complete access to the machine and the
interface was basically raw hardware. Programming was accomplished by
wiring up plugboards to control basic behaviour of the computer - the problems
to be solved were normally numerical calculations.

Operating systems and programming languages were virtually unheard of.

2nd Generation (1955-1965)

These systems had minimal operating systems and were typically mainframe
computers used to support commercial and scientific applications. Jobs and
programs were written using punch cards and the computers were normally
application specific.

The minimal OS was required to interact with I/O devices and a subroutine
library existed to manage I/O interaction. The library is loaded into the top of
memory and stays there. Application programs existed (compiler/card reader),
but were limited.

Optimisations could be made in reducing the setup time by batching similar


jobs. A small computer reads a number of jobs onto magnetic tape and the
magtape is run on the mainframe. A number of jobs is then done in succession
without wasting CPU time. These systems had a small permenantly resident
OS which had initial control and controlled automatic job sequencing.

These systems had a new memory layout and the processor had two modes -
Normal, where some instructions (e.g., HALT) are forbidden, which was used
by the batch programs, and Privileged, where all instructions are available, this
was the level the OS ran in. Some CPUs also implemented protection in
Normal mode to prevent the user program overriding the first section of
memory in which the OS existed. Normal mode also has an instruction to call
the OS (hand back control).
Batch systems were good for CPU bound scientific computing with minimal
I/O, but bad for data processing with high I/O as the CPU is often idle. The
speeds of the mechanical I/O devices is considerably slower than that of the
electronic computer and jobs often have to pause when performing I/O. In
some cases, the CPU was often idle for >80% of the time.

However, with a permenantly resident OS in the memory, the OS now becomes


an overhead, with less physical memory available for the application running at
the time. Programming discipline in the application is still required as the OS
can not protect against the application crashing the machine via bad I/O,
infinite loops, etc...

3rd Generation (1965-1980)

The introduction of ICs led to minicomputers, which tried to merge the


scientific computers with I/O computers and this introduced
multiprogramming. Disk drives allowed jobs and programs to be stored on
disk, allowing the computer to have access to many jobs. The CPU can now
load many jobs into memory by giving each one a memory partition and then
execute them by multiplexing between them.

In these OSs, the OS picks a job and then passes control to it. Eventually, the
job either finishes or waits for some task (e.g., I/O) to complete. In a non-
multiprogrammed system, the CPU would now be idle. In a multiprogrammed
system, the OS switches the CPU to another job.

Eventually, the operation the original job was waiting for completes, enabling
that job to be eligable to be waiting for CPU time. As long as there is at least
one job waiting to execute, the CPU is not idle.

When a job completes, the OS can load another job into that memory slot. This
technique is called spooling - Simultaneous Peripheral Operation On Line, and
is still used today for operations such as print queues.

Multiprogramming OSs are quite complex - they now handle quite complex
I/O devices (disks, etc) and swapping between applications when suspended on
I/O.

Multiprogramming works best when there are a small number of large jobs -
this reduces the overhead of the OS due to managing memory between less
jobs. This method also encourages a fairer overall use of the computer.

Multiprogramming systems led on to time-sharing/interactive systems. The


CPU here is multi-plexed between several jobs that are kept in memory and on
disk. The CPU is allocated to a job only if the job is in memory. When a job
waits on I/O, the OS switches the CPU to another job. Jobs can also be
switched in between memory and the disk. On-line communication between
the user and the system is provided - when the OS finishes the execution of one
command, it seeks the next "control statement" from the user's keyboard.

In time-share systems, the user has the illusion of using the computer
continuously and the user can now interact directly with the computer whilst it
is running. This encourages smaller jobs and more efficient use of the
computer. Additionally, many users can share the computer simultaneously,
submitting jobs/commands directly via keyboards/terminals, etc.

The first time-sharing system was CTSS, developed by MIT. This formed the
basis for MULTICS, the first commercial OS that had limited success. UNIX
was then developed by Bell Labs in the high-level language C, based on B.

4th Generation (1980-present)

The fourth generation of operating systems broguth about the first mass user
operating systems, as LSI and VLSI enabled low-cost computing, with mini
and desktop computers becoming common.

The age of mass computing is based around desktop computers. The desktop
PC is dedicated to a single user and the distinction between desktop and
minicomputers become increasingly blurred. Many minicomputers are now
merely highend desktop PCs, with advanced I/O capabilities and
multiprocessors.

The multiprogrammed OS (UNIX, VMS, etc) was developed further for mass
user computers, adding functionality like: massive file storage, fast
networking, complex application programs and the potential for
multiprocessor. Modern desktop systems have dedicated single user OSs (e.g.,
CP/M, DOS, MS-DOS, Mac OS, Windows) with complex I/O devices:
keyboards, mice, display screens, small printers, graphics tablets, etc, and
complex graphic applications, with user convenience and responsiveness
considered of high importance.

An increased blurring between desktop computers and minicomputers has led


to minicomputer OSs becoming available for desktop, e.g., Linux - based on
the BSD and SysV variants of UNIX.

Much of the current OS trends are based in the relative cost of computers and
people. In the beginning, computers were few with very expensive hardware, so the
people cost was cheap compared to the computer cost, therefore the goal was the
maximise hardware utilisation. Now we have an amazing number of computers with
cheap hardware and the human cost is expensive compared to the hardware cost.
The goal now is ease of use.
Current OS trends are towards specialist requirements - moving away from the
general purpose OS for specialist needs, or tuning existing general purpose OSs. The
introduction of real-time OSs, which require low overheads and high efficiency of
resource usage. Multi-processor machines are becoming more common and OSs
need to update to be able to handle the extra hardware. On the other hand, embedded
devices are now become very common, which have limited power, space and weight
requirements.

Role of the Operating System


Resource Sharing

The OS contains a set of algorithms that allocates resources to the programs


executed on behalf of the user. These resources include time, power, hardware,
etc...

Control Program

The control program controls the operation of the application programs to


prevent errors affecting other programs.

Provision of a Virtual Machine

This hides interfaces to I/O devices, filing systems, etc, and provides a
programming interface for applications.

Kernel

The kernel is the only program resident all the time (all other applications are
application programs).

The modern general purpose OS contains various components to accomplish these


roles, such as:

Process management
Main memory management
File management
I/O system management
Secondary storage management
Protection system
Command interpreter system

All but the last two in the above list can be consdered to be resource management

Processes
A program is an algorithm expressed in some notation. A process is the activity
of executing the program.

Early systems only allowed one program to be executed at a time. Later


systems allowed multiprogramming, where many programs are loaded into
memory at once and only one can be executed at any time. Programs are
termed jobs or processes when executing. A process is sequential and can only
do one thing at a time.

Multiple user processes run in seperate (protected) address space, and can't
directly access other memory spaces. Some user programs can share address
space (e.g., an interpreter), but only the code (which is read only) is shared.
The kernel can access all, however.

As the CPU only has one program counter, so the OS must use a system known
as process switching.

Process Creation

Processes are created by one of 4 principle events:

1. System initialisation (typically, this is the OS and system


programs)
2. By running a process (started by the kernel or user programs)
3. By user execution (normally with interacting with some kind of
shell)
4. Initiation of a batch job (typically only in batch systems, however)

When a new process is created, the OS must assign memory and create a
task/process control block (TCB/PCB) which holds various data about the
currently running process, such as:

Process state
Unique program ID
Program counter
CPU registers
Memory information (this, along with the program counter and
CPU registers forms the process context)
I/O status information (e.g., open files)

Process Hierarchies

When a parent creates a child process, the child process can create its own
child processes, and we get a hierarchy (or tree) of processes. UNIX terms
this as a process group, whereas Windows has no such concept, and all
processes are created equal.

Process State

In single processor systems, only one process can be in the running state
at any time. If all processes are waiting (or blocked), then the processor is
idle, until one of the processes become unblocked, or ready.

The states possible are:

new: The process is being created


running: Instructions are being executed
waiting: The process is waiting for some event to occur
ready: The process is waiting to be run
terminated: The process has finished execution
The switch between processes is called context switching - where the
context of the currently running program is saved in the process context
of the PCB and then the process context from the new PCB is loaded into
the CPU. The context switch time is overhead as the CPU does nothing
useful whilst switching. The time it takes for a switch is dependent on the
hardware support available.

Process Termination

There are various conditions that can terminate processes:

1. Normal exit (voluntary)


2. Exit on an error condition (voluntary)
3. Fatal error (involuntary)
4. Killed by another process (involuntary)

When a process is terminated, all resources used by a process is reclaimed


by the OS, with the PCB last (as the PCB contains information on where
the resources are to be reclaimed).

System Calls

System calls provide an interface between the running process and the
OS. They are available as an assembly level instruction, normally some
kind of software interrupt (INT 0x80 on Linux) which corresponds to a
system call to which the OS reacts appropriately.

Parameters are used in system calls, just as procedure calls. Normally


registers, tables or stacks are used, but in modern processors, tables are
the most popular. Another method is using instructions that force an
address space change without using an interrupt, but the effect is the
same.

System calls normally map onto the services provided by the OS.

POSIX is an IEEE standard based on the UNIX interface. It is not an OS,


merely a standard API which is very vague and covers almost all systems
calls you would ever have to make. Some aspects of most major OSs are
POSIX compliant, but there exists no complete implementation of the
POSIX standard. Linux is primarily POSIX compliant for the core calls.

Threads
Traditional (heavyweight) processes have a single control path through their code.
Many applications are concurrent, but do not want the overhead of multiple
processes and inter-process communication (e.g., a word processor has UI, reading
input, a background spell checker, etc).

Many OS's provide threads (sometimes called lightweight processes), a basic unit of
CPU use. Each is a seperate control path through the code, and threads within a
process are essentially independent. Threads can access all the address space by the
process, and threads have no protection against each other (therefore, you should
program threads safely).

Single and Multi-Threaded Processes

Each thread comprises of a thread ID, program counter, register set and a stack.
The thread shares with other threads belonging to a process, the code section,
the data section and OS resources such as open files. Each thread needs
context, but not as much as the process needs, as it inherits some of the context
of the process.
Threading has various benefits, such as responsiveness - other threads can continue
if one blocks (e.g., on I/O), resource benefits - there's only one copy of the code and
TCB, etc, economy - thread creation and management is less costly than that of
processes, and better utilisations of multi-processor architectures (threads can run in
parallel).

User Threads

Implemented by user level libraries, this provide threads to an application with


the the support of the underlying OS. Here, the application handles its own
threading, so the OS only sees one process. The application has a thread table,
used in the application for keeping track of the thread state and context, in a
similar way to how the kernel handles TCBs for processes. The state diagram
for a thread looks similar to the one for a process:
User threads are usually fast and have low management overheads (as a library,
the threads library is within the address space of the process, so calls to the
threads library functions are fast). Application can also define management
(i.e., scheduling) based on better information on the threads than a more
general policy.

This method does have disadvantages, however. When a thread calls I/O, the
whole process is blocked, as the OS sees the whole process as waiting for I/O.
When an application has threads that are likely to block often, then user threads
may not be the best option.

Kernel Threads

Kernel threads are implemented at the OS level, and therefore has higher
overheads (a thread control space/table is now needed in the TCB) and is
slower, due to syscalls. With kernel threads, the OS schedules all threads, so it
can schedule appropriately (e.g., when one thread blocks, only move that into
waiting instead of the whole process). Threads can also be scheduled over
multiple processes, so we have parallelism, instead of concurrency (which can
introduce its own set of problems).
Hybrid Threading

Both user and kernel threading has their advantages and disadvantages, so
using the hybrid methods give you the best of both worlds. Most real
implementations of threading are hybrid.

A common approach is to multiplex a number of user threads onto a number of


kernel threads; each kernel thread has some set of user threads that take turns
using the kernel thread. There is usually a maximum number of kernel threads
per user process.

Many-to-One

This is the same as normal user-level threading (only with management in


the OS); many user level threads are mapped to a single kernel thread. All
thread management happens in user-space, so this is efficient, but it does
suffer from the one thread blocking the whole process problem.
You could also implement a kernel thread as a process, which is used on
systems that don't support kernel threads.

One-to-One

All threads here are kernel threads - each user thread maps directly on to
one kernel thread. This has more concurrency than the many-to-one
model, as when one thread blocks another can run. This does have the
overhead of having to create a kernel thread for each user thread,
however.

Many-to-Many
This is the normal method of implementing threading. Here, the user
library controls all - which threads are in the kernel, which are not and
also controls swapping, etc. The library tries to control the threads so that
not all of the kernel threads end up blocked at the same time, by swapping
in and out waiting user threads to kernel threads.

This has the advantages of there being control over concurrency in the
application, and the application can allocate user threads to kernel threads
in a way to optimise the concurrency and ensure that at least one user
thread is runnable at any time.

This method is quite complex, however and there are overheads in both
the OS and the application. The user thread scheduling policy may also
conflict with the kernel thread scheduling policy, introducing
inefficiencies.

Threading Issues

fork() creates a seperate duplicate process and exec() replaces the entire
process with the program specified by its parameter. However, we need to
consider what happens in a multi-threaded process. exec() works in the same
manor - replacing the entire process including any threads (kernel threads are
reclaimed by the kernel), but if a thread calls fork(), should all threads be
duplicated, or is the new process single threaded?

Some UNIX systems work around this by having two versions of fork(), a
fork() that duplicates all threads and a fork that duplicates only the calling
thread, and usage of these two versions depends entirely on the application. If
exec() is called immediately after a fork(), then duplicating all threads is
not necessary, but if exec() is not called, all threads are copied (an OS
overhead is implied to copy all the kernel threads as well).

Thread Cancellation

Cancellation is termination of the thread before it is complete (e.g., if multiple


threads are searching a database and one finds an answer, the others can stop).

There are two types of cancellation: asynchronous and deferred (or


synchronous). With asynchronous, the calling thread can immediately
terminate the target thread, but in deferred cancellation, the target thread is
information that it is to be terminated by setting some status in the target
thread. When the target thread consults its status, it finds out it is to terminate
and performs some clear-up and terminates itself. This requires some co-
ordination between threads.

Asynchronous cancellation is useful if the target process does not respond,


otherwise it should not be used, as it can lead to inconsistencies.

Signal Handling

This is used in UNIX to notify a process an error has occured. It is normally


generated by an event and is delivered to the process. Once delivered, the
signal must be handled.

Synchronous signals are delivered to the same process that performs the
operation that caused the signal, e.g., divide by 0 errors, etc...

Aysnchronous signals are delivered to running processes and are generated by


something external to the process, e.g., a key press.

The signal can be handled by a default signal handler, or a user-defined signal


handler. User-defined overrides the default handler (in most cases).

In multi-threaded systems, we have to consider the case of to which thread


should the signal be delivered?

To the thread where the signal applies (how do you know this?)
To every thread (has a lot of overhead)
To certain threads (which ones?)
To a designated thread, which handles all signals (has the problem of
there being the overhead of another thread, and the thread may not be
scheduled in the best way to handle the signal)

CPU Scheduling
CPU scheduling refers to the assignment of the CPU amongst processes that are
ready. It selects from among the processes in memory that are ready to execute and
allocates the CPU to one of them. In terms of process state, the scheduler decides
which of the processes that are in the ready state should be allocated to the CPU and
enter the running state.

The CPU scheduling decisions may take place when a process:

1. switches from the running to the waiting state


2. switches from the running to the ready state
3. switches from the waiting to the ready state
4. terminates

A scheduling decision may also take place as the result of a hardware interrupt.

We call scheduling under conditions 1 and 4 nonpreemptive - once the program is


running, it only gives up the CPU voluntarily. Scheduling under conditions 2 and 3
is preemptive - that is, the running process can be preempted by an interrupt.
Problems may arise when a program is preempted whilst accessing shared data,
which gives us a need for mutual exclusion.

Dispatcher

The dispatcher module gives control of the CPU to the process selected by the
scheduler. This involves:

switching context
switching to user mode
jumping to the proper location in the user program to restart that
program

With the dispatcher, we have a new metric, dispatch latency, which is the time
it takes for the dispatcher to stop one process and start another running.
Dispatch latency is optimal when it is minimised.

The dispatcher is an example of a mechanism, which is how you should implement


scheduling. Policies decide what will be done, and the seperation of policies and
mechanisms is an important OS principle, allowing flexibility if policy decisions are
to be changed later.

Process Scheduling Queues

This is based upon the states in the process state transition graph. The OS
needs to find the TCBs of all processes efficiently, and needs to find a process
waiting on a particular event quickly when that event occurs. Specifically, the
OS needs to find the next process to dispatch (i.e., from the ready queue)
quickly on a context switch. You also need to be able to move processes
between queues efficiently.

A process is only in one queue at a time. For example, you may have a ready
queue, which is the set of all processes residing in main memory, ready to
execute, and also a number of waiting queues, one for each different I/O
device, which is the set of processes waiting on I/O from that device. Processes
move between queues as they change state. Pictorally, this may look like this:
With this model, we have two types of schedulers, the short-term scheduler (or
CPU scheduler), which selects which process should be executed next from the
ready queue and allocates CPU. It is invoked very frequently (every 100 or so
milliseconds), and must be very fast. Additionally, we have the long-term
scheduler (or job scheduler). This selects which new processes should be
brought into the ready queue, and is invoked very infrequently (in terms of
seconds and perhaps even minutes). The long-term scheduler controls the
degree of multi-programming (i.e., the number of processes in memory).
However, how do we decide which jobs should be brought in from main
memory?

Processes can be described as being either I/O-bound (spends more time doing
I/O than computations, and has many short CPU bursts) or CPU-bound (spends
more time doing computation, has a few very long CPU bursts). The long term
scheduler tries to maintain a balance between these two forms.

Some OSs also have a medium term scheduler, which swaps in/out partially
executed processes.

Scheduling Policy

When the OS needs to select another process to execute, the scheduling policy
decides which of the processes to assign to the CPU. Scheduling policy
effectively determines the order of the ready queue so that the next process to
run is at the head of the queue.
Scheduling policies are often defined on the CPU-I/O burst cycle. Most
process executions consist of a CPU execution and an I/O wait. The
characteristics of this cycle help determine policies. The relationship between
relative sizes of CPU and I/O bursts are very important. I/O bursts are very
long compared to CPU bursts in most applications. A histogram of CPU-burst
times gives us a typical CPU-burst graph, showing many short CPU bursts and
few long CPU bursts, which are the CPU bound processes.

Some of the goals of scheduling policies are:

CPU utilisation high (keeping the CPU as busy as possible)


Throughput high (the number of processes that complete their execution
per time unit)
Turnaround time low (the amount of time it takes to execute a particular
process)
Waiting time low (amount of time a process has been in the waiting
queue)
Response time low (amount of time it takes from when a request was
submitted until the first response is produced, not output - for time-
sharing environments)

Turnaround time alone is a bad goal to optimise towards, as overall processes


will take longer.

First-Come, First-Served Scheduling

In First-Come, First-Served (FCFS) scheduling, processes are scheduled


in the order that they arrive in the ready queue.

The average waiting time for FCFS is not optimal, however, as it suffers
from the convoy effect, where short processes are delayed behind long
processes.

Shortest-Job-First Scheduling

Shortest-Job-First (SJF) scheduling works by associating each process


with the length of its next CPU burst, and then use these lengths to
schedule the process with the shortest time next.

This has two schemes: non-preemptive (once the CPU is given to the
process, it cannot be preempted until it completes its CPU burst); and
preemptive (if a new process arrives with CPU burst length less than the
remaining time of the current executing process, preempt), which is
sometimes known as Shortest-Remaining-Time-First (SRTF) .
SJF is optimal for giving a minimum averasge waiting time for a given set
of processes.

Determining the length of the next CPU burst can only be done by
estimating the length. One way of accomplishing this is by using the
length of previous CPU bursts and exponential averaging with the
forumla τn + 1 = αtn + (1 - α)τn, where tn = the actual length of the nth
CPU burst, τn + 1 = predicted value for next CPU burst and 0 ≤ α ≤ 1.

Priority Scheduling

A priority number (integer) is associated with each process and the CPU
is allocated to the process with the highest priority (the smallest integer).
This can be either preemptive or non-preemptive. SJF is a priority
scheduling policy if the priority is predicted next CPU burst time.

This does have the problem of starvation, where low priority processes
may never execute, and the solution to this is aging, where as time
progresses, the priority of the process is increased.

Round Robin

Each process gets a small unit of CPU time (a time quanta), which is
usually about 10-100 ms. After this time has elapsed, the process is
preempted and added to the end of the ready queue. A smaller time quanta
means more context switches and therefore more overheads, so the time
quanta needs to be set to some appropriate value that is large with respect
to the size of the context switch, but not too high as to basically become a
FIFO policy. If there are n processes in the ready queue and the time
quantum is q, then the process gets 1/n of the CPU time in chunks of at
most q time units at once. No process waits more than (n - 1)q time units.

Average turnaround time does not necessarily improve as the quantum


increases, so quantum size is a trade-off between good turnaround and
low overheads.

Multi-level Scheduling

Here, the ready queue is partitioned into seperate queues, e.g., a


foreground queue (for interactive processes) and a background queue (for
batch processes). Each queue has its own scheduling algorithm, for
example the foreground one is a RR queue and the background one may
be FCFS.
Each queue must also be scheduled for CPU time, for example we could
have a fixed priority scheduling algorithm (serve all from foreground,
then from background), but this does have the possibilty of starvation, or
we could have a time-slice, where each queue gets a certain amount of
CPU time which it can schedule among its processes, i.e., 80% to
foreground and 20% to background.

Multi-level Feedback Queue

A process can move between the various queues in a multi-level


scheduling policy, and this is a way of implementing aging. The
multilevel-feedback-queue scheduler is defined by the following
parameters:

number of queues
scheduling algorithms for each queue
method used to determine when to upgrade a process
method used to determine when to downgrade a process
method used to determine which queue a process will enter when
that process needs service

Thread Scheduling

For user-level threads, any general purpose scheduling policy can be used
according to the best policy for that application. General user-level thread
implementations should allow the application sufficient freedom to define an
application-specific scheduling regime, i.e., allow the application to reorder the
thread scheduling queues at any time, with any policy.

Memory Management
Processes (including threads) need physical memory in order to execute. The OS
manages memory and allocates memory to processes from physical memory, and
virtual memory. Physical memory management enables the OS to safely share
physical memory among processes. This allows us to multiprogram the processor, to
time-share the system (overlap computation and I/O) and to prevent one process
from using another memory.

Memory management techniques must meet the following requirements:

Processes may require more memory than available physical memory


You may need to load processes to different locations each time they execute
You may need to swap all or part of a process to secondary storage during
execution (and then reload it at a different memory location, as above)
Memory protection between processes is required
Some simple memory management techniques include loading the process into
memory and switching between them with no protection between the processes
(DOS), or copying the entire process memory to secondary storage when it performs
I/O and then copy it back when it is ready to restart (early UNIX). This is clearly
inefficient.

Most modern OSs work by giving each process a part of memory that it is allowed
to access and check each memory access to ensure that the process stays within its
own memory.

Physical Memory Allocation

Main memory must accomodate both the OS and user processes. Usually, the
main memory is divided into a number of distinct partitions, such as the
resident OS, which is normally a single partition held in low memory, as this is
where CPUs place interrupt vectors. User processes normally consist of one or
more partitions, within memory above the OS. However, we need to consider
how we're going to divide up these user partitions.

The simplest scheme is to partition into regions with fixed boundaries - this is
called fixed partitioning and has two types, where the partitions are of equal
size, or where they are of unequal size. Any process whose size is less than or
equal to the partition size can be loaded into an available partition. If all the
partitions are full, the operating system can swap a process out of a partition.
However, if a process doesn't fit into a partition, it is the programmers
responsibility to make sure that the program can be split over multiple
partitions. With this, main memory use is inefficient. Any program, no matter
how small, occupies an entire partion. This is called internal fragmentation.

With unequal partitions, the problem is reduced, but not eliminated. Instead of
creating x partitions of equal size, partitions of varying sizes are created, so a
program can be loaded into the one most appropriate to its size.

Another method is dynamic partitioning, where the partitions are of variable


length and number and are created when the process asks for them - so, the
process is allocated exactly as much memory as it requires, eliminating internal
fragmentation. It does however leave holes in memory, as processes are
swapped in and out in different places, leaving small gaps between processes.
This is called external fragmentation and could lead to a situation where there
is enough free memory for a process to run, but it is fragmented across main
memory. The way to solve this is using compaction - rearranging the programs
in memory so they are contiguous and all free memory is in one block. This is
very, very slow.
There are different ways of implementing allocation of partitions from a list of
free holes, such as:

first-fit: allocate the first hole that is big enough


best-fit: allocate the smallest hole that is small enough; the entire list of
holes must be searched, unless it is ordered by size
next-fit: scan holes from the location of the last allocation and choose
the next available block that is large enough (can be implemented using
a circular linked list)

First-fit is usually the best and fastest, but analysis has shown that under first-
fit, fragmentation can cause up to half of the available memory to be lost in
holes (external fragmentation) or unused memory (internal fragmentation).
Best-fit has the smallest amount of external fragmentation. Next-fit can
produce external fragmentation similar to best or first, depending upon the next
block found. It does use holes at the end of memory more frequently than
others perhaps leading to the largest blocks of contiguous memory.

A common approach, especially in embedded systems is the buddy system.


Under the buddy system, the entire space is treated as a single block of 2U. If a
process requests memory of size s, such that 2U-1 < s ≤ 2U, the entire block is
allocated, otherwise the block is split into two equal buddies. This process
continues until the smallest block greater than or equal to s is generated.

At any time, the buddy system maintains a list of holes of each size 2i. A hole
on the i + 1 list can be split into 2 equal holes of size 2i in the i list.

Implementing Physical Memory Management

There are two main problems in implementing physical memory management -


how can processes be run in different areas of memory, given that they contain
lots of addresses, and how can the OS provide memory protection.

Memory addresses in a program must be translated into physical memory


addresses, so programs are created to be relocatable, such as the ELF binaries
for Linux. There are three points where binding of program address to
memories can occur:

compile time - if the memory location is known at compile time, static


absolute addresses can be compiled in; any starting address changes
must be recompiled in.
load time - here, the code generated is relocatable and the code can be
loaded at any address and the final binding occurs when the program is
loaded. If the memory locations change, the code just needs to be
reloaded.
execution time - the binding is delayed until runtime if the process can
be moved from one memory segment to another during execution. This
method does require hardware support.

To work round this, we have the concept of a logical address space that is
bound to a seperate physical address space, which is the key to memory
management. Logical addresses are the ones generated by the CPU, and
physical addresses are the ones seen by the memory unit. When using compile
or load-time address-binding schemes, the logical and physical addresses are
the same, but with the execution-time addres-binding scheme, the logical
(virtual) and physical addresses differ. This is accomplished using a memory
management unit.

The memory management unit (MMU) is a hardware device that maps a


logical address to a physical one, so the user program only deals with logical
addresses, and never sees the physical one.

The most simple form of MMU is the relocation register. The relocation
register holds the value of an offset and then adds the offset to all logical
addresses given to it to give the physical address. In this system, logical
addresses are in the range 0..max for all processes and physical addresses are in
the range R..R+max, where R is the value in the relocation register. The
process only needs to generate the logical addresses, and only sees the process
running in addresses 0..max. This principle of mapping is central to memory
management.

The relocation register can provide memory protection with the addition of a
limit register, which contains the allowed range of logical addresses - each
logical address must be less than the limit register. Hardware support is
obviously required for this.

Another way of thinking of this is not thinking of a program as one large chunk
of contiguous memory - a process may be broken up into pieces that do not
need to be loaded into main memory all the time. This increases opportunities
for multiprogramming as with many processes in main memory, it is more
likely a process will be in the ready state at any particular point. This is
achieved by programming techniques with minimal OS support.

Static vs. Dynamic Loading

So far, we have assumed that the entire program data and code must be
loaded into memory for process to execute - this is called static loading.
An alternative approach is called dynamic loading which works by not
loading routines until it is called. This gives a better memory-space
utilisation and unused routines are never loaded. This is useful when large
amounts of code are required to handle infrequently occuring cases. If
dynamic loading is implemented through program design, no special
support from the OS is required.

Dynamic loading enables a small part of the process to be loaded initially


- usually around the program entry point with part of the data and for a
greater degree of multiprogramming as only parts of the program are
needed at any time in physical memory. It is usually the responsibility of
the application to ensure that the appropriate parts of the application are
loaded at any time.

Some OSs do provide libraries to aid loading parts of a program.

Overlays

These are slightly more structured, and codes are put into phases that are
loaded over one another. Again, the motivation is to keep only code and
data in memory that is actually needed. When one phase completes, the
application arranges to overlay its memory with the next phase when
required. The total memory required is the same as the memory required
by the largest phase. Some OS support may be given to implement this.

This is useful when there is a limited amount of physical RAM, and some
secondary storage is available to hold code/data, e.g., in embedded
systems.

Shared Libraries

Shared libraries can be implemented in two ways - using static or


dynamic linking. In static linking, all libraries required by the program are
linked into the image (i.e., form part of the data/code of the program), so
the memory allocated must be big enough to hold the application and the
libraries.

With dynamic linking, the libraries required by the program are


dynamically linked at execution time (just-in-time), so the memory
allocated must be big enough to hold the application. Shared libraries are
allocated from a different area of memory.

Swapping
A process (or part of a process) can be swapped temporarily out of
memory to a backing store (e.g., disk) and then brought back into memory
for continued execution. This is how medium-level scheduling is used. If
you want to reload at a different address, execution time binding is
required.

The issue with swapping is the backing store - a fast disk large enough to
accomodate copies of all memory images for all users that must provide
direct access to these memory images is required. A major part of the
swap time overhead is the transfer time. The total transfer time is directly
proportional to the amount of memory swapped.

One of the more common swapping variants is called roll-out, roll-in and
it is used with priority-based scheduling algorithms. Lower-priority
processes are swapped out so higher-priority processes can be loaded and
executed.

Many different varieties of swapping can be found on many OSs, such as


UNIX, Linux, Windows, etc...

Paging

The problems with using physical memory allocation with unequal, fixed
or variable sized partitions are fragmentation and inefficiencies
(algorithms to track holes, and compaction). Paging and segmentation are
memory allocation techniques that largely eliminate these problems.

Many of these problems arise as memory allocation is of contiguous


blocks of memory to processes. Paging removes this restriction, as the
logical address space of a process can be noncontiguous. The physical
memory space is divided into fixed-size blocks called frames, and logical
memory is divided into blocks of the same size called pages. Pages are
then allocated and fit into frames.

To implement paging, we need to keep track of all free frames. To run a


program of size n pages, we need to find n free frames and load the
program into them. There is a problem with internal fragmentation if a
process size is not an exact number of frames. Only part of one frame is
lost to this fragmentation, however, and if we make the frame size quite
small, this isn't a significant problem.

With paging, the logical address generated by the CPU is divided into a
page number p and a page offset d. The page number is used as an index
into a page table which contains the base address of each page in physical
memory and the page offset is combined with the base address to define
the physical memory address that is sent to the memory unit.

Usually, page tables are a fixed length, often the same size as a frame. If a
process does not require all entries in the page table, the redundant entries
are set to be invalid. If a reference is made to an invalid page table entry,
the OS must handle a page fault and there is the potential for an erroneous
logical address to have been generated by the application process.

The page table is normally kept in main memory, so in this scheme every
data/instruction access requires two memory lookups, one for the page
table and then one for the actual instruction. The two memory access
problem can be solved using a special peice of hardware called a
translation look-aside buffer (TLB).

TLBs are implemented using associative memory, which allows for a


parallel search of page numbers in order to get the frame number. The
TLB is not a part of normal physical memory.
Memory protection in paging is implemented by associating a protection
bit with each frame. The valid-invalid bit is attached to each entry in the
page table, where "valid" indicates that the associated page is in a process'
legal address space and is thus a legal page, whereas "invalid" indicates
that the page is not in the legal address space of the process.

Modern systems allow processes to have large logical address spaces, so


the page table itself can become very large. Hierarchial page tables can be
used to break up the logical address space into multiple page tables.

In address spaces > 32 bits, hashed page tables have


become common. Here, the virtual page table is See hash tables
hashed into a page table. in ADS

Often, having one page table per process can be expensive in terms of
physical memory. The alternative is the inverted page table, where there is
one entry for each real page of memory and the entry consists of the
virtual address of the page stored in that real memory location, with
information about the process that owns that page. This decreases
memory usage, but increases the time needed to search the table when a
page reference occurs. Hash tables can be used to limit the search to one,
or at most a few, page-table entries.

Segmentation

Paging seperates the user's view of memory from actual physical memory.
A user thinks of the program as a collection of segments. A segment is a
logical unit, such as: the main program, procedures, functions, methods,
objects, local variables, global variables, common blocks, the stack, the
symbol table and arrays. Segmentation is a memory management scheme
that supports a user view of memory.

In segmentation, you have a logical address consisting of a segment


number and an offset and a segment table that maps two-dimensional
physical addresses. Each table entry has a base that contains the starting
physical address where segments reside in memory and the limit that
specifies the length of the segment.
With each entry in the segment table, a validation bit (one that indicates
whether or not it was an illegal statement) and read/write/execute
privileges are associated with the entry. Since segments can vary in
length, memory allocation is a dynamic storage-allocation problem, which
can lead to external fragmentation.

Segmented Paging

Both paging and segmentation have advantages and disadvantages and


most succesful processor families are moving towards a mixture of
segmentation and paging. Segmented paging usually uses a segment table,
a multi-level page table and an offset.

Virtual Memory
Virtual memory is the seperation of user logical memory from physical
memory. Only part of the program needs to be in memory for execution (i.e.,
only the part needed for current execution - follows from the principle of
locality, where programs and data references inside a process tend to cluster
and only a few pieces of a process will be needed in any period of time). The
major advantages of virtual memory is that the logical address space can be
much larger than the physical address space and more processes can be partly
in physical memory at any time.

Virtual memory can be much larger that physical memory, with n pages of
virtual memory mapped to m pages of physical memory and other pages on
secondary storage.

Virtual memory can be implemented assuming a virtual address space of


0..max, where max is usually a power of 2, e.g., could be 232 bytes, or 4 GiB,
in 32-bit systems. The process can be loaded anywhere in physical memory
and often the physical memory size is significantly smaller than the virtual
address space of an individual process.

Physical memory nowadays is very large, so virtual memory is only useful for
large data sets, and for compilation models.

Demand Paging

Demand paging brings a page into memory only when it is needed. This means
there is less I/O, less memory, a faster response time and more
multiprogramming.

A page is needed when there is a reference to it (i.e., an instruction/data


address). An invalid address will lead to an abort, and a not-in-memory request
will lead to a fetch to memory - this is called a page fault.

When the first reference to a page occurs, the page will not be in memory
under this system, so a page fault occurs. This is a non-maskable interrupts
(can't be covered up by the user or OS) and it is synchronous, you need to
know the running process that caused the interrupt. This allows for the OS to
bring the appropriate page into main memory and for it to detect invalid
memory references - the OS contains a range of valid addresses for a process.
The process often does not need the full range of virtual memory, hence code
size, data size and stack size can be used to check if a memory reference is
valid. This information is stored on a per process basis in the TCB.

When a page fault occurs, the OS looks at the internal table to decide if the
process is permitted to make the memory reference and if it is, it finds an
empty frame and swaps the page into that frame, updates the page table and the
internal OS tables and continues the instruction that caused the page fault - as
with other standard interrupts.

Pure demand paging is where you start executing a program with no pages
loaded into memory. This will trigger a page fault immediately, however.

Demand paging is more complex when a TLB is involved. When a virtual


address is given, the processor consults the TLB, and if a TLB hit occurs, the
real address is returned and main memory is consulted in the usual manner. If
the page table entry is not found in the TLB (a miss), the page number is used
to index the process page table, and then a check occurs to see if the page is in
main memory. If it's not, a page fault is issued and the non-TLB process as
above is followed, with the TLB being updated as well as the page table.

Page Replacement

Page replacement is needed when there are no free physical memory


frames.

When we want to load a page into memory, we try to find a free frame to
put the page into. If there is no free frame, then we use a page
replacement algorithm to select a victim frame, and then the victim is
copied back onto disk and the new frame is loaded into its place and the
page and frame tables are updated.

The aim for designing a page replacement algorithm is to give the lowest
page-fault rate. An algorithm is evaluated by running a particular string of
memory references and computing the number of page faults on that
string. An ideal page fault graph would look like:
Some algorithms, however give more page faults if the number of pages
is increased - this is called Belady's Anomaly.

The simplest page replacement algorithm is the first-in-first-out method,


which can be implemented using a circular linked list.

Another algorithm is the least recently used (LRU) algorithm, which


replaces the page that has not been referenced for the longest time. By the
principle of locality, this should be the page that is least likely to be
referenced in the future. Each page could be tagged with the time of last
reference, but this would require a great deal of overhead. LRU can be
easily implemented using a stack of page numbers; when a page is
referenced, it is moved or placed on the top of the stack, and when
looking for something for a replacement, the bottom of the stack is all that
you need.

The second chance algorithm uses a reference bit, which is initially set to
0 and then is updated to 1 when it is accessed. The second chance
algorithm uses a circular list of pages. When searching for a page to
replace, if a page has a reference bit one, it is set to 0 and the page left in
memory, and the next page in circular order is replaced subject to the
same rules.

We can also consider counting algorithms, where a counter is kept of the


number of references that have been made to each page. The least
frequently used (LFU) algorithm replaces the page with the smallest
count. The most frequently used (MFU) algorithm is based on the
algorithm that the page with the smallest count was probably just brought
in and has yet to be used.

The optimal algorithm replaces the page that will not be used for the
longest period of time. It is impossible to have perfect knowledge of
future events, but if you already know what's happening (as with a
reference string), you can use this algorithm to measure how well your
algorithm performs.

The OS may wish some frames to remain in physical memory frames, such as
parts of the OS itself, control structures, I/O buffers, etc..., which is why we
can use page and frame locking. When a frame is locked, it may not be
replaced. This usually requires a lock bit to be added with each frame, and is
usually supported in the OS, not hardware.

Each process requires a minimum number of pages. If indirect referencing is


used, each reference requires a page, so each level of references increases the
amount of pages required.

Frame Allocation

Again, there are different methods of allocating frames. If there are 100
frames and 5 processes, each process can be given 20 pages. This is called
equal allocation. A different way is called proportional allocation, where
allocation is given according to the size of the process: ai, the allocation
for process pi = si/S × m, where si = size of process pi, S = Σsi and m =
total number of frames.

A variation of the proportional allocation scheme is the priority allocation


scheme, where the proportions are computed based on priorities, rather
than size (a higher priority process will execute quicker if it has more
frames).

If a process pi develops a page fault you can either select a replacement


from one of its frames or a replacement from a process with a lower
priority number. This brings us to the question of global or local
replacement. Global replacement will select a replacement frame from the
set of all frames, so one process can take a frame from another. In priority
based allocation schemes, this may allow higher priority processes to
increase their allocation of frames by taking a frame from a low priority
process. Problems arise as the page fault behaviour of one process will
also depend on the behaviour of other processes.
In local replacement, each process selects from only its own set of
allocated frames, so the process' page fault behaviour is entirely
dependent upon itself, and not upon other processes. The process
execution time becomes more consistent across multiple executions.

Thrashing

If a process does not have enough frames allocated, the page fault rate can
get very high, so a lot of swapping back and forth between main memory
and secondary storage is done, and the result is thrashing, which is where
more work is done by the OS swapping and paging than is done by the
application in doing something useful. This has severe performance
implications.

A common form of thrashing is due to the OS monitoring utilisation and


many page faults occur, decreasing utilisation. As utilisation decreases,
the OS increases the degree of multiprogramming to try and increase it,
which in turn increases the page faults and decreases utilisation...

Thrashing occurs under global frame replacement more as frames can be


stolen from other (active) processes. Local frame replacement limits
thrashing to within individual processes. If a process starts thrashing, it
cannot steal frames from another process and cause it to start thrashing.

To prevent thrashing, a process must allocate a sufficient number of


frames. We can predict a sufficient amount of frames using the locality
model. The diagram below shows how memory accesses change over
execution of a program.
Thrashing will occur when the size of the locality > memory (pages)
allocated to the process.

The working set model assumes locality - it defines a per-process working


set window which contains the most recent page references for a process.
If a page is in active use, it will have been accessed in the working set
window, and will be in the working set. If a page is not being used, it will
drop from the working set sometime after its last reference. The working
set approximates the locality of a process.
Formally, we have Δ, the working set window width in time and WSSi,
the working set of process Pi is the total number of pages accessed in the
most recent Δ. The working set will vary with time, and if Δ is too small,
it will not cover the entire locality, and if Δ is too large, it'll cover several
localities. If Δ = ∞, it will cover the entire program. The most important
property is the total size of the working sets in the system. D = ΣWSSi. If
D > m (the total number of pages), thrashing will occur, and a process
must be suspended.

The working set can be implemented with an interval timer and a


reference bit. For example, if Δ = 10,000, each process in the OS has bits
to indicate whether it has been accessed in the last 5000, 10,000 or 15,000
cycles. On a page access, the 5000 bit is set, and a timer interrupt every
5000 cycles shifts all the bits along one, with the 5000 reference bit set to
0. If any of the bits in memory = 1, the page is in the working set.

This is not entirely accurate, as you cannot tell within an interval of 5000
when a page reference actually occured. You can improve this by
interrupting more frequently and increasing the number of reference bits,
but this has a greater overhead and reduces the time available to do useful
work.

The working set model is quite clumsy and expensive to implement, and
page-fault frequency is better. An "acceptable" page-fault rate is better,
and if the page fault rate travels outside of these bounds, then the process
loses a frame, and if the page-fault rate gets too high, the process gains a
frame.
Inter Process Communication
We need processes to co-operate to carry out particular jobs, e.g., one process
requests a service from another (such as requesting an I/O operation from the kernel,
as the kernel is still a process), pipelined processes working together (one process
can't proceed until the previous one has finished) and where multiple processes work
on the same task, for example one might read from disk to memory, and another
reads from memory to the audio device. This can be accomplished using inter
process communication (IPC).

IPC has numerous advantages, such as:

Information sharing (e.g., of shared files)


Computation speed-up (e.g., a program can be split across multiple
computers)
Modularity (for software engineering reasons)
Convenience (e.g., to edit, compile and print in parallel)

Independent processes cannot affect or be affected by the execution of another


process (all processes are in a seperate protected memory address space). Co-
operating processes can affect or be affected by the execution of another process, via
some form of IPC mechanism.

The OS must solve some of the problems that IPC causes, however. It must provide
mechanisms to control interaction betwen the processes and must order accesses to
shared resources by processes. The OS must also provide mechanisms by which
resources can wait until the resource is available.

A paradigm for co-operating processes is the producer-consumer problem. A


producer process produces information that is consumed by a consumer process.
There are two models which can be used here, one is the unbounded-buffer, which
places no practical limit on the size of the buffer, and the buffer is just expanded as
the producer fills it and the other is the bounded-buffer, which assumes there is a
fixed buffer size and the producer must wait for the consumer to remove data if the
buffer is full.

One way to implement the bounded-buffer solution is to use


a circular array/list, but this only works if the processes can See lecture notes
share memory. Another implementation that maximises the for full details
use of the buffer is to use an integer called a counter which is
then incremented when a producer writes to the buffer and decremented when a
consumer reads from it. These operations must be performed atomically though,
which isn't always the case if a context switch occurs during the increment and
decrement process.

We now have a race condition, which is the situation where several processes access
and manipulate shared data concurrently and the final value of the shared data
depends on which process finishes last, so the data may become inconsistent. To
prevent race conditions and to ensure consistent data, concurrent processes must be
synchronised. This requires mechanisms to ensure the orderly execution of co-
operating processes.

Critical Sections

Here, we can use critical sections. If we have n processes competing to use


some shared data, each process has a code segment called the critical section,
in which the shared data is accessed. The problem is to ensure that when one
process is executing in its critical section, no other process is allowed to
execute in its critical section. Solutions to this problem include:

Mutual exclusion (mutexs) - if a process is executing in its critical


section, then no other processes can execute in their critical sections
Progress - if no process is executing in its critical section and there
exists some program that wish to enter their critical section, then the
selection of the processes that will enter their critical section next can
not be postponed indefinately
Bounded waiting - a bound must exist on the number of times that other
processes are allowed to enter their critical sections after a process has
made a request to enter its critical section and before that request is
granted
One way to accomplish this is to use command such as
EnterCriticalSection() and LeaveCriticalSection().
Processes may share some common variables to synchronise their actions. For
example, we could have an integer value called turn, which if it matches the
value associated with that process, then the process can enter its critical section
and then increment the turn counter. This satisfies the mutual exclusion
requirement, but not progress.

A second algorithm is to use guard flags, which consists of an array of flags.


When a process is ready to enter its critical section, it sets its flag to true and
waits until all other processes have completed their critical sections and then
sets its flag to false. This satisfies mutual exclusion again, but not progress.

The final algorithm is a mixture of turn and flags, which waits for the turn of a
process that is waiting to enter its critical section to finish and then enter its
own. This meets all three requirements, and solves the critical section problem
for two processes.

However, we need to solve the problem for n processes, and one way of
implementing this is to use the Bakery (or Lamport) algorithm. Before entering
its critical section, a process receives a number, and the holder of the smallest
number enters its critical section. If processes i and j receive the same number,
then if i < j, then i is served first, else j is served first. The number scheme used
always generates numbers in an increasing order of enumeration.

As in many aspects of software, hardware can make programming easier.


Hardware exists that can solve the critical section problem. If an operation
TestAndSet existed, which could test and modify the contents of a word
atomically, mutual exclusion could be implemented. This does not satisfy the
bounded wait requirement, however.

The software algorithms discussed above for exclusion of processes from critical
sections are problematic as they rely on the continuous testing of variables (busy
waiting), which is inefficient and wasteful of CPU cycles. The protocols are
complex and unenforceable, relying on programmer discipline and co-operation.
The OS implements efficient IPC mechanisms instead.

Signals

POSIX/Linux signals are a dataless


communication either between processes, or We've already discussed
between the OS and a process. Signals can be signal handling in threads
thought of as software interrupts and require a in the threads section.
handler within a process. When the event that
causes the signal occurs, the signal is said to be generated and when the
appropriate action for the process in response to the signal is taken, the signal
is said to be delivered. In the interim, the signal is said to be pending. Signals
consist of a number of types, such as system defined (normally up to 31)
corresponding to a particular event, or user-defined (normall integers > 31).
Some of these signals include:

SIGINT (2) - interrupt generated from a terminal


SIGILL (4) - interrupt generated by an illegal instruction
SIGFPE (8) - floating point exception
SIGKILL (9) - kill a process
SIGSEGV (11) - segmentation violation
SIGALRM (14) - alarm clock timeout
SIGCHLD (17) - sent to parent when child exits

Synchronous signals occur at the same place, every time in the program (e.g.,
SIGFPE, SIGILL, SIGSEGV) and are caused by the program itself.
Asynchronous signals may occur at any time from a source other than the
process (another process e.g., SIGKILL, a terminal driver on a key press e.g.,
SIGINT on Ctrl+C, on an alarm clock, e.g., SIGALRM) and a process can
choose how to handle the specific signal.

A signal can be treated in the following ways, it can be ignored (except for
SIGKILL), the default action set by the OS can be executed (the default action
is decided by the signal used, such as terminating the process, suspending the
process, terminating the process and saving a copy in memory, continue a
suspended process or just ignore), or a specific signal-catching function can be
invoked.

There are essential system calls for signalling:

kill(pid, signal) - send a signal of value signal to the


process with ID pid.
sigaction(handler, signal) - installs a handler for a
specific signal (handlers are specific to each process)
sigprocmask(..., mask, ...) - signals can be blocked or
masked in much the same manner as hardware interrupts can be masked
by the OS
sigpending(mask) - signals that are raised whilst they are masked
out can be examined
sigsuspend(mask) - a process can wait until a signal arrives

Semaphores

Semaphores are "a non-negative integer, which apart from initialisation can
only be acted on by two standard, atomic, uninterruptible operations, WAIT
(which decrements the value of the semaphore) and SIGNAL (which
increments the value)" - E. Dijkstra

You must guarantee that a maximum of one process can execute a WAIT or
SIGNAL operation on the same semaphore at the same time (atomicity).

There can be two types of semaphore, a counting semaphore (the integer value
can range over an unrestricted domain) and a binary semaphore, where the
binary value can range only between 0 and 1. This can be simpler to
implement.

Semaphores can be used to implement mutual exclusion for the critical section
of n processes. The shared data is a binary-semaphore mutex which is WAITed
on when a program wants to enter the critical section and then SIGNALed
when the critical section completes.

Semaphores can also be used to implement synchronisation. For example, B


can only be executed in program j only once A has executed in program i.
When a semaphore is used to indicate a condition, it is called a condition
variable and synchronisation using a condition variable is called condition
synchronisation.

This can introduce two conditions, however: starvation, which is indefinite


blocking when a process may never be removed from the queue in which it is
suspended; and deadlock, where two or more processes are waiting indefinitely
for an event that can only be caused by only one of the waiting processes.

Counting semaphores can be used for resource management. If you consider a


resource with m allocatable elements, the counting semaphore is set to m and
then users can wait for a resource by waiting on the semaphore and then free
up a resource by signalling a semaphore.

The classic problems of synchronisation, such as the bounded-buffer problem,


the readers and writers problem and the dining-philosophers problem can be
solved using semaphores and mutexs.

OS level semaphores ensure mutual exclusion over a shared resource if and


only if the application is disciplined. For example, the programmer should only
access shared data if and only if it is holding the appropriate semaphores, and it
should issue a signal to release the semaphores when it finishes the critical
section.

Semaphore operations still require atomicity. If an operation is in the OS, it can


be accessed by a process using a system call. When it is in the OS, the call will
execute atomically, or at least seem atomic from the users point of view, if
interrupts are enabled. True atomicity can only be achieved with hardware
support, such as the TestAndSet instruction discussed earlier.

POSIX semaphores support two interfaces (Linux does not support these fully).
Unnamed semaphores are used within a process (or between related processes)
and are created by a call to sem_init(...) by a process. Operations
include sem_wait(...) for a wait and sem_post(...) for a signal.
sem_trywait(...) also exists, which works like wait if the semaphore has
a greater count than 0, but doesn't wait if the semaphore is 0, returning
immediately to the calling process. These semaphores are destroyed by a call to
sem_destroy(...), or if the creating process dies.

Named semaphores can be used between any process. The namespace is the
same as the filesystem (so the name of a semaphore is effectively a filename).
These are created using sem_create(...) and then can be manipulated as
above. sem_close(...) removes the link between a process and a
semaphore, and they can live on past the termination of the creating process, so
must be explictly destroyed using sem_destroy(...). Here, the OS
merely stores a list of processes waiting on a particular semaphore (i.e.,
processes that are blocked on a semaphore), but hardware support is still
required for atomicity.

Linux implements SysV semaphores (i.e., as it existed in System V UNIX),


which are not POSIX compliant. They are system wide resources, managed by
the OS, and any process can access a semaphore, which is stored in the
semaphore table of an OS. Allocation occurs via semget(...), which
creates a set (which may be of one) of semaphores or associate with an existing
set of semaphores. All operations on semaphores occurs using
semop(semaphore, op, ...) and semctl(...) allows you to
examine or change the value of a semaphore, e.g., to initialise it, as well as
telling you how many processes are waiting on the semaphore and allowing
you to destroy it.

Semaphores can also be used to provide mutual exclusion between threads,


such as provided by the Linux pthreads library which provides mutexes.
pthread_mutex_init() provides initialisation for a mutex,
pthread_mutex_lock() obtains a lock for a mutex and
pthread_mutex_unlock() will give up a lock. The semaphores between
pthreads are supported by the standard Linux kernel semaphores.

Conditional synchronisation can also be used, where a condition enables


threads to atomically block and test a condition under the protection of mutual
exclusion, until the condition is true. If the condition is false, the thread blocks
on the variable and automatically releases the mutex. If another thread changes
the condition, it may wake up waiting threads by signalling the associated
condition variable. Waiting threads will wake up and then reacquire the mutex
and reevaluate the condition. pthread_cond_init() will initialise a
condition variable, which must be a mutex initialised by
pthread_mutex_init(). pthread_cond_wait() and
pthread_cond_signal() take a condition variable and a mutex as
parameters.

Using the low-level IPC mechanisms mentioned above relies on programmer


discipline for the correct use of semaphores to wait and signal in the correct order at
the correct point in the algorithm. Additionally, it requires an agreement between
communicators (i.e., between the reader and the writer) on the precise protocol to be
used, and problems can be encountered if the writer assumes writers have priority
and readers assume that the readers have priority over the shared data. Sometimes,
hardware support is required to provide atomicity (TestAndSet, interrupt disabling,
etc).

High-level IPC provides a structured IPC less susceptible to programmer error. It


incorporates both the means of shared data control and the shared data itself, as
opposed to low level IPC, such as semaphores, where the shared data control is
distinct from the shared data (note, the POSIX implementation assumes that the
shared data in a file could be accessed from processes without waiting for the
appropriate semaphore).

Message Passing

Message passing is a mechanism for processes to communicate and


synchronise their actions. In a message system, processes communicate with
each other without resorting to shared variables. If P and Q wish to
communicate, they need to establish a communication link between them and
exchange messages via send/receive.

This does require OS support and a number of implementation issues must be


resolved - where are messages stored whilst they are in transit, how many
processes can join in with a communication, and is the message size fixed or
variable, is the link uni or bi-directional, synchronous or asynchronous, etc...

With message passing, processes must name each other explicitly, such as:
send(P, message), to send a message to process P and receive(Q,
message) to receive a message from process Q.

With this method, links are established automatically and a link is associated
with exactly one pair of communicating processes, and between each pair their
exists exactly one link. The link may be unidirectional, but it is usually
bidirectional.
A method of indirect communication involves using messages being directed
and received from mailboxes. Each mailbox has a unique ID and processes can
only communicate only if they share a mailbox. This introduces the problem of
deciding who gets a message if a mailbox is shared, however. The primitives in
this method are: send(A, message) and receive(A, message),
which sends a receives messages from mailbox A.

In this case, a link is only established if processes share a common mailbox,


and a link may be associated with many processes. Each pair of processes may
share several communication links and the links may be uni- or bi-directional.

Mailboxes can be owned by either the OS or a process. If it is OS owned (in


OS address space), the OS is independent and not attached to any process, so
operations and system calls are required for the process to create a new
mailbox, send and receive messages through the mailbox and to delete
mailboxes. If a mailbox is process owned (in the address space of a process)
the process with a mailbox receives messages and other processes send
messages to the mailbox. An operation or system call is still required to write
to the mailbox.

Message passing can be considered to either be blocking or non-blocking.


Blocking is considered synchronous; in a blocking send, the sending process is
blocked until the message is received by the receiver or a mailbox, and in a
blocking receive, the receiver is blocked until the message is available. Non-
blocking is considered asynchronous; in a non-blocking send, the sending
process sends the message and then resumes execution, and for a non-blocking
receiver, the receive call will either get a message or a NULL. The send and
receive primitives may be either blocking or non-blocking, depening on the
implementation.

The link (direct communication) or mailbox (indirect communication) will


contain messages that have been sent or not received. This is called a buffer,
and they can be implemented in more than one way:

Zero capacity - 0 messages are held, and the sender must wait for the
receiver
Bounded capacity - here there is a buffer with a finite length of n
messages and the sender must wait if the link is full
Unbounded capacity - infinite length, and the sender never waits

Client-Server Communication

When the processes are on different machines, network communication is


required. Here, three forms of IPC are available: sockets, remote
procedure calls (RPCs) and remote method invocation (Java). These will
be discussed in the NDS course.

In POSIX, message queues between processes are used. Messages are stored in
FIFO order, and the queue is set up so that either the process sends or receives
blocking (on the full or empty queue respectively), or a the process sends or
receives non-blocking. A process reading from an empty queue will block.

In POSIX, the process creates or connects to an existing one using


mq_open(name, ...), where name is a filename in the same namespace
as POSIX semaphores. Sending and receiving messages occur using
mq_send(...) and mq_receive(...), and process can close its
connection to a message queue by mq_close(...), however the message
queue will still exist after this call. To destroy a message queue,
mq_unlink(...) is used before mq_close(...), and the queue will be
destroyed after all processes attached to the queue have unlinked.

High-level Synchronisation Mechanisms

Semaphores provide a convenient and effective mechanism for process


synchronisation, but their incorrect use can result in problems difficult to
detect. To combat these problems, a number of high-level synchronisation
constructs have been proposed: critical regions and monitors.

These processes are not implemented at the OS level, but usually as part of the
programming language. There is no atomicity at the application level.

Critical regions are a high-level synchronisation construct. A shared variable v


of type T is declared as v : shared T. The variable v is only accessible
inside the statement region v when B do S, where B is a boolean
expression, which means that when statement S is being executed, no other
process can access variable v.

Regions referring to the same shared variable exclude each other in time (i.e.,
mutual exclusion is enforced). When a process tries to execute the region
statement, the boolean expression B is evaluated. If B is true, S is executed, but
if B is false, the process is delayed until B becomes true and no other process is
in the region associated with v.

Critical regions are implemented using semaphores. With the shared variable x,
the following are associated: semaphore mutex, first-delay,
second-delay; int first-count, second-count;. Mutually
exclusive access to the critical section is provided by mutex. If a process can
not enter the critical section because B is false, it initially waits on first-
delay and is moved to second-delay before it is allowed to re-evaluate B.
A track is kept on the number of processes waiting on first-delay and second-
delay by using first-count and second-count respectively.

This algorithm assumes a FIFO ordering in the queueing of processes for a


semaphore. For an arbitrary queueing discipline, a more complicated
implementation is required.

Monitors are high-level synchronisation constructs that allows the safe sharing
of an abstract data type among concurrent processes. The monitor encapsulates
shared data together with procedures that act upon that data. Only the
procedures can be accessed from outside the monitor; the data inside the
monitor can not be directly accessed from outside the monitor. Only one
process can be active inside a monitor at a time.
To allow a process to wait within a monitor, a condition variable must be
declared as: condition x, y;. The condition variable can only be used
with the operations wait and signal. [Link]() means that the process
invoking the operation is suspended until another process invokes
[Link](), which resumes exactly one suspended process. If no process is
suspended, then the signal operation has no effect.

A conditional-wait construct also exists: [Link](c), where c is an integer


expression evaluated when the wait operation is executed. The value is a
priority number stored within the name of a process that is suspended. When
[Link]() is executed, the process with the smallest associated priority
number is resumed next.

Two conditions need to be checked to establish the correctness of a system:

User processes must always make their calls on the monitor in a correct
sequence.
You must ensure that an uncooperative process does not ignore the
mutual-exclusion gateway provided by the monitor and try to access the
shared resource directly, without using the access protocols.

Deadlocks
Deadlocks occur when a set of blocked processes are each holding a resource and
waiting to acquire a resource held by another process in the set. Deadlocks can be
characterised by the bridge-crossing problem:

There is only one lane across the bridge. We can view each section of the bridge as a
resource, and if a deadlock occurs, it can be resolved if one car backs up (pre-empt
resources and rollback), but sometimes that will require the cars following it to be
backed up also. Starvation can also occur, if only cars from one direction are
allowed to pass.

Deadlock can arise if four conditions hold simultaneously in a situation:

1. Mutual exclusion: only one process at a time can use the resource
2. Hold and wait: a process holding at least one resource is waiting to acquire
additional resources held by other processes
3. No preemption: a resource can only be released voluntarily by the process
holding it, after the process has completed its task
4. Circular wait: there exists a set {P0, P1, ..., Pn} of waiting processes such that
P0 is waiting for a resource that is held by P1, which is waiting for a resource
that is held by P2, ..., Pn - 1 is waiting for a resource that is held by Pn, that is
waiting for a resource that is held by P0.

We could plot a resource allocation graph which consists of a set of vertices V and a
set of edges E, such that V is partitioned into two types, P = {P1, P2, ..., Pn}, the set
of all processes in the system and R = {R1, R2, ..., Rm}, the set of all resource types
in the system. Each process utilises a resource using: requests, a directed edge Pi →
Rj, which if it can not be assigned immediately, the requesting process is blocked;
and resource use, indicated by a directed edge Rj → Pi. That is, the process uses the
resource. When a resource is released, edges are removed.
The above resource allocation graph has no deadlocks, as there are no circularities.
P3 has all the resources it needs; when it frees R3 then P2 can proceed; when P2
releases R1, then P1 can proceed.
The above resource allocation graph does have a deadlock, however. Mutual
exclusion holds over all of the resources, a number of processes hold resources and
are waiting for others and no process wants to release a resource. P3 holds R3 and
wants R2; P2 holds R1 and R2 and wants R3 and P1 holds R2 and wants R1. This is a
circular wait, and we have all the conditions for deadlock.

If a graph contains a cycle, the following basic rules can be applied: if the graph
contains no cycles, no deadlock occurs; if the graph contains a cycle, if there is only
one instance per resource type, then deadlock occurs, otherwise if there are several
instances per resource type, there is the possibility of deadlock.

Handling Deadlocks
The free basic methods for handling deadlocks are prevention/avoidance
(ensuring that the system will never enter a deadlock state), detection and
recovery (allows the system to enter a deadlock state and then recover from it)
and the head-in-the-sand approach (ignore the problem and pretend that
deadlocks never occur - this is the approach used by most OS, including UNIX,
Linux and Windows).

To implement prevention, we need to ensure that one or more of the deadlock


conditions do not hold:

Mutual exclusion is not required for sharable resources, but it must hold
for non-sharable resources.
Hold and wait must guarantee that whenever a process requests a
resource, it does not hold any other resources. The process must request
and be allocated all of its resources before it begins execution, or the
process can only request resources when the process currently has none.
This does have the possibility of low resource utilisation and starvation.
No preemption - If a process that is holding some resources requests
another resource that can not be immediately allocated to it, then all
resources currently being held are released. Preempted resources are
added to the list of resources for which the process is waiting and the
process will only be restarted when it can regain its old resources, as
well as the new ones it is requesting.
Circular wait - impose a total ordering of all resource types and require
that each process requests resources in an increasing order of
enumeration.

Deadlock algorithms prevent deadlock, so deadlock will never occur if (at least
one of) the restrictions are enforced. But, there is a cost to this system. To
restrain how resource requests are made can have the potential for low device
utilisation and a reduced system throughput.

Alternatively, with some addition a priori information, deadlock can be avoided


on a per request basis. The simple method is that each process declares the
maximum number of resources of each type that it may need, or a dynamic
alternative is to dynamically examine the resource-allocation state to ensure
that there can never be a circular-wait condition. The resource allocation state
is defined by the number of available and allocated resources, and the
maximum demands of the process.

Systems States

When a process requests an available resource, the system must decide if


immediate allocation would leave the system in a safe state - a state where
there exists a sequence of all processes so that all processes can get their
required resources and complete. If the system is not in a safe state, it is in an
unsafe state, which means the possibility of deadlock exists (but is not
necessarily going to happen). The deadlock state is a subset of the unsafe
states.

Avoidance requires that the system will never enter an unsafe state, so an
algorithm must pick a safe sequence of events and processes for execution.
One such algorithm is the resource-allocation graph algorithm. A claim edge Pi
→ Rj indicates that process Pi may request resource Rj which is represented by
a dashed line. A claim edge is converted to a request edge when a process
requests a resource. When a resource is released by a process, the assignment
reconverts to a claim edge. Resources must be claimed a priori in the system.

File Systems
Files are an integral part of most computer systems. Input to applications can be by
means of a file, and output can be saved in a file for long-term storage. File systems
can store large amounts of data, and the storage is persistent. The file management
system is considered part of the OS.

The file managment system is a way a process may access files. The programmer
does not need to develop file management software, this is provided by the OS. The
objectives of a file management system are to meet the data management needs of
the user, to guarantee the data in the file is valid, to minimise the potential for lost or
destroyed data and to optimise performance. We can also break down what the user
requirements for data management are into:

being able to create, delete, read and change files


controlled access to other users' files
control accesses allowed to that users' files
restructure the users' files in a form appropriate to the problem
move data between files
backup and recover the users' files in case of damage
access and share the users' files using symbolic names

To enable this, the file management system typically provides functions to indentify
and locate a selected file, to organise files in a structure (often accomplished using
directories) which describes the locations of all files and their attributes, and to
provide mechanisms for protection. The file management systems provides
secondary storage management, that is, it allocates files to free disk blocks and
manages free storage for available disk blocks.

Files
A file is a contiguous, logical addres space (apart from sparse files), however,
they may not be stored as a contiguous file. Files may be of different types, but
from the perspective of the OS, files have no structure - all a file is is a
sequence of machine words and bytes. However, from the user or application
perspective, files may have a simple record structure made up of lines (of fixed
or variable length) or complex structures, which have special control characters
in the structure to make up files of various types.

Files also have attributes assigned to them:

Name - in a human readable form


Type - for systems that support different types
Location - pointer to a file location on the device
Size - current file size
Protection - e.g., read, write, execute bits
Time, date and user identification - used for protection, security and
usage monitoring

These file attributes are kept in the directory structure.

The OS only provides basic file operations on files, such as: create, write, read,
seek, delete and truncate, so more complex operations must be made up of
combinations of these operations, e.g., copy is a create, followed by a read
from a source file and a write to the new file.

File types, traditionally implemented using file extensions (in Microsoft


Windows, this is a legacy from DOS), enables the OS to invoke an appropriate
application for each file type. UNIX and its variants use a "magic number" at
the start of the file to declare its type.

File systems have criteria for file organisation, such as rapid access (especially
needed when accessing a single data item), ease of update (not always a concern,
however, e.g., CD-ROM), economy of storage (want to reduced wasted space,
although this is less of a concern in modern computing), simple maintenance and
reliability.

Sequential Access Files


In these types of files, access starts at the beginning of the file, and to read to
the middle of the file, you must start at the beginning and read the file until you
reach the position you want, and then rewind to get back to the beginning. This
method of file access is slow.

A fixed format is used for structuring data into records. Records are normally
the same length, and the fields in a record are in fixed order. One field (usually
the first) is a key, which uniquely identifies the record within that file. Usually,
records are stored in key order and batch updates are used to reduce the update
time (records are cached and then ordered before an update is done). This
method is designed for sequential access devices (tapes), but sequential files
can be implemented on random access devices, such as disks.

Additions to and deletions from the file can cause problems. The file is stored
in a sequential ordering of record keys, so insertion requires "moving" some of
the records on the storage medium. This is very expensive in time. Usually,
new records are placed in a new file, called the overflow, log or transaction file
and a periodic batch update occurs that merges the overflow file with the
original file. Alternatively, records are organised as a linked list. More
overhead in terms of storage is required, and additional processing is required
also. This only really works (on sequential storage devices) if the records
remain in key order. Deletion requests can also be stored in the log file.

To reduce slow, sequential access files, and to support random access storage
devices better, indexed sequential files can be used. The index provides the
lookup capability to get in the vicinity of a desired record faster. The index is
searched for a key field that is equal or less than the desired key value which
contains a pointer to the main file. The search then continues in the main file
by the location indicated by the pointer. The records are still stored in order,
and new records are added to an overflow file (a record in the main file
preceding the new record is updated to contain a pointer to the new record in
the overflow file). The overflow is then periodically merged with the main file
during a batch update.

Comparison: If a file contains 1,000,000 records, in a standard sequential file,


the average lookup case is 500,000 accesses to find a record in the sequential
file. If an index contains 1000 entries, it will take on average 500 accesses to
find the key, followed by 500 accesses in the main file. On average, it is now
1000 accesses.

Direct Access Files

Direct access files exploit the capability of random access devices (such as
disks) to access directly any address on the device. No sequential ordering of
data in the file is required, and this method can be used where rapid access is
required, such as in swapping or convential user files. It is not that useful for
backups or archives where only occasional access is required (these can
therefore be implemented on cheaper media, such as tapes).

Directories

Directories are used to organise files, and they themselves are files owned by
the OS. Directories contain information about files, such as name, attributes,
location, ownership, size, access/update times and protection information. They
provide a mapping between file names and the files themselves.

From a user perspective, they need to search for a file (by name or attribute),
create a file, delete a file, list a directory and rename a file, etc...

The directory must be organised to obtain efficiency (e.g., to locate a file


quickly), naming (so it is convenient for users, two users can have the same
name for different files and the same file can have several different names) and
grouping (a logical grouping of files by properties).

There are multiple ways of implementing directories. One such way is the
single-level directory, where a single directory exists for all users. However, if
all users share the same directory, they must ensure that all file names are
unique, and there is the grouping problem of little structure or organisation.

In a two-level directory setup, each user has their own directory, which allows
a user to have the same filename as another user, this allows for efficient
searching, but has the problem of there being no grouping capability between
users.

Another is the tree-structured directory, which allows for efficient searching


and grouping capability, without any sharing of files. In this model, we have
the concept of a current, or working, directory. There are issues with this
model, however. There are absolute and relative path names, and creating a
new file or directory is done in the current directory. Also, how do you delete a
directory? Most OSs require for you to explicitly state that you want to remove
all files in the directory, or that you do it beforehand.

Tree structure prohibits the sharing of files and directories, however. Acyclic
graphs remove this restriction. They are more complex than tree, and two
different names can refer to the same file (aliasing), so there can be multiple
path names for the same file. This does introduce a deletion problem, however
- you can not delete a file until all references to it are solved. This is difficult,
so we can solve it using backpointers (so we can delete all the pointers to a
file), have backpointers in a daisy-chain organisation, or by using an entry-
hold-count solution.
Acyclic graphs are complex to manage, however. You need to ensure there are
no cycles, which links can introduce. Merely having subdirectories and files
can not introduce cycles. Another way of guaranteeing no cycles is to only
allow links to files, not subdirectories, or by using a cycle detection algorithm
every time a new link is added to determine whether or not it is okay (this is
expensive).

A file has to be opened before it is of any use, and similarly, a file system must be
mounted before any files or directories are available.

Sharing of files on multi-user systems is desirable. Sharing can be done through a


protection scheme. A simple method is to ensure that only one user has write access
to a file at a time. A complex method is using a variation of semaphores
implemented by some file systems (including in POSIX) called file locks. On
distributed systems, files may be shared across a network.

Protection

Protection exists to make sure that files are not read or modified by
unauthorised persons. Protection mechanisms are provided by the OS to
safeguard information inside the computer. Objects and the rights that they
have in particular contexts are identified in the aim to prohibit processes from
accessing objects they are not authorised to access.

A domain is associated with a set of pairs <objects, rights>, where rights will
be a subset of operations that are allowed on the object, and a domain could be
a user or a process.

It is possible for an object to be in two domains at the same time with or


without the same rights. Different mechanisms are used for identifying
domains, in UNIX a combination of user ID and group ID is used.

Another method of providing protection is through access control lists and


capabilities. It is abstract model, where matrix rows correspond to domains,
and columns correspond to objects. The ACL defines columns - access rights
for an object in different domains, and capabilities define rows, which are
associated with each process and is a set of operations that are permitted.

An access matrix might look like this:

↓ → Access File Card


File 1 File 3 Printer
Capability List 2 Reader
neil read read
andy read print
DomainX read execute
DomainY read/write read/write

The capability list is a row of the matrix, and a capability then is like an
abstract data type, that is, it points to an object and specifies which operations
are allowed on it (it may also have a field 'type' indicating what type of object
it is). The list then specifies what capabilities are accessible in this domain.

In UNIX, the file owner (by default, the creator) is able to control what should
be done, and by whom. Types of accesses include read, write and execute. In
UNIX, this is implemented as a bitmask with three domains, owner, group and
other. Groups in this case are created by a sysadmin, not the user.

File systems reside on secondary storage and contains both programs and data. It
also contains swapped out pages as part of the virtual memory management system.
The main issues to be addressed here are how to store files (a contiguous logical
structure) on secondary storage (disk) and mapping high level file operations on to
low-level disk operations, whilst requiring efficient use of the disk and fast
read/write times.
Disks provide a series of blocks, usually of a fixed size, which are sometimes related
to the memory frame/page size, often as a multiple. The number of file blocks can
be loaded into one physical memory frame.

Disks are random access, that is, you can directly address any given block on the
disk, and it is simple to access any file sequentially or randomly. For I/O efficiency,
transfers between memory and disk are often performed in units of blocks.
For convenient access to disks, the OS imposes a logical file system, which defines
how files look to the user (i.e., operations, file attributes, directories, etc...).
Algorithms and data structures are needed to map the logical file structure onto the
physical disk. This file organisation maps logical file structure onto physical block
addresses. A basic file system issues generic commands to a device driver to read
and write blocks. Finally, the I/O control contains device drivers to translate basic
file system commands into commands for a specific disk.

Several structures are used to implement file systems, such as on disk structures
including the boot control block, the partition control block, the directory structure
to organise the files and a per file file control block (FCB) containing permissions,
dates (access, create, modify), owner, group, size and blocks. In memory structures
are also used, such as the partition table, the directory structure (with recently
accessed directories) and an open file table (both system wide, and per process).
The above diagram shows the necessary file system structures provided by the OS to
open a file (a) and reading a file (b).

Virtual File Systems

Virtual file systems exist so that a single interface can be provided by the OS to
many different types of file systems. Different file systems can be mounted on
different mount points within the same directory structure, and operations have
the same effect on all the different file systems.

This logical file system is called the Virtual File System, or VFS and it allows
the same system call interface (the API) to be used for different types of file
system. The API is to the VFS interface, rather than any specific type of file
system, and the VFS can distinguish between local and remote files.

Directory Implementation

The simplest implementation of directories is a linear list of file names with


pointers to the data blocks. This is time-consuming, however (e.g., a search is
required when a file is created to ensure uniqueness), as linear searches are
expensive, however this cost can be offset by keeping frequently used directory
information in a cache. Keeping an order on the names could reduce searches,
but it would make it more costly for insert/delete of names.

Another implementation is to use a hash table, which decreases directory


search time. There is a problem with collisions (situations where two file
names hash to the same location) in this approach, however. A method is
needed for knowing that a hash location has a number of corresponding file
names, and for determining which name to use. This type of hash table is called
a chained-overflow hash table.

File Allocation Methods

In contiguous allocation, each file occupies a set of contiguous blocks on the


disk. This is relatively simple to implement, as only a starting location (block
#) and length (number of blocks) are required. Random access is also relatively
easy. This is wasteful of space (dynamic storage-allocation problem) due to
fragmentation (see memory management earlier) and files can not easily grow.

A slightly more improved method is linked allocation, where each file is a


linked list of disk blocks, which may be scattered anywhere on the disk, and
where each block contains a pointer to the next block. Again, this is simple as
only the starting address is required and there is no wasted space in this free
space management system. However, there is no random access, as you have to
go down the linked list to find the appropriate block.
The file allocation table (FAT) system is a variant of linked allocation used in
MS-DOS and OS/2. A section at the beginning of the disk is dedicated to a
FAT, containing an entry for each disk block. The directory entry then only
needs the block number for the start of the file. The FAT then indexed by the
block number gives the next block, etc...

Linked allocation solves the problems of contiguous allocation, but it does not
support random access tables without a FAT. This can be achieved with an
index table - all the pointers are brought together into an index block, which
exists per file, where the ith entry points to the ith block of file, and this gives
us random access. With large files, however, large index tables are needed,
which is a disadvantage; multi-level indexing can be introduced to combat this,
however. Additionally, the index table is an overhead.

Free Space Management

Since disk space is limited, space needs to be reused from deleted files and free
space needs to be monitored. One way of implementing this is using a bit
vector which shows when a block is free or occupied. An alternative to this bit
vector is the linked list. A pointer is kept to the first free block, which contains
a pointer to the next free block, etc... This is inefficient, as to traverse the list
requires multiple disk block reads, however this is an uncommon operation, as
most operations require just one block. However, the linked list does not
require any extra free space, unlike the bit vector, although there is no easy
way to get contiguous space easily.

These methods can be modified using grouping, storing the addresses of the
next n free blocks in the first free block; next n addresses of free blocks in the
next free block, etc... - this is more efficient in the number of blocks that need
to be read; and counting, which takes advantage of the fact that free blocks
usually form together. The first free block contains the number of free blocks
immediately following (i.e., forming a contiguous free store), together with the
address of the next free block at the start of some contiguous free store.

POSIX and Linux File Systems

In these implementations, the logical file system provides simple file


operations, including: create, open, read, write, close, mount (mounting a file
system at a mount point), stat (returns information about a file), lseek
(repositions the current position in the file) and truncate (causes file to be
truncated to a specific size).

This assumes a structure where all files have distinct absolute paths, using a
tree hierarchy. It is possible to set up cycles via symbolic links
The VPS provides uniform access to all file systems, including special files
(e.g., /proc, /dev).

The file system organisation is a multi-level indexing scheme, with an i-node


(index-node) associated with each file, which contains attributes and disk
addresses of the files' blocks. There is no distinction between files and
directories other than having a different mode type field.

In an i-node, the first few disk addresses are stored (which for small files, will
be all the disk addresses) and there are links to further i-nodes which contain
further disk addresses. This may go up to further indirection, which gives us a
very flexible system at the expense of complexity and storage space.

For directories, each entry contains a file name and an i-node number. Here,
the i-node contains information about the time, size, type, ownership and disk
blocks contained in the i-node. If you imagine that the file system knows the
root directory and wants to look up /usr/neil/ops/exam, then from the root i-
node, the system locates the i-node number of the file /usr, and then looks for
the next component neil, and so on. Relative paths work in the same way,
where there is a known current directory i-node.

The Ext2 Linux general purpose filesystem implements this, with a mostly
static allocation of i-nodes. The disk is split up into block groups of one or
more disk cylinders (i.e., those physically together) and blocks are used for i-
nodes and file data blocks. Bitmaps are used for keeping data blocks and i-
nodes in the group. When allocating disk blocks, related information is
attempted to be kept together (i.e., i-nodes of all files in a directory in the same
block group) and new directories are created in block groups with the smallest
number of directories.

This scheme works well as long as there is > 10% free space.

With journalling file systems, files are stored in both disc and main memory (a
memory cache of disk blocks is used for efficiency), however we have to
consider the possibility of what happens when the system fails. On the disk,
file system data structure may be corrupted, but we must ensure there is no loss
of data (i.e., consistency is maintained). Journaling (or log-based transaction)
file systems attempt to maintain consistency.

For a file operation, updates to the disk structure are written to a log, and when
complete, the changes are considered committed. The updates for a file
operation form a transaction, and actual updates to disk structures are
performed so that if there is a crash, it is known how much of the transaction
has been performed (and which other transactions remain in the log). ReiserFS
under Linux is an example of a journalling file system.
I/O
Handling I/O efficiently is important to the user, and therefore to the OS. All user
interaction with the machine is done using I/O and most user invoked operations
(i.e., applications) will use I/O at some point.

Operating systems provide I/O device handling based on utilising low-level


hardware interfaces to devices (i.e., access device via special instructions). Device
drivers are provisioned here to handle the idiosyncrasies of individual devices,
programmed in terms of a low-level hardware interface and provide a standard high-
level application interface to devices, to enable their easy use by application
software.

Devices are addressed either using direct mapping,


where devices exist in a special I/O range and See ICS for information
special I/O instructions exist in the CPU to on computer organisation
implement this, or by memory mapping, where
devices appear as addresses in the normal memory range and normal processor
instructions are used to access these addresses.

Devices tend to be accessible in the same place for a particular architecture and each
address range maps to the I/O control registers for the deviced accessed by special
I/O instructions. Some devices have both direct and memory-mapped address space
(e.g., graphics cards).

I/O ports typically consist of a status register, which contains the bits read by the
host that indiciate device state (e.g., data ready for a ready), a control register which
enables the host to change the mode of a device (e.g., from full to half-duplex for
serial ports), a data-in register which is read by the host to get input and the data-out
register, which is written to by the host for data to be sent to the device. This
introduces the issue of how does the host know when a particular condition has
arisen in the status register.

Polling is one way to solve this problem, and is a host initiated approach. The device
status register is polled until the wanted condition occurs, but this way is inefficient
as there is a busy-wait cycle to wait for I/O from a device. The alternative solution is
using interrupts, which is initiated by both the host and the device. A CPU interrupt
request line is triggered by the device and the interrupt handler receives interrupts
and uses an interrupt vector to dispatch an interrupt to the correct handler based on
priority. Most interrupts can be maskable to ignore or delay some interrupts, and the
interrupt mechanism can be used for exceptions. There is no need for busy-waiting
in this model.
In the IA32 architecture, interrupts 0-31 are unmaskable, as some are exceptions,
and 32-255 are maskable and are used for devices.

DMA is used to avoid programmed I/O for large data


movement and is implemented using a DMA controller to For information on
bypass the CPU and transfer data directly between the I/O DMA, see ICS
device and memory.
Application Interfaces

There are many different hardware devices, but application processes want a
common interface, but devices can vary in many dimensions: character (deals
with bytes or words of data, e.g., keyboards, mice, serial ports, etc and
commands to deal with this include get and put - often libraries are
implemented to allow in-line editing), stream (a stream of bits) or block
(devices deal with blocks of data, such as disks, commands include read, write,
seek and most raw I/O filesystem access uses this) transfer; sequential or
random access; sharable or dedicated device; device speed; read-write, read-
only or write-only. Timer devices also exist, which provide the current time,
elapsed time, or just a standard timer.

I/O can be either blocking or non-blocking. That is, a process is suspended


until I/O is completed under a blocking scheme (this is easy to use and
understand, but is insufficient for some needs - although using threads blocking
can be programmed around), or under a non-blocking scheme, an I/O call
returns as much as available and returns quickly with a count of bytes read or
written. A final scheme is asynchronous, which is difficult to use and is a
scheme where a program runs while the I/O executes and the I/O subsystem
signals the process when the I/O is completed.
Usually, hardware devices are controlled in the same manner as files (create,
write, read, delete, etc...), so the filesystem namespace can be used for devices
(/dev in UNIX). The kernel implements I/O through a series of layers as below:

The I/O subsystem of the kernel implements I/O scheduling and some I/O
request ordering via a per-device queue, although some OSs do try to
implement fairness. Additionally, buffering is also implemented, where data is
stored in memory while it is transferred between devices (this copes with
device speed mismatch and transfer size mismatch and "copy semantics).

Buffering can be implemented using many different methods, such as caching,


where fast memory holds a copy of the data (always just a copy, and this is the
key to performance), or spooling, where output for a device is held (this is
useful is the device can only serve one request at a time, e.g., printing). Device
reservation can also be used, where exclusive access is provided to a device
and system calls for allocation and deallocation are used. However, deadlock
(discussed above) can be encountered here.

The kernel has data structures for things like keeping state info for I/O
components including open file tables, network connections and the character
device state, as well as many, many other data structures to track buffers,
memory allocation and "dirty" blocks. Some OSs use object-orrientated
methods and message passing to implement I/O.

Some OSs use a file based approach to kernel I/O structures. Devices are
named as files and accesses to devices are by file operations (with some
exceptions, e.g., seek in a character device may have no effect). In Linux, an
additional command ioctl() provides direct access to the device driver, perhaps
to set the device into a particular mode (e.g., changing speed or other
parameters of a serial port).

I/O Performance

I/O is a major factor in system performance, as it makes demands on the CPU


to execute the device driver or kernel I/O code and it forces context switches
due to interrupts. Data copying and network traffic are especially stressful.

Ways of improving performance include reducing the number of context


switches, data copying and interrupts by using large transfers, smart controllers
and polling. DMA can also be used and the CPU, memory, bus and I/O
performance need to be balanced for highest throughput.

This brings us to the question of where should functionality be implemented to


get best performance - application kernel or hardware level?

Common questions

Powered by AI

The main challenge with direct access files is efficiently organizing and retrieving data without requiring structurally ordered storage, which can lead to complexities in locating and updating records . These files leverage random access capabilities of storage devices to address this by enabling direct retrieval of data blocks regardless of their physical location . Algorithms and data structures, such as hash tables, provide efficient mappings from logical to physical storage . This organization allows direct access files to provide rapid data retrieval and update capabilities.

Sequential file access reads records in order, which can be inefficient, especially for large files requiring mid-file access, as it necessitates starting from the file's beginning each time . Indexed sequential access improves efficiency by adding an index used to quickly locate desired records, reducing lookup times drastically. For example, with 1,000,000 records, sequential access may take 500,000 accesses on average, whereas indexed sequential access reduces this to about 1000 accesses by first locating record pointers via an index . Hence, indexed sequential access is much more efficient for large datasets requiring frequent non-sequential access.

The VFS facilitates interoperability by providing a uniform API for accessing various file systems, regardless of their type or underlying structure . It abstracts the complexities of different file systems, allowing users and applications to interact with files without concerning themselves with physical details. VFS manages files mounted at different points within the file structure and distinguishes between local and remote files, enabling seamless operations across disparate systems . This interface enhances flexibility and reduces the need for specialized handling of different file types within applications.

The multilevel feedback queue allows processes to move between various queues based on specific scheduling policies . It provides the flexibility to adjust process priorities dynamically, a key advantage for accommodating different process requirements and improving overall system performance. The inclusion of aging helps address starvation by ensuring that lower-priority processes eventually receive CPU time, as they get promoted to higher priority queues over time . This balances workload distribution effectively across different types of processes.

Dynamic linking defers the resolution of external symbols until program execution, unlike static linking, where all needed libraries are linked at compile time . The main benefits of dynamic linking include reduced memory usage since libraries are loaded only as needed during execution, leading to shared code and data across different programs . This results in decreased disk space and memory requirements, as multiple programs can share the same copy of a library, enhancing system efficiency and flexibility.

Paging addresses memory fragmentation by dividing the physical memory into fixed-size blocks called frames and the logical memory into pages that fit into these frames . This eliminates the need for contiguous memory allocation, thus tackling issues of external fragmentation. However, paging still results in internal fragmentation when a process does not perfectly use up a complete number of frames, but this is minimal compared to problems arising from contiguous memory allocation . By ensuring processes are given noncontiguous blocks of memory, paging effectively resolves the issues of fragmentation and enhances memory utilization.

Memory management allocates and manages both physical and virtual memory to ensure safe sharing of memory among processes . It provides protections to prevent processes from accessing each other's memory spaces, using techniques such as partitioning memory and implementing paging to manage workload effectively. These techniques help in time-sharing the system and multiprogramming the processor without interfering with each process's allocated memory space . Techniques like virtual memory allow processes to utilize more memory than the physically available memory, ensuring efficiency.

Using a hash table in directory implementation offers the advantage of decreased search time due to its efficient lookup capabilities, making it suitable for quick access to files . However, it introduces challenges related to hash collisions, where multiple file names map to the same hash location. This necessitates additional handling mechanisms, such as chained-overflow hash tables, to resolve conflicts effectively, adding overhead to maintenance . While offering speed advantages, the complexity and potential increase in storage due to collision resolution can be seen as drawbacks depending on the specific use case requirements.

Swapping is necessary when processes exceed available physical memory, requiring temporary storage of inactive process data on a backing store such as a disk . It allows the system to manage memory usage dynamically, making room for higher-priority processes by moving lower-priority ones out temporarily, known as roll-out, roll-in . While swapping enables better resource allocation and CPU time sharing, its impact on system performance includes overhead related to transfer time, which is proportional to the amount of memory swapped . Thus, while necessary for resource management, efficient swapping mechanisms are crucial to mitigate potential performance degradation.

In UNIX systems, processes can create a hierarchy or tree structure called a process group, where a parent process can create child processes, and each child can further spawn its own children . In contrast, Windows does not implement a process hierarchy in the same way; all processes are created as equals, without forming a structured tree . This difference affects how processes are managed and terminated, especially concerning resource allocation and signal broadcasting within the systems.

You might also like