Overview
Multicore Programming
Multithreading Models
Thread Libraries
Implicit Threading
Threading Issues
Operating System Examples
To introduce the notion of a thread—a fundamental unit of CPU
utilization that forms the basis of multithreaded computer
systems
To discuss the APIs for the Pthreads, Windows, and Java thread
libraries
To explore several strategies that provide implicit threading
To examine issues related to multithreaded programming
To cover operating system support for threads in Windows and
Linux
Breaking task into smaller task can finish job quickly
Parallelization can improve performance of applications:
Each computing entity performs small part of job
Can compute in parallel if supported by hardware
GPUs can perform thousands of tasks in parallel
An application can be implemented with several threads
Threads are processing entities, similar to processes
Light weight processes
Basic unit of CPU utilization
Help in achieving parallelization
A traditional process has single thread of control
Light weight processes
Basic unit of CPU utilization
Help in achieving parallelization
If a process has multiple threads of control, it can perform more than
one task at a time.
Example
Web browser: One thread for display, another for retrieving data
Word processor: a thread for display, another thread to respond to user
input, another to perform spell check
Shares with other threads belonging to same process
Code section
Data section
Other OS resources, such as Open files and signals
Thread comprises of
Thread ID
Program counter
Register set
Stack
Responsiveness – may allow continued execution if part of
process is blocked, especially important for user interfaces
Resource Sharing – threads share resources of process, easier
than shared memory or message passing
Economy – Allocating memory and resources for process
creation is costly. Because threads share resources of the process
to which they belong, it is more economical to create and context
switch threads.
Scalability – process can take advantage of multiprocessor
architectures. Threads are running parallel in different
processors. A single threaded process can only run on one CPU,
no matter how many processors are available. Multithreading on
a multi-CPU machine increases concurrency
Multicore or multiprocessor systems putting pressure on
programmers, challenges include:
Dividing activities
Balance
Data splitting
Data dependency
Testing and debugging
Parallelism implies a system can perform more than one
task simultaneously
Concurrency supports more than one task making progress
Single processor / core, scheduler providing concurrency
Types of parallelism
Data parallelism – distributes subsets of the same data across multiple
cores, same operation on each
Task parallelism – distributing threads across cores, each thread
performing unique operation
As # of threads grows, so does architectural support for threading
CPUs have cores as well as hardware threads
Consider Oracle SPARC T4 with 8 cores, and 8 hardware threads per core
Concurrent execution on single-core system:
Parallelism on a multi-core system:
Identifies performance gains from adding additional cores to an application that
has both serial and parallel components
S is serial portion
N processing cores
That is, if application is 75% parallel / 25% serial, moving from 1 to 2 cores
results in speedup of 1.6 times
As N approaches infinity, speedup approaches 1 / S
Serial portion of an application has disproportionate effect on
performance gained by adding additional cores
But does the law take into account contemporary multicore systems?
User threads – Supported above the kernel and management done by
user-level threads library
Three primary thread libraries:
POSIX Pthreads
Windows threads
Java threads
Kernel threads – Supported and managed directly by the Kernel
Ultimately, there must be a relationship between user and kernel
threads to make the system to function
Examples – virtually all general purpose operating systems, including:
Windows
Solaris
Linux
Tru64 UNIX
Mac OS X
Three common ways to establishing the relainship:
Many-to-One
One-to-One
Many-to-Many
Many user-level threads mapped to single
kernel thread
Thread management is done by the thread
library in user space, so it is efficient.
Limitations:
One thread blocking causes all to block
Because only one thread can access the
kernel at a time, Multiple threads may not
run in parallel on multicore or
multiprocessor system
Few systems currently use this model
Examples:
Solaris Green Threads
GNU Portable Threads
Each user-level thread maps to kernel thread
More concurrency than many-to-one model by allowing another
thread to run when a thread makes a blocking system call
Also allows multiple threads to run in parallel on multiprocessors
Creating a user-level thread creates a kernel thread
Because the overhead of creating kernel threads can burden the
performance of an application, most implementations of this model
restricts the number of kernel threads supported by the system. (5
threads in 4 processors)
Examples
Windows
Linux
Solaris 9 and later
Multiplexes many user-level threads to smaller or equal number of
kernel threads.
The number of kernel threads may be specific to either a particular
application or a particular machine
Developers can create as many user threads as necessary and the
corresponding kernel threads can run in parallel on a multiprocessor.
Also when a thread performs a blocking system call the kernel can
schedule another thread for execution.
Solaris prior to version 9
Windows with the ThreadFiber package
Similar to M:M, except that it allows a user thread to be bound to
kernel thread
Examples
IRIX
HP-UX
Tru64 UNIX
Solaris 8 and earlier
Hyperthreaded systems allow their processor core resources to become
multiple logical processor for performance.
It enables the processor to execute two threads or sets if instructions, at
the same time. Since hyper-threading allows two streams to be executed
in parallel, it is almost like having two separate processors working
together
Thread library provides programmer with API for creating and
managing threads
Two primary ways of implementing
Library entirely in user space
Kernel-level library supported by the OS
May be provided either as user-level or kernel-level
A POSIX (Portable OS Interface) standard (IEEE 1003.1c) API for thread creation and
synchronization
It is an execution model that exists independently from a language and a parallel
execution model. It allows a program to control multiple different workflows that
overlap in time. Each flow of work is referred to as a thread.
Defines a standard interface and environment that can be used by an operating system
(OS) to provide access to POSIX-compliant applications. The standard also defines a
command interpreter (shell) and common utility programs. POSIX supports
application portability at the source code level so applications can be built to run
on any POSIX-compliant OS.
If the code is written using the POSIX API, it need not be rewritten for another POSIX
compliant system.
Even if the code is POSIX compliant, the CPU instructions in the compiled executable
are different across different CPU architectures.
API specifies behavior of the thread library, implementation is up to development of
the library
Common in UNIX operating systems (Solaris, Linux, Mac OS X)
Java threads are managed by the JVM
Typically implemented using the threads model provided by
underlying OS
Java threads may be created by:
Extending Thread class
Implementing the Runnable interface
Specific for Linux based System
Fork() : The fork() system call is used to create a separate
duplicate process. New process will have different process ID
Exec(): When an exec() system call is invoked, the program
specified in the parameter (Which will be another program) to
exec() will replace the entire process – including all threads
(process ID remains same)
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
int main()
{
printf(“Hello!\n PID=%d \n”,getpid());
return 0;
}
Compile: gcc 1.c
Run: ./[Link]
Hello!
PID:5523
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
int main()
{
fork();
printf(“Hello!\n PID=%d \n”,getpid());
return 0;
}
Compile: gcc 1.c
Run: ./[Link]
Hello!
PID:5523
Hello!
PID:5524
First fork()
#include <stdio.h>
P1 -> P2
#include <sys/types.h>
#include <unistd.h>
Second fork()
int main()
P1->P3
{
P2->P4
fork();
fork();
Third fork()
fork();
P1 -> P5
printf(“Hello!\n PID=%d \n”,getpid());
P2->P6
return 0;
P4->P7
}
P3->P8
Compile: gcc 1.c
Run: ./[Link]
Executed 8 times with different PID
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
int main()
{
printf(“ PID=%d \n”,getpid());
char *args[]={“Hello”, “Welcome”,NULL};
execv(“./3”,args);
printf(“Back to 2.c”);
return 0;
}
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
printf(“We are in 3.c \n”);
printf(“PID of 3.c = %d\n”,getpid());
return 0;
}
Compile: gcc 2.c -o out1
gcc 3.c –o out2
Run: ./out1
PID = 5523
We are in 3.c
PID of 3.c =5523 # i.e.. Content of 3.c is replaced in 2.c, process ID remains Same
# Back to 2.c statement from 2.c is not printed.
Semantics of fork() and exec() system calls change in a
multithreaded program
Thread cancellation of target thread
Asynchronous or deferred
Signal handling
Synchronous and asynchronous
Thread pools
Scheduler Activations
Semantics of fork() and exec() system calls change in a
multithreaded program
Issues:
If one thread in a program calls fork(), does the new process duplicate all
threads, or is the new process single-threaded
Solution:
Some UNIX systems have chosen to have two versions of fork(), one that
duplicates all threads in the process and another that duplicates only the
thread that invoked the fork() system call
But which version of fork() to use and when?
Depends on the application.
Then duplicating all threads is unnecessary, as the program
specified in the parameters to exec() will replace the process
Inthis instance, duplicating only the calling thread is
appropriate.
Then the separate process should duplicate all threads
Terminating or killing a thread before it has completed
If multiple threads are concurrently searching for the database and
one thread returns the result, the remaining threads might be
canceled
When a user presses a button on a web browser that stops a web page
from loading any further(all threads images, text, etc.) all threads
loading the page are canceled
Thread to be canceled is target thread
Cancellation of target threads may occur in two different scenarios:
Asynchronous cancellation: One thread immediately terminates the
target thread. Target thread has no power.
Deferred cancellation: The target thread periodically checks whether it
should terminate, allowing it an opportunity to terminate itself in an
ordinary fashion. Target thread has power
Pthread code to create
and cancel a thread:
In situations where
Resources have been allocated to a canceled thread
Often the OS will reclaim system resources from a canceled thread but some
cases will not reclaim all resources. Therefore, cancelling a thread
asynchronously may not free a necessary system-wide resource.
A thread is canceled while in the midst of updating data it is sharing with
other threads
Good way:
Differed cancellation allows thread to check whether it should be canceled
at a point when it can be canceled safely.
Invoking thread cancellation requests cancellation, but actual
cancellation depends on thread state
If thread has cancellation disabled, cancellation remains
pending until thread enables it
Default type is deferred
Cancellation only occurs when thread reaches cancellation point
I.e. pthread_testcancel()
Then cleanup handler is invoked
On Linux systems, thread cancellation is handled through
signals
n Signals are used in UNIX systems to notify a process that a
particular event has occurred.
n A signal handler is used to process signals
1. Signal is generated by particular event
2. Signal is delivered to a process
3. Signal is handled by one of two signal handlers:
1. default
2. user-defined
n Examples of synchronous signals include illegal memory access or
1 divide by 0. A signal is generated to the process on these event
n Every signal has default handler that kernel runs when handling
signal
l User-defined signal handler can override default
l For single-threaded, signal delivered to process
n Handling signal in single threaded program is straight
forward. Signals are always delivered to the process.
n Where should a signal be delivered for multi-threaded?
l Deliver the signal to the thread to which the signal applies
l Deliver the signal to every thread in the process
l Deliver the signal to certain threads in the process
l Assign a specific thread to receive all signals for the process
Growing in popularity as numbers of threads increase, program
correctness more difficult with explicit threads
Creation and management of threads done by compilers and run-
time libraries rather than programmers
Three methods explored
Thread Pools
OpenMP
Grand Central Dispatch
Other methods include Microsoft Threading Building Blocks (TBB),
[Link] package
Create a number of threads in a pool where they await work
Advantages:
Usually slightly faster to service a request with an existing thread than
create a new thread
Allows the number of threads in the application(s) to be bound to the
size of the pool
Separating task to be performed from mechanics of creating task
allows different strategies for running task
[Link] could be scheduled to run periodically
Windows API supports thread pools:
Set of compiler directives and an API for C,
C++, FORTRAN
Provides support for parallel programming
in shared-memory environments
Identifies parallel regions – blocks of code
that can run in parallel
#pragma omp parallel
Create as many threads as there are cores
#pragma omp parallel for
for(i=0;i<N;i++) {
c[i] = a[i] + b[i];
}
Run for loop in parallel
Apple technology for Mac OS X and iOS operating systems
Extensions to C, C++ languages, API, and run-time library
Allows identification of parallel sections
Manages most of the details of threading
Block is in “^{ }” - ˆ{ printf("I am a block"); }
Blocks placed in dispatch queue
Assigned to available thread in thread pool when removed from
queue
Two types of dispatch queues:
serial – blocks removed in FIFO order, queue is per process, called
main queue
Programmers can create additional serial queues within program
concurrent – removed in FIFO order but several may be removed
at a time
Three system wide queues with priorities low, default, high
Thread-local storage (TLS) allows each thread to have its own
copy of data
Useful when you do not have control over the thread creation
process (i.e., when using a thread pool)
Different from local variables
Local variables visible only during single function invocation
TLS visible across function invocations
Similar to static data
TLS is unique to each thread
Many systems implementing many to many or
two level model place an intermediate data
structure between user and kernel threads
To the user thread library, the LWP appears to be
a virtual processor on which the application can
schedule a user thread to run
Each LWP is attached to a kernel thread and it is
kernel threads that the OS schedules to run on
the physical processor.
If the kernel thread blocks(waiting for an I/O
operation to complete), the LWP blocks as well.
Up the chain the user level thread attached to the
LWP also blocks
One scheme for communication between user
thread library and kernel is known as scheduler
activation.
The kernel provides an application with a set of virtual processors
(LWPs), and the application can schedule user threads onto an
available virtual processor.
Moreover, the kernel must inform an application about certain events.
This procedure is known as an upcall.
Upcalls are handled by the thread library with an upcall handler, and
upcall handlers must run on a virtual processor.
One event that triggers an upcall occurs when an application thread
is about to block. In this scenario, the kernel makes an upcall to the
application informing it that a thread is about to block and identifying
the specific thread. Then the kernel allocates a new virtual processor
to the application.
The application runs an upcall handler on this new virtual processor,
that saves the state of the blocking thread and relinquishes / call off
the virtual processor on which the blocking thread is running.
Another thread that is eligible to run on the new virtual processor is
scheduled then by the upcall handler.
Whenever the event that the blocking thread was waiting for occurs,
the kernel makes another upcall to the thread library informing it that
the previously blocked thread is now eligible to run.
A virtual processor is also required for The upcall handler for this
event, and the kernel may allocate a new virtual processor or
preempt one of the user threads and then run the upcall handler on
its virtual processor.
The application schedules an eligible thread to run on an available
virtual processor, after marking the unblocked thread as eligible to
run.
Windows Threads
Linux Threads
Windows implements the Windows API – primary API for Win
98, Win NT,Win 2000, Win XP, and Win 7
Implements the one-to-one mapping, kernel-level
Each thread contains
A thread id
Register set representing state of processor
Separate user and kernel stacks for when thread runs in user mode
or kernel mode
Private data storage area used by run-time libraries and dynamic
link libraries (DLLs)
The register set, stacks, and private storage area are known as
the context of the thread
The primary data structures of a thread include:
ETHREAD (executive thread block) – includes pointer to process to
which thread belongs and to KTHREAD, in kernel space
KTHREAD (kernel thread block) – scheduling and synchronization
info, kernel-mode stack, pointer to TEB, in kernel space
TEB (thread environment block) – thread id, user-mode stack, thread-
local storage, in user space
Linux refers to them as tasks rather than threads
Thread creation is done through clone() system call
clone() allows a child task to share the address space of the
parent task (process)
Flags control behavior
struct task_struct points to process data structures
(shared or unique)