Threading and Parallel Programming Constructs

To be really proficient in threading, you need a grasp of the theory behind the different techniques


January 23, 2009
URL:http://www.drdobbs.com/threading-and-parallel-programming-const/212902219

Shameem Akhter, a platform architect at Intel, and Jason Roberts, a senior software engineer at Intel, are the authors of Multi-Core Programming: Increasing Performance through Software Multithreading on which this article is based. Copyright (c) 2008 Intel Corporation. All rights reserved.


This article describes the theory and practice of the principal parallel programming constructs that focus on threading and begins with the fundamental concepts of synchronization, critical section, and deadlock. It will discuss common issues associated with these concepts, as well as classical problems in the area of threading.

Synchronization

Synchronization is an enforcing mechanism used to impose constraints on the order of execution of threads. The synchronization controls the relative order of thread execution and resolves any conflict among threads that might produce unwanted behavior. In simple terms, synchronization is used to coordinate thread execution and manage shared data.

In an environment where messages are used for communicating between a sender and a receiver, synchronization is implicit, as a message must be sent before the message can be received. On the other hand, for a shared-memory based environment, threads have no implicit interdependency unless some constraints are imposed.

Two types of synchronization operations are widely used: mutual exclusion and condition synchronization. In the case of mutual exclusion, one thread blocks a critical section -- a section of code that contains shared data -- and one or more threads wait to get their turn to enter into the section. This helps when two or more threads share the same memory space and run simultaneously. The mutual exclusion is controlled by a scheduler and depends on the granularity of the scheduler. Condition synchronization, on the other hand, blocks a thread until the system state specifies some specific conditions. The condition synchronization allows a thread to wait until a specific condition is reached. Figure 1 shows the generic representation of synchronization.

Figure 1: Generic Representation of Synchronization Block inside Source Code

While a number of techniques are available for synchronization, only a few methods are used by developers on a regular basis. The techniques used are also to some extent determined by the programming environment.

The scope of synchronization is broad. Proper synchronization orders the updates to data and provides an expected outcome. In Figure 2, shared data d can get access by threads Ti and Tj at time ti, tj, tk, tl, where ti tj tk tl and a proper synchronization maintains the order to update d at these instances and considers the state of d as a synchronization function of time. This synchronization function, s, represents the behavior of a synchronized construct with respect to the execution time of a thread.

Figure 2: Shared Data Synchronization, Where Data d Is Protected by a Synchronization Operation

Figure 3 represents how synchronization operations are performed in an actual multi-threaded implementation in a generic form, and demonstrates the flow of threads. When m>=1, the creation timing for initial threads T1…Tm might not be the same. After block Bi as well as Bj, the number of threads could be different, which means m is not necessarily equal to n and n is not necessarily equal to p. For all operational environments, the values of m, n, and p are at least 1.

Figure 3: Operational Flow of Threads for an Application

Critical Sections

A section of a code block called a "critical section" is where shared dependency variables reside and those shared variables have dependency among multiple threads. Different synchronization primitives are used to keep critical sections safe. With the use of proper synchronization techniques, only one thread is allowed access to a critical section at any one instance. The major challenge of threaded programming is to implement critical sections in such a way that multiple threads perform mutually exclusive operations for critical sections and do not use critical sections simultaneously.

Critical sections can also be referred to as synchronization blocks. Depending upon the way critical sections are being used, the size of a critical section is important. Minimize the size of critical sections when practical. Larger critical sections-based code blocks should split into multiple code blocks. This is especially important in code that is likely to experience significant thread contention. Each critical section has an entry and an exit point. A critical section can be represented as shown in Figure 4.


<Critical Section Entry,
 to keep other threads in waiting status>
...
Critical Section
...
<Critical Section Exit,
 allow other threads to enter critical section>

Figure 4: Implement Critical Section in Source Code

Deadlock

Deadlock occurs whenever a thread is blocked waiting on a resource of another thread that will never become available. According to the circumstances, different deadlocks can occur: self-deadlock, recursive deadlock, and lock-ordering deadlock. In most instances, deadlock means lock-ordering deadlock.

The self-deadlock is the instance or condition when a thread, Ti, wants to acquire a lock that is already owned by thread Ti In Figure 5 (a), at time Ta thread T i owns lock i, where li is going to get released at tc. However, there is a call at tb from Ti which requires li. The release time of li is td, where td can be either before or after tc. In this scenario, thread Ti is in self-deadlock condition at tb. When the wakeup path of thread Ti, resides in another thread, Tj that condition is referred to as recursive deadlock, as shown in Figure 5 (b). Figure 5 (c) illustrates a lock-ordering thread, where thread Ti locks resource rj and waits for resource ri, which is being locked by thread Tj. Also, thread Tj locks resource ri and waits for resource rj, which is being locked by thread Ti. Here, both threads Ti and Tj are in deadlock at Td, and w is the wait-function for a lock.

Figure 5: Deadlock Scenarios

To represent the transition model of deadlock of an environment, consider representing atomic states by si and each thread of the system by Ti. Each thread can transition from one state to another by requesting a resource, acquiring a resource, or freeing the current resource. So, the transition can be represented as shown in Figure 6, where, ri requesting a resource, ai acquiring a resource, and fi freeing current resource.

Figure 6: Deadlock Scenario in a State Transition for a Thread

For any thread Ti, if the state transition of Ti becomes sd for all possible scenarios and remains blocked at sd, thread Ti would not have any way to transition from sd to any other state. That is why state sd is called the deadlock state for thread Ti.

Avoiding deadlock is one of the challenges of multi-threaded programming. There must not be any possibility of deadlock in an application. A lock-holding prevention mechanism or the creation of lock hierarchy can remove a deadlock scenario. One recommendation is to use only the appropriate number of locks when implementing synchronization.

Synchronization Primitives

Synchronization is typically performed by three types of primitives: semaphores, locks, and condition variables. The use of these primitives depends on the application requirements. These synchronization primitives are implemented by atomic operations and use appropriate memory fences. A memory fence, sometimes called a memory barrier, is a processor dependent operation that guarantees that threads see other threads' memory operations by maintaining reasonable order. To hide the granularity of these synchronization primitives, higher level synchronizations are used. That way application developers have to concern themselves less about internal details.

Semaphores

Semaphores, the first set of software-oriented primitives to accomplish mutual exclusion of parallel process synchronization, were introduced by the well known mathematician Edsger Dijkstra in his 1968 paper, The Structure of the "THE"-Multiprogramming System (Dijkstra 1968). Dijkstra illustrated that synchronization can be achieved by using only traditional machine instructions or hierarchical structure. He proposed that a semaphore can be represented by an integer, sem, and showed that a semaphore can be bounded by two basic atomic operations, P (proberen, which means test) and V (verhogen, which means increment). These atomic operations are also referred as synchronizing primitives. Even though the details of Dijkstra's semaphore representation have evolved, the fundamental principle remains same. Where, P represents the "potential delay" or "wait" and V represents the "barrier removal" or "release" of a thread. These two synchronizing primitives can be represented for a semaphore s as follows:


Thread "T" performs operation "P":
    P(s) --> atomic {sem = sem-1; temp = sem}
        if (temp < 0)
            {Thread T blocked and enlists on a
             waiting list for s}

Thread "T" performs operation "V":
    V(s) -->  atomic {sem = sem +1; temp = sem}
        if (temp < =0)
            {Release one thread from the waiting
             list for s}

where semaphore value sem is initialized with the value 0 or 1 before the parallel processes get started. In Dijkstra's representation, T referred to processes. Threads are used here instead to be more precise and to remain consistent about the differences between threads and processes. The P operation blocks the calling thread if the value remains 0, whereas the V operation, independent of P operation, signals a blocked thread to allow it to resume operation. These P and V operations are "indivisible actions" and perform simultaneously. The positive value of sem represents the number of threads that can proceed without blocking, and the negative number refers to the number of blocked threads. When the sem value becomes zero, no thread is waiting, and if a thread needs to decrement, the thread gets blocked and keeps itself in a waiting list. When the value of sem gets restricted to only 0 and 1, the semaphore is a binary semaphore.

To use semaphore, you can consider semaphore as a counter, which supports two atomic operations. Implementation of semaphores varies. From a usability perspective, two kinds of semaphores exist: strong and weak. These represent the success of individual calls on P. A strong semaphore maintains First-Come-First-Serve (FCFS) model and provides guarantee to threads to calls on P and avoid starvation. And a weak semaphore is the one which does not provide any guarantee of service to a particular thread and the thread might starve. For example, in POSIX, the semaphores can get into starvation status and implemented differently than what Dijkstra defined and considered as a weak semaphore (Reek 2002).

According to Dijkstra, the mutual exclusion of parallel threads using P and V atomic operations represented as follows:


semaphore s
s.sem = 1
begin
    T: <non-critical section>
        P(s)
        <critical section>
        V(s)
        Goto T
end

Semaphores are largely of historical interest. They are the unstructured "goto" statements of multi-threaded programming. Most programming environments provide higher-level structured synchronization primitives. However, like the goto, semaphores are occasionally the best primitive to use. A typical use of a semaphore is protecting a shared resource of which at most n instances are allowed to exist simultaneously. The semaphore starts out with value n. A thread that needs to acquire an instance of the resource performs operation P. It releases the resource using operation V.

Let's examine how semaphores might be used for the producer consumer problem and whether the problem can be resolved using semaphores or not. Producer-consumer is a classic synchronization problem, also known as the bounded-buffer problem. Here a producer function generates data to be placed in a shared buffer and a consumer function receives the data out of the buffer and operates on it, where both producer and consumer functions execute concurrently. Pseudo-code using a semaphore for the producer-consumer problem is shown in Figure 7.


semaphore s

void producer () {
    while (1) {
        <produce the next data>
        s->release()
    }
}
void consumer() {
    while (1) {
        s->wait()
        <consume the next data>
    }
}
Figure 7: Pseudo-code of Producer-Consumer Problem

Here neither producer nor consumer maintains any order. If the producer function operates forever prior to the consumer function then the system would require an infinite capacity and that is not possible. That is why the buffer size needs to be within a boundary to handle this type of scenario and make sure that if the producer gets ahead of the consumer then the time allocated for the producer must be restricted. The problem of synchronization can be removed by adding one more semaphores in the previous solution shown in Figure 7. Adding the semaphore would maintain the boundary of buffer as shown in Figure 8, where sEmpty and sFull retain the constraints of buffer capacity for operating threads


semaphore sEmpty, sFull

void producer() {
    while (1) {
        sEmpty->wait()
        <produce the next data>
        sFull->release()
    }
}
void consumer() {
    while (1) {
        sFull->release()
        <consume the next data>
        sEmpty->wait()
    }
}

Figure 8: Dual Semaphores Solution for Producer-Consumer Problem

Instead of using two independent semaphores and having a constraint-based solution, the solution in Figure 8 can be implemented using other synchronization primitives as well. The following sections discuss how to solve the producer-consumer problem using locks and conditional variables primitives.

Locks

Locks are similar to semaphores except that a single thread handles a lock at one instance. Two basic atomic operations get performed on a lock:

At most one thread acquires the lock. A thread has to acquire a lock before using a shared resource; otherwise it waits until the lock becomes available. When one thread wants to access shared data, it first acquires the lock, exclusively performs operations on the shared data and later releases the lock for other threads to use. The level of granularity can be either coarse or fine depending on the type of shared data that needs to be protected from threads. The coarse granular locks have higher lock contention than finer granular ones. To remove issues with lock granularity, most of the processors support the Compare and Swap (CAS) operation, which provides a way to implement lock-free synchronization. The atomic CAS operations guarantee that the shared data remains synchronized among threads. If you require the use of locks, it is recommended that you use the lock inside a critical section with a single entry and single exit, as shown in Figure 9.


{define all necessary locks}
<Start multithreading blocks>
...
<critical section start>

<acquire lock L>

.. operate on shared memory protected by lock L ..

<release lock L>
<critical section end>
...
<End multithreading blocks>

Figure 9: A Lock Used Inside a Critical Section

From an implementation perspective, it is always safe to use explicit locks rather than relying on implicit locks. In general a lock must not be held for a long periods of time. The explicit locks are defined by the developer, whereas implicit locks come from the underlying framework used, such as database engines provides lock the maintain data consistency.

In the produce-consumer problem, if the consumer wants to consume a shared data before the producer produces, it must wait. To use locks for the producer-consumer problem, the consumer must loop until the data is ready from the producer. The reason for looping is that the lock does not support any wait operation, whereas Condition Variables does.

Lock Types

An application can have different types of locks according to the constructs required to accomplish the task. You must avoid mixing lock types within a given task. For this reason, special attention is required when using any third party library. If your application has some third party dependency for a resource R and the third party uses lock type L for R, then if you need to use a lock mechanism for R, you must use lock type L rather any other lock type. The following sections cover these locks and define their purposes.

Mutexes. The mutex is the simplest lock an implementation can use. Some texts use the mutex as the basis to describe locks in general. The release of a mutex does not depend on the release() operation only. A timer attribute can be added with a mutex. If the timer expires before a release operation, the mutex releases the code block or shared memory to other threads. A try-finally clause can be used to make sure that the mutex gets released when an exception occurs. The use of a timer or try finally clause helps to prevent a deadlock scenario.

Recursive Locks. Recursive locks are locks that may be repeatedly acquired by the thread that currently owns the lock without causing the thread to deadlock. No other thread may acquire a recursive lock until the owner releases it once for each time the owner acquired it. Thus when using a recursive lock, be sure to balance acquire operations with release operations. The best way to do this is to lexically balance the operations around single-entry single-exit blocks, as was shown for ordinary locks. The recursive lock is most useful inside a recursive function. In general, the recursive locks are slower than non-recursive locks. An example of recursive locks use is shown in Figure 10.


Recursive_Lock L
void recursiveFunction (int count) {
    L->acquire()
    if (count > 0) {
        count = count - 1;
        recursiveFunction(count);
    }
    L->release();
}

Figure 9: An Example of Recursive Lock Use

Read-Write Locks. Read-Write locks are also called shared-exclusive or multiple-read/single-write locks or non-mutual exclusion semaphores. Read-write locks allow simultaneous read access to multiple threads but limit the write access to only one thread. This type of lock can be used efficiently for those instances where multiple threads need to read shared data simultaneously but do not necessarily need to perform a write operation. For lengthy shared data, it is sometimes better to break the data into smaller segments and operate multiple read-write locks on the dataset rather than having a data lock for a longer period of time.

Spin Locks. Spin locks are non-blocking locks owned by a thread. Waiting threads must "spin," that is, poll the state of a lock rather than get blocked. Spin locks are used mostly on multiprocessor systems. This is because while the thread spins in a single-core processor system, no process resources are available to run the other thread that will release the lock. The appropriate condition for using spin locks is whenever the hold time of a lock is less than the time of blocking and waking up a thread. The change of control for threads involves context switching of threads and updating thread data structures, which could require more instruction cycles than spin locks. The spin time of spin locks should be limited to about 50 to 100 percent of a thread context switch (Kleiman 1996) and should not be held during calls to other subsystems. Improper use of spin locks might cause thread starvation. Think carefully before using this locking mechanism. The thread starvation problem of spin locks can be alleviated by using a queuing technique, where every waiting thread to spin on a separate local flag in memory using First-In, First-Out (FIFO) or queue construct.

Condition Variables

Condition variables are also based on Dijkstra's semaphore semantics, with the exception that no stored value is associated with the operation. This means condition variables do not contain the actual condition to test; a shared data state is used instead to maintain the condition for threads. A thread waits or wakes up other cooperative threads until a condition is satisfied. The condition variables are preferable to locks when pooling requires and needs some scheduling behavior among threads. To operate on shared data, condition variable C, uses a lock, L. Three basic atomic operations are performed on a condition variable C:

To control a pool of threads, use of a signal function is recommended. The penalty for using a broadcast-based signaling function could be severe and extra caution needs to be undertaken before waking up all waiting threads. For some instances, however, broadcast signaling can be effective. As an example, a "write" lock might allow all "readers" to proceed at the same time by using a broadcast mechanism.

To show the use of a condition variable for a synchronization problem, the pseudo-code in Figure 11 solves the producer-consumer problem discussed earlier. A variable LC is used to maintain the association between condition variable C and an associated lock L.


Condition C;
Lock L;
Bool LC = false;

void producer() {
    while (1) {
        L ->acquire();
        // start critical section
        while (LC == true) {
            C ->wait(L);
        }
        // produce the next data
        LC = true;
        C ->signal(L);
        // end critical section
        L ->release();
    }
}

void consumer() {
    while (1) {
        L ->acquire();
        // start critical section
        while (LC == false) {
            C ->wait(L);
        }
        // consume the next data
        LC = false;
        // end critical section
        L ->release();
    }
}

Figure 11: Use of a Condition Variable for the Producer-Consumer Problem

Monitors

For structured synchronization, a higher level construct is introduced for simplifying the use of condition variables and locks, known as a monitor. The purpose of the monitor is to simplify the complexity of primitive synchronization operations and remove the implementation details from application developers. The compiler for the language that supports monitors automatically inserts lock operations at the beginning and the end of each synchronization-aware routine. Most recent programming languages do not support monitor explicitly, rather they expose lock and unlock operations to the developers. The Java language supports explicit monitor objects along with synchronized blocks inside a method. In Java, the monitor is maintained by the "synchronized" constructs, such as


synchronized (object) {
    <Critical Section>
}

where the "condition" primitives are used by wait(),notify(), or notifyAll() methods. Do not confuse this with the Monitor object in the Java SDK though. The Java Monitor object is used to perform resource management in Java Management Extension (JMX). Similarly, the monitor object in C# is used as lock construct.

Messages

The message is a special method of communication to transfer information or a signal from one domain to another. The definition of domain is different for different scenarios. For multi-threading environments, the domain is referred to as the boundary of a thread. The three M's of message passing are multi-granularity, multi-threading, and multitasking (Ang 1996). In general, the conceptual representations of messages get associated with processes rather than threads. From a message-sharing perspective, messages get shared using an intra-process, inter-process, or process-process approach, as shown in Figure 12.

Figure 12: Message Passing Model

Two threads that communicate with messages and reside in the same process use intra-process messaging. Two threads that communicate and reside in different processes use inter-process messaging. From the developer's perspective, the most common form of messaging is the process-process approach, when two processes communicate with each other rather than depending on the thread.

In general, the messaging could be devised according to the memory model of the environment where the messaging operation takes place. Messaging for the shared memory model must be synchronous, whereas for the distributed model messaging can be asynchronous. These operations can be viewed at a somewhat different angle. When there is nothing to do after sending the message and the sender has to wait for the reply to come, the operations need to be synchronous, whereas if the sender does not need to wait for the reply to arrive and in order to proceed then the operation can be asynchronous.

The generic form of message communication can be represented as follows:

Sender:

    <sender sends message to one or more recipients through structure>
    \\ Here, structure can be either queue or port
    <if shared environment>
        {wait for the acknowledgment>
    <else>
        {sender does the next possible operation>

Receiver:


<might wait to get message from sender from
    appropriate structure>
    <receive message from appropriate structure and process>

The generic form of message passing gives the impression to developers that there must be some interface used to perform message passing. The most common interface is the Message Passing Interface (MPI). MPI is used as the medium of communication, as illustrated in Figure 13.

Figure 13: Basic MPI Communication Environment

To synchronize operations of threads, semaphores, locks, and condition variables are used. These synchronization primitives convey status and access information. To communicate data, they use thread messaging. In thread messaging, synchronization remains explicit, as there is acknowledgment after receiving messages. The acknowledgment avoids primitive synchronization errors, such as deadlocks or race conditions. The basic operational concepts of messaging remain the same for all operational models. From an implementation point of view, the generic client-server model can be used for all messaging models.

Inside hardware, message processing occurs in relationship with the size of the message. Small messages are transferred between processor registers and if a message is too large for processor registers, caches get used. Larger messages require main memory. In the case of the largest messages, the system might use processor-external DMA, as shown in Figure 14.

Figure 14: System Components Associated with Size of Messages

Flow Control-based Concepts

In the parallel computing domain, some restraining mechanisms allow synchronization among multiple attributes or actions in a system. These are mainly applicable for shared-memory multiprocessor or multi-core environments. The following section covers only two of these concepts, fence and barrier.

Fence

The fence mechanism is implemented using instructions and in fact, most of the languages and systems refer to this mechanism as a fence instruction. On a shared memory multiprocessor or multi-core environment, a fence instruction ensures consistent memory operations. At execution time, the fence instruction guarantees completeness of all pre-fence memory operations and halts all post-fence memory operations until the completion of fence instruction cycles. This fence mechanism ensures proper memory mapping from software to hardware memory models, as shown in Figure 15. The semantics of the fence instruction depend on the architecture. The software memory model implicitly supports fence instructions. Using fence instructions explicitly could be error-prone and it is better to rely on compiler technologies. Due to the performance penalty of fence instructions, the number of memory fences needs to be optimized.

Figure 15: Fence Mechanism

Barrier

The barrier mechanism is a synchronization method by which threads in the same set keep collaborating with respect to a logical computational point in the control flow of operations. Through this method, a thread from an operational set has to wait for all other threads in that set to complete in order to be able to proceed to the next execution step. This method guarantees that no threads proceed beyond an execution logical point until all threads have arrived at that logical point. Barrier synchronization is one of the common operations for shared memory multiprocessor and multi-core environments. Due to the aspect of waiting for a barrier control point in the execution flow, the barrier synchronization wait function for ith thread can be represented as

Wbarrier i = f Tbarrier i, Rthread i where Wbarrier is the wait time for a thread, Tbarrier is the number of threads has arrived, and Rthread is the arrival rate of threads.

For performance consideration and to keep the wait time within a reasonable timing window before hitting a performance penalty, special consideration must be given to the granularity of tasks. Otherwise, the implementation might suffer significantly.

Implementation-dependent Threading Features

The functionalities and features of threads in different environments are very similar; however the semantics could be different. That is why the conceptual representations of threads in Windows and Linux remain the same, even though the way some concepts are implemented could be different. Also, with the threading APIs of Win32, Win64, and POSIX threads (Pthreads), the semantics are different as well. Windows threading APIs are implemented and maintained by Microsoft and work on Windows only, whereas the implementation of Pthreads APIs allows developers to implement threads on multiple platforms. The IEEE only defined the Pthreads APIs and let the implementation be done by OS developers. Due to the implementation issues of Pthreads, not all features exist in Pthreads APIs. Developers use Pthreads as a wrapper of their own thread implementations. There exists a native Linux Pthreads library similar to Windows native threads, known as Native POSIX Thread Library (NPTL).

Consider the different mechanisms used to signal threads in Windows and in POSIX threads. Windows uses an event model to signal one or more threads that an event has occurred. However, no counterpart to Windows events is implemented in POSIX threads. Instead, condition variables are used for this purpose.

These differences are not necessarily limited to cross-library boundaries. There may be variations within a single library as well. For example, in the Windows Win32 API, Microsoft has implemented two versions of a mutex. The first version, simply referred to as a mutex, provides one method for providing synchronized access to a critical section of the code. The other mechanism, referred to as a CriticalSection, essentially does the same thing, with a completely different API. What's the difference?

The conventional mutex object in Windows is a kernel mechanism. As a result, it requires a user-mode to kernel-mode transition to work. This is an expensive operation, but provides a more powerful synchronization mechanism that can be used across process boundaries. However, in many applications, synchronization is only needed within a single process. Therefore, the ability for a mutex to work across process boundaries is unnecessary, and leads to wasted overhead. To remove overhead associated with the standard mutex Microsoft implemented the CriticalSection, which provides a user-level locking mechanism. This eliminates any need to make a system call, resulting in much faster locking.

Even though different threading libraries will have different ways of implementing the features described in this article, the key to being able to successfully develop multi-threaded applications is to understand the fundamental concepts behind these features.

Key Points

This article presented parallel programming constructs. To become proficient in threading techniques and face fewer issues during design and development of a threaded solution, an understanding of the theory behind different threading techniques is helpful. Here are some of the points you should remember:

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.