Basics of Lock-free Programming
One way to allow multi-task sharing of a referenced object is to use locks, where only one task at a time is granted access to a shared resource. Other tasks wait until the lock is released. A "resource" here is primarily a data object, but it can encompass other entities such as files, ports, specialized processors, data stores, user interfaces, and the like.
Lock-free approaches, on the other hand, are based on hardware synchronization primitives that are non-blocking, so that rather than waiting until a lock is available, tasks run optimistically on the assumption that there is no contention. When a synchronization point is reached, the code then checks for potential conflicts arising from concurrent updates of the shared data by other tasks. In the face of conflicts, the processing is repeated.
Lock-free approaches are similar to optimistic locking policies, where locks are only checked at the end of a transaction. Optimistic approaches to data sharing are most useful in low-contention environments. Such policies tend to scale well with large numbers of processors and short execution tasks, where the ability to introduce more concurrency without locking overhead and delays, outweighs the overhead of transaction restarts due to contention.
When the overhead of reprocessing is not excessive, lock-free approaches have significant other advantages in that they also avoid issues such as:
- Deadlock which occurs when one task holding a resource requests a resource held by a second task which is in turn waiting on a resource held by the first task to be freed.
- Priority Inversion which occurs when high-priority lock requesters wait for lower priority requesters.
- Stalls which occur when tasks wait on tasks that are not running or may even have failed.
- Convoying which occurs when multiple tasks wait in turn for a lock and then wind up running literally in lock step, and, therefore, serially.
In The Concurrency Revolution, Herb Sutter comments:
But, concurrent lock-free programming is extremely hard for programmers to understand and reason about. Most of the time, only systems and library writers should have to understand lock-free programming, although virtually everybody should be able to take advantage of the lock-free systems and libraries those people produce.
The approaches I outline here can be implemented with basic C++ language support for threads along with libraries using lock-free programming techniques, and thereby provide a considerable degree of simplicity for typical application developers.
A library implementation would provide the basics for the support described here. However, there are many variations that could be adapted to specific needs of a variety of applications. Hence a "policy-based" implementation lets concept users adapt the implementation to specific goals and strategies.
In particular, adaptability and sample classes are needed for a Shared Resource Manager with polices for a Resource Access Wrapper, a Controller that provides the algorithms used by the Shared Resource Manager, and an Allocator with memory control strategies. Additional application-level policies can support such common access requirements as data location and transformations, security control, and error handling and reporting.
Basic Implementation Support
Basic support for the lock-free programming approaches suggested here involves:
- Environment support
- A Shared Resource Manager to maintain versions of current, in use and potentially updated instances of shared data.
- An Allocator to manage instances of application data as well as internal control elements.
- Synchronization Elements to report storage characteristics for data sharing.
- An application-level Object Accessor.
Environment Support for Lock-free Programming
Basic synchronization primitives require:
- Atomic operations involving some concept of an atomic test-and-set, where data is set only if the test is true, and where there can be no intervening operations to change the data between the test and the set. Typical machine operations are compare-and-swap (CAS) as well as simpler operations such as counter increment or decrement.
- Memory barriers that ensure all relevant data changes are visible to other machine processors that can observe results of a successful test and set. In systems with cache memory, this requires a "release" by the modifying processor to flush changed data to main memory, and an "acquire" by an observing processor, to typically invalidate cache pages so that they will be refetched on the next access.
As multi-tasking becomes more common across systems with a variety of memory caches and physical types, it will be necessary to specify additional characteristics of shared object storage. For instance:
- Data, that is accessible to multiple tasks running on processors on the same chip, does not need to be flushed through main memory.
- Memory release and acquire operations do not necessarily have to flush or invalidate all memory currently cached by a task.
- Memory may be conceptually shared with remote processors through communication links.
The approaches I outline here provide for application specification of the precise data needed to be released and acquired. Not only does this allow more efficient operation on platforms that can exploit this information, but it also provides the specifications needed for reasoning about the correctness of the code.
Shared Resource Manager
The Shared Resource Manager maintains control and visibility to resources that are in contention. It is where the essential support resides for the algorithms necessary for the abstractions discussed here.
Fundamental algorithms for lock-free approaches typically process a sorted list of atomic operations. If processing of this list is interrupted, then other tasks, on finding a conflict, cause one of the tasks to fail and retry, while the other task continues processing.
The typical shared resource is state data shared by separate tasks, but it can include other system data such as files, ports, data stores, devices, and the like.
Shared memory may have special allocation requirements based on performance considerations. These can be configured and reported by Synchronization Elements. For specialized memory systems, the Allocator can set and use appropriate memory options and operations. For cached memories, the Allocator can be concerned with such issues as locality of reference, cache line usage and aliasing, and cache look-aside translation buffers.
All instances of data versions that are current, shared or used exclusively need to be tracked. Only when they can no longer be referenced can they be freed and reallocated. This is critical, for instance, in data list structures to avoid an ABA problem.
A requirement for an Allocator, in some situations, is to insure that memory is freed by the same task that allocated it.
One of the more difficult lock-free programming issues is the so-called "ABA problem." This occurs typically for list structures, where applications check a pointer, on the assumption that if the pointer has not changed, then the referenced data is still the correct reference. For the ABA problem to occur suppose location A is referenced as a next element in the list. If the object at A is deleted from the list, the next pointer may be set to another object at location B and the object at A freed. Later still, the memory at A can be reallocated to a new object that could be inserted at A's previous list position. A typical test that the next element in the list is still the same one, might check that the next list reference is still location A. This is valid only if location A cannot be freed and then reallocated, while the reference exists.
Systems with garbage collection can sometimes avoid ABA difficulties since referenced data cannot be freed and reallocated. However, garbage collection by itself does not handle circular references, which occur in complex data structures that may need to be shared.
One approach is to maintain a change counter that is updated whenever the reference is updated. If both the pointer and the count have not changed, then there has been no intervening update. In some environments, however, it is not possible to atomically update both the counter and the reference directly. For one approach to solving this see Maged M. Michael's paper Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects which presents an approach there keeps track of all tasks referencing an object and prevents reuse of the location if any references still exist. Simpler, as for other shared pointers, would be to just maintain a count of references. A list of references, though, has the advantage that there is the possibility for clean up, if a task fails.
All this suggests that there be mechanisms to configure objects with storage characteristics that can be determined at least at run time. Here I suggest the concept of "Synchronization Elements". Shared data objects can inherit from them and obtain the required interface properties. Synchronization Elements may be as simple as pointers or array indices. Typically they are the subject of CAS operations, but they can be used with other atomic operations.
Synchronization Elements provide an interface that returns a list of data areas, including the Synchronization Element itself and all data areas that can be referenced directly or indirectly through values in the Synchronization Element. This list can be configured at run time. The elements in the list provide information about their referenced storage area, including its address and length, and storage properties. These properties can be environment specific and may need to be configured with environment specific adaptation code.
This interface is used by internal routines to determine what areas of memory need to be released and acquired, or sent and received, from an updating task to receiving tasks. As cache sizes become larger, the ability for memory barriers to release and acquire only required storage can provide significant performance gains. Such support is already available on specialized processors. Also, in many cases, it should be possible for the task acquiring the data to determine its source and then exercise the appropriate protocols to access it.
The compiler also needs to know which data elements are not stable across synchronization points. In a C++ compiler, this could be effected with some extension of the concepts inherent in the volatile keyword. (An appropriate replacement might be called "synch".) Routines that establish a synchronization point can have some sort of volatile keyword in their declaration also. Such an explicit declaration of synchronization points requires fewer of them than in other proposals. Also, with care, these declarations would typically be isolated in a few low-level application functions that basically implement the application-specific shared data model.
At the application level, it is useful to tailor an Object Accessor that encapsulates a proxy for access to shared objects. This can simplify application interfaces, and it also ensures that appropriate access protocols are uniformly used. In some -- but not all -- cases, application code at higher levels can be largely data-sharing agnostic. In addition, an Accessor can be extended to meet particular application interface requirements, such as data transformations.