Dr. Dobb's | Lock-free Interprocess Communication

Interprocess communication is an essential component of modern software engineering. Often, lock-free IPC is accomplished via special processor commands. Here's a way around that.

Alexander is currently working as a Senior Software Engineer for Computer Associates, and can be contacted for the purposes of this article at [email protected].

Some time ago I was interested in finding a type of interprocess (interthread) communication which would require minimal preconditions on the platform and not require a set of special processor commands. This article describes lock-free interprocess communication which does not require any special processor commands, and the only precondition for successful interprocess communication is knowledge of processor word length.

Many recipes for lock-free interprocess communication require special atomic processor commands. In this article, I propose a communication type that requires only atomic writing of processor word from processor cache into main memory and atomic processor word reading from main memory into the processor register or processor cache. To the best of my knowledge, all platforms satisfy these requirements if the address in memory is properly aligned. Later in the article, I'll discuss performance issues related to cache trashing.

Based on this communication mechanism, I implement a registry which allows updates but at the same time allows a linear increase in reading performance by increasing the number of processors. Some performance testing results are presented for single- as well as multi-processor systems. This testing shows that on some platforms, lock-free algorithms outperform standard techniques by 30 or more times (>=3000%).

In this article, I'll present four algorithms. Algorithm one demonstrates the main idea. Algorithm two (light pipe) is mostly a low level communication primitive to pass data between two processes, but is interesting to some extent in and of itself. Algorithm three takes into account cache line length to increase the performance of algorithm two. Algorithm four is an implementation of a registry optimised for reads. Algorithm four uses algorithm three (cache line optimised light pipe) as a communication primitive.

Algorithm One

This algorithm demonstrates the main idea. Consider the simplest case, which involves two processes (or threads; for the purposes of this article, I'll use "processes" and "threads" interchageably). I'll call them Writer and Reader. Writer is transmitting one non-zero processor word of data to Reader and gets confirmation that data has been delivered successfully. This is done by writing this word and confirmation into shared memory.

In our case, it is important that shared memory which is used to transmit data between processes is aligned by the size of the processor word. Aligning guarantees that read/write operations to the main memory will be atomic. Atomic in this case is assumed to mean that if a value of a word initially was V1, and a process writes another value V2, then any other process or thread even without any synchronisation will read either the old value V1 or the new value V2.

This algorithm does not guarantee how fast data will be delivered from one process to another, or how fast confirmation will be delivered back. This is similar to sending a network packet from one computer to another and receiving a confirmation.

Algorithm Two (Light Pipe)

This communication model also involves two processes: Reader and Writer. Writer is writing data into shared memory and the Reader process is reading consistent data from shared memory. The shared memory is aligned by processor word and initially is filled with zeros. Data is written by the Writer process to shared memory word by word, starting from word number 0. Data must contain only non-zero words (later I'll show how to lessen this requirement). Before writing a non-zero word to some address, the Writer process reads a word from this address and checks that its value is zero. If the read word is zero-valued then data can be written, otherwise the Writer process must wait. Suppose there are N words in the shared memory. After writing word number N-1, the first process starts writing word number 0. Here is some pseudocode for Writer process:

The second process (Reader) is reading words from shared memory starting from word number 0. When this process reads a zero-value word from shared memory, it waits, and then reads data again and again until it reads a non-zero word. This is a signal that new data has arrived. After reading a non-zero word, the Reader process writes a zero-valued word in its place. Then it processes the most recently read non-zero word and moves to the next word. After reading word number N-1, the process starts reading word number 0. Here is some pseudo code for Reader process:

In other words, processes are sending messages to each other that are one processor word long. The first process (Writer) sends non-zero information messages; the second process (Reader) reads them and writes zero-valued messages in their place as confirmation that non-zero messages are read. Since processor words are passed between different processor cache levels and main memory atomically, we can be sure that communication will be consistent.

Figures 1, 2 and 3 explain the algorithm and data migration between different processor caches.

Figure 1: Initial state (cache not considered).

Figure 2: Writer is writing non-zero words (shown in black) into shared memory, and reader is reading non-zero words from shared memory and replacing them with zero-valued words (shown in white). Cache is not considered.

If the processes (or threads) are run on different processors, then the messages are delivered by cache synchronization hardware asynchronously and without any intervention. This can lead to higher performance.

Figure 3: Data migration between different cache memory on different processors. Some words have not been delivered from one cache to another.

Passing Zero Words

Let us get back to the above question about the possibility of passing non-zero words. To solve this problem it is possible to use coding. For example, M arbitrary words are coded with M+1 non-zero words (M < 2^length(word)-1).

To perform coding of M arbitrary words into M+1 non-zero words we need to add one more word in front of these M words which will contain the number (from 1 to M) of the first zero word or a value bigger than M if there is no such zero-valued word inside those M words. Then the first zero-valued word is replaced with number of the next zero-valued word or with a number bigger than M if the next zero-valued word does not exist. The process repeats until all zero-valued words are replaced.

If we need to pass a chunk of M words to another process, it can be only one-write overhead in terms of the number of words written into shared memory. Some processor overhead is required for coding and decoding, but it should be relatively small because it is performed inside the same processor cache (no interprocess communication is involved). See Figure 4.

Figure 4: Passing zero words. White is for zero-valued words, black is for non-zero valued words, red for zero-valued words which were replaced with non-zero valued words and green is for the first additional word which contains the number of the first non-zero valued word.

Algorithm Three (Cache Line Optimized Light Pipe)

In the aforementioned mechanism there are a few performance problems. One of them is cache trashing (false sharing) which can occur when Writer and Reader are writing to the same cache line (see [1] for more details about false sharing). In this case the cache synchronization mechanism will need to pass written data inside the same cache line a few times between processor caches. To avoid this behaviour the above mechanism needs to be updated:

This update will change the behaviour of Reader in such a way that it postpones writing back any zero-valued words to the cache line until the whole line is read. It writes zero values backwards from the end of the cache line deliberately to avoid the situation when Writer starts writing into the cache line before it is completely cleared with zero values by Reader.

Algorithm Four (Read-optimized Registry)

A common problem is implementing a "registry" optimized for reads, but allowing updates. A standard implementation would be to use std::map and putting locks (critical sections for Windows) around any modifications or reads to/from this map.

Another implementation inspired by the above lock-free technique combines the standard technique of using mutexes and light pipes described above. The algorithm involves a main copy of registry data (one instance of class Registry) and views (or subscribers) of this registry (many instances of class RegistryView) which have local copies of registry data. (See Figure 5.)

Figure 5: Multiple subscribers to an instance of Registry.

Light pipes (KeyValuePipe instances) are used to send updates from the main copy of data to every local copy asynchronously. Every update is applied to the main copy of data at first and then this update is coded and written to every pipe which eventually leads to the update of every local copy of data.

Every subscriber, before reading from its local copy of data, checks for updates which come from the Registry. If some data is found in the pipe then this data is applied to the local copy and then the read operation is performed from the local updated copy of data.

Since updates are rare, the most common scenario is that no updates are found in the pipe. In this case the check is eventually just one read from the pipe's buffer and positive comparing the result to zero.

It should be clearly understood that the above mechanism allows every reader to achieve almost the same speed of reading updated data as the speed of reads from the local copy by introducing some latency between data updates of the main copy and updates of the local copies. It is normal that the local copy is slightly behind the main copy of data for a short period of time. On the other hand, all changes to the local copy are consistent and the sequence of changes is guaranteed to be the same as for the main copy. In many cases, this behaviour is sufficient. Examples could be implementations of different routing tables which are repeatedly read and relatively rarely updated. For them, some latency and asynchrony in data updates do not play a vital role as well, but consistency is important.

Performance Testing

Algorithm 2 was compared to Algorithm 3 and a standard version of this algorithm. This test involves passing 10,000,000 integer values from one thread to another. The standard algorithm performs all required locking when it accesses shared resources. Locking is performed for every passing integer.

	Standard Algorithm	Algorithm 2 (Light Pipe)	Algorithm 3 (Cache Line Optimised Light Pipe)
Dell notebook 1.6 GHz	11.5 sec	0.18 sec	0.22 sec
Dell server 2 processors with hyper threading technology (4 virtual processors)	243 sec	0.35 sec	0.07 sec

Table 1: Performance comparison of standard algorithm, Algorithm 2 and Algorithm 3.

I would not consider this comparison very fair for standard applications because we usually do not send information in small chunks from one process to another, but for some applications (like routers or switches) this technique may bring some benefits. For example, consider a few specialized processors which perform pipelined data processing (Figure 6).

Figure 6: Pipelined data processing.

Then data can be passed between these processors using the proposed mechanism. It should improve overall performance in case of high load and probably decrease latency time in case of low load.

Table 2 shows Algorithm 4 compared with standard version of this algorithm. The testing procedure in this case involves running 8 threads. Each thread reads from the registry 1,024,000 times and updates the registry 1,000 times.

	Standard version of Algorithm 4	Algorithm 4 (Lock-free Read-optimized Registry)
Dell notebook 1.6 GHz	3.4 sec	3.5 sec
Dell server, 2 processors 3.2 GHz with hyper threading technology (4 virtual processors)	37 sec	1.2 sec

Table 2: Performance comparison of standard and optimized versions of Algorithm 4.

The lock-free algorithm does not provide any benefits on single-processor systems, but on a multi-processor system it outperformed the standard algorithm by about a factor of 30. Since multiprocessor systems are becoming mainstream the described technique or its variations may bring considerable benefits for applications which use it.