Practical Lock-Free Buffers

Tools
  • Print
  • Email
Lock-free systems enjoy freedom from deadlocks
Sarmad is a Senior Software Engineer at PalmChip Pakistan (Pvt.) Limited and an M.Phil of FAST-NU Pakistan. He can be contacted at m_sarmad_asgher@yahoo.com.


Lock-free programming frees you from system failure by tolerating individual process failures. How can a process failure cause system failure? When a process fails while holding a lock, other processes in the system requiring that lock have to wait indefinitely. On the other hand, lock-freedom (or "non-blocking behavior") guarantees that the system makes progress when some process takes few steps [1]. By definition, lock-free systems enjoy freedom from deadlocks. This property of lock-free systems can be handy in applications like API monitoring. Many malware analysis tools monitor Win32 APIs. These tools [2] work by injecting code into monitored process and use Interprocess communication (IPC) to log API calls [3]. If this IPC mechanism uses locks, then it can result in deadlock. In this case, the probability of deadlock is real because execution of monitored process can halt while some thread may be logging an API call. Scenarios which increase the likelihood of deadlock include:

  • The monitored process is killed by users.
  • The monitored process exits without waiting for all the threads to terminate.
  • A thread in the monitored process is killed by the process itself.

If lock-free data structures are used for IPC, then the system can tolerate process/thread halts. In this article, I present a multi-producer/single-consumer lock-free buffer which can be used for IPC. (The complete source code and related files are available here.) In this particular situation, the consumer is the monitoring tool which logs API calls, whereas producers are the monitored processes which produce API call records. To simplify matters, I assume consumer does not fail. This lock-free buffer is different from usual circular buffers commonly used. A circular buffer uses in and out pointers to track free and used items, whereas this lock-free buffer uses a special value to track free items. This special value is never used as data. This simplifies the problem because you don't have to worry about updating in and out pointers atomically. The declaration of the buffer is:

const int MAX_SIZE=255; int buffer[MAX_SIZE]={0};

This declares a buffer of integer items. You might be wondering how integer slots can store API call records. I build on this buffer of integers and construct a buffer which can hold strings. Returning to the integer buffer, an item with a zero value indicates a free slot. All the items in the buffer are initialized to zero. Insertion is straightforward. The insert operation iterates through the items to find an empty one, then tries to put an integer value using atomic primitive CAS. (For description of CAS, see the sidebar "CAS and ABA".) Note that writing lock-free code requires care to avoid potential pitfalls see [4] for details.

int Insert(int data) { int i=0; for(i=0;i<'MAX_SIZE;i++) { if(CAS(&buffer[i], data, 0)) return i; } return -1; }

There is another handy insert operation on the buffer which tries to put the integer value at a particular index:

BOOL InsertAt(int data,int index) { if(CAS(&buffer[index], data, 0)) return TRUE; return FALSE; }

Similarly, Delete operation iterates through the buffer to find an inserted item and tries to take it out using CAS:

int Delete(int *data) { int i=0; for(i=0;i<MAX_SIZE;i++) { *data=buffer[i]; if(*data!=0) { if(CAS(&buffer[i],0, *data)) return i; } } return -1; }

This implementation is lock-free because no process owns any slots in the buffer which means a process failure does not result in loss of any slots. As the buffer slots remain intact it is always possible to put an item if the buffer is not full. The ABA problem (see the sidebar titled "CAS and ABA") commonly faced by lock-free data structures miraculously does not cause any problems. Consider a scenario: First process reads the buffer slot to be empty, but is preempted by second process which inserts an item in that slot. This is followed by a third process which takes out that item. This in turn is followed by the resumption of the first process which finds the slot to be empty. But it does not matter if first process inserts an item in that slot. It shows that ABA problem is not an issue in this case.

CAS and ABA

CAS takes three arguments: the address of a memory location, a new value, and an expected value. If and only if the memory location holds the expected value, the new value is written to it, atomically. A Boolean return value indicates whether the write occurred:

CAS(address, new value, expected value)
    if(*address != expected value)
        return false;
    }
    else{
        *address = new value;
        return true;
    }

According to Maged Micheal in "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects" (see "The ABA Problem"):

ABA problem occurs when a thread reads a value A from a shared location, and then other threads change the location to a different value, say B, and then back to A again. Later, when the original thread checks the location, e.g., using read or CAS, the comparison succeeds, and the thread erroneously proceeds under the assumption that the location has not changed since the thread read it earlier. As a result, the thread may corrupt the object or return a wrong result.

—SA

How you can build on this integer buffer to put strings? You can put individual characters one at a time into the integer buffer. The immediate question is how to separate a character of one string from a character of another string. A simple answer is to use unique Message Number for a string. The next question is how the consumer can link these characters to form the string again. This is tricky. A simple approach can be to put index of the character in the buffer along with message number and actual character. However, this approach has a serious drawback. It will limit the length of string which can be transferred through the buffer. The approach used to overcome this limitation is to find a free slot to put the first special character along with the relative offset of the second character in the lock-free buffer. This is followed by putting the second character at the announced index along with offset of the third character and so on. This allows unlimited size strings to be transferred through the buffer. See the TransferString function in Listing 1 for complete implementation.

volatile unsigned int msgCount=0;//Number of Unique Message Numbers In Use const int MAX_NEIGHBOURS=255;// consecutive characters max distance in buffer BOOL TransferString(char *str) { unsigned int msgNumber=0; int i=0; int data; int insertedAt=-1; int nextInsertAt=-1; //Following loop tries to get a unique message number to transfer string. do { msgNumber=msgCount; if(msgNumber+1>0xFFFF) { return FALSE; } } while(!CAS(&msgCount, msgNumber+1, msgNumber)); msgNumber=msgNumber+1;//Unique Message Number //pack the message number and start of string marker into data data=msgNumber; data=data<<16; data=data|255; //Following loop finds a free slot to transfer the start of string marker while(insertedAt==-1) insertedAt=FindFreeSlot(); //Following loop finds a free slot in close vicinity of the previous slot while(nextInsertAt==-1) nextInsertAt=FindFreeSlot(insertedAt,MAX_NEIGHBOURS); //pack the information about where the next character is expected. data=data|(nextInsertAt<<8); //Following loop inserts the packed data at previously decided index while(!InsertAt(data,insertedAt)); //Calculate the absolute location in the buffer for the next character insertedAt=(insertedAt+nextInsertAt)%MAX_SIZE; //Following loop transfers all the characters in the string except null character. for(i=0;str[i]!=0;i++) { //Pack and put the messange number the current character and the offset data=msgNumber; data=data<<16; data=data|str[i]; while(nextInsertAt==-1) nextInsertAt=FindFreeSlot(insertedAt,MAX_NEIGHBOURS); data=data|(nextInsertAt<<8); while(!InsertAt(data,insertedAt)); insertedAt=(insertedAt+nextInsertAt)%MAX_NEIGHBOURS; } //Pack the message number and character and offset 0 to communicate string end. data=msgNumber; data=data<<16; while(!InsertAt(data,insertedAt)); return TRUE; } //Find free slot at max "range" away from "startIndex " and return the offset. int FindFreeSlot(int startIndex,int range) { int i; //find free slot onwards from start index and return the relative offset if found for(i=startIndex+1;i<MAX_SIZE&&i<startIndex+range;i++) { if(buffer[i]==0) return i-startIndex; } if(startIndex+range>=MAX_SIZE) { //find free slot from 0 index onward but before startIndex and in range for(i=0;i<startIndex&&i<=(startIndex+range)%MAX_SIZE;i++) { if(buffer[i]==0) return i+(MAX_SIZE-startIndex); } } return -1; } //Find first free slot int FindFreeSlot() { int i; for(i=0;i<MAX_SIZE;i++) { if(buffer[i]==0) return i; } return -1; } //Maximum Number of threads in the system (this limitation is in demo version) const int THREAD_COUNT=4; //Maximum Number of characters in a string (this limitation is in demo version) const int MAX_STRING_SIZE=255; //tempBuffer array which holds the integers deleted from the integer buffer int tempBuffer[THREAD_COUNT*MAX_STRING_SIZE]={0}; //deleteIndexes array to hold index from which item was taken out of the integer buffer int deleteIndexes[THREAD_COUNT*MAX_STRING_SIZE]={0}; // tempBufferIndex points to next free slot in tempBuffer as well as deleteIndexed. int tempBufferIndex=0; // strings is a two dimensional array to hold string transferred by each thread char strings[THREAD_COUNT][MAX_STRING_SIZE]={0}; // stringIndexes [i] tells where to put the next character in strings[i] int stringIndexes[THREAD_COUNT]={0}; // nextAt[i] tells where the next character is expected to be inserted by thread i in the //integer buffer int nextAt[THREAD_COUNT]={0}; //This function tries to retrieve the strings transferred by many threads in an infinite loop void RetrieveString() { int data; int i=0; //Initialize the nextAt array to indicate no characters currently seen for(i=0;i<THREAD_COUNT;i++) nextAt[i]=-1; //Following lines of code deletes the first item from the buffer and stores it locally deleteIndexes[tempBufferIndex]=-1; while(deleteIndexes[tempBufferIndex]==-1) deleteIndexes[tempBufferIndex]=Delete(&data); tempBuffer[tempBufferIndex]=data; //Following loop goes on infinitely while(1) { deleteIndexes[tempBufferIndex+1]=-1; while(deleteIndexes[tempBufferIndex+1]==-1) { deleteIndexes[tempBufferIndex+1]=Delete(&data); //Try to rearrange already transferred characters RearrangeString(); } //put the deleted item from integer buffer into array local to consumer. tempBuffer[tempBufferIndex+1]=data; //advance the local buffer index tempBufferIndex++; } }

//This function tries to rearrange already transferred characters void RearrangeString() { int i; unsigned int msgNumber; int data; unsigned int charMask=255; int ch; for(i=0;i<=tempBufferIndex;i++) { data=tempBuffer[i]; if(data==0) continue; msgNumber=data>>16;//unpack the message number ch=data&charMask;//unpack the character if(ch==255) {//if it is start of string marker //unpack the next character offset. Calculate the absolute index at which next character is expected nextAt[msgNumber-1]=(deleteIndexes[i]+((data&0xFF00)>>8))%MAX_SIZE; //initialize the string index stringIndexes[msgNumber-1]=0; tempBuffer[i]=0; } else if(ch==0) {//if it is end of string marker //if it is the next character in the message if(nextAt[msgNumber-1]==deleteIndexes[i]) { //null terminate the string strings[msgNumber-1][stringIndexes[msgNumber-1]]=0; tempBuffer[i]=0;//remove item from tempBuffer printf("%s\n",strings[msgNumber-1]);//print it } } else { //if it is the next character in the message if(nextAt[msgNumber-1]==deleteIndexes[i]) { //form the string by storing the next character strings[msgNumber-1][stringIndexes[msgNumber-1]]=ch; stringIndexes[msgNumber-1]++;//increament index //unpack the next character offset and calculate absolute one nextAt[msgNumber-1]=(nextAt[msgNumber-1]+((data&0xFF00)>>8))%MAX_SIZE; tempBuffer[i]=0;//remove item from tempBuffer } } } }

Listing 1

This TransferString function implementation builds on integer buffer. This implementation assumes that integer size is 4 byte on the target platform. The 4 bytes of an integer item which is put into the buffer holds the following information (see Figure 1):

  • The lowest order byte call it byte number zero holds the character to be transferred.
  • The byte number 1 immediately next to byte number zero holds the relative offset of next character from the current one in the buffer.
  • Finally, the byte number 2 and 3 hold a 2-byte message number.

Figure 1: Composition of an item of the Integer buffer.

Again, there is a single consumer. This simplifies the problem because concurrency is not an issue. I am going to discuss a demo consumer implementation here. The consumer calls Delete function on the integer lock-free buffer in an infinite loop. The consumer remembers the index from which the item was taken out of the buffer as well as the integer data itself. In the infinite loop it calls RearrangeString function which loops through those remembered items and checks three conditions:

  1. Integer value contains the first special character to indicate start of string
  2. Integer value contains the last character to indicate the end of string
  3. Integer value is the next character of the string

In the first case, consumer calculates the index of the next character in the buffer and remembers it to check later on that an item with the same message number is really the next item in the sequence. In case 2 consumer prints out the string. In all three cases the consumer forgets this remembered item so that the same character is not repeated again. This approach works even in case when two nonconsecutive characters of the same string are inserted at the same index of the buffer because it is not possible to insert an item at an in use index. This means an item inserted first at the index is removed first. This result in RearrangeString function correctly linking the item inserted first at the same index. For a complete implementation of RearrangeString function, see Listing 1.

There are 0xFFFF possible message numbers which is very limiting for this implementation to be used for IPC in malware analysis tool. To make this implementation practical the same message number can be used again by the same thread in a process to communicate as many API calls as it makes. This implies that the resulting implementation can have 0xFFFF threads communicating API calls simultaneously. In other words, this lock-free buffer can tolerate failure of 0xFFFF threads. The system is not in dead-lock even if all message numbers are consumed. The new threads can continue their execution with out logging the API calls. No requirement of memory management is another advantage of this implementation. Although the demo implementation of consumer assumes maximum number of threads at compile time, however this technique has no such limitation. There can be any number of threads in the system.

Lock-freedom has the potential to make a difference. It invites you to think about new ideas and change the status-co in lock-free paradigm. You can start by thinking about ways to avoid the limitation on number of thread failures this implementation of lock-free buffer tolerates.

Acknowledgment

Thanks to Shafiq-ur-Rehman and Bilal Hashmi for guiding me on concurrency and lock-freedom related issues.

References

[1]M. Herlihy, "Wait-free synchronization," ACM Transactions on Programming Languages and Systems, vol. 11, no. 1, pp. 124"149, Jan. 1991.

[2]FireEye Anti-Malware and Anti-Botnet Security tool is an example. (http://www.fireeye.com/products/index.html).

[3]Details of API monitoring are out of scope of this article. See Detours for an example. (hhttp://research.microsoft.com/en-us/projects/detours/)

[4]H. Sutter. "Lock-Free Code: A False Sense of Security" (Dr. Dobb's, September 2008). (www.ddj.com/cpp/210600279)

[5]Michael, M.M. "Hazard pointers: Safe memory reclamation for lock-free objects," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 8, Aug. 2004.

Looking For The Lost Packets: Part 2
Techniques for debugging multicore packet-processing systems
Looking For The Lost Packets: Part 1
Techniques for debugging multicore packet-processing systems
DSP Meets Wireless Communications
Parallel Pattern 4: Gather

Real World Parallelism Webinar Series
  • February 18, 2010
    Lock Contention, Using Intel Parallel Studio to Improve Performance
    Speaker: Vasanth Tovinkere, Software Engineer, Intel Corporation (Bio)

    Vasanth Tovinkere is a software engineer in the Developer Products Division (DPD) at Intel. His current role involves defining novel approaches to understanding and visualizing parallel performance and consulting with strategic customers to help them prepare and deliver code for the multicore world. Vasanth has been involved in the development of automatic semantic event detectors for digital sports technologies in Intel Labs. He also has been awarded three patents and has two patents pending.

    Abstract:
    Discover how easy it is to use the power of Microsoft Visual Studio and Intel Parallel Studio to find performance issues due to lock contention in threaded applications. This ensures that shipped applications can take better advantage of multicore processors. In this webcast, we provide live demonstrations that show how to identify lock contentions issues with Visual Studio and Intel Parallel Studio, an add-in to Visual Studio that helps developers create fast, reliable code on multicore processors.t.