The FifoEmbed Library

By Dan Muresan, October 01, 2004

When programming device drivers, network stacks, and similar systems, you often need a queue and storage management. FifoEmbed is a library that provides facilities for thread-safe basic queues, packet queues, and FIFO allocators.

October, 2004: The FifoEmbed Library

Dan Muresan is an engineer in the R&D group of Topex Public Switching in Bucharest, Romania. He can be contacted at muresan stanfordalumni.org.

When programming device drivers and network stacks, you usually need a queue and some form of storage management. In real-time systems, you'd like your queue and storage manager primitives to run in deterministic time. FifoEmbed, the C library I present here, can help. The complete source code for FifoEmbed is available at http://www.cuj.com/code/.

FifoEmbed provides three abstract data types (classes), all of which are implemented on top of circular arrays:

A basic queue that supports enqueue/dequeue operations, as well as something that I call "direct block access."
A packet queue implemented on top of the basic queue that additionally provides out-of-sequence insertion and packet resequencing.
A memory pool queue ("mpoolq") is an allocator optimized for first-in/first-out access patterns.

The three classes offer constant-time primitives and are inherently thread-safe, requiring (almost) no locks when used in a producer/consumer framework.

FifoEmbed is designed to be used concurrently by one producer and one consumer thread. A typical scenario is a TCP/IP or RTP (real-time protocol) implementation. The producer receives packets from the network, allocates storage out of an mpoolq, and inserts pointers in a packet queue. Packets may arrive out of order—the packet queue will place them into their proper slots. The consumer dequeues pointers from the packet queue, processes the data, and frees up storage.

All classes store data in circular arrays and use two pointers—a head and a tail—to keep track of where the actual queue resides. The two pointers start out at the beginning of the array and increment until they reach the array boundary, at which point they wrap around to the beginning. When the producer enqueues an element, the tail increments; when the consumer dequeues an element, the head increments. The queue is empty when the head and tail coincide, and full when the tail is one slot behind the head (the dummy slot convention). The slots between the head pointer and tail pointer form the active window (the actual queue), while the rest of the array is called the "inactive part;" see Figure 1.

For flexibility, I have defined the classes as pseudotemplates, basing the underlying arrays on any previously defined type. For example, the mpoolq is defined in the mpoolq.h header file as:

typedef struct {
  MPOOLQ_CELL_T *base, *end,
                *head, *tail;
  int size;  // == end - base
} MPOOLQ_T;

Users must define MPOOLQ_CELL_T (and optionally MPOOLQ_T) before including mpoolq.h:

#define MPOOLQ_CELL_T int
#include "mpoolq.h"

In real-time systems, lock-free algorithms are desirable because they avoid the problem of priority inversion. I designed FifoEmbed around the venerable single-reader/single-writer lock-free ring buffer algorithm, which is used in many software packages (audio drivers in particular). Locks are eliminated by partitioning primitives and class members into three categories and enforcing the following restrictions:

Producer primitives only modify the tail and the inactive part of the array.
Consumer primitives only modify the head and the active window.
Thread-unsafe primitives follow no rules and must be called from a critical section.

Access to the head and tail pointers is routed via a special union type that prevents the compiler from optimizing inappropriately (cache writes in registers or reorder stores):

union {
  volatile T * volatile w;
  volatile T * volatile const r;
  volatile T * const l;
} head, tail;

Here, T is the appropriate cell type. The syntax means that for the r and w fields, both the address and the pointed cells are volatile, while for l, only the pointed cells are volatile. Primitives always write via the w field and read via:

The l field for a pointer that they own (that is, the producer reading the tail); this lets the compiler cache the pointer between consecutive reads.
The r field for a pointer that they do not own.

On the hardware side, the lock-free implementation requires CPU/compiler support for sequentially coherent, atomic access to pointers and array slots. In particular, multiprocessor systems may present an inconsistent view of the memory in the absence of explicit synchronization instructions (memory barriers); for such targets, FifoEmbed requires extra porting effort.

In simple systems, in which both the consumer and producer are polled alternately from an endless main loop, there might be no need for synchronization. In most systems, however, the consumer and producer must sleep when the queue is empty/full and signal each other when changes occur. Listing 5, which presents the solution implemented in pthreads_framework.c, uses:

A POSIX semaphore, q_count, to signal incoming data to the consumer.
A flag (more_space), a POSIX condition variable (more_space_cv), and a mutex (more_space_lock), to signal the availability of space to the producer.

Listing 5 uses the basic queue. The pthread_cond_wait() operation takes a mutex and a condition variable and atomically unlocks the mutex and waits for the condition variable to be signaled; right before pthread_cond_wait() exits, it locks the mutex again [1].

Implementation

FifoEmbed is implemented in four C header files (see Table 1). Most primitives are defined as both macros (Listing 1) and inline functions; the inline function is often a trivial stub that references the corresponding macro. The partial listings show only one of the versions.

Function and macro names are prefixed with the name of the pertinent class (mpoolq_alloc() and pktq_ins_at(), for instance); in discussion, I often omit the prefix when the class is clear from context. I also use the following suffixes frequently:

_chk designates a primitive that checks and reports error conditions via its return code.
_ds stands for the dummy slot convention.

Instances of all FifoEmbed classes are initialized by passing the base address and the size of the underlying array to the appropriate init() primitive; you must allocate the array beforehand.

The size of the active window should be checked using the:

full(), avail(), and lavail() primitives from the producer thread. avail() counts the size of the inactive part of the array (excluding the dummy slot); lavail() counts the number of contiguous slots available starting at the tail pointer, without wraparound and without crossing the dummy slot (the mpoolq version is a little different).
empty(), used(), and lused() primitives from the consumer thread. used() counts the size of the active window, while lused() counts the number of contiguous slots available starting at the head pointer (without wraparound).

To minimize code duplication, all primitives related to circular arrays are defined as C macros in one header file, circ_arr.h. The macros fall into three categories:

Macros related to moving pointers around the circular array (increment-and-wrap, decrement-and-wrap, and so on).
Counting macros determine whether the circular array is full or empty (FULL(), EMPTY()), count how many slots are used (USED(), LUSED()), and determine how many are available (DS_AVAIL(), DS_LAVAIL()). Counting primitives in the three classes are all based on these macros.
Special versions of the aforementioned macros for the case when the size of the array is a power of two. These versions usually translate to more efficient code, but waste memory by forcing the developer to round up queue sizes to the next power of two. The basic queue can be made to prefer these macros by defining FIFO_P2_SIZE; other classes will not use them.

Basic Queues

Basic queues (Listing 2) support two access models: copying access and direct access. With copying access, elements are copied into and out of the queue using the ins() and read() primitive families.

With direct access, the producer and/or consumer access data in the circular array directly, and call the swallow() and drop() primitives to commit the changes when done. For example, here's how a producer can insert 10 elements into a queue q using direct access:

Call lavail() to check that 10 contiguous slots are indeed available.
Write to the first 10 slots of the inactive part of the array, q.tail [0] to q.tail [9].
Call fifo_swallow (&q, 10), which moves the tail pointer up by 10 positions, incorporating the new elements into the queue.

Similarly, to sum up the oldest 10 elements from q using direct access, the consumer would:

Call lused() to ensure that the 10 elements are available as a contiguous block.
Compute q.head [0] + ... + q.head [9].
Call fifo_drop (&q, 10), which increments the head pointer by 10, eliminating the 10 oldest slots from the queue.

An application using fixed-size blocks exclusively should make the queue size a multiple of the block size to ensure that all blocks can be enqueued and/or dequeued with direct access.

The direct-access primitives facilitate zero-copy networking implementations.

Packet Queues

Packet queues (Listing 3) are implemented as a basic queue holding pointers to packets; there is no special data structure. Basic queue primitives may still work, but there are also packet queue-specific primitives (their names start with the pktq_ prefix).

The packet queue can accommodate empty slots, marked by NULL pointers. This makes out-of-sequence insertion possible: A new primitive, ins_oo(), takes both an offset argument and a value argument. When packets arrive in sequence, the caller should supply an offset of 1. When early packets arrive, the offset should be two or more. When late packets arrive, the offset should be a negative number (see Table 2 and Figure 2).

The rules for converting sequence numbers to offsets follow from the way the insertion primitives modify the data structure. Insertion with a positive offset o amounts to inserting o-1 empty slots before performing the actual insertion (the tail pointer increments by o slots overall). The insertion primitives, striving to run in constant time, do not actually clear the o-1 newly incorporated slots, but rely on pktq_init() and pktq_read() to keep slots in the inactive part of the array empty at all times.

Insertion with a negative offset o does not change the tail pointer, but overwrites a slot within the active window (the slot o positions behind the tail). Such late insertions are not thread-safe: If they are ever used, you must protect all late insertions and all pktq_read() invocations with critical sections.

The ins_oo() primitive may fail if the target slot is already occupied (the current packet is a duplicate) or if the target slot is outside the active window (the offset is too large or too small). The offset is too small when the packet has arrived too late (its target slot has already been passed up by the consumer). The offset is too large when the insertion would force the tail pointer to cross the dummy slot. In all cases, supplied offsets must be less than the size of the array; the application may need extra logic to handle large gaps in the packet sequence.

Since ins_oo() may fail, applications may end up transferring unusable packets. To avoid this problem, there is a two-step alternative to ins_oo(): ins_valid() returns a pointer to the target slot if the insertion would succeed and NULL otherwise; ins_at() writes to a target slot directly. If ins_valid() fails, the packet can be skipped; if ins_valid() succeeds, the application can transfer the packet and call ins_at(). The two steps must be enclosed in a critical section.

Consumers can call either fifo_read() or, if out-of-sequence insertions are ever used, pktq_read(). The pktq version skips empty slots; thus, pktq_read() runs in linear time with respect to the largest number of contiguous empty slots in the active window. It also fills serviced slots with NULL.

Memory Pool Queues

A memory pool queue (mpoolq, Listing 4) is similar to a regular queue; the difference is that its elements are not slots, but contiguous blocks of slots from the circular array. The circular array is called the "pool;" its slots are called "cells." The user-defined cell type, MPOOLQ_ CELL_T, determines alignment, as well as the maximum possible size of a block. Blocks are aligned on a cell boundary and cover an integer number of cells. Size arguments and return values for the mpoolq primitives are specified in user units instead of cells, unless otherwise implied by the primitive's name; a user unit defaults to a byte (char) but can be redefined if required (MPOOLQ_USER_T).

The first cell of each block is a header storing the length of the block. Each block can be either in use or free (a hole). To distinguish between the two kinds, the header contains a positive number (counted in user units, excluding the header itself) for allocated blocks and a negative number (counted in cells, including the header) for holes; see Figure 3 (where cells are equal to user units for simplicity). MPOOLQ_CELL_T must be a signed type. The maximum size of a block is:

2 ^ (8 * sizeof (MPOOLQ_CELL_T) - 1)

The mpoolq combines the roles of a queue and a storage manager. The producer allocates consecutive blocks out of the pool by calling alloc() (the blocks are automatically enqueued). The consumer dequeues blocks with the help of two primitives: oldest() returns the address of the oldest block in the queue, while free() marks blocks as free and possibly dequeues them. If the sequence of free() calls matches the original sequence of alloc() calls, blocks are automatically dequeued. If blocks are freed out of sequence, dequeuing is delayed. When many recent blocks have been freed but an old block is still in use, the pool may appear full.

On the producer side, the alloc() primitive tries to return blocks starting at the tail pointer; if there isn't enough space there, the allocator tries to leave a hole at the end of the pool and wrap around (Figure 4). In both cases, the resulting block must not cross either the dummy slot or the array boundary.

This algorithm works well only when the requested size is less than half the size of the pool. A request for more may fail even when the pool is empty: If the head and the tail pointer are both at the half-point of the pool, the block cannot be allocated starting at the tail pointer (it would cross outside the pool); neither can the block be allocated at the base of the pool after creating a hole (it would cross the dummy slot). Nothing can be done, short of reinitializing the mpoolq to reset the head and tail pointers.

The mpoolq version of the lavail() primitive, lavail_cells(), returns the maximum number of cells available for allocation without creating a hole at the end of the pool—unless that number is zero. When the tail pointer is exactly one slot behind the end of the array, no block can be allocated without wraparound (the single slot is wasted on the header); rather than return zero, lavail_cells() forces the tail to wrap around by creating a hole of length zero and tries again. The result may still be zero if the pool is full.

On the consumer side, the free() primitive turns a used block into a hole by modifying its header; afterwards, it calls next_blk(), a primitive that sweeps away holes at the beginning of the queue (by moving the head pointer to the beginning of the first allocated block, if any). The other consumer-side primitives (oldest(), used_cells(), and empty()), call next_blk() before examining the active window. Thus, they return accurate results regardless of the structure of the active window.

In specialized applications, an mpoolq is better than a classic heap managed by malloc() and free() because of its deterministic behavior. When accesses occur in a FIFO pattern, alloc(), free(), and all the other primitives run in constant time with respect to the pool size and the number of allocated blocks. When blocks are freed out of order, only consumer-side primitives are impacted (they run in linear time with respect to the largest deviation from FIFO order). No fragmentation occurs after a large number of blocks have been allocated and freed, so performance does not degrade over time. Additionally, mpoolq.h exposes functions (not included in the partial listing) that query the size of a block, free parts of a block, and shrink the most recently allocated block.

References

[1] The Open Group, Single Unix Specification (http://www.opengroup.org/products/publications/catalog/un.htm).

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

The FifoEmbed Library

Implementation

Basic Queues

Packet Queues

Memory Pool Queues

References

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

The FifoEmbed Library

Implementation

Basic Queues

Packet Queues

Memory Pool Queues

References

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content