Parallel Pattern 9: Pack

The pack pattern takes as input a collection and a Boolean condition (or flag) for each element of that collection. It discards the elements of the input collection for which the condition is false and then packs together the surviving elements into a dense output collection, eliminating the gaps. The output of a pack may be empty, so a system supporting pack needs to be able to represent empty collections.

Frequently pack is combined with the "map" pattern. The map pattern applies a function to every element of a collection. Pack can be implemented as an "output mode" for a map, where a flag can be used to retain or discard each output of the map. Pack can also be implemented as a separate operation, but fusing it with a map (either implicitly or explicitly) is useful as then we can avoid using memory bandwidth for output elements that are to be discarded. A diagram of the pack operation (fused with the output of a map) is shown in Figure 1:

Figure 1: The pack pattern. A Boolean condition indicates whether each element of a collection is to be retained or discarded. The retained elements are compressed into a continuous sequence without gaps, in the same order as the original collection.

As an example for where this pattern is useful, consider the problem of collision detection. In collision detection we have to consider a large number of objects, a small fraction of which might intersect each other. The goal is to produce a list containing all the intersecting pairs of objects. Typically, we divide solutions to this problem into two phases, a broad phase and a narrow phase. In the broad phase, we generate a collection of potentially colliding pairs of objects. In the narrow phase, we test each potentially colliding pair to see if they actually do collide. Pack is useful in the narrow phase, where we apply an intersection operation to each potentially colliding pair. If the result is true, we output the pair, and if not, we discard it. Frequently there are many more potentially colliding pairs than true collisions. In this case, fusing the pack with the map performing the intersection tests is a major optimization, since it can save significant memory bandwidth compared to using separate operations. Collision detection is one form of set intersection, which in its more abstract form is also used for performing database joins and text searching.

Pack can actually be implemented with a scan (a pattern which computes all the partial sums of a collection) combined with a scatter (scatters are random writes, a very general but potentially unsafe pattern to be discussed later). Suppose we generate a collection containing all the flags F, marking each position to be retained with a 1 and those to be discarded with a 0:

```
F = [1,0,1,0,1,1,0,1,0]

```

We then compute the prefix-scan of this collection using an addition operation that treats the elements as integers rather than Boolean values. Note that we use the "prefix" version of the scan operation where each output position is the sum off all values up to but not including the current value:

```
prefix(F) = [0,1,1,2,2,3,3,3,<b>4</b>]

```

Now we do a conditional scatter where we write every "marked" element into the address given by the corresponding position in the prefix-scan. Here is the input and the prefix-scan output again, but with the marked positions highlighted:

```
F = [<b>1</b>,0,<b>1</b>,0,<b>1</b>,<b>1</b>,0,<b>1</b>,0]
prefix(F) = [<b>0</b>,1,<b>1</b>,2,<b>2</b>,<b>3</b>,3,<b>4</b>,4]

```

This shows that pack is not a "fundamental" pattern, since it can be implemented in terms of scan and scatter. However, it is still useful to provide pack as an explicit operation in a programming model for two reasons: first, pack is deterministic, while the more general scatter operation in general is not. Second, pack is more coherent than a general scatter, and there are various ways to optimize its implementation that do not directly use scan and scatter. For instance, we can break the input into blocks and do a serial pack (which is trivial) on each block, and then we can do a scan on the block counts to determine where to move each block into the final result. This results in larger and more coherent data movements than in a pack based directly on per-element scan and scatter. In fact, if we are really clever, we (or the compiler or the programming platform) may be able to optimize out the final block-scatter, and/or replace it with a gather into a following operation.

There are at least two useful variants of the pack pattern. First, we do not have to discard the elements for which the Boolean condition fails. Instead, we can pack them into the upper half of the output collection. This "split" pattern is shown in Figure 2.

Figure 2: The split pattern. This is a variant of pack which does not discard elements, but instead packs them to the top or bottom of the output collection. The input order is otherwise retained.

In the split pattern, the order of the elements within each output segment is the same as their relative position in the input (in other words, it is a "stable" reordering), and the total output size is always exactly the same size as the input. Split can obviously be implemented by running pack twice and merging the results. However, a direct implementation can be more efficient. Like pack, this pattern is deterministic and the implementation can also exploit its relatively high spatial coherence.

The split pattern can be used directly to implement a binary radix sort. The input data is split sequentially based on the bits of the sort key, starting with the lowest-order bit of the key. This results in an ordered sequence after n passes, where n is the number of bits in the key. Higher-radix versions of split are possible, for instance, we might want to split into k different categories. This can be used to implement quicksort or radix sorts with fewer passes.

Interestingly, split and pack can be used to efficiently emulate SPMD control flow on SIMD processors (sometimes called "fibering") by "packet routing". Basically, the flowchart of the control flow of the function used in a map operation is interpreted as a dataflow graph. This also requires inverse pack and split operations that can restore data to its original position. However, this technique is complicated enough that I will save discussing it to a future post where I will also discuss its combination with masking and tiling, another (more common) technique for implementing fibering.

There is another possible generalization of pack when it is used as an output mode for a map, the "expand" pattern. In the expand pattern each computational element of a map can output 0 or more values, not just 0 or 1 as in pack. This is shown in Figure 3.

Figure 3: The expand pattern. The expand pattern is a variant of pack in which every output position can contain a variable number of elements (including zero). The elements are packed together into a single sequence, retaining the order overall and from each subgroup.

In the expand pattern, the output element should also appear in the output collection in the order in which they were "emitted". Like pack, the output of an expand operation can also be empty.

The expand pattern is actually used in "geometry shaders" in modern graphics APIs, where it can be applied to surface subdivision and procedural geometry generation. For example, suppose we want to implement displacement mapping. For every input triangle on an input surface we might want to generate a variable number of output triangles to represent the displaced surface. We might also want to cull parts of the surface that we know are not visible and output NO triangles. This pattern can also occur in collision detection in the broad phase, where we may generate several potentially colliding pairs for each object in the input. I might also be used for decompression. Generally, expand can be used for "data amplification" where a small amount of input might expand into a large amount of output.

Expand can be challenging to implement since we do not know the total size of the output in advance. On the one hand, this provides a form of dynamic memory allocation, which can be useful. On the other hand, it means we cannot preallocate storage for the output. One solution is to put a bound on the total number of elements that can be generated from each position. This is done in implementations of geometry shaders, for example. While convenient for the programming system (or hardware) implementer, such a restriction can be troublesome for the application developer, who may not be able to determine such a bound in some applications.

Finally, for both split and expand, it may be reasonable to represent the results of these operations with nested collections. A nested collection stores a collection of variable-sized collections. In one dimension, we might have a "segmented array" that keeps track (typically using a separate boundary marking array internally) of boundaries between segments. The output of the split operation in Figure 2 would have two segments, one of length 5 and one of length 4, while the output of the expand operation in Figure 3 would have 4 segments of length 4, 1, 2, and 3.

To conclude, I have presented the "pack" pattern and two variants, "split" and "expand", that are useful for algorithm steps that have variable-rate output. Pack can be used to filter data as in collision detection, while split can be used to divide data into categories. Expand is a generalization of pack that supports any amount of output. This makes it challenging to implement but also very powerful, as it can be used for "data amplification" and also supports a safe form of dynamic memory allocation. All of these operations are often fused with "map". These patterns are special but common cases of scatter. However, unlike scatter, they are safe and deterministic.

More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.