Channels ▼
RSS

Parallel Pattern 5: Stencil



Structured Patterns: An Overview

Parallel Pattern 1: Superscalar Sequences and Task Graphs

Parallel Pattern 2: Selection

Parallel Pattern 3: Map

Parallel Pattern 4: Gather

Parallel Pattern 5: Stencil

Parallel Pattern 6: Partition

Parallel Pattern 7: Reduce

Parallel Pattern 8: Scan

Parallel Pattern 9: Pack


A Gather pattern takes a collection of memory addresses (or equivalently, array indices) and reads them in parallel, resulting in an array of results. Gather is a very general pattern, but it is useful to look at some special cases, since the properties of these special cases can be exploited to improve performance. In this article, I discuss one such special case -- the Stencil pattern. A stencil is a gather in which all memory addresses used for reads are expressed as offsets relative to the location of a collection of outputs. Suppose we have an input array A, an output array B, and a set of integer offsets P. Then for every index k in the output array, we read from every input in A given by k + p for every p in P. Note that the index values in P may be multidimensional if the inputs and outputs are multidimensional. A stencil accesses a regular neighborhood of the input for each output.

For example, suppose A and B are two-dimensional and P is the set {(0,0),(-1,0),(1,0),(0,-1),(0,1)}. The for output location (2,2) in B, we access A[2,2], A[1,2], A[3,2], A[2,1], and A[2,3]. For other outputs in B, the stencil is shifted appropriately. Note that the stencil does not specify what operation is performed on the inputs to generate the outputs. Also, it is possible for the stencil to generate reads from A outside its valid index range. A full specification of a stencil pattern has to include how these boundary cases are handled. This can be done by using default values in place of reads from A, or by modifying the indicies so they fall in the valid inputs of A (for instance, by clamping them to a valid range, by wrapping them, or by reflecting them).

The Stencil pattern appears in a number of applications, but most notably in image processing and simulation. A wide variety of linear and non-linear image processing operations are specified using stencils, including linear convolution and non-linear noise reduction. Explicit solutions to partial differential equations, used in simulation and seismic reconstruction as well as in image processing, use iterative applications of stencil operations.

Efficient implementation of the Stencil pattern depends on a number of optimizations. A good direct implementation in terms of low-level operations is surprisingly complex, but can significantly outperform a naive implementation.

The first optimization is boundary handling. There is usually a large region of the output for which no offset input reads from an invalid location in the input. As this is the most common case (since the number of samples in the interior grows much faster than the number on the boundary), we should include an optimized implementation for the interior that avoid unnecessary checking for out-of-bounds accesses into the input. Even along the boundaries, it is better to generate a number of special-case passes rather than check for boundary handling inside the inner loop. There may be a large number of such cases based on the size of the stencil and the dimensionality of the data.

Second, in the interior we want to use sliding windows over a strip in order to exploit spatial coherency. Adjacent outputs from the stencil typically share a number of inputs. By assigning a "strip" of the input volume to a single core and then working along this strip, we can cache previous reads from within the strip in on-chip memory and avoid reading them again, thereby saving off-chip main memory bandwidth. This reduces the amount of available parallelism, so there is a trade-off here between parallelism and memory performance. However, with multidimensional stencil processing usually there is more than enough parallelism, so some can be sacrificed for more efficient use of memory (and less contention between cores for this typically limited resource). Adjacent strips may still make redundant memory reads. By making the strips relatively broad (again, resulting in a tradeoff between parallelism and memory performance) this "halo" effect can be reduced. A decision also needs to be made about the orientation of the strips. The optimal orientation can be depend on the layout of the data in memory as well as the shape of the stencil.

Finally, we want to vectorize the computation on processors with vector units. If the input consists of scalar data and is laid out contigously, stencil offsets tend to cause memory accesses that are not conveniently aligned with vectors, so swizzling (vector element rearranging) operations need to be inserted. Some of these can be vectorized and combined if the stencil and the strip processing can be treated as a whole and if the processor supports suitable whole-vector swizzling operations. Vectorization also interacts with strip processing and sliding windows. Normally we want to choose a strip width (or length, depending on the orientation of the strip) that is a multiple of the vector width. Both strip processing and vectorization lead to efficient processing of data that is a multiple of the strip width and the vector width. However, in real data the sizes may not be a convenient multiple of these widths, so usually some extra processing (without vectorization) needs to be done along an "outside" strip.

Hand implementing the above process using a low-level interface, needless to say, results in some very ugly and hard-to-maintain code. Also, some of the above optimizations are hardware-dependent (especially when it comes to proper vectorization). However, the basic "declarative" specification of the stencil is very simple and easy to understand. Therefore, doing the above optimizations for the Stencil pattern is a prime candidate for automation. It should be noted that built-in support for Stencil patterns is one of the key features of the 4.0 release of the Intel RapidMind platform. With this release, it's possible for developers to simply give the offsets of the stencil in a portable fashion and the platform will automatically create a suitable low-level implementation.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video