Design

Faster Image Processing with OpenMP

By Henry A. Gabb and Bill Magro, March 01, 2004

OpenMP-based tools and multithreading can improve performance on multiprocessor systems.

Mar04: Faster Image Processing with OpenMP

Multithreading, multiprocessors, and higher performance

Henry and Bill work at the Intel Parallel Applications Center. They can be reached at [email protected] and bill.magro@ intel.com, respectively.

Barycentric Coordinates

Processing digital images typically involves several filtering steps, some of which are time-consuming. In this article, we use OpenMP-based tools to show how you can use multithreading to improve filter performance on multiprocessor systems and/or processors that support Hyper-Threading Technology. (Hyper-Threading is an Intel-developed technology that provides thread-level parallelism on each processor, allowing multiple threads of applications to run simultaneously on one processor; see http://www.intel.com/technology/hyperthread/).

The first step involves selecting a suitable filter. What constitutes a suitable filter? The filter should be compute intensive. There's no point wasting time threading a computation with instant response time. The potential speedup should merit the cost of thread creation in terms of both system and programmer time.

An important and time-consuming operation is the "radial blur," in which each pixel is replaced by the average of neighboring pixels along an arc. In digital image processing, blurring is used to convey motion, soften edges, give the illusion of distance, or focus the viewer's attention on nonblurred image regions. Radial blur conveys a rocking motion or movement around a point in the image. In Figure 1, the focus is on the lighthouse, which is also the blur center. The blur algorithm achieves this effect by averaging the pixels on an arc intersecting the target pixel (Figure 2). Calculating the positions of the neighboring pixels involves a fair amount of trigonometric computations because the pixels lie on an arc. Consequently, this algorithm is computationally expensive, and the amount of work per pixel increases as the blur angle increases.

Selecting a Threading Method

Having selected a suitable filter for multithreading, which threading method should you use? There are basically two choices—explicit threading or OpenMP. The OpenMP standard (http://www.openmp.org/) is a specification for a portable implementation of shared memory parallelism in Fortran, C, and C++. The spec provides a set of compiler directives and runtime library routines that extend Fortran, C, and C++ to achieve shared memory parallelism. OpenMP language extensions include work-sharing constructs, data environment, and synchronization. The standard also includes a callable runtime library with accompanying environment variables. Explicit threading methods (POSIX and Win32 threads, for instance) are best suited to task-parallel problems, or those where each thread performs a different function. A computer game is a good example of a task-parallel system. The task of processing keyboard or joystick input could be assigned to one thread, audio processing to another, and graphics rendering to yet another. The threads could also be given different priorities. This generality is a key advantage of explicit threading methods but it comes at a price.

Explicitly threading an application is often time consuming and error prone. It is an invasive process that can require significant modifications to existing code. First, concurrent tasks must be encapsulated in functions that can be mapped to threads. Second, POSIX and Win32 thread tasks only accept one argument, typically requiring changes to function prototypes and new data structures. Third, applications that were not originally designed for concurrent execution can have serious thread-safety issues. Any global or static variable, for example, is a potential target of a race condition. Finally, porting explicitly threaded code between different operating systems can be difficult. Obviously, the Win32 API is only available on Microsoft operating systems. Even then, there are differences in supported features between different versions of Windows. The same can be said of Pthreads on different flavors of UNIX.

Continuing the computer game example, graphics rendering is a data parallel problem. The same computations can be applied independently to frames in a video sequence. You could treat this like any other task parallel problem and map the rendering function to multiple threads and manually divide the frames among those threads but it's better to use a method specifically designed to express data parallelism. That's where OpenMP comes into play. Image filters typically apply the same operation to each pixel. The radial blur algorithm is no exception.

OpenMP defines a set of C preprocessor pragmas (or directives for Fortran), which you use to describe parallelism to the compiler. (OpenMP has a limited ability to express task parallelism as well.) OpenMP-compliant compilers are available for most operating systems, which makes OpenMP quite portable. Most important, however, OpenMP is compact and often nonintrusive. Threading existing serial code rarely requires significant code modifications. Compilers that do not understand OpenMP simply ignore the pragmas. For these reasons, we selected OpenMP as our threading method.

Threading the Radial Blur Algorithm

Once you've selected an image processing algorithm and appropriate threading method, it's time to implement the threading. Rather than develop our own image manipulation utilities (image file I/O, format converters, and so on), we decided to develop our multithreaded radial blur filter as an Adobe Photoshop plug-in. We adapted the GIMP's mblur_radial function and added it to an image filter template included in the Adobe Photoshop SDK (Listing One).

The resulting code, however, contains an induction variable. Induction variables often introduce dependencies into otherwise parallel loops. In Example 1(a) the OpenMP pragmas cause the loop to execute in parallel, spreading the iterations across a thread team, with each thread getting private copies of variables i and j. Giving each thread a private copy of j causes incorrect results when the loop is executed with more than one thread; see Table 1. If j is shared, however, multiple threads could update this variable simultaneously, risking data loss. Therefore, access to this variable must be synchronized, as in Example 1(b). Only one thread at a time may execute the structured block following the OpenMP critical pragma, thus avoiding a race condition on j. This solves the correctness problem, but serializes execution, defeating the purpose of threading. Loops containing induction variables are best restructured to avoid excessive synchronization.

Returning to the radial blur code, as the function loops over the pixels, a pointer keeps track of the current location in the output image; see Example 2(a). This pointer is an induction variable.

The outer loop over the image rows contains most of the work in the filter. Executing this loop in parallel should significantly improve performance. As with most image processing algorithms, the work on one pixel is independent of the work on other pixels. However, something has to be done with the induction variable before OpenMP can be added to the filter (the induction is still an issue even if we had chosen to use explicit threading rather than OpenMP).

Before the loop can be executed in parallel, it is necessary to guarantee that each thread will always process a different set of pixels. Fortunately in this case, it is relatively easy to remove the induction by calculating the absolute offset from the origin instead of incrementing a placeholder; see Example 2(b). The offset is now a pure function of the loop indices. This makes the outer-loop iterations completely independent. They can be executed in any order without changing the final results. The revised pixel loop for the RadialBlur function is in Listing Two.

Now it's time to begin threading the pixel calculations with OpenMP. Again, the outermost pixel loop contains the most work so this is where we'll put the OpenMP parallel region. Variables are shared among threads by default in OpenMP. This is satisfactory for many variables in the pixel loops. Those that are read but never written can be safely shared among threads (filterRect, maskPixel, for instance). It is also safe to share arrays that are indexed using the counter of the parallel loop. Provided pixelY is private, for example, threads will never write to the same pixel row (the index of a parallel loop is private by default in OpenMP). However, to guarantee correct results from parallel execution, threads must maintain private copies of some variables.

Analyzing and correctly classifying the variables in small parallel regions is not too difficult. Manually analyzing larger parallel regions with nested function calls, however, can be time consuming and error prone. Fortunately the Intel Thread Checker (http://www.intel.com/software/ products/threading/) automatically detects OpenMP errors. A parallel program must preserve data dependence constraints if it is to produce correct results. Violations of these constraints are called data races or storage conflicts. For example, consider this pseudocode:

for (y = 0; y < Height; y++)

{

for (x = 0; x < Width; x++)

{

pixelOffset = f(x, y);

ProcessPixel (pixelOffset);

}

}

The offset to the current pixel is a function of the loop indices x and y. If multiple threads execute this code, it is easy to interleave statements that demonstrate how storage conflicts lead to incorrect results; see Table 2. The variables x, y, and pixelOffset are shared between Threads 0 and 1. At T₁, x and y are overwritten by Thread 1. As a result, both threads calculate the same offset and proceed to process the same pixel. Storage conflicts also occur on pixelOffset. Making these three variables private to each thread solves the problem.

To analyze the pixel loop for data dependencies, a single OpenMP parallel for pragma was placed before the outermost loop—no variables were explicitly classified as private. Next, the plug-in was compiled with the Intel C++ compiler using the -Qopenmp and -Qtcheck options. The former enables OpenMP, while the latter enables Thread Checker analysis. The resulting code is run inside the Thread Checker, which reports storage conflicts on several variables in the pixel loop. Variables that are first written then read in each loop iteration (x, y, xr, yr, xx, yy, step, pixelOffset, r, arcPixelOffset, count, arc, for instance) must be private. Loop indices should also be private (pixelY, pixelX, i). These variables must be declared in the OpenMP private clause for the pixel loop to execute correctly in parallel (Listing Two).

Tuning the Filter

Once the pixel loop executes correctly in parallel, you can turn your attention to parallel efficiency. Ideally, the filter would execute twice as fast on two processors (or four times as fast on four processors, and so on). This rarely occurs in practice, however, because of hardware and system limitations. There is overhead associated with thread creation and synchronization. There is also system overhead associated with thread scheduling. At the hardware level, the memory bus must have sufficient bandwidth to supply data to multiple processors. The Intel Thread Profiler monitors specific runtime characteristics that limit parallel performance. Factors such as excessive thread synchronization, excessive parallel overhead (too many threads for too little work), and load imbalance adversely affect parallel performance.

The initial Thread Profiler profile of our image filter indicates good speedups from one to two threads, but a slight load imbalance appears when an image is processed with four threads (Figure 3). In this case, we used OpenMP static scheduling because of its low overhead. Static scheduling divides the loop iterations, in our case pixel rows, as evenly as possible among the threads. For 10 iterations and two threads, the first thread processes iterations 1-5, the second processes 6-10. This is effective if the amount of work per iteration is constant. When the entire image is processed, the amount of computation per pixel row is approximately constant. So why is there a slight load imbalance as the number of threads is increased to four? Example 3(a) (taken from Listing Two) shows one source of load imbalance in the pixel loop. An input pixel is skipped if it falls outside the image bounds. Thus, there is less work associated with pixels along the image perimeter. When an image is divided statically among four threads, there is less work in the top and bottom quadrants than the center quadrants. However, there is an even greater source of imbalance in our image filter.

Consider Example 3(b), also taken from Listing Two. When editing a digital image, it is common to operate on sections of an image. These sections often have irregular shapes. There is no work associated with pixels outside the selected region (governed by the maskPixel array). Remember that static scheduling works best when the amount of work per loop iteration is approximately constant. The amount of work per pixel row is far from constant if the image section is triangular, for example. This creates a severe load imbalance when the image rows are divided statically among threads.

For this reason, OpenMP provides alternative schedule methods. When the amount of work per iteration varies, dynamic scheduling is often best. In a dynamically scheduled loop, iterations are assigned to threads as they become available to do work. Unlike explicit threading, OpenMP scheduling changes are easily accomplished via a schedule clause (Listing Two). Dynamic scheduling incurs slightly higher system overhead than static scheduling, but in the case of our filter, its effect on parallel performance is beneficial.

Conclusion

While a simple image filter might only merit threading for very large images, radial blurring is sufficiently compute intensive that even small images benefit from parallel computing. In fact, the implementation shown here uses an approximation to simplify the calculations. Blurring can be made more accurate, and consequently more compute intensive, using barycentric coordinates (see the accompanying text box entitled "Barycentric Coordinates") but this exercise is left to the reader. Most image processing methods are data parallel, so OpenMP is a fast and easy route to parallelism. It normally requires little code modification and is easy to debug and tune with existing programming tools.

DDJ

Listing One

void RadialBlur(void* data, int32 dataRowBytes, void* mask, 
                                  int32 maskRowBytes, Rect* tileRect)
{
   uint8* pixel = (uint8*)data;   // Points to top left pixel of the image
   uint8* maskPixel = (uint8*)mask;
   uint16 rectHeight = (uint16)(tileRect->bottom - tileRect->top);
   uint16 rectWidth = (uint16)(tileRect->right - tileRect->left);
   // Variables for radial blur
   uint32 arc;       // Accumulate values on blur arc
   int n, step;      // Number of pixels on blur arc and increment
   int count;        // Number of arc pixels within rectangle
   float offset;     // Angular offset from blur pixel
   float theta;      // Angular increment of arc
   float angle;      // Angular sweep from rectangle center to blur pixel
   int R;            // Distance from rectangle center to rectangle corner
   int xr, yr, r;    // Distance from rectangle center to blur pixel
   int16 w, h;       // Coordinates of pixel at rectangle center
   float *ct, *st;   // Cosine and sine tables
   float xx, yy;     // Real coordinates of arc pixel
   int i;
   uint32 pixelOffset;
   const double PI = 3.141592654;
   Rect* filterRect = &gFilterRecord->filterRect;
   PSImagePlane plane;
   // Used to determine the addresses of pixels on blur arc
   plane.data = &gFilterRecord->inData;
   plane.bounds.top = filterRect->top;
   plane.bounds.bottom = filterRect->bottom;
   plane.bounds.left = filterRect->left;
   plane.bounds.right = filterRect->right;
   plane.rowBytes = gFilterRecord->outRowBytes;
   plane.colBytes = gFilterRecord->outColumnBytes;
   // Initialize radial blur variables
   // Find rectangle center (w,h)
   w = gFilterRecord->imageSize.h / 2;
   h = gFilterRecord->imageSize.v / 2;
R = sqrt (w * w + h * h);
   angle = (float)gParams->angle / 180.0 * PI;   // Convert to radians
   n = 4 * angle * sqrt(R) + 2;
   theta = angle / ((float)(n - 1));
   // Setup cosine and sine tables
   if (((ct = (float *) malloc (n * sizeof(float))) == NULL ) ||
      ((st = (float *) malloc (n * sizeof(float))) == NULL ))
   {
      *gResult = memFullErr;
      return;
   }
   offset = theta * (n - 1) / 2;
   for (i = 0; i < n; ++i)
   {
      // Build sine and cosine tables
      ct[i] = cos(theta * i - offset);
      st[i] = sin(theta * i - offset);
   }
   // Loop over pixels in the image
   for(uint16 pixelY = 0; pixelY < rectHeight; pixelY++)
   {
      for(uint16 pixelX = 0; pixelX < rectWidth; pixelX++)
      {
         bool leaveItAlone = false;
         if (maskPixel != NULL && !(*maskPixel) && !gParams->ignoreSelection)
            leaveItAlone = true;
         if (!leaveItAlone)
         {
            // Find (x,y) coordinates of pixels on arc
            xr = pixelX - w;
            yr = pixelY - h;
            r = sqrt (xr * xr + yr * yr);
            if (r == 0)
               step = 1;
            else if ((step = R / r) == 0)
               step = 1;
            else if (step > (n - 1))
               step = n - 1;
            for (i = 0, count = 0, arc = 0; i < n; i += step)
            {
               xx = w + (float)xr * ct[i] - (float)yr * st[i];
               yy = h + (float)xr * st[i] + (float)yr * ct[i];
               if ((yy >= filterRect->bottom) || (yy < filterRect->top) ||
                   (xx < filterRect->left) || (xx >= filterRect->right))
                       continue;   // (x,y) outside rectangle
               ++count;
               // Convert (x,y) coordinate to pixel address
               pixelOffset = ((int)xx * gFilterRecord->outColumnBytes) +
                             ((int)yy * gFilterRecord->outRowBytes);
               arc += *((uint8 *)gFilterRecord->inData + pixelOffset);
            }
            if (count != 0)
               *pixel = (uint8)(arc / count);
         }
         pixel++;
         if (maskPixel != NULL)
            maskPixel++;
      }
      pixel += (dataRowBytes - rectWidth);
      if (maskPixel != NULL)
         maskPixel += (maskRowBytes - rectWidth);
   }
   // Clean-up sine and cosine tables
   free(ct);
   free(st);
}

Back to Article

Listing Two

#pragma omp parallel for schedule(dynamic) \
            private(pixelY, pixelX, pixelOffset, arcPixelOffset, \
                    i, count, arc, xx, yy, step, x, y, xr, yr, r)
   for (pixelY = 0; pixelY < rectHeight; pixelY++)
   {
      for (pixelX = 0; pixelX < rectWidth; pixelX++)
      {
         // Find (x,y) coordinates of current pixel
         x = gFilterRecord->inRect.left + pixelX;
         y = gFilterRecord->inRect.top + pixelY;

         pixelOffset = (x * gFilterRecord->outColumnBytes) +
                       (y * gFilterRecord->outRowBytes);
         bool leaveItAlone = false;
         if (maskPixel != NULL && !gParams->ignoreSelection &&
             !(*(maskPixel + pixelOffset)))
                leaveItAlone = true;
         if (!leaveItAlone)
         {
            xr = x - w;
            yr = y - h;
            r = sqrt (xr * xr + yr * yr);
            if (r == 0)
               step = 1;
            else if ((step = R / r) == 0)
               step = 1;
            else if (step > (n - 1))
               step = n - 1;
            for (i = 0, count = 0, arc = 0; i < n; i += step)
            {
               xx = w + (float)xr * ct[i] - (float)yr * st[i];
               yy = h + (float)xr * st[i] + (float)yr * ct[i];
               if ((yy >= filterRect->bottom) || (yy < filterRect->top) ||
                   (xx < filterRect->left) || (xx >= filterRect->right))
                       continue;   // (x,y) outside rectangle
               ++count;
               // Convert (x,y) coordinate to pixel address
               arcPixelOffset = ((int)xx * gFilterRecord->outColumnBytes) +
                                ((int)yy * gFilterRecord->outRowBytes);
               arc += *((uint8 *)gFilterRecord->inData + arcPixelOffset);
            }
            if (count != 0)
               *(pixel + pixelOffset) = (uint8)(arc / count);
         }
      }
   }

Back to Article

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Faster Image Processing with OpenMP