Channels ▼
RSS

Parallel

Array Building Blocks: A Flexible Parallel Programming Model for Multicore and Many-Core Architectures


Vector-Vector Product

Shown below are the C and Array Building Blocks implementations of the vector-vector product computation. One key difference to note is that the C/C++ code requires a loop to traverse each element of src1 and src2 vectors, while the Array Building Blocks functions perform the same operations across the data automatically.


// Vector-Vector Product: C/C++ Code
#include "cpp_code.h"
int main(void)
{
    double src1[n], src2[n], dst[n];
    for (int i = 0; i != n; ++i) {
        dst[i] = src1[i] * src2[i];
    }
}

// Vector-Vector Product: Array Building Blocks Code
#include <arbb.hpp>
using namespace arbb;
void vectorProduct(dense<F64> src1, dense<F64> src2, dense<F64> &dst1) {
    dst1 = src1 * src2;
}

int main()
{
    double src1[n], src2[n], dst[n];
    dense<f64> Src1; bind(Src1, src1, n);
    dense<f64> Src2; bind(src2, src2, n);
    dense<f64> Dst1; bind(Dst1, dst, n);
    call(vectorProduct)(Src1, Src2, Dst1);
}

To express this in Array Building Blocks you would need to:

  • Include the <arbb.hpp> header file,
  • Create the required Array Building Blocks objects,
  • Express the computation over these objects, and
  • Use call to invoke an Array Building Blocks function from C++ code.

Computation of Pi by Integration

We will show a simple example that estimates the value of Pi by estimating the area of an integral. Although this is a toy problem, a similar structure is used in much more sophisticated applications.

Shown below is a Pthreads version of this computation:


// Pthread Code for computation of Pi by integration
float global_pi = 0.0;
pthread_mutex_t global_lock = PTHREAD_MUTEX_INITIALIZER
int main()
{
   ...
   pthread_create (&tid[i], NULL, pi_calc, &t_num[i]);
   ...
}
void *pi_calc (void *num)
{
   ...
   myid = *(int *)num;
   float step = 1.0 / NSTEPS, my_pi = 0.0;
   start = (NSTEPS / NTHREADS) * myid;
   end = start + (NSTEPS / NTHREADS);
   for (i = start; i < end; ++i) {
      x = step * ((float)i - 0.5);
     my_pi += 4.0 / (1.0 + x * x);
   }
   pthread_mutex_lock (&global_lock);
   global_pi += my_pi;
   pthread_mutex_unlock (&global_lock);
   }
global_pi *= step;

The above Pthreads implementation requires the programmer to explicitly create/manage threads and ensure correctness/scalability of the application by using the right synchronization APIs.

To ease thread parallel programming, the OpenMP standard was defined and supported in compilers. Shown below are two implementations of this computation using OpenMP. The first uses a critical section, the section uses a reduction variable:


// Partial OpenMP code for computation of Pi by integration -- v1 
float step = 1.0 / NSTEPS, x, pi = 0.0; 
#pragma omp parallel for private(x) 
for (i = 0; i < NSTEPS; ++i) {
   x = step * (float(i) - 0.5);
   #pragma omp critical
   pi += 4.0 / (1.0 + x * x);
}
pi *= step;

// Partial OpenMP Code for computation of Pi by integration -- v2 
float step = 1.0 / NSTEPS, x, pi = 0.0; 
#pragma omp parallel for private(x) reduction(+:pi) 
for (i = 0; i < NSTEPS; ++i) {
    x = step * (float(i) - 0.5);
    pi += 4.0 / (1.0 + x * x);
}
pi *= step;

The above OpenMP code is simple, small and easy to understand. The programmer had to add only one or two lines of OpenMP pragmas. However, it still requires the best APIs be chosen to ensure scalability. The second implementation is more scalable than the first since it uses the OpenMP reduction clause and avoids the necessity of a critical region (lock). However, there is no guarantee in either case that the resulting code will be vectorized. Also, if the critical region or the reduction were accidentally omitted, OpenMP will just happily go ahead and generate incorrect code with a race condition.

Now, let us look at the Array Building Blocks implementation to understand the differences and merits of using Array Building Blocks. We will use the vector style for this, although we could also use the kernel style:


// Partial Array Building Blocks code for Pi computation
f32 step = 1.0f / NSTEPS;
dense<f32> I = static_cast<dense<f32>>(indices<i32>(0, NSTEPS, 1);
dense<f32> X = (I + 0.5f) * step;
dense<f32> partial = 4.0f  / (X * X + 1.0f);
f32 pi1 [m11]= step * add_reduce(partial);

Note that this example also uses the indices construct to construct a (virtual) array of sample positions corresponding to the values of the induction variable in the original loop.

The salient feature about writing code in Array Building Blocks is that the underlying Array Building Blocks runtime generates the optimized code for the target platform (the best width vectorized code), creates multiple threads for parallel computation of partial Pi values, and finally performs an add reduction operation (add_reduce built-in function in Array Building Blocks). The programmer is relieved of having to write explicit threaded code and the associated synchronization to avoid data-race conditions.

Complex Number Multiply

Let us now review an additional simple example that computes the element-wise multiplication of two arrays of complex numbers.


// C code for Complex Number Multiply
struct complex{
   float a;
   float b;
};
...
for (i = 0; i < N; ++i)
{
   c3[i].a = c1[i].a * c2[i].a - c1[i].b * c2[i].b; //real
   c3[i].b = c1[i].a * c2[i].b + c1[i].b * c2[i].a; //imag
}
...

The above C code when compiled with one of the latest C++ compilers would typically generate SSE2 code as shown below. Here, we used the Intel C++ compiler v11.1 with the -xSSE2 flag for illustration.


// SSE2 code: Intel C++ v11.1 compiler for Complex Number Multiply loop ...
movups xmm6,[rsp+rax*8]
movups xmm1,[rsp+rax*8+0x10]
movups xmm4,[rsp+rax*8+0x3a0]
movups xmm2,[rsp+rax*8+0x3b0]
movaps xmm0,xmm6
movaps xmm5,xmm4
shufps xmm6,xmm1,0xdd
movaps xmm3,xmm6
shufps xmm0,xmm1,0x88
movaps xmm1,xmm0
shufps xmm5,xmm2,0x88
mulps xmm6,xmm5
mulps xmm1,xmm5
shufps xmm4,xmm2,0xdd
mulps xmm3,xmm4
mulps xmm0,xmm4
subps xmm1,xmm3
movaps xmm7,xmm1
addps xmm0,xmm6
unpckhps xmm1,xmm0
movups [rsp+rax*8+0x6d0],xmm1
unpcklps xmm7,xmm0
movups [rsp+rax*8+0x6c0],xmm7
add rax,0x4
  ...

The generated assembly for the C code compiled for an AVX (256-bit vector instructions) target is shown below. For illustration, we have used the Intel C++ compiler v11.1 with the option -xAVX. Intel AVX is a 256-bit instruction set extension to SSE (128-bit) and will be introduced in the future generation of Intel64 processors built on 32nm process technology.


// AVX (256-bit) code illustration for the Complex Number Multiply loop ...
vmulps %ymm10, %ymm0, %ymm9
vmulps %ymm10, %ymm3, %ymm3
vmulps %ymm2, %ymm0, %ymm0
vsubps %ymm9, %ymm1, %ymm1
vaddps %ymm0, %ymm3, %ymm2
vunpcklps %ymm2, %ymm1, %ymm3
vunpckhps %ymm2, %ymm1, %ymm4
vperm2f128 $32, %ymm4, %ymm3, %ymm5
vmovups %ymm5, 1600(%rsp,%rax,8)
vperm2f128 $49, %ymm4, %ymm3, %ymm6
vmovups %ymm6, 1632(%rsp,%rax,8)
addq $16, %rax
 ...

In the above example for multiplication of Complex Numbers, to exploit the underlying vector instructions (SSE or AVX), we had to use static compiler options. Array Building Blocks helps here by providing a JIT compiler and runtime that allows the application to be executed on the best ISA wherever feasible, thus ensuring portability and performance across architectures.

Even though the SSE2 code shown above runs on a processor which supports AVX instructions, the execution of AVX code enables us to obtain a higher performance without having to recompile the code with AVX as the target. The same Array Building Blocks binary will run on both platforms, but will generate SSE code on SSE processors and AVX code on AVX processors, automatically.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video