### Vector-Vector Product

Shown below are the C and Array Building Blocks implementations of the vector-vector product computation. One key difference to note is that the C/C++ code requires a loop to traverse each element of **src1** and **src2** vectors, while the Array Building Blocks functions perform the same operations across the data automatically.

// Vector-Vector Product: C/C++ Code #include "cpp_code.h" int main(void) { double src1[n], src2[n], dst[n]; for (int i = 0; i != n; ++i) { dst[i] = src1[i] * src2[i]; } } // Vector-Vector Product: Array Building Blocks Code #include <arbb.hpp> using namespace arbb; void vectorProduct(dense<F64> src1, dense<F64> src2, dense<F64> &dst1) { dst1 = src1 * src2; } int main() { double src1[n], src2[n], dst[n]; dense<f64> Src1; bind(Src1, src1, n); dense<f64> Src2; bind(src2, src2, n); dense<f64> Dst1; bind(Dst1, dst, n); call(vectorProduct)(Src1, Src2, Dst1); }

To express this in Array Building Blocks you would need to:

- Include the <arbb.hpp> header file,
- Create the required Array Building Blocks objects,
- Express the computation over these objects, and
- Use call to invoke an Array Building Blocks function from C++ code.

### Computation of Pi by Integration

We will show a simple example that estimates the value of Pi by estimating the area of an integral. Although this is a toy problem, a similar structure is used in much more sophisticated applications.

Shown below is a Pthreads version of this computation:

// Pthread Code for computation of Pi by integration float global_pi = 0.0; pthread_mutex_t global_lock = PTHREAD_MUTEX_INITIALIZER int main() { ... pthread_create (&tid[i], NULL, pi_calc, &t_num[i]); ... } void *pi_calc (void *num) { ... myid = *(int *)num; float step = 1.0 / NSTEPS, my_pi = 0.0; start = (NSTEPS / NTHREADS) * myid; end = start + (NSTEPS / NTHREADS); for (i = start; i < end; ++i) { x = step * ((float)i - 0.5); my_pi += 4.0 / (1.0 + x * x); } pthread_mutex_lock (&global_lock); global_pi += my_pi; pthread_mutex_unlock (&global_lock); } global_pi *= step;

The above Pthreads implementation requires the programmer to explicitly create/manage threads and ensure correctness/scalability of the application by using the right synchronization APIs.

To ease thread parallel programming, the OpenMP standard was defined and supported in compilers. Shown below are two implementations of this computation using OpenMP. The first uses a critical section, the section uses a reduction variable:

// Partial OpenMP code for computation of Pi by integration -- v1 float step = 1.0 / NSTEPS, x, pi = 0.0; #pragma omp parallel for private(x) for (i = 0; i < NSTEPS; ++i) { x = step * (float(i) - 0.5); #pragma omp critical pi += 4.0 / (1.0 + x * x); } pi *= step; // Partial OpenMP Code for computation of Pi by integration -- v2 float step = 1.0 / NSTEPS, x, pi = 0.0; #pragma omp parallel for private(x) reduction(+:pi) for (i = 0; i < NSTEPS; ++i) { x = step * (float(i) - 0.5); pi += 4.0 / (1.0 + x * x); } pi *= step;

The above OpenMP code is simple, small and easy to understand. The programmer had to add only one or two lines of OpenMP pragmas. However, it still requires the best APIs be chosen to ensure scalability. The second implementation is more scalable than the first since it uses the OpenMP reduction clause and avoids the necessity of a critical region (lock). However, there is no guarantee in either case that the resulting code will be vectorized. Also, if the critical region or the reduction were accidentally omitted, OpenMP will just happily go ahead and generate incorrect code with a race condition.

Now, let us look at the Array Building Blocks implementation to understand the differences and merits of using Array Building Blocks. We will use the vector style for this, although we could also use the kernel style:

// Partial Array Building Blocks code for Pi computation f32 step = 1.0f / NSTEPS; dense<f32> I = static_cast<dense<f32>>(indices<i32>(0, NSTEPS, 1); dense<f32> X = (I + 0.5f) * step; dense<f32> partial = 4.0f / (X * X + 1.0f); f32 pi1 [m11]= step * add_reduce(partial);

Note that this example also uses the **indices** construct to construct a (virtual) array of sample positions corresponding to the values of the induction variable in the original loop.

The salient feature about writing code in Array Building Blocks is that the underlying Array Building Blocks runtime generates the optimized code for the target platform (the best width vectorized code), creates multiple threads for parallel computation of partial Pi values, and finally performs an add reduction operation (add_reduce built-in function in Array Building Blocks). The programmer is relieved of having to write explicit threaded code and the associated synchronization to avoid data-race conditions.

### Complex Number Multiply

Let us now review an additional simple example that computes the element-wise multiplication of two arrays of complex numbers.

// C code for Complex Number Multiply struct complex{ float a; float b; }; ... for (i = 0; i < N; ++i) { c3[i].a = c1[i].a * c2[i].a - c1[i].b * c2[i].b; //real c3[i].b = c1[i].a * c2[i].b + c1[i].b * c2[i].a; //imag } ...

The above C code when compiled with one of the latest C++ compilers would typically generate SSE2 code as shown below. Here, we used the Intel C++ compiler v11.1 with the **-xSSE2** flag for illustration.

// SSE2 code: Intel C++ v11.1 compiler for Complex Number Multiply loop ... movups xmm6,[rsp+rax*8] movups xmm1,[rsp+rax*8+0x10] movups xmm4,[rsp+rax*8+0x3a0] movups xmm2,[rsp+rax*8+0x3b0] movaps xmm0,xmm6 movaps xmm5,xmm4 shufps xmm6,xmm1,0xdd movaps xmm3,xmm6 shufps xmm0,xmm1,0x88 movaps xmm1,xmm0 shufps xmm5,xmm2,0x88 mulps xmm6,xmm5 mulps xmm1,xmm5 shufps xmm4,xmm2,0xdd mulps xmm3,xmm4 mulps xmm0,xmm4 subps xmm1,xmm3 movaps xmm7,xmm1 addps xmm0,xmm6 unpckhps xmm1,xmm0 movups [rsp+rax*8+0x6d0],xmm1 unpcklps xmm7,xmm0 movups [rsp+rax*8+0x6c0],xmm7 add rax,0x4 ...

The generated assembly for the C code compiled for an AVX (256-bit vector instructions) target is shown below. For illustration, we have used the Intel C++ compiler v11.1 with the option **-xAVX**. Intel AVX is a 256-bit instruction set extension to SSE (128-bit) and will be introduced in the future generation of Intel64 processors built on 32nm process technology.

// AVX (256-bit) code illustration for the Complex Number Multiply loop ... vmulps %ymm10, %ymm0, %ymm9 vmulps %ymm10, %ymm3, %ymm3 vmulps %ymm2, %ymm0, %ymm0 vsubps %ymm9, %ymm1, %ymm1 vaddps %ymm0, %ymm3, %ymm2 vunpcklps %ymm2, %ymm1, %ymm3 vunpckhps %ymm2, %ymm1, %ymm4 vperm2f128 $32, %ymm4, %ymm3, %ymm5 vmovups %ymm5, 1600(%rsp,%rax,8) vperm2f128 $49, %ymm4, %ymm3, %ymm6 vmovups %ymm6, 1632(%rsp,%rax,8) addq $16, %rax ...

In the above example for multiplication of Complex Numbers, to exploit the underlying vector instructions (SSE or AVX), we had to use static compiler options. Array Building Blocks helps here by providing a JIT compiler and runtime that allows the application to be executed on the best ISA wherever feasible, thus ensuring portability and performance across architectures.

Even though the SSE2 code shown above runs on a processor which supports AVX instructions, the execution of AVX code enables us to obtain a higher performance without having to recompile the code with AVX as the target. The same Array Building Blocks binary will run on both platforms, but will generate SSE code on SSE processors and AVX code on AVX processors, automatically.