# CEAN: C/C++ Extensions for Array Notations

CEAN (short for "C++ Extension for Array Notation") is a relatively new programming syntax for C/C++ programmers. Available in Intel's Parallel Studio 2011 suite of tools, CEAN is a sequential programming syntax that's not to be confused with a parallization syntax called Array Building Blocks, also available in Parallel Studio 2011.

When C/C++ programmers strive to eek out maximum performance of an application they often incorporate SSE intrinsic functions into their code. SSE represents the Small Vector class of instructions. These SSE intrinsic functions provide an abstraction between the underlying SSE instructions of the processor and the compilers ability to assign and manipulate registers (including SSE registers) on the system. The benefit of using SSE intrinsic functions is you can attain maximum performance.

However, the trouble with using SSE intrinsic functions is that you must take care in using them and they require some knowledge about the processor architecture. Consequently, this programming feature tends to be used by the experienced programmers.

Until recently, the processor architecture issues related to whether the processor was 32-bit or 64-bit, where 64-bit processors have twice the number of SSE registers as does the 32-bit processors. When you get into more advanced features of the SSE intrinsic functions, knowledge of the revision level of the SSE instruction set is also important.

With the introduction of the AVX instruction set (Intel Sandy Bridge), the width of the SSE registers is doubled and additional instructions have been added. What this means for programmers is that you may need to write a permutation of optimized functions based on the following variables:

- type (double, float, char, short, int32)
- SSE present or not
- SSE version
- 32-bit or 64-bit
- AVX

This permutation of architectures requires the creation -- and maintenance -- of a large number of functions. And this requires future maintenance as new instructions are introduced into the architecture.

Consider then the prospect of being able to write code that reduces the requirements of using SSE intrinsic functions, eliminates reworking your code as SSE instructions evolve, and produces as optimal code that, in many cases, is as good as your handwritten SSE intrinsic functions. CEAN is a step in this direction.

Let's take a look at an example of use of SSE intrinsic functions within a simple function to scale a vector of floats.

__declspec(noinline) void VectorScale(float* vec, int len, float scale) { // sanity check if(len <= 0) return; // nothing to do // for added performance we will use 16-byte aligned load/stores // force vec to be aligned at 16 byte boundary // performing individual scaling until aligned #pragma novector while((intptr_t)vec & 15) { *vec++ *= scale; if(--len == 0) return; } // here when vec is aligned to 16 byte boundary // and there is a residual vector remaining __m128 mul = _mm_load1_ps(&scale); // 4 floats __m128 temp; // 4 floats // get number of floats representing full SSE registers // 4 floats per SSE register int lenFullSSE = len & ~3; for(int i=0; i < lenFullSSE; i += 4) { temp = _mm_load_ps(&vec[i]); // load aligned packed doubles temp = _mm_mul_ps(temp, mul);// multiply _mm_store_ps(&vec[i], temp); } // finished vector using aligned load/store // check for residual data int residual = len & 3; if(residual) { vec += lenFullSSE; #pragma novector for(int i=0; i < residual; ++i) vec[i] *= scale; } }

Note that the inner loop can be written without the **temp** by enclosing the SSE intrinsic function call in place of the **temp**. This produces a nested call:

for(int i=0; i < lenFullSSE; i += 4) { _mm_store_ps( &vec[i], _mm_mul_ps( _mm_load_ps(&vec[i]), mul)); }

While you can run performance benchmarks to evaluate the effects of using explicit temps or nested intrinsic function calls, I find it easier (and thus recommend) producing an .ASM listing file and looking at the emitted code.

Before balking at the idea of examining assembler code, consider that you do not need to understand the .ASM code beyond counting lines of code and, more importantly, counting the number of memory references. It is generally the case that the performance is inversely proportional to the number of memory references.

The two different programming styles of using SSE intrinsic functions (with **temp** and **nested**) happened to produce the same code for the inner loop:

.B2.9: movaps xmm2, XMMWORD PTR [esi] inc edi mulps xmm2, xmm1 movaps XMMWORD PTR [esi], xmm2 add esi, 16 cmp edi, eax jb .B2.9

In some cases, explicitly using a temporary will improve or degrade the performance of the code.

In the above loop, you have seven (7) instructions with two of these instructions containing memory references. Memory references have the "… PTR [register] …" format. But depending on the assembler dialect (GNU style) may omit the PTR and have "…[register]…".

Also, in the above code snippet, I removed compiler annotations from the .ASM file.

The compiler-produced .ASM file inclusive of the annotations look something like this:

$LN118: ; LOE eax edx ecx ebp esi .B3.3: ; Preds .B3.2 $LN119: test cl, 3 ;51.2

The **$LN nn:** are line numbers, "; test …" are compiler annotations, and

**.B**are branch labels. Removing these compiler annotations make for easier reading of this article; therefore I have omitted them. Added to the assembler listing file, for this article are:

*n.nn*

**** my article comments here**

With my annotations, those of you that are not familiar with assembly code should be able to make some sense of it. The complete assembler code for this function is:

PUBLIC [email protected]@[email protected] [email protected]@[email protected] PROC NEAR ; parameter 1(vec): 16 + esp ; parameter 2(len): 20 + esp ; parameter 3(scale): 24 + esp .B2.1: ** entry point of VectorScale push esi push edi push ebx ** ebx = len mov ebx, DWORD PTR [20+esp] ** (len <= 0) test ebx, ebx ** edx = vec mov edx, DWORD PTR [16+esp] ** jump to return if(len <= 0) jle .B2.16 .B2.2: ** while((intptr_t)vec & 15) test dl, 15 je .B2.7 .B2.3: ** xmm0 = scale (lowest float of xmm0) movss xmm0, DWORD PTR [24+esp] .B2.4: ** xmm1 = *vec (lowest float of xmm0) movss xmm1, DWORD PTR [edx] ** xmm1 *= scale mulss xmm1, xmm0 ** *vec = xmm1 movss DWORD PTR [edx], xmm1 ** vec++ add edx, 4 ** if(--len == 0) return dec ebx je .B2.16 .B2.5: ** while((intptr_t)vec & 15) test dl, 15 jne .B2.4 ** here when vec at 16 byte boundary .B2.7: ** int lenFullSSE = len & ~3; mov ecx, ebx and ecx, -4 ** xmm0 = scale (lowest float of xmm0) movss xmm0, DWORD PTR [24+esp] ** duplicate lowest float of xmm0 to all floats of xmm0 shufps xmm0, xmm0, 0 ** for(int i=0; i < lenFullSSE; i += 4) ** loading registers for use in body of loop lea esi, DWORD PTR [3+ecx] mov eax, esi sar eax, 1 shr eax, 30 add eax, esi sar eax, 2 test ecx, ecx jle .B2.11 .B2.8: xor edi, edi mov esi, edx .B2.9: ** temp = _mm_load_ps(&vec[i]); // load aligned packed doubles movaps xmm1, XMMWORD PTR [esi] inc edi ** temp = _mm_mul_ps(temp, mul);// multiply mulps xmm1, xmm0 ** _mm_store_ps(&vec[i], temp); movaps XMMWORD PTR [esi], xmm1 ** esi holds &vec[i], bump to next 4 floats add esi, 16 ** i < lenFullSSE cmp edi, eax jb .B2.9 .B2.11: ** int residual = len & 3; and ebx, 3 ** if(residual) je .B2.16 .B2.12: ** for(int i=0; i < residual; ++i) jle .B2.16 .B2.13: ** xmm0 = scale (one float) movss xmm0, DWORD PTR [24+esp] ** vec += lenFullSSE; lea eax, DWORD PTR [edx+ecx*4] xor esi, esi .B2.14: ** temp = vec[i] (one float) movss xmm1, DWORD PTR [eax] ** ++i inc esi ** temp *= scale mulss xmm1, xmm0 ** vec[i] = temp movss DWORD PTR [eax], xmm1 ** advance register holding &vec[i] add eax, 4 ** for(int i=0; i < residual; ++i) cmp esi, ebx jl .B2.14 .B2.16: ** return pop ebx pop edi pop esi ret

This function, using the SSE intrinsic functions, produces fairly compact and near optimal code.

As for permutations of this function for architecture, this function makes use of a few number of registers; therefore 32-bit or 64-bit architecture differences are not of concern. This leaves variable type (**double**, **float**, **char**, **short**, **int32**) and **AVX**. Assuming you code for P4 and later, the system will have at least SSE. What this means is you will have to write 10 similar functions (five if you omit **AVX**).

Now let's look at the same function written using CEAN syntax:

__declspec(noinline) void VectorScaleCEAN(double* vec, int len, double scale) { vec[0:len] *= scale; }

This function is reduced to one statement. In fact, you would most likely eliminate the function and use the statement inline. The following two lines are equivilent:

// ... VectorScaleCEAN(X, lenX, 2.0); // use function X[0:lenX] *= 2.0; // perform inline // ...

When examining the assembler code generated for the function, which includes my annotations (**) for this article we find:

PUBLIC [email protected]@[email protected] [email protected]@[email protected] PROC NEAR ; parameter 1(vec): eax ; parameter 2(len): edx ; parameter 3(scale): 20 + esp .B3.1: mov eax, DWORD PTR [4+esp] mov edx, DWORD PTR [8+esp] PUBLIC [email protected]@[email protected] [email protected]@[email protected]:: push edi push ebx ** tests for valid len test edx, edx jle .B3.17 .B3.2: ** test for vec aligned at 16 byte boundary mov ecx, eax movss xmm1, DWORD PTR [20+esp] and ecx, 15 je .B3.5 ** test for vec aligned at 4 byte boundary ** not performed in our VectorScale due to knowledge about program .B3.3: test cl, 3 jne .B3.18 .B3.4: ** preamble of loop to run vec up to 16 byte aligned boundary neg ecx add ecx, 16 shr ecx, 2 .B3.5: lea ebx, DWORD PTR [8+ecx] cmp edx, ebx jl .B3.18 .B3.6: mov edi, edx sub edi, ecx and edi, 7 neg edi add edi, edx test ecx, ecx jbe .B3.10 .B3.7: xor ebx, ebx .B3.8: ** loop to progress one float at a time till aligned movss xmm0, DWORD PTR [eax+ebx*4] mulss xmm0, xmm1 movss DWORD PTR [eax+ebx*4], xmm0 inc ebx cmp ebx, ecx jb .B3.8 .B3.10: ** load xmm0 (all 4 floats) with scale movaps xmm0, xmm1 shufps xmm0, xmm0, 0 .B3.11: ** loop to run through 16 byte aligned portion of vec ** *** note additional optimization to double up moves movaps xmm2, XMMWORD PTR [eax+ecx*4] movaps xmm3, XMMWORD PTR [16+eax+ecx*4] mulps xmm2, xmm0 mulps xmm3, xmm0 movaps XMMWORD PTR [eax+ecx*4], xmm2 movaps XMMWORD PTR [16+eax+ecx*4], xmm3 add ecx, 8 cmp ecx, edi jb .B3.11 .B3.13: ** test for residual cmp edi, edx jae .B3.17 .B3.15: ** loop for residual data movss xmm0, DWORD PTR [eax+edi*4] mulss xmm0, xmm1 movss DWORD PTR [eax+edi*4], xmm0 inc edi cmp edi, edx jb .B3.15 .B3.17: ** return pop ebx pop edi ret ** extra snip of code for use above .B3.18: xor edi, edi jmp .B3.13

The CEAN code performed the same tests we did by hand performed:

- Sanity check on length of vector
- When necessary, individual float scaling until aligned at 16-byte boundary
- Additional test for vec not at multiple of 4 bytes
- 16-byte aligned scaling
- *** with additional optimization of doubling up the load/multiply/store
- The residual floats scaling

The CEAN code should out-perform our handwritten code (meaning we could have done better had we spent more time at it). The beauty of the CEAN approach is we do not have to write additional code for use with **AVX**. The compiler will do this automatically.

Additional notes for when producing your test code as .ASM listing for optimizations include:

- Use Release Build
- Turn off IPO (Inter-Procedural Optimization)
- Add
**__declspec(noinline)** - Do not use literals or a variable where the compiler optimization code will know the values.
- Consider using
**#pragma nounroll**in places.

### Caveats

CEAN is relatively new to the language syntax. As with any new feature, expect some issues with regard to code generation. The issues are identified and reported back to the product support department of your compiler vendor will be addressed and corrected in the incremental updates. The more users experimenting with CEAN, and reporting issues, the faster the CEAN capability matures into a solid feature.

While CEAN will not eliminate all the circumstances for using intrinsic functions, it will eliminate many such instances of their use. And produce code that is more portable. I suggest that you give CEAN serious consideration and provide feedback to the development teams working on it to make this extension available to us overworked programmers.