CEAN: C/C++ Extensions for Array Notations
CEAN (short for "C++ Extension for Array Notation") is a relatively new programming syntax for C/C++ programmers. Available in Intel's Parallel Studio 2011 suite of tools, CEAN is a sequential programming syntax that's not to be confused with a parallization syntax called Array Building Blocks, also available in Parallel Studio 2011.
When C/C++ programmers strive to eek out maximum performance of an application they often incorporate SSE intrinsic functions into their code. SSE represents the Small Vector class of instructions. These SSE intrinsic functions provide an abstraction between the underlying SSE instructions of the processor and the compilers ability to assign and manipulate registers (including SSE registers) on the system. The benefit of using SSE intrinsic functions is you can attain maximum performance.
However, the trouble with using SSE intrinsic functions is that you must take care in using them and they require some knowledge about the processor architecture. Consequently, this programming feature tends to be used by the experienced programmers.
Until recently, the processor architecture issues related to whether the processor was 32-bit or 64-bit, where 64-bit processors have twice the number of SSE registers as does the 32-bit processors. When you get into more advanced features of the SSE intrinsic functions, knowledge of the revision level of the SSE instruction set is also important.
With the introduction of the AVX instruction set (Intel Sandy Bridge), the width of the SSE registers is doubled and additional instructions have been added. What this means for programmers is that you may need to write a permutation of optimized functions based on the following variables:
- type (double, float, char, short, int32)
- SSE present or not
- SSE version
- 32-bit or 64-bit
- AVX
This permutation of architectures requires the creation -- and maintenance -- of a large number of functions. And this requires future maintenance as new instructions are introduced into the architecture.
Consider then the prospect of being able to write code that reduces the requirements of using SSE intrinsic functions, eliminates reworking your code as SSE instructions evolve, and produces as optimal code that, in many cases, is as good as your handwritten SSE intrinsic functions. CEAN is a step in this direction.
Let's take a look at an example of use of SSE intrinsic functions within a simple function to scale a vector of floats.
__declspec(noinline)
void VectorScale(float* vec, int len, float scale)
{
// sanity check
if(len <= 0)
return; // nothing to do
// for added performance we will use 16-byte aligned load/stores
// force vec to be aligned at 16 byte boundary
// performing individual scaling until aligned
#pragma novector
while((intptr_t)vec & 15)
{
*vec++ *= scale;
if(--len == 0)
return;
}
// here when vec is aligned to 16 byte boundary
// and there is a residual vector remaining
__m128 mul = _mm_load1_ps(&scale); // 4 floats
__m128 temp; // 4 floats
// get number of floats representing full SSE registers
// 4 floats per SSE register
int lenFullSSE = len & ~3;
for(int i=0; i < lenFullSSE; i += 4)
{
temp = _mm_load_ps(&vec[i]); // load aligned packed doubles
temp = _mm_mul_ps(temp, mul);// multiply
_mm_store_ps(&vec[i], temp);
}
// finished vector using aligned load/store
// check for residual data
int residual = len & 3;
if(residual)
{
vec += lenFullSSE;
#pragma novector
for(int i=0; i < residual; ++i)
vec[i] *= scale;
}
}
Note that the inner loop can be written without the temp by enclosing the SSE intrinsic function call in place of the temp. This produces a nested call:
for(int i=0; i < lenFullSSE; i += 4)
{
_mm_store_ps(
&vec[i],
_mm_mul_ps(
_mm_load_ps(&vec[i]), mul));
}
While you can run performance benchmarks to evaluate the effects of using explicit temps or nested intrinsic function calls, I find it easier (and thus recommend) producing an .ASM listing file and looking at the emitted code.
Before balking at the idea of examining assembler code, consider that you do not need to understand the .ASM code beyond counting lines of code and, more importantly, counting the number of memory references. It is generally the case that the performance is inversely proportional to the number of memory references.
The two different programming styles of using SSE intrinsic functions (with temp and nested) happened to produce the same code for the inner loop:
.B2.9:
movaps xmm2, XMMWORD PTR [esi]
inc edi
mulps xmm2, xmm1
movaps XMMWORD PTR [esi], xmm2
add esi, 16
cmp edi, eax
jb .B2.9
In some cases, explicitly using a temporary will improve or degrade the performance of the code.
In the above loop, you have seven (7) instructions with two of these instructions containing memory references. Memory references have the "… PTR [register] …" format. But depending on the assembler dialect (GNU style) may omit the PTR and have "…[register]…".
Also, in the above code snippet, I removed compiler annotations from the .ASM file.
The compiler-produced .ASM file inclusive of the annotations look something like this:
$LN118:
; LOE eax edx ecx ebp esi
.B3.3: ; Preds .B3.2
$LN119:
test cl, 3 ;51.2
The $LNnn: are line numbers, "; test …" are compiler annotations, and .Bn.nn are branch labels. Removing these compiler annotations make for easier reading of this article; therefore I have omitted them. Added to the assembler listing file, for this article are:
** my article comments here
With my annotations, those of you that are not familiar with assembly code should be able to make some sense of it. The complete assembler code for this function is:
PUBLIC ?VectorScale@@YAXPAMHM@Z
?VectorScale@@YAXPAMHM@Z PROC NEAR
; parameter 1(vec): 16 + esp
; parameter 2(len): 20 + esp
; parameter 3(scale): 24 + esp
.B2.1:
** entry point of VectorScale
push esi
push edi
push ebx
** ebx = len
mov ebx, DWORD PTR [20+esp]
** (len <= 0)
test ebx, ebx
** edx = vec
mov edx, DWORD PTR [16+esp]
** jump to return if(len <= 0)
jle .B2.16
.B2.2:
** while((intptr_t)vec & 15)
test dl, 15
je .B2.7
.B2.3:
** xmm0 = scale (lowest float of xmm0)
movss xmm0, DWORD PTR [24+esp]
.B2.4:
** xmm1 = *vec (lowest float of xmm0)
movss xmm1, DWORD PTR [edx]
** xmm1 *= scale
mulss xmm1, xmm0
** *vec = xmm1
movss DWORD PTR [edx], xmm1
** vec++
add edx, 4
** if(--len == 0) return
dec ebx
je .B2.16
.B2.5:
** while((intptr_t)vec & 15)
test dl, 15
jne .B2.4
** here when vec at 16 byte boundary
.B2.7:
** int lenFullSSE = len & ~3;
mov ecx, ebx
and ecx, -4
** xmm0 = scale (lowest float of xmm0)
movss xmm0, DWORD PTR [24+esp]
** duplicate lowest float of xmm0 to all floats of xmm0
shufps xmm0, xmm0, 0
** for(int i=0; i < lenFullSSE; i += 4)
** loading registers for use in body of loop
lea esi, DWORD PTR [3+ecx]
mov eax, esi
sar eax, 1
shr eax, 30
add eax, esi
sar eax, 2
test ecx, ecx
jle .B2.11
.B2.8:
xor edi, edi
mov esi, edx
.B2.9:
** temp = _mm_load_ps(&vec[i]); // load aligned packed doubles
movaps xmm1, XMMWORD PTR [esi]
inc edi
** temp = _mm_mul_ps(temp, mul);// multiply
mulps xmm1, xmm0
** _mm_store_ps(&vec[i], temp);
movaps XMMWORD PTR [esi], xmm1
** esi holds &vec[i], bump to next 4 floats
add esi, 16
** i < lenFullSSE
cmp edi, eax
jb .B2.9
.B2.11:
** int residual = len & 3;
and ebx, 3
** if(residual)
je .B2.16
.B2.12:
** for(int i=0; i < residual; ++i)
jle .B2.16
.B2.13:
** xmm0 = scale (one float)
movss xmm0, DWORD PTR [24+esp]
** vec += lenFullSSE;
lea eax, DWORD PTR [edx+ecx*4]
xor esi, esi
.B2.14:
** temp = vec[i] (one float)
movss xmm1, DWORD PTR [eax]
** ++i
inc esi
** temp *= scale
mulss xmm1, xmm0
** vec[i] = temp
movss DWORD PTR [eax], xmm1
** advance register holding &vec[i]
add eax, 4
** for(int i=0; i < residual; ++i)
cmp esi, ebx
jl .B2.14
.B2.16:
** return
pop ebx
pop edi
pop esi
ret
This function, using the SSE intrinsic functions, produces fairly compact and near optimal code.
As for permutations of this function for architecture, this function makes use of a few number of registers; therefore 32-bit or 64-bit architecture differences are not of concern. This leaves variable type (double, float, char, short, int32) and AVX. Assuming you code for P4 and later, the system will have at least SSE. What this means is you will have to write 10 similar functions (five if you omit AVX).
Now let's look at the same function written using CEAN syntax:
__declspec(noinline)
void VectorScaleCEAN(double* vec, int len, double scale)
{
vec[0:len] *= scale;
}
This function is reduced to one statement. In fact, you would most likely eliminate the function and use the statement inline. The following two lines are equivilent:
// ... VectorScaleCEAN(X, lenX, 2.0); // use function X[0:lenX] *= 2.0; // perform inline // ...
When examining the assembler code generated for the function, which includes my annotations (**) for this article we find:
PUBLIC ?VectorScaleCEAN@@YAXPAMHM@Z
?VectorScaleCEAN@@YAXPAMHM@Z PROC NEAR
; parameter 1(vec): eax
; parameter 2(len): edx
; parameter 3(scale): 20 + esp
.B3.1:
mov eax, DWORD PTR [4+esp]
mov edx, DWORD PTR [8+esp]
PUBLIC ?VectorScaleCEAN.@@YAXPAMHM@Z
?VectorScaleCEAN.@@YAXPAMHM@Z::
push edi
push ebx
** tests for valid len
test edx, edx
jle .B3.17
.B3.2:
** test for vec aligned at 16 byte boundary
mov ecx, eax
movss xmm1, DWORD PTR [20+esp]
and ecx, 15
je .B3.5
** test for vec aligned at 4 byte boundary
** not performed in our VectorScale due to knowledge about program
.B3.3:
test cl, 3
jne .B3.18
.B3.4:
** preamble of loop to run vec up to 16 byte aligned boundary
neg ecx
add ecx, 16
shr ecx, 2
.B3.5:
lea ebx, DWORD PTR [8+ecx]
cmp edx, ebx
jl .B3.18
.B3.6:
mov edi, edx
sub edi, ecx
and edi, 7
neg edi
add edi, edx
test ecx, ecx
jbe .B3.10
.B3.7:
xor ebx, ebx
.B3.8:
** loop to progress one float at a time till aligned
movss xmm0, DWORD PTR [eax+ebx*4]
mulss xmm0, xmm1
movss DWORD PTR [eax+ebx*4], xmm0
inc ebx
cmp ebx, ecx
jb .B3.8
.B3.10:
** load xmm0 (all 4 floats) with scale
movaps xmm0, xmm1
shufps xmm0, xmm0, 0
.B3.11:
** loop to run through 16 byte aligned portion of vec
** *** note additional optimization to double up moves
movaps xmm2, XMMWORD PTR [eax+ecx*4]
movaps xmm3, XMMWORD PTR [16+eax+ecx*4]
mulps xmm2, xmm0
mulps xmm3, xmm0
movaps XMMWORD PTR [eax+ecx*4], xmm2
movaps XMMWORD PTR [16+eax+ecx*4], xmm3
add ecx, 8
cmp ecx, edi
jb .B3.11
.B3.13:
** test for residual
cmp edi, edx
jae .B3.17
.B3.15:
** loop for residual data
movss xmm0, DWORD PTR [eax+edi*4]
mulss xmm0, xmm1
movss DWORD PTR [eax+edi*4], xmm0
inc edi
cmp edi, edx
jb .B3.15
.B3.17:
** return
pop ebx
pop edi
ret
** extra snip of code for use above
.B3.18:
xor edi, edi
jmp .B3.13
The CEAN code performed the same tests we did by hand performed:
- Sanity check on length of vector
- When necessary, individual float scaling until aligned at 16-byte boundary
- Additional test for vec not at multiple of 4 bytes
- 16-byte aligned scaling
- *** with additional optimization of doubling up the load/multiply/store
- The residual floats scaling
The CEAN code should out-perform our handwritten code (meaning we could have done better had we spent more time at it). The beauty of the CEAN approach is we do not have to write additional code for use with AVX. The compiler will do this automatically.
Additional notes for when producing your test code as .ASM listing for optimizations include:
- Use Release Build
- Turn off IPO (Inter-Procedural Optimization)
- Add __declspec(noinline)
- Do not use literals or a variable where the compiler optimization code will know the values.
- Consider using #pragma nounroll in places.
Caveats
CEAN is relatively new to the language syntax. As with any new feature, expect some issues with regard to code generation. The issues are identified and reported back to the product support department of your compiler vendor will be addressed and corrected in the incremental updates. The more users experimenting with CEAN, and reporting issues, the faster the CEAN capability matures into a solid feature.
While CEAN will not eliminate all the circumstances for using intrinsic functions, it will eliminate many such instances of their use. And produce code that is more portable. I suggest that you give CEAN serious consideration and provide feedback to the development teams working on it to make this extension available to us overworked programmers.

