Channels ▼

Jim Dempsey

Dr. Dobb's Bloggers

CEAN: C/C++ Extensions for Array Notations

October 25, 2010

CEAN (short for "C++ Extension for Array Notation") is a relatively new programming syntax for C/C++ programmers. Available in Intel's Parallel Studio 2011 suite of tools, CEAN is a sequential programming syntax that's not to be confused with a parallization syntax called Array Building Blocks, also available in Parallel Studio 2011.

When C/C++ programmers strive to eek out maximum performance of an application they often incorporate SSE intrinsic functions into their code. SSE represents the Small Vector class of instructions. These SSE intrinsic functions provide an abstraction between the underlying SSE instructions of the processor and the compilers ability to assign and manipulate registers (including SSE registers) on the system. The benefit of using SSE intrinsic functions is you can attain maximum performance.

However, the trouble with using SSE intrinsic functions is that you must take care in using them and they require some knowledge about the processor architecture. Consequently, this programming feature tends to be used by the experienced programmers.

Until recently, the processor architecture issues related to whether the processor was 32-bit or 64-bit, where 64-bit processors have twice the number of SSE registers as does the 32-bit processors. When you get into more advanced features of the SSE intrinsic functions, knowledge of the revision level of the SSE instruction set is also important.

With the introduction of the AVX instruction set (Intel Sandy Bridge), the width of the SSE registers is doubled and additional instructions have been added. What this means for programmers is that you may need to write a permutation of optimized functions based on the following variables:

  • type (double, float, char, short, int32)
  • SSE present or not
  • SSE version
  • 32-bit or 64-bit
  • AVX

This permutation of architectures requires the creation -- and maintenance -- of a large number of functions. And this requires future maintenance as new instructions are introduced into the architecture.

Consider then the prospect of being able to write code that reduces the requirements of using SSE intrinsic functions, eliminates reworking your code as SSE instructions evolve, and produces as optimal code that, in many cases, is as good as your handwritten SSE intrinsic functions. CEAN is a step in this direction.

Let's take a look at an example of use of SSE intrinsic functions within a simple function to scale a vector of floats.


__declspec(noinline)
void VectorScale(float* vec, int len, float scale)
{
	// sanity check
	if(len <= 0)
		return;		// nothing to do
	// for added performance we will use 16-byte aligned load/stores
	// force vec to be aligned at 16 byte boundary
	// performing individual scaling until aligned
	#pragma novector
	while((intptr_t)vec & 15)
	{
		*vec++ *= scale;
		if(--len == 0)
			return;
	}
	// here when vec is aligned to 16 byte boundary
	// and there is a residual vector remaining
	__m128	mul = _mm_load1_ps(&scale);	// 4 floats
	__m128	temp;	// 4 floats
	// get number of floats representing full SSE registers
	// 4 floats per SSE register
	int	lenFullSSE = len & ~3;
	for(int i=0; i < lenFullSSE; i += 4)
	{
		temp = _mm_load_ps(&vec[i]); // load aligned packed doubles
		temp = _mm_mul_ps(temp, mul);// multiply
		_mm_store_ps(&vec[i], temp);
	}
	// finished vector using aligned load/store
	// check for residual data
	int	residual = len & 3;
	if(residual)
	{
		vec += lenFullSSE;
		#pragma novector
		for(int i=0; i < residual; ++i)
			vec[i] *= scale;
	}
}

Note that the inner loop can be written without the temp by enclosing the SSE intrinsic function call in place of the temp. This produces a nested call:


	for(int i=0; i < lenFullSSE; i += 4)
	{
		_mm_store_ps(
&vec[i],
_mm_mul_ps(
_mm_load_ps(&vec[i]), mul));
	}

While you can run performance benchmarks to evaluate the effects of using explicit temps or nested intrinsic function calls, I find it easier (and thus recommend) producing an .ASM listing file and looking at the emitted code.

Before balking at the idea of examining assembler code, consider that you do not need to understand the .ASM code beyond counting lines of code and, more importantly, counting the number of memory references. It is generally the case that the performance is inversely proportional to the number of memory references.

The two different programming styles of using SSE intrinsic functions (with temp and nested) happened to produce the same code for the inner loop:


.B2.9:
        movaps    xmm2, XMMWORD PTR [esi]
        inc       edi
        mulps     xmm2, xmm1
        movaps    XMMWORD PTR [esi], xmm2
        add       esi, 16
        cmp       edi, eax
        jb        .B2.9

In some cases, explicitly using a temporary will improve or degrade the performance of the code.

In the above loop, you have seven (7) instructions with two of these instructions containing memory references. Memory references have the "… PTR [register] …" format. But depending on the assembler dialect (GNU style) may omit the PTR and have "…[register]…".

Also, in the above code snippet, I removed compiler annotations from the .ASM file.

The compiler-produced .ASM file inclusive of the annotations look something like this:


$LN118:
                                ; LOE eax edx ecx ebp esi
.B3.3:                          ; Preds .B3.2
$LN119:
        test      cl, 3                                         ;51.2

The $LNnn: are line numbers, "; test …" are compiler annotations, and .Bn.nn are branch labels. Removing these compiler annotations make for easier reading of this article; therefore I have omitted them. Added to the assembler listing file, for this article are:

** my article comments here

With my annotations, those of you that are not familiar with assembly code should be able to make some sense of it. The complete assembler code for this function is:


	PUBLIC ?VectorScale@@YAXPAMHM@Z
?VectorScale@@YAXPAMHM@Z	PROC NEAR 
; parameter 1(vec): 16 + esp
; parameter 2(len): 20 + esp
; parameter 3(scale): 24 + esp
.B2.1:
** entry point of VectorScale
        push      esi
        push      edi
        push      ebx
** ebx = len
        mov       ebx, DWORD PTR [20+esp]
** (len <= 0)
        test      ebx, ebx
** edx = vec
        mov       edx, DWORD PTR [16+esp]
** jump to return if(len <= 0)
        jle       .B2.16
.B2.2:
**	while((intptr_t)vec & 15)
        test      dl, 15
        je        .B2.7
.B2.3:
** xmm0 = scale (lowest float of xmm0)
        movss     xmm0, DWORD PTR [24+esp]
.B2.4:
** xmm1 = *vec (lowest float of xmm0)
        movss     xmm1, DWORD PTR [edx]
** xmm1 *= scale
        mulss     xmm1, xmm0
** *vec = xmm1
        movss     DWORD PTR [edx], xmm1
** vec++
        add       edx, 4
** if(--len == 0) return
        dec       ebx
        je        .B2.16
.B2.5:
**	while((intptr_t)vec & 15)
        test      dl, 15
        jne       .B2.4
** here when vec at 16 byte boundary
.B2.7:
** int lenFullSSE = len & ~3;
        mov       ecx, ebx
        and       ecx, -4
** xmm0 = scale (lowest float of xmm0)
        movss     xmm0, DWORD PTR [24+esp]
** duplicate lowest float of xmm0 to all floats of xmm0
        shufps    xmm0, xmm0, 0
** for(int i=0; i < lenFullSSE; i += 4)
** loading registers for use in body of loop

        lea       esi, DWORD PTR [3+ecx]
        mov       eax, esi
        sar       eax, 1
        shr       eax, 30
        add       eax, esi
        sar       eax, 2
        test      ecx, ecx
        jle       .B2.11
.B2.8:
        xor       edi, edi
        mov       esi, edx
.B2.9:
** temp = _mm_load_ps(&vec[i]); // load aligned packed doubles
        movaps    xmm1, XMMWORD PTR [esi]
        inc       edi
** temp = _mm_mul_ps(temp, mul);// multiply
        mulps     xmm1, xmm0
** _mm_store_ps(&vec[i], temp);
        movaps    XMMWORD PTR [esi], xmm1
** esi holds &vec[i], bump to next 4 floats
        add       esi, 16
** i < lenFullSSE
        cmp       edi, eax
        jb        .B2.9
.B2.11:
** int residual = len & 3;
        and       ebx, 3
** if(residual)
        je        .B2.16
.B2.12:
** for(int i=0; i < residual; ++i)
        jle       .B2.16
.B2.13:
** xmm0 = scale (one float)
        movss     xmm0, DWORD PTR [24+esp]
** vec += lenFullSSE;
        lea       eax, DWORD PTR [edx+ecx*4]
        xor       esi, esi
.B2.14:
** temp = vec[i] (one float)
        movss     xmm1, DWORD PTR [eax]
** ++i
        inc       esi
** temp *= scale
        mulss     xmm1, xmm0
** vec[i] = temp
        movss     DWORD PTR [eax], xmm1
** advance register holding &vec[i]
        add       eax, 4
** for(int i=0; i < residual; ++i)
        cmp       esi, ebx
        jl        .B2.14
.B2.16:
** return
        pop       ebx
        pop       edi
        pop       esi
        ret

This function, using the SSE intrinsic functions, produces fairly compact and near optimal code.

As for permutations of this function for architecture, this function makes use of a few number of registers; therefore 32-bit or 64-bit architecture differences are not of concern. This leaves variable type (double, float, char, short, int32) and AVX. Assuming you code for P4 and later, the system will have at least SSE. What this means is you will have to write 10 similar functions (five if you omit AVX).

Now let's look at the same function written using CEAN syntax:


__declspec(noinline)
void VectorScaleCEAN(double* vec, int len, double scale)
{
	vec[0:len] *= scale;
}

This function is reduced to one statement. In fact, you would most likely eliminate the function and use the statement inline. The following two lines are equivilent:


// ...
VectorScaleCEAN(X, lenX, 2.0); // use function
X[0:lenX] *= 2.0;                         // perform inline
// ...

When examining the assembler code generated for the function, which includes my annotations (**) for this article we find:


       PUBLIC ?VectorScaleCEAN@@YAXPAMHM@Z
?VectorScaleCEAN@@YAXPAMHM@Z	PROC NEAR 
; parameter 1(vec): eax
; parameter 2(len): edx
; parameter 3(scale): 20 + esp
.B3.1:
        mov       eax, DWORD PTR [4+esp]
        mov       edx, DWORD PTR [8+esp]
	PUBLIC ?VectorScaleCEAN.@@YAXPAMHM@Z
?VectorScaleCEAN.@@YAXPAMHM@Z::
        push      edi
        push      ebx

** tests for valid len

        test      edx, edx
        jle       .B3.17
.B3.2:

** test for vec aligned at 16 byte boundary

        mov       ecx, eax
        movss     xmm1, DWORD PTR [20+esp]
        and       ecx, 15
        je        .B3.5
** test for vec aligned at 4 byte boundary
** not performed in our VectorScale due to knowledge about program
.B3.3:
        test      cl, 3
        jne       .B3.18
.B3.4:
** preamble of loop to run vec up to 16 byte aligned boundary
        neg       ecx
        add       ecx, 16
        shr       ecx, 2
.B3.5:
        lea       ebx, DWORD PTR [8+ecx]
        cmp       edx, ebx
        jl        .B3.18
.B3.6:
        mov       edi, edx
        sub       edi, ecx
        and       edi, 7
        neg       edi
        add       edi, edx
        test      ecx, ecx
        jbe       .B3.10
.B3.7:
        xor       ebx, ebx
.B3.8:
** loop to progress one float at a time till aligned
        movss     xmm0, DWORD PTR [eax+ebx*4]
        mulss     xmm0, xmm1
        movss     DWORD PTR [eax+ebx*4], xmm0
        inc       ebx
        cmp       ebx, ecx
        jb        .B3.8
.B3.10:
** load xmm0 (all 4 floats) with scale
        movaps    xmm0, xmm1
        shufps    xmm0, xmm0, 0
.B3.11:
** loop to run through 16 byte aligned portion of vec
** *** note additional optimization to double up moves
        movaps    xmm2, XMMWORD PTR [eax+ecx*4]
        movaps    xmm3, XMMWORD PTR [16+eax+ecx*4]
        mulps     xmm2, xmm0
        mulps     xmm3, xmm0
        movaps    XMMWORD PTR [eax+ecx*4], xmm2
        movaps    XMMWORD PTR [16+eax+ecx*4], xmm3
        add       ecx, 8
        cmp       ecx, edi
        jb        .B3.11
.B3.13:
** test for residual
        cmp       edi, edx
        jae       .B3.17
.B3.15:
** loop for residual data
        movss     xmm0, DWORD PTR [eax+edi*4]
        mulss     xmm0, xmm1
        movss     DWORD PTR [eax+edi*4], xmm0
        inc       edi
        cmp       edi, edx
        jb        .B3.15
.B3.17:
** return
        pop       ebx
        pop       edi
        ret
** extra snip of code for use above
.B3.18:
        xor       edi, edi
        jmp       .B3.13

The CEAN code performed the same tests we did by hand performed:

  • Sanity check on length of vector
  • When necessary, individual float scaling until aligned at 16-byte boundary
  • Additional test for vec not at multiple of 4 bytes
  • 16-byte aligned scaling
  • *** with additional optimization of doubling up the load/multiply/store
  • The residual floats scaling

The CEAN code should out-perform our handwritten code (meaning we could have done better had we spent more time at it). The beauty of the CEAN approach is we do not have to write additional code for use with AVX. The compiler will do this automatically.

Additional notes for when producing your test code as .ASM listing for optimizations include:

  • Use Release Build
  • Turn off IPO (Inter-Procedural Optimization)
  • Add __declspec(noinline)
  • Do not use literals or a variable where the compiler optimization code will know the values.
  • Consider using #pragma nounroll in places.

Caveats

CEAN is relatively new to the language syntax. As with any new feature, expect some issues with regard to code generation. The issues are identified and reported back to the product support department of your compiler vendor will be addressed and corrected in the incremental updates. The more users experimenting with CEAN, and reporting issues, the faster the CEAN capability matures into a solid feature.

While CEAN will not eliminate all the circumstances for using intrinsic functions, it will eliminate many such instances of their use. And produce code that is more portable. I suggest that you give CEAN serious consideration and provide feedback to the development teams working on it to make this extension available to us overworked programmers.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video