Amazing Performance Gains Using SSE Intrinsics
I'm in the middle of writing up some case studies based on interviews with users of Intel Parallel Studio. As part of the exercise I set myself the goal of duplicating every technique the project engineers used.
Developers in two of the first three case studies used SSE2 (Streaming SIMD Extensions) intrinsics to speed up their code. I must admit, I've never really used the intrinsics before (apart from inserting some pre-fetch instructions in some code), and thought they were a bit 'out-of-fashion'. I was completely amazed that by spending a couple of hours on some code I was able to get the code running nearly 20 times faster.
The code that I wrote does a lot of array manipulation, iterating through an array millions of times to test for values. By rearranging the code so that I used an array of SSE registers, rather than an array of integers, I was able to get the dramatic increase in speed.
Hard Work
Getting familiar with the different intrinsics is hard work. Most of the time spent rewriting this code was reading the description of each intrinsic in the compiler manual. Even now I'm not sure I've written code with the most effective use of the intrinsics.
The method I followed was to:
- Write the original code
- Use Intel Parallel Amplifier to check for Hotspots (in my case the hotspot was a function testing if the array held a certain value).
- Rewrite the hotspot code using SSE intrinsics
- Re-run Parallel Amplifier
In the code I used the following intrinsics:
__m128i MyArray[MY_MAX_NUM]; // array of 128bit values
_mm_and_si128( ...) // AND
_mm_setzero_si128() // init to zero
_mm_cmpeq_epi32(...) // IS EQUAL?
_mm_storeu_si128(...); // COPY SSE2 results into memory
A more complete description of these intrinsics can be found in the Parallel Studio help. These macros are also supported by the Microsoft compiler. You can find some introductory material on SSE intrinsics here.
Using SSE intrinsics now goes to near the top of my list of things you should consider when optimising code.

