Stephen Blair-Chappell is a Technical Consulting Engineer at Intel, and has worked in the Intel Compiler Lab for the last 10 years.
The Intel compiler has a feature that can make some applications run much faster -- auto-vectorisation. With a flick of the switch, some code can be sped up significantly. A number of times I have seen programs run much faster just by using this option without changing a line of code. With vectorisation, the compiler uses a set of advanced Single Instruction Multiple Data (SIMD) instructions, which ornate most modern CPUs.
Love at First Sight
"I've got to have the latest Intel compiler; it just doubled the speed of my code". So wrote a developer recently having enabled the auto-vectoriser in the compiler. Over the last 18 months I've enjoyed being at the "coal-face," working with developers helping them to optimise their code. Even as recently as this last week I worked with a company whose application speeded up by a factor of ten -- that is a 1000% speedup. No wonder I'm head-over-heels. These experiences make me feel good, make the developer look good in front of his manager, and also promotes the product that helps pay my mortgage.
Now I know that some of my colleagues will be jumping up and down at this point saying "Stephen you've got to lower people's expectations -- tell them some code can never be vectorised." Glass half-empty comes to mind at this point.
So How Does It Work?
Silicon manufacturers have enhanced the CPUs they produce in each new generation, adding extra on-chip extensions to the core -- one area of interest being support for maths and floating-point instructions. In the days if the Intel 386, a maths co-processor the Intel 387 was one such extension. The latest generation of Intel processors include support for MMX, SIMD extensions SSE, SSE2, SSE3, SSSE3, and SSE4.
SIMD instructions are capable of doing the same calculation on multiple data. So, for example, it is possible to perform four floating-point operations in one instruction; see Figure 1.
To take advantage of these new instructions the C\C++ programmer could insert intrinsic instructions direct into their source code, but this is hard work and would make the source code non-portable.
Luckily, some clever people at the Intel Compiler Labs looked at ways of automatically using these SIMD instructions and came up with a technique known as "auto-vectorisation."
When auto-vectorisation is enabled, the compiler looks for opportunities where several traditional calculations could be replaced by a single SIMD instruction. A typical example of this would be a loop containing a floating-point operation. The compiler can effectively reduce the loop count by a factor of 4 by replacing the floating-point instruction with a SIMD instruction.
Putting It In Words
There are times when applications may not be suitable candidates for vectorisation. Sometimes code which potentially could be vectorised is not done so because of dependencies or other code anomalies. When a piece of code doesn't vectorise, the reporting feature of the vectoriser can help one understand why not. There are three levels of report, the last level being the most verbose, giving both reasons for failure and success; see Figure 2.
I must admit that most times I've used vectorisation I've hardly ever modified any code, apart from perhaps inserting the odd #pragma ivdep to tell the compiler to ignore a particular loop dependency. The compiler's vectorisation reports can be a great aid to helping the developer shoe-horn vectorisation into more awkward code.
The vectoriser will try to work out if there is sufficient work to be done before it transforms some code. You can see in Figure 3 that the compiler didn't vectorise the first code snippet, however when the loop counter limit in increased to 100 the vectoriser decides that there is sufficient work to warrant transformation.
Keeping the Family Happy
So what happens when I run my latest vectorised code on an old PC that doesn't support the newer SIMD instructions? Simple -- it crashes, or at least it will crash unless I also use another clever option that the Intel compiler brings.
The Generate Alternative Code Paths compiler option, tells the compiler to generate an alternate bypass alongside the code section that has been vectorised. More than one bypass can be added, so you could have support for several specific CPUs along with a generic version of the vectorised code that will run on all processors. At the start of the vectorised code the CPUID of the machine you're running on is determined, based on that value the most appropriate code route is then taken. The compiler options for the alternate paths are in Figure 4. There is a small increase in code size because of the duplicated code paths. Checking the CPUID also introduces a few extra instructions that have to be executed but the impact on the runtime is negligible.
Why Not Give It a Go!
Let's be truthful, I work for Intel in the Compiler Labs -- so you'd expect me to plug the Intel products. Even so, why not let the compiler speak for itself?
You can download a free evaluation version from www.intel.com/software. It's a plug-and-play replacement for the Microsoft and GCC compiler and fairly easy to integrate into an existing project. Nice thing is that during the evaluation period there's free online support and access to a user forum.
While you're trying out vectorisation on you code, you could also experiment with other optimisation options such as inter procedural optimisation, profile guided optimisation and auto-parallelism all of which are well documented in the online help.
This month's Dr. Dobb's Journal
- Hear Ray Ozzie, Founder of Talko at Enterprise Connect - Enterprise Connect Orlando
- Learn Newest Developments in Video at Enterprise Connect - Enterprise Connect Orlando
- Attend the Collaboration Track at Interop Las Vegas - Interop Las Vegas 2016
- Interop Las Vegas Cloud Connect Track - Interop Las Vegas 2016
- Come to Interop Las Vegas, May 2 - 6, 2016 - Interop Las Vegas 2016
- The Role of the WAN in Your Hybrid Cloud
- Stop Malware, Stop Breaches? How to Add Values Through Malware Analysis
- Securosis Analyst Report: Security and Privacy on the Encrypted Network
- Overview: Cloud Operations Platfom for AWS
- Coding to standards and quality: supply-chain application development
Most Recent Premium Content
- November - Mobile Development
- August - Web Development
- May - Testing
- February - Languages
- Open Source
- Windows and .NET programming
- The Design of Messaging Middleware and 10 Tips from Tech Writers
- Parallel Array Operations in Java 8 and Android on x86: Java Native Interface and the Android Native Development Kit
- January - Mobile Development
- February - Parallel Programming
- March - Windows Programming
- April - Programming Languages
- May - Web Development
- June - Database Development
- July - Testing
- August - Debugging and Defect Management
- September - Version Control
- October - DevOps
- November- Really Big Data
- December - Design
- January - C & C++
- February - Parallel Programming
- March - Microsoft Technologies
- April - Mobile Development
- May - Database Programming
- June - Web Development
- July - Security
- August - ALM & Development Tools
- September - Cloud & Web Development
- October - JVM Languages
- November - Testing
- December - DevOps
Dr. Dobb's Journal
Dr. Dobb's Tech Digest