Intel AVX2 Will Bring Integer Instructions with 256-bit SIMD Numeric Processing Capabilities
A week ago, Intel released public details on its next generation of the x86 architecture. The forthcoming microarchitecture, codenamed "Haswell," will introduce Intel AVX2, a new SIMD instruction set that extends Intel AVX. The SIMD instruction set details are already available, but you will have to wait until 2013 to use them, when the first members of the Haswell microprocessors family become available.
The second-generation Intel Core processor family, codenamed "Sandy Bridge," introduced Intel Advanced Vector Extensions (AVX) in 2011. Intel AVX is a 256-bit instruction set extension to Intel SSE that requires explicit operating system support.
Linux kernel version 2.6.30 or higher, Windows 7 Service Pack 1, and Windows Server 2008 R2 Service Pack 1 added the necessary state management to support Intel AVX. Because Intel AVX2 is also a 256-bit instruction set extension, operating system support shouldn't be a problem. Windows 7 developers had to wait for Service Pack 1 to take full advantage of Intel AVX, but Intel AVX2 won't require additional state management changes. Thus, if an operating system already supports Intel AVX, it will provide complete access to Intel AVX2.
Intel AVX2 instructions will follow the same programming model introduced by the Intel AVX instructions. One of the most interesting enhancements is the promotion of most Intel AVX 128-bit integer SIMD instruction sets to 256 bit. Intel AVX brought 256-bit floating-point SIMD instructions, but it didn't include 256-bit integer SIMD instructions. Intel AVX2 will allow you to operate with the AVX 256-bit wide YMM register for integer data types.
For example, the PABSD instruction was part of the Supplemental Streaming SIMD Extensions 3 (SSSE 3) introduced with the Intel Core 2 architecture. The PABSD mnemonic means packed absolute value for double-word. This assembly instruction receives a 128-bit input parameter that contains four 32-bit signed integers. The instruction returns a 128-bit output that contains the absolute value for each of the four 32-bit signed integers, packed in the 128-bit output.
You can calculate the absolute values for four 32-bit signed integers with a single call to the PABSD instruction. If you have to calculate the absolute values for 1,000 32-bit signed integers, you can do it with 250 calls to this instruction instead of using a single instruction for each 32-bit signed integer. Thus, you can achieve very important speedups. However, because it is necessary to pack the data before calling the SIMD instruction and then unpack the output, it is also important to measure this overhead, which adds some code.
Intel AVX introduced the VPABSD instructio, which promoted PABSD to AVX, but didn't duplicate the number of integers that can be processed at the same time. If you have to calculate the absolute values for 1,000 32-bit signed integers, you can do it with 250 calls to the AVX VPABSD instruction.
Intel AVX2 will promote the VPABSD instruction to 256 bits because it will be possible to make it work with the YMM 256-bit register. Thus, with a AVX2 VPABSD instruction that uses the YMM register, you will be able to duplicate the number of integers that can be processed at the same time. If you have to calculate the absolute values for 1,000 32-bit signed integers, you can do it with 125 calls to the AVX2 VPABSD instruction that works with the YMM register. In addition, if you run SIMD instructions in multiple cores, you can reduce the number of necessary calls. In fact, I've already explained the advantages of running as many SIMD instructions in parallel as available physical cores in my previous post, High-Level Programming Languages Should Improve Support for SIMD Instructions.
Because you won't have to change the programming model, you will be able to achieve impressive speedups by making minor changes to code that uses Intel AVX 128-bit integer SIMD instruction sets. However, remember that Intel AVX2 won't be available until 2013.
Intel AVX2 provides other enhanced functionalities in other areas, such as:
- Specific instructions to fetch non-contiguous data elements from memory.
- Instructions to simplify permute operations on data elements.
- Vector shift instructions with variable-shift count per data element.
You can download the full Intel AVX and AVX2 Programming Reference here. The PDF document is titled “Intel Advanced Vector Extensions,” but it has been updated with the forthcoming Intel AVX2 instruction set.