### Working with Hardware-Dependent Vector Types

When you work with any of the three fixed-sized vectors (`Vector2f`

, `Vector3f`

, or `Vector4f`

), the SIMD instructions are working with 128-bit SIMD registers, which allow you to perform operations on four packed single-precision floating-point numbers. Many modern CPUs provide SIMD instructions capable of working with 256-bit SIMD registers. If you use the necessary instructions, you'll be able to perform the same operations on eight packed single-precision floating-point numbers; that is, twice the amount of data in a single operation.

One of the main advantages of working with fixed-sized vectors is that you can easily replace your existing code with code that generates SIMD instructions. However, the main tradeoff is that you might be generating SIMD instructions that aren't taking full advantage of the capabilities offered by the underlying hardware. If you need an algorithm to generate better SIMD instructions based on the underlying hardware, you will want to use 128-bit, 256-bit, and even 512-bit registers from a single slice of C# source code. This way, RyuJIT will generate the most appropriate SIMD instructions available in each CPU in which the code is executed.

As you already know, these kinds of optimizations usually add overhead and aren't the best option for every case. Thus, you must be careful when working with hardware-dependent vector types. Sometimes, the overhead added to pack and unpack data types kills the performance improvements achieved with the use of more optimal SIMD instructions. As with any code optimization, it is necessary to measure speed improvements.

The `Microsoft.Bcl.Simd`

NuGet package provides the `System.Numerics.Vector<T>`

struct to provide vectors that encapsulate a specific number of values of type `T`

. The `Vector<T>`

struct abstracts a SIMD register and the optimal number of elements is `sizeof(SIMD register) / sizeof(T)`

. For example, if `sizeof(simd-register)`

is equal to 256-bits, `T`

is `float`

, `sizeof(float)`

is 32-bits, the number of elements is going to be 256/32 = 8 elements, where `T`

can be any of the following numerical types:

`double`

`int`

`long`

`float`

The JIT compiler determines the optimal number of elements defined as `sizeof(SIMD register) / sizeof(T)`

and makes it available in the static `Length`

property. As I explained in the previous article, RyuJIT CTP4 is capable of emitting SIMD instructions included in the Streaming SIMD Extensions 2 (SSE2) instruction set, so the maximum SIMD register it can work with is 128-bits. However, the final version of this JIT compiler should be able to support the more powerful Advanced Vector Extensions (AVX) instruction set and the maximum SIMD register size should be 256-bits on capable hardware; that is, when you execute the application with RyuJIT on CPUs that provide AVX.

With SSE2 and its 128-bits SIMD register, the values for the `Length`

static property for the supported numerical types are:

`Vector<double>.Length: 2`

`Vector<int>.Length: 4`

`Vector<long>.Length: 2`

`Vector<float>.Length: 4`

The following lines show an example of using of `Vector<float>`

to compute square roots of packed single-precision, floating-point values considering the maximum number of values that the type can handle based on the underlying hardware. The code copies from an input array of `float`

s, computes the square roots, and then copies the results to an output array of `float`

.

// valuesIn has 16 float elements var valuesIn = new float[] {4f, 16f, 36f, 64f, 9f, 81f, 49f, 25f, 100f, 121f, 144f, 16f, 36f, 4f, 9f, 81f}; var valuesOut = new float[valuesIn.Length]; // Vector<int>.Length is equal to 4 when RyuJIT produces SSE2 instructions for (int i = 0; i < valuesIn.Length; i += Vector<int>.Length) { // Each vector works with 4 int values from i to i + 4 when RyuJIT produces SSE2 instructions var vectorIn = new Vector<float>(valuesIn, i); var vectorOut = VectorMath.SquareRoot(vectorIn); vectorOut.CopyTo(valuesOut, i); }

The code defines a `valuesIn`

array with 16 `float`

elements. To simplify the sample code, I've chosen several elements that produce an exact result when divided by the possible values of `Vector<float>.Length`

. With SSE2 support, `Vector<float>.Length`

is 4; and with future AVX support, `Vector<float>.Length`

will have a value of 8. Obviously, in real-life algorithms, you might have numbers of elements that are not going to produce exact results when divided by the value of the `Length`

static property and you will need to consider this case. The `valuesOut`

array has the same number of elements defined in `valuesIn`

.

A `for`

loop takes slices of the `valuesIn`

array of `float`

based on the value of the `Vector<float>.Length`

static property, then generates a new `Vector<float>`

named `vectorIn`

with the appropriate number of elements from the `valuesIn`

array. With SSE2 support, `vectorIn`

will have 4 `float`

values. With AVX support, `vectorIn`

will have 8 float values. The code uses the `Vector<float>`

constructor to create a vector from an array copying from `i`

(the specified index parameter) to `i + Vector<float>.Length`

.

Then, the code calls the `VectorMath.SquareRoot`

method to generate a new `Vector<float>`

named `vectorOut`

with the computed square roots of the `float`

elements included in `vectorIn`

. As I explained earlier, with SSE2 support, the `VectorMath.SquareRoot`

generates the `SQRTPS`

instruction (Compute Square Roots of Packed Single-Precision Floating Point Values), equivalent to the `_mm_sqrt_ps`

instrinsic. Thus, with a single SSE2 SIMD instruction, the code computes the square roots of four `float`

elements.

Next, the code calls the `vectorOut.CopyTo`

method to copy elements of the vector to the destination array starting form the `i`

index into the `valuesOut`

array. As I explained, the method uses SIMD instructions.

Of course, the code adds significant overhead. However, the sample is a simple demo of how you can generate code that will be optimized to take advantage of more-powerful SIMD instructions and larger SIMD registers. When RyuJIT adds AVX support, the code can calculate twice as many square roots in a single instruction without any changes in the C# code. (But don't forget that I've simplified the code that distributes the data from the array to the vectors.)

### Conclusion

While hardware-dependent vector types add code overhead, they are very interesting when you need to optimize algorithms that would benefit from larger SIMD register sizes in the future. Your investment in optimizing the code might have higher dividends in the final JIT compiler version with AVX support.

You probably won't want to rewrite all your algorithms to use SIMD instructions, but you will definitely find it interesting to consider working with the new vector types, operators, and methods that generate SIMD instructions because they offer performance improvements that you could not achieve in the past with .NET Framework. Don't forget that the documentation for `Microsoft.Bcl.Simd`

is not updated with the latest improvements included in RyuJIT CTP4, and you will have to wait until the final JIT compiler and NuGet packages are updated to get serious speed improvements.

*Gastón Hillar is a senior contributing editor at *Dr. Dobb's.