Channels ▼
RSS

Design

SIMD-Enabled Vector Types with C#


Working with Hardware-Dependent Vector Types

When you work with any of the three fixed-sized vectors (Vector2f, Vector3f, or Vector4f), the SIMD instructions are working with 128-bit SIMD registers, which allow you to perform operations on four packed single-precision floating-point numbers. Many modern CPUs provide SIMD instructions capable of working with 256-bit SIMD registers. If you use the necessary instructions, you'll be able to perform the same operations on eight packed single-precision floating-point numbers; that is, twice the amount of data in a single operation.

One of the main advantages of working with fixed-sized vectors is that you can easily replace your existing code with code that generates SIMD instructions. However, the main tradeoff is that you might be generating SIMD instructions that aren't taking full advantage of the capabilities offered by the underlying hardware. If you need an algorithm to generate better SIMD instructions based on the underlying hardware, you will want to use 128-bit, 256-bit, and even 512-bit registers from a single slice of C# source code. This way, RyuJIT will generate the most appropriate SIMD instructions available in each CPU in which the code is executed.

As you already know, these kinds of optimizations usually add overhead and aren't the best option for every case. Thus, you must be careful when working with hardware-dependent vector types. Sometimes, the overhead added to pack and unpack data types kills the performance improvements achieved with the use of more optimal SIMD instructions. As with any code optimization, it is necessary to measure speed improvements.

The Microsoft.Bcl.Simd NuGet package provides the System.Numerics.Vector<T> struct to provide vectors that encapsulate a specific number of values of type T. The Vector<T> struct abstracts a SIMD register and the optimal number of elements is sizeof(SIMD register) / sizeof(T). For example, if sizeof(simd-register) is equal to 256-bits, T is float, sizeof(float) is 32-bits, the number of elements is going to be 256/32 = 8 elements, where T can be any of the following numerical types:

  • double
  • int
  • long
  • float

The JIT compiler determines the optimal number of elements defined as sizeof(SIMD register) / sizeof(T) and makes it available in the static Length property. As I explained in the previous article, RyuJIT CTP4 is capable of emitting SIMD instructions included in the Streaming SIMD Extensions 2 (SSE2) instruction set, so the maximum SIMD register it can work with is 128-bits. However, the final version of this JIT compiler should be able to support the more powerful Advanced Vector Extensions (AVX) instruction set and the maximum SIMD register size should be 256-bits on capable hardware; that is, when you execute the application with RyuJIT on CPUs that provide AVX.

With SSE2 and its 128-bits SIMD register, the values for the Length static property for the supported numerical types are:

  • Vector<double>.Length: 2
  • Vector<int>.Length: 4
  • Vector<long>.Length: 2
  • Vector<float>.Length: 4

The following lines show an example of using of Vector<float> to compute square roots of packed single-precision, floating-point values considering the maximum number of values that the type can handle based on the underlying hardware. The code copies from an input array of floats, computes the square roots, and then copies the results to an output array of float.

// valuesIn has 16 float elements
var valuesIn = new float[] {4f, 16f, 36f, 64f, 9f, 81f, 49f, 25f, 100f, 121f, 144f, 16f, 36f, 4f, 9f, 81f};
var valuesOut = new float[valuesIn.Length];
// Vector<int>.Length is equal to 4 when RyuJIT produces SSE2 instructions
for (int i = 0; i < valuesIn.Length; i += Vector<int>.Length) {
    // Each vector works with 4 int values from i to i + 4 when RyuJIT produces SSE2 instructions 
    var vectorIn = new Vector<float>(valuesIn, i);
    var vectorOut = VectorMath.SquareRoot(vectorIn);
    vectorOut.CopyTo(valuesOut, i);
}

The code defines a valuesIn array with 16 float elements. To simplify the sample code, I've chosen several elements that produce an exact result when divided by the possible values of Vector<float>.Length. With SSE2 support, Vector<float>.Length is 4; and with future AVX support, Vector<float>.Length will have a value of 8. Obviously, in real-life algorithms, you might have numbers of elements that are not going to produce exact results when divided by the value of the Length static property and you will need to consider this case. The valuesOut array has the same number of elements defined in valuesIn.

A for loop takes slices of the valuesIn array of float based on the value of the Vector<float>.Length static property, then generates a new Vector<float> named vectorIn with the appropriate number of elements from the valuesIn array. With SSE2 support, vectorIn will have 4 float values. With AVX support, vectorIn will have 8 float values. The code uses the Vector<float> constructor to create a vector from an array copying from i (the specified index parameter) to i + Vector<float>.Length.

Then, the code calls the VectorMath.SquareRoot method to generate a new Vector<float> named vectorOut with the computed square roots of the float elements included in vectorIn. As I explained earlier, with SSE2 support, the VectorMath.SquareRoot generates the SQRTPS instruction (Compute Square Roots of Packed Single-Precision Floating Point Values), equivalent to the _mm_sqrt_ps instrinsic. Thus, with a single SSE2 SIMD instruction, the code computes the square roots of four float elements.

Next, the code calls the vectorOut.CopyTo method to copy elements of the vector to the destination array starting form the i index into the valuesOut array. As I explained, the method uses SIMD instructions.

Of course, the code adds significant overhead. However, the sample is a simple demo of how you can generate code that will be optimized to take advantage of more-powerful SIMD instructions and larger SIMD registers. When RyuJIT adds AVX support, the code can calculate twice as many square roots in a single instruction without any changes in the C# code. (But don't forget that I've simplified the code that distributes the data from the array to the vectors.)

Conclusion

While hardware-dependent vector types add code overhead, they are very interesting when you need to optimize algorithms that would benefit from larger SIMD register sizes in the future. Your investment in optimizing the code might have higher dividends in the final JIT compiler version with AVX support.

You probably won't want to rewrite all your algorithms to use SIMD instructions, but you will definitely find it interesting to consider working with the new vector types, operators, and methods that generate SIMD instructions because they offer performance improvements that you could not achieve in the past with .NET Framework. Don't forget that the documentation for Microsoft.Bcl.Simd is not updated with the latest improvements included in RyuJIT CTP4, and you will have to wait until the final JIT compiler and NuGet packages are updated to get serious speed improvements.


Gastón Hillar is a senior contributing editor at Dr. Dobb's.

Related Article

64-bit SIMD Code from C#


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video