Channels ▼
RSS

.NET

Microsoft's C++ AMP Unveiled


Tiled C++ AMP Version

In order to reduce the number of RAM accesses the parallel algorithm performs, we would like to share input data between threads, rather than letting each thread independently load the same data from global memory. It is possible to achieve this goal using "thread tiling" and by utilizing "tile static memory." A tile is a unit of sharing of memory and also a unit of synchronization between threads. Threads in a tile can synchronize using tile barriers, and they can also employ atomic operations on tile static memory.

Our strategy for tiling is to load tile_size worth of memory at a time. This is done in tandem by all threads in the tile, each one loading one element of input. After having completed this collaborative operation, each thread figures out whether any of the data loaded needs to be factored into its local moving average calculation. The code follows below:

	void sma_amp_tiled(int series_length, const float *series, float *moving_average, int window)
	{
	array_view<const float>        arr_in(series_length, series);
	array_view<writeonly<float>>   arr_out(series_length, moving_average);
	static const int tile_size = 512;
	int number_of_groups = (series_length + tile_size -1)/tile_size;
	int total_threads = number_of_groups * tile_size;
	parallel_for_each(
	  grid<1>(extent<1> (total_threads)).tile<tile_size>(), 
	  [=](tiled_index<tile_size> tiled_idx) restrict(direct3d) {
	    int gid = tiled_idx.global[0];
	    int lid = tiled_idx.local[0];	

	    int tile_min_thread = tiled_idx.tile_origin[0];
	    int tile_max_thread = tile_min_thread + tile_size -1;
	
	    int tile_min_ref = tile_min_thread - (window - 1) < 0 ? 0 : tile_min_thread - (window - 1);
	    int tile_max_ref = tile_max_thread >= arr_in.extent[0] ? arr_in.extent[0]-1 : tile_max_thread;	

	    int my_min_ref = gid - (window - 1) < tile_min_ref ? tile_min_ref : gid - (window - 1);
	    int my_max_ref = gid > tile_max_ref ? tile_max_ref : gid;	

	    float sum = 0.0f;
	    for (int i=tile_min_ref; i<=tile_max_ref; i+=tile_size) {
	      tile_static float tile[tile_size];	

	      tile[lid] = arr_in[i + lid];
	      tiled_idx.barrier.wait();	

	      int ith_min_ref = i < my_min_ref ? my_min_ref : i;
	      int ith_max_ref = i + (tile_size-1) > my_max_ref ? my_max_ref : i + (tile_size-1);	

	      for (int j=ith_min_ref; j<=ith_max_ref; j++)
	        sum += tile[j-i];
	      tiled_idx.barrier.wait();
	    }
	    if (gid >= (window-1) && gid < series_length)
	      arr_out[gid] = sum / window;
	  });
	}

Explaining the algorithm in minute detail falls outside of the scope of this short introduction to C++ AMP, but let's focus on a few common patterns that appear in the code. Lines 5 through 9 take the number of elements in the problem and launch a tiled computation over this index range. To do that, we select a tile size, pad the number of threads to a multiple of the tile size, and then launch a kernel over a tiled_grid. The parameter to the lambda will correspondingly be a tiled_index, which provides information about the location of the thread within the local tile and within the global computation.

Lines 11 through 21 all contain index math to figure out the pieces of data the thread tile will load collaboratively and the indices within this range that each specific thread has interest in.

Lines 24 through 36 iterate over the index set for the tile, and load memory from the array_view into the variable tile, which is annotated with the tile_static storage class — this tells the system to make tile a variable that is accessible by all threads within a given tile. After storing the read values into tile and before proceeding to read the values produced by other threads, each thread must reach a barrier and wait for its peers, denoted in the code with tiled_idx.barrier.wait().

Finally, results are stored in lines 37 and 38. We must check whether the current thread corresponds to an element in the output. Since we have padded the grid over which the parallel_for_each was launched, some threads may not correspond to an output element.

The tiled version is naturally much more complicated than the simple parallel version, but the rewards are significant, too; for many inputs we have tested, it is twice as fast.

Additional optimizations are possible, both in the high-level algorithm and tactically in the way we code using C++ AMP. These optimizations will be covered in future articles. For many problems, however, the speedup offered by a simple parallel algorithm is by itself compelling enough (as is the case here). So the advice as always will be: Invest in optimization only as much as your requirements call for.

And Now What?

This article has attempted to give you a taste of C++ AMP, but there is a long list of features that we simply did not have space to cover. We will discuss them in a future article on Dr. Dobb's. That list of features includes our math and atomics library, Direct3D-interop interface and HLSL-intrinsic support (for game developers), a CPU fall-back solution when there is no other capable accelerator on the system, and upcoming Visual Studio support for debugging, profiling, and all other parts of the market-leading IDE, plus more.

We hope that you have found this short introduction to C++ AMP compelling and we look forward to hearing how massive parallelism is going to transform your problem domain! Microsoft is hard at work bringing heterogeneous computing to the mainstream, through a technology that offers performance, productivity, and portability. We are doing that with a future-proof design that abstracts from today's hardware landscape that is still in motion.


Daniel Moth is a Principal Program Manager in Microsoft's Developer Division. Yossi Levanoni is a Principal Architect in Microsoft's Developer Division.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video