Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Thread Parallelism Using Cilk Notation for C/C++



Getting top performance out of modern processors requires both SIMD and thread parallelism. As I discussed in SIMD Parallism using Array Notation for C/C++, fork-join and SIMD parallelism can be combined to solve a computational problem:

  • fork-join parallelism can specify which subproblems can be solved in parallel by different hardware threads.
  • SIMD parallelism can be applied to each subproblem, which helps the compiler exploit SIMD instructions.

In fork-join parallelism, control flow forks into separate flows, and later the flows join back together. Recursive fork-join can be particularly effective. For example, 10 levels of two-way splits creates 1024-way potential parallelism. 1024 separate hardware threads would be inefficient. But systems like Cilk Plus (and Threaded Building Blocks) are careful to turn potential parallelism into actual parallelism only when necessary. Indeed with these systems you should specify as much fork-join parallelism as you can, and let the system determine when, where, and how much to exploit.

An important point is that the subproblems should fit in cache, otherwise multiple threads can easily run up against memory bandwidth limitations. Sometimes it is hard to determine what a cache-sized subproblem is. When in doubt, go as small as practical. The subproblems are too small if the overhead of scheduling a chunk becomes significant compared to the work for solving the subproblem serially. Because Cilk Plus is tightly integrated into the compiler, and has highly structured parallelism, it tends to have lower chopping overhead than systems like TBB that are less structured and supplied as libraries. So you can often afford to chop into finer subproblems than you might with a system such as TBB.

A particularly effective form of chopping is exhibited by cache oblivious algorithms. These optimize for all levels of cache that might exist (including virtual memory!), oddly enough without knowing anything about the levels. See the link for more information.

Quick Introduction to Cilk Notation

There are two ways to express fork-join parallelism in Cilk:

  • cilk_spawn / cilk_sync
  • cilk_for

For example, the statement cilk_spawn f(); asynchronously calls a function f. Here asynchronous means that thethe caller keeps going without waiting for the callee to return. This is a neat way to express parallelism because most of what you learned about subroutine calls still applies. The actual arguments are evaluated, bound to formal arguments, and the callee's body is invoked. The only thing that changes is that the caller continues to execute after the callee is entered, if there is a hardware thread available to continue execution of the caller.

Eventually a caller must wait on its callees. To do so, the caller invokes cilk_sync. For example, the following code lets f(x), g(y), and h(z) run in parallel if there are sufficient hardware resources. It waits for them to complete before printing their results.

a = cilk_spawn f(x);
b = cilk_spawn g(y);
c = h(z);
cilk_sync;
cout << a << b << c << "\n";

Figure 1 shows the execution order.

[Click image to view at full size]
Figure 1

The last call is not spawned as a matter of good Cilk style. It could be spawned, but doing so is considered poor style, because there is no other work to do between that call and the cilk_sync. Doing nothing in parallel with doing something is pointless overhead.

The semantics of cilk_sync is that it waits for all cilk_spawn calls that were issued by the current Cilk block. A Cilk block is the body of a function, or the body of a cilk_for loop. Complex control flow is allowed between cilk_spawn and cilk_sync. For example, this code:

  

for( list::iterator i=x.begin(); i!=x.end(); ++x )
    cilk_spawn f(*i);
cilk_sync;

walks a linked list x sequentially and issues calls f(*i) that run concurrently. Execution waits for all of them to complete at the cilk_sync.

There is always an implicit cilk_sync when a routine returns. Hence a callee cannot accidentally leave "dangling tasks" running after it returns. This feature makes it much easier to reason about Cilk parallelism than it is to reason about raw threads.

The callee can be function object (functor) too. Since a C++1x lambda expression returns a function object, you can also spawn a block of work like this:

cilk_spawn [&]{for( int i=0; i<5; ++i ) cout<<"quack ";} ();
other_work(); // Do other work while quacking.
cilk_sync;

In the example, the lambda expression [&]{...} returns a function object. The cilk_spawn...() causes it to execute asynchronously. Do not forget the trailing parentheses, or the compiler will balk.

The notation cilk_for is like a C/C++ for, except that iterations run in parallel if resources permit. For example:

cilk_for( vector::iterator i=x.begin(); i!=x.end(); ++x )
    f(*i);

allows, but does not mandate, each loop iteration to run in parallel. A cilk_for loop is usually more efficient than a serial for loop wrapped around cilk_spawn requests. The reason is that a cilk_for can distribute the work across processors in parallel, and much more efficiently, than the combination of for and cilk_spawn. However, a cilk_for is more limited than a plain for. The iteration variable must be of an integral type or a random access iterator, and the cilk_for must fit a certain pattern, which permits the number of iterations to be computed before the loop iterations commence.

Another key feature of Intel Cilk Plus are hyperobjects. They deal the "join" part of fork-join parallelism when there is data to be joined, such as in reductions. Hyperobjects are a topic for another day.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.