Ct: C for Throughput Computing

, December 23, 2008

Using Ct, developers can provide a parallel-ready version of common data types that C/C++ programmers already use

One of the main challenges in scaling multi-core for the future is that of migrating programming tools, build environments, and millions of lines of existing code to new parallel programming models or compilers. To help this transition, Intel researchers are developing "Ct," or C/C++ for Throughput Computing.

The Motivation for Ct

Increased software-exposed parallelism in microprocessors and co-processors (like GPUs) is driving development of applications and programming models that scale to utilize many threads and vector instructions. This need is multiplied as these "throughput" architectures will increasingly require software hiding of the exposed micro-architectural latencies through the use of additional threads. Collection-oriented (or data-oriented) parallelism will comprise a large majority of parallelized and vectorized compute cycles in future programs. In a future architecture supporting hundreds to thousands of hardware threads and thousands of software threads, thread creation will often be driven by small and large data structures representing vectors, hierarchies, hash tables, etc. Truly heterogeneous task parallelism will almost certainly exist in most applications to some extent. Having a flexible, parallel collection-oriented building block for user-defined classes will also be essential. Using Ct, developers can provide a parallel-ready version of common data types that C/C++ programmers already use.

How Does It Work?

In principal, Ct works with any standard C++ compiler because it is a standards-compliant C++ library (with a lot of runtime behind the scenes). When one initializes the Ct library, one loads a runtime that includes the compiler, threading runtime, memory manager -- essentially all the components one needs for threaded and/or vectorized code generation. Ct code is dynamically compiled, so the runtime tries to aggregate as many smaller tasks or data parallel work quanta so that it can minimize threading overhead and control the granularity according to run-time conditions. With the Ct dynamic engine, one gets precise inter-procedural traces to compile, which is extremely useful in the highly modularized and indirect world of object-oriented programming.

Collections in Ct

One of the justified criticisms of data-parallel programming models is that they focus on the very restrictive (but performance friendly) vector as the primitive parallel collection. Defining parallel operations over vectors is useful when a data type fits nicely into a flat, contiguous segment of memory. For dense linear algebra, for example, this is a convenient abstraction. What if the collection is a sparse matrix, a tree, a graph, or a set of value-key associations? With Ct, the programmer can do this using a TVEC, or a "throughput vector".

A TVEC generally represents parallel collections of data, including flat linear collections (what is usually meant by "vector"), matrices (higher-dimensional linear data structures), sparse collections, and indexed (or associative) collection types. Between these "shapes" of TVECs, one can easily build most of the interesting data types mentioned earlier. Like the typical STL container, TVECs are polymorphic in their base type.


    TVEC<F64> ADoubleVec;

Attributes like length and, more generally, shape are dynamically determined. The programmer that wants more control can declare a TVEC with a particular shape. The declaration here looks very much like a reference type in C++. ADoubleVec refers to an immutable value allocated on Ct's heap. TVECs are managed (i.e. garbage-collected). If one assigns a new value to a TVEC variable, one is not overwriting the old value (though the compiler may do this). This is because another computation may need this old value, especially as the runtime compiler is aggressively reordering and parallelizing computations. In this respect, TVEC values are immutable.

Here are some examples of what a collection in a TVEC might look like.

A flat vector

 
     A = [a, b, c, d, e] --> A[2] == c

A 2D matrix:

     B = [[a, b, c], [d, e, f], [g, h, i]] -->  B[0, 2] == c

An irregular, nested vector (like segmented vectors in Nesl):

     C = [[a], [], [b, c, d], [e, f]] -->  C[2, 1] == c

An indexed vector:

     D = [(0 ->  a), (2 ->  b), (8 ->  c), (_ ->  d)] -->  D[8] == c    
     // The _ denotes a default value for the empty holes, e.g. D[5] == d

The base types of TVECs can be any of the usual value types, and will soon include tuples and records (structs). Ct also supports operators on a full range of scalar value types.

Operators in Ct

All operators on TVECs are implicitly parallel. All operators in Ct are uniformly defined over TVECs of all shape. Though the implementations of various shapes may be more complex than others, this is a detail handled by the runtime compiler. There are several broad categories of Ct operators, three of which we will cover here: element-wise, collective communication, and permutation.

Element-wise operators include the familiar arithmetic operators programmers would expect, like the ability to add two vectors using the "+" operator. These include binary and unary arithmetic operators, transcendental operators, and so on. For TVECs of all shapes, these operators behave sensibly and are overloaded for TVEC-scalar, scalar-TVEC, and scalar-scalar variants. Here are some examples using pretty printed values instead of TVEC variables:


    [1, 2, 3, 4] + [-1, 2, -3, 4] = [0, 4, 0, 8]

Things get a little more interesting with collective communication operations. These operators include reductions and scans (or, as they are otherwise known, prefix-sums). A reduction simply adds a binary operator over an entire collection, essentially it's innermost dimension. For flat vectors, this means one reduces the vector to a single value. For example, the following is the result of a sumReduce:


    sumReduce([1, 2, 3, 4]) = [10]
    sumReduce([[1, 2], [3, 4, 5]]) = [3, 12]
    sumReduce([(1 -> 1), (2 ->  1), (1->  2)]) = [(1-> 3), (2-> 1)]

Note that while the semantics are consistent and sensible, the way reductions are performed depends heavily on the shape of the vector. The programmer does not have to worry about this (though the runtime does). They can program generically against a consistent set of reduction and scan operators. Scans are similar to reductions, but do not reduce dimensionality, instead retaining the partial results in sequence order. This is quite useful for sorting operations.

Permutations include operations that reorder values in a TVEC. We support the general forward and backward permutations, i.e. scatter and gather. These would be expressed A[Indices] = C and A = B[Indices], respectively. However, there are a select number of highly useful and very efficient permutations that are somewhat more constrained. These include pack, unpack, and shift/rotate. Pack simply takes two similarly shaped TVECs, a mask vector which must have a Boolean or compatible base type and a value TVEC of any base type, and eliminates the elements in value TVEC whose corresponding flag values are false. Shift (and rotate) shift elements up, down, right, left, etc. This is very useful in signal or image processing functions.


    Pack([a, b, c, d, e, f], [0, 1, 1, 0, 1, 0]) = [b, c, e]
    ShiftRight([a, b, c, d, e, f], [1], i) = [i, a, b, c, d, e]
    RotateLeft([a, b, c, d, e, f], [2]) = [c, d, e, f, a, b]

User-defined Operators in Ct

Ct supports a complete scalar set of types for a couple of reasons. First, operations involving both TVECs and scalars are useful. Second, programmers should be able to define their own operations or function to apply across TVECs or to use as combining operators in reductions and scans. (For functional programming folks, think map, fold, and foldlist.) This flavor of data parallel programming can be referred to as "kernel" style, as opposed to the "vector" flavor also supported in Ct. Both are useful in differing circumstances. For linear algebra, we find the "vector" style easier to express algorithms, whereas for some signal and image processing algorithms, we find the "kernel" style easier. The underlying machinery is the same, but having both styles is an expressive convenience to the programmer.

Instead of expressing the following color conversion as:


    Y = YRCoeff*R + YGCoeff*G + YBCoeff*B;
    U = URCoeff*R + UGCoeff*G + UBCoeff*B;
    V = VRCoeff*R + VGCoeff*G + VBCoeff*B;

The programmer can write:

    Y = map(RGBtoY, R, G, B);
    U = map(RGBtoU, R, G, B);
    V = map(RGBtoV, R, G, B);

Where the programmer has defined:


    TElt

This is a basic example. Things get interesting when the programmer introduces some control flow (If-Then-Else's, For loops, While loops) or wants to touch neighboring pixels, say, for a 3x3 filter.


    TElt<F32> threebythreefun(TElt<F32> arg, F32 w0, F32 w1, F32 w2, F32 w3, F32 w4) {
    return w0*arg + w1*arg[-1][0] + w2*arg[0][-1] + w3*arg[1][0] + w4*arg[0][1];
    }
    
    map(threebythreefun, some2DGrid, smooth1, smooth2, smooth3, ... [ etc.]);

The alternative would have been to shift this image around, logically creating 4 additional copies. Though the compiler would have optimized these copies away, it's a little misleading and not transparent enough in terms of the programmer's intent. One can even express partial orders on the applications of these kernels (think deblocking filters for H.264):


    TElt<I32> errordiffuse(TElt<I32> pixel) {
    return someexpression(pixel,RESULT[-1][0],RESULT[-1][-1],RESULT[0][-1]);
    };

    ditheredcolors = map(errordiffuse, filteredcolors);

With reduction and scans, things get more interesting. For circuit folks, carry look-ahead adders are just specialized scans.

Tasks in Ct

The Ct API is a declarative way of specifying parallel tasks that are dependent on each other. So, collections of Ct operations on TVECs become collections of task dependence graphs. This still might be too constraining for those who would like to get at tasking abstractions directly. To this end, Ct introduces two abstractions for task parallelism.

The basic tasking unit in the Ct runtime is a future, an idea that goes back to MultiLisp. A future is basically a parallel-ready thunk (a function and arguments) that may or may not be executed immediately (and asynchronously), but whose execution must precede the reading of any value it generates (the runtime guarantees this). So, futures are exposed at the API level. Simply take any Ct functions and invoke it using the "spawn" function in the Ct namespace. The runtime makes sure any dependencies are satisfied so that the corrected order of execution is respected while parallelizing the call. The value a Ct future returns, in essence, is the only "effect" that is visible. Ct also incorporates a structured task parallel abstraction called hierarchical, synchronous tasks (HSTs). The basic idea is that tasks can execute in parallel, but their "effects" to be isolated from each other. Some of these effects are useful, but we want them combined in some predictable (deterministic) way. This can be viewed as a generalization of bulk synchronous processes. With HSTs, Ct easily support various other forms of structured parallelism, including pipeline and fork-join patterns (all with deterministic behaviors).

Execution Semantics of Ct

Ct is a dynamically compiled via a language runtime through library initialization. This is done so that one can expose these higher-level abstractions to heavy compiler optimizations, especially as one migrates code between platforms. This means that the Ct API calls used in a program may execute out of order and asynchronously with respect to the rest of one's C/C++ code. The only Ct API calls that are guaranteed to be "synchronous" with the C/C++ code are those that move data back and forth between TVECs and native C/C++ memory. The semantics of Ct allow this aggressive reordering by the compiler and runtime, though there will be some effort required if one is heavily intermixing C/C++ code and Ct API calls to optimize one's code. For functional programmers, the evaluation of Ct API calls can be viewed as "lenient" (Ct is strict and purely functional). For web programmers, the execution model will look similar to Python in that one "executes" all the code at least once, including the object declarations (which are bits of code that create and initialize the objects), then one uses the objects and their methods. In Ct, the analogy would be that the C/C++ sequencing of Ct API calls is the first pass through, then the runtime's code cache is used to avoid recompilation.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Ct: C for Throughput Computing

The Motivation for Ct

How Does It Work?

Collections in Ct

Operators in Ct

User-defined Operators in Ct

Tasks in Ct

Execution Semantics of Ct

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Ct: C for Throughput Computing

The Motivation for Ct

How Does It Work?

Collections in Ct

Operators in Ct

User-defined Operators in Ct

Tasks in Ct

Execution Semantics of Ct

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content