Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼

Clay Breshears

Dr. Dobb's Bloggers

Is SPMD a Relevant Threading Model Anymore?

November 03, 2011

You might be acquainted with Flynn's Taxonomy of computer architecture. Even so, here's a quick refresher: Flynn's Taxonomy describes a machine's execution model in terms of instruction streams and data streams. Each of these has two choices: Single (one) or Multiple. In this way, a strictly serial processor would be classified as "Single Instruction, Single Data" (SISD). A multiple processor configuration that could be programmed to simultaneously execute different programs on different sets of data would be "Multiple Instructions, Multiple Data" (MIMD).

In a similar fashion, the term for a programming model using "Same Program, Multiple Data" (SPMD) was coined. This programming methodology arose from distributed-memory programming, where a single program is written that assumes each instance of the program will be running on a different processor and each instance will be working on a separate data set. The separate data sets can be truly independent — like separate data files — or a monolithic data set that is divided among the multiple execution streams or processes.

In either case, each process needs to know which data set is to be used. The easiest way to do this is to have each process know its place within the set of processes. This involves allowing each process to determine the number of processes involved in the computation and the unique number (rank) of the current process within that set. Armed with that information, and the size or number of the data sets, the processes can compute which data sets (or portion of a single large data set) is to be used.

In programs written using MPI, the functions MPI_Comm_size() and MPI_Comm_rank() return the number of executing processes and the rank of the calling process, respectively. Besides being able to compute what data a process will compute with, these two functions are necessary for sharing data. The rank of a process serves as the "address" of the process for sending and receiving messages. If a large data set is to be utilized, a "producer" process can parcel out pieces to the "consumer" processes. This is an obvious place where explicit SPMD coding is used.

What about programming with threads in a shared-memory environment? Even in the case where there is the need to divide a large data set among the threads, does each thread need to know its rank within the larger team of threads? Recent programming libraries have answered "No" to that question.

Take for example OpenMP. Computation on a large data set is typically done within the body of a loop. Thus, in order to divide a large data set among a set of threads, you can simply divide up the iterations of the loop that operates on the data set. This is done via a simple OpenMP pragma prefixing the loop. As an example, here's a simple loop that executes some operation on each element of a two-dimensional array"

#pragma omp parallel for
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < M; ++j)

The OpenMP runtime will note the number of threads in the current team (either created at this parallel region, if it is the first executed, or woken up from sitting idle), divide up the iterations of the outer loop, and distribute them to members of the thread team. All of this is done without programmer or user intervention. Plus, it all still works correctly if the code runs on a different system or a different number of threads is used in subsequent executions or even if the value of N has changed when the loop comes around to be executed again.

The OpenMP API does have functions similar to the MPI_Comm_size() and MPI_Comm_rank() functions. The programmer can divide up the loop iterations as desired. If you program in this way, all threads on the team must be a "producer" (computing the portion of the loop iterations to be used), as well as a "consumer" (computing the loop body). Here is one way to write this explicit division of the outer loop iterations:

#pragma omp parallel
{ int start, end;
  int numThreads = omp_get_num_threads();  // size of team
  int myId = omp_get_thread_num();         // rank within team
  start = (N/numThreads) * myid;
  end   = (N/numThreads) * (myid+1);
  if (myId == numThreads-1) 
    end = N;
#pragma omp for
  for (int i = start; i < end; ++i)
    for (int j = 0; j < M; ++j)

Now, if I had an axe to grind, I could note that this code has over three times the number of lines as the first example. Even though that's not the point of this post, it does illustrate that doing an explicit division of labor within threaded code will add some (unnecessary) complexity to the program. The more complexity added, the more chance there is for error, too. For example, if N is a prime number (therefore not evenly divisible by numThreads), and if the conditional on lines 7-8 was not included, some iterations would not be assigned to a thread.

Consider more recent threaded programming models like Cilk, Intel's Threading Building Blocks, and the Microsoft's Task Parallel Library. The default use for each of these models is to have the programmer identify where independent computations (tasks) are located within the code, and allow the runtime to divide up the work and assign tasks to threads. Regardless of whether the identified tasks are loop iterations or explicit blocks of code, by not requiring the programmer to even know about threads, these libraries have simplified coding and reduced the chance for coding errors. Learning to program in parallel without needing to design algorithms that rely on the programmer to dole out tasks to threads is much easier. Scalability of the executable is more likely when the number and composition of tasks is not fixed at compile time.

I haven't even mentioned data parallel programming on GPUs and other coprocessors. Data is shuttled over to the device and simply used in the computations. There is neither a concept of nor a mechanism for explicitly assigning blocks of iterations to a thread.

So, is SPMD programming to become a thing of the past like slide rules, flowchart templates, and the Arithmetic IF? Not completely. SPMD is still the dominant model for distributed-memory computation or any other message-passing model. However, even though threads remain the shared-memory execution mechanism for the foreseeable future, I can't imagine that functions to know a thread's rank within a team will be included in future parallel programming libraries.

Is that a good thing or a bad thing? Mostly good, I'd say. This will make teaching parallel programming easier and coding a little less error prone. However, I can think of one instance of a parallel algorithm that would likely be more scalable if I could control which threads do what parts of the computation, rather than having to rely on the threading library runtime to make its best guess (which will be explored in a future post). Do you have a favorite example where SPMD is not needed, but would benefit the execution of the algorithm?

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.