C/C++

Data-Flow Multitasking

By Rabindra P. Kar, November 01, 1989

The beauty of data-flow architectures, as Robin explains here, is that they let you harness the power of multiple processors to process a stream of data in a sequential algorithm.

NOV89: DATA-FLOW MULTITASKING

DATA-FLOW MULTITASKING

A bridge between single and multiple processor systems

Rabindra P. Kar

Robin is a senior engineer with the Intel Systems Group and author of the Rhealstone real-time benchmark proposal. He can be reached at 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497.

In recent years, it has become apparent that computer designs based on the traditional Von Neumann computer architecture (a single processor stepping sequentially through a linear memory) are reaching their performance limits. Performance continues to rise as hardware clock rates get faster and features such as pipelining and cache memory improve. However, the realization is dawning that to achieve quantum performance leaps with existing technology, some form of multiple processor architecture must be used.

In the 1980s, we have seen the appearance of many multiple processor computer designs, but the number of these computers in actual use is tiny compared with traditional, single CPU systems. There are several reasons for this. Multiprocessor systems are expensive. They cost more because they use more hardware, and do not enjoy the economies of scale that come with high-volume production and sales. Furthermore, applications and/or systems programmers must deal with a vastly more complex hardware environment, making the entire development cycle difficult. Programmers must identify the sections of an application that can be tackled in parallel, and then devise programs that will take advantage of the hardware. Little performance gain can be achieved for applications that are sequential in nature, unless data-flow concepts are used in software design.

While multiprocessing systems are expensive and relatively uncommon, multitasking has been with us for decades. With the notable exception of DOS, most operating systems in common use today (including Unix, IBM's OS/ MVS, OS/2, iRMX and VRTX in the real-time world) offer multitasking. That brings up an interesting question. Can applications programs be written on multitasking systems today that will be easy to port to the multiprocessing systems of tomorrow? Is this possible even for applications that are sequential in nature?

This article is about data-flow multitasking, a software concept that may help bridge the gap between multitasking and multiprocessing capability.

Data-Flow Concepts

Data-flow architectures have been studied for at least three decades, but they have been overshadowed by the now dominant centralized processing, control-flow type architectures identified with Von Neumann. In a data-flow machine, storage (memory) is not viewed as a separate, passive area, but as a link between data processing operations (instructions). Data processing operations are enabled (executable) when input data is available to them, and when their previous output has been read and disposed of. This concept is similar to "just-in-time" assembly lines pioneered by major Japanese manufacturers.

The easiest explanation of data-flow concepts is by a simple example -- a program that computes the roots of a quadratic equation [ax + bx + c = 0]. Figure 1 shows a flowchart for this program in classic sequential style. Each step in the program cannot be executed until the result(s) of the previous operation are known. The traditional approach to executing this program is to have the computer's CPU step through each operation sequentially (read, compute DET, compute NUM+/-, compute RES) and then loop back to process the next set of data. The input data, and all intermediate results, are stored in a part of the computer's memory. This is referred to as the "control-flow" approach.

Now, let's modify the flowchart in Figure 1. Instead of thinking of each operation as a subroutine or function that a single CPU must execute sequentially, imagine a CPU (or hard-wired logic) dedicated to each operation. Each CPU (or logic module) can be called an operation processor (OP). Think of memory as a set of mailboxes connecting the OPs as shown in Figure 2.

Each OP is either executing its specific operation, or it is in "wait" mode. In wait mode, it constantly looks for new data in the mailbox preceding the operation (its "input mailbox"). The OP also checks that whatever result it placed in the mailbox following the operation (its "output mailbox") has been read by the next OP in line. If these two conditions are met, it reads the input mailbox data, executes the operation, puts the result into the output mailbox, and goes back to wait mode. This is called the "data-flow" approach -- each CPU is dedicated to one operation, and data "flows" from one CPU to another.

In a true data-flow machine, each operation is a basic machine instruction, such as multiplication or addition. Because we are trying to superimpose a macroscopic data-flow strategy on a control-flow machine, however, the computations within each operation (calculating DET or NUM, for example) are still executed with a control-flow approach. It is only the relationship between the major operations in our flowchart that simulate a data-flow architecture.

Multiple Tasks Simulate Multiple CPUs

The data-flow approach is promising because it allows the power of multiple processors to be harnessed to process a stream of data in a sequential algorithm. The problem today with this approach is that most programmers only have access to single CPU (Von Neumann) architectures, whether they are VAX terminals, workstations, or humble PC/AT clones. Yet much of today's software may need to be ported to powerful, multiprocessing machines within the next decade.

Can application programs using data-flow concepts be written today? Yes, if the multitasking abilities of today's operating systems are used. Each OP in the above example can be effectively simulated by a separate task that reads data from a "mailbox," executes the operation, and writes the result to the next mailbox. Most multitasking operating systems offer features such as pipes, queues, and others for intertask communication. The programmer can set these up to perform the mailbox function. If the OS allows tasks to share memory/data segments, simply assigning some global variables as "mailboxes" is an effective, high-performance way of passing data between tasks. This technique is shown in the following section through a programming example. The programs are based on the flowcharts in Figure 1 and Figure 2.

A Programming Example

Listing One, page 84, shows quad_eq.c, a C program that calculates roots of a quadratic equation using the "normal" control-flow approach. Input data (100 sets of coefficients) was "created" in this example program simply for convenience. In actuality the data would be read from some I/O device.

In a single CPU environment, quad_eq.c is a simple and efficient program. Because this is the environment that 99 percent of today's computer science majors learned about programming, it also seems like the natural way. However, if a multi-processor system became available, there is no obvious way to port this program using parallel computation to speed it up.

Listing Two, page 84, shows dflow.c, a multitasking version of quad_eq.c, which runs under Intel's real-time multitasking operating system, iRMX II. Each computational function in quad_eq.c has been turned into a computational task in dflow.c. For example, the numerator( ) function is replaced by num_task( ). Global variables act as mailboxes to pass data between tasks. Essentially, dflow.c implements the flowchart in Figure 2, with computational tasks substituting for OPs.

In quad_eq.c, main( ) controls the flow of the program -- the sequence in which each function (subroutine) is called. In dflow.c, main( ) tells the OS which tasks should be created. After that, it gets out of the way (lowers its own priority) while the computational tasks (det_task( ), num_task( ), and res_task( )) process all the input data. The computational tasks delete themselves when the data is exhausted (when the "a" coefficient of the next set of data is 0.0), returning control to main( ), which promptly exits.

Global variables det, num, index1, index2, and array y[] constitute the mailbox information passed from one task to the next. In a data-flow environment, each operation is performed only when there is new data in a function's input mailbox, and only if the result of that function's previous operation has been read from its output mailbox. The variables det and num are used for this synchronization (num_task( ) will do a computation only when it sees a non-zero value in det, and zero in num). After the variable det is read it is set to zero, so det_task( ) will know that its output mailbox was read.

Each task calls rq$sleep (0, &status) if the time is not right to process new data. The first parameter of rq$sleep is the sleep-time. A sleep-time of zero tells the operating system that the running task is willing to give up the CPU in favor of some other equal priority task that needs it.

The program dflow.c is a data-flow implementation of a typical control-flow program. It is a multitasking solution to a straightforward computation. It is conceptually simple to port a multitasking solution to a multiprocessor environment, and reap the enormous performance benefits of concurrent computation.

Multitasking Versus Single-Task Performance

While data-flow multitasking is a bridge to high performance in a multi-processor environment, it may lower performance slightly in a single-CPU system. There are two main reasons for this. The first is that CPU time spent calling and returning from a function or subroutine is usually less than the time required for a task switch. For this reason, quad_eq.c will execute faster than dflow.c on most single CPU multitasking systems. Secondly, intertask communication mailboxes using global variables may not be feasible because of OS restrictions, or the necessity to keep the software more modular and easier to maintain. Using other communication features, such as pipes, may impose a performance penalty.

In some situations, though, data-flow multitasking can be used to speed up a program, even on a single CPU system. A good example is when the program must talk to relatively slow I/O devices. Multitasking can be used to overlap inherent I/O delays. While one task is waiting for a disk write to complete, another task may write to the console or read from a keyboard.

Listing Three, page 84, shows res_task2( ), code for a task that could be added to dflow.c. This code calculates the root of a quadratic. It is identical to res_task( ) code. Except that it prints its output to a file rather than the console. Note that det_task( ) and num_task( ) involve no interaction with any slow I/O device, the CPU probably spends most of its time waiting for the print operation to complete. With two or more tasks printing the results on different output devices, I/O delays can be overlapped, and the program is speeded up.

Multiple Processors

In general, if a computation has been divided into n number of operations, at least n tasks must be spawned to implement data-flow multitasking. When ported to a multiprocessor system, each task will be concurrently executed by a separate processor, thus utilizing n processors. The mailboxes can be global variables in memory on a common bus (tightly coupled multiprocessing), or implemented by message passing using a wide variety of interprocessor connection schemes.

If there are more processors available, you can utilize more than two processors per operation and share common input and output mailboxes. Assuming that interprocessor communication overhead is much smaller than the execution time for any operation, the effects of using multiple processors varies. If the processor subsystems execute each of the n operations in the same amount of time, no performance advantage can be gained by using multiple processors per operation.

In most cases, though, the execution time for each of the n operations will be unequal. A rule of thumb is that the optimum number of processors per operation is proportional to the execution time of each operation. If the program involves three operations and operation 1 takes t milliseconds, operation 2 takes 2t and operation 3 takes 3t, then optimal performance is achieved by using one processor to execute operation 1, two processors for operation 2, and three for operation 3.

When multiple processors are executing the same operation concurrently, data integrity in the mailboxes becomes a tricky problem. Test and set operations on mailbox data must be made atomic (uninterruptible) by hardware or software means. For example, if there are two processors dedicated to calculating the numerator in Listing Two, each processor must be able to read the value of det or num and set the value in complete privacy. The other processor must not be able to access the data between the read and set actions.

Relevance of Real-Time Operating Systems

There are several reasons why a real-time operating system (iRMX II) was used as a foundation for the data-flow multitasking example in this article. 1. Many real-time data acquisition and analysis applications are pushing the limits of a single-CPU solution. Application designers are looking at real-time multiprocessing to provide high performance at a reasonable cost. Data-flow multitasking is the first software step towards that goal.

2. iRMX II is part of the family of real-time OSs. The newer members of this family are oriented towards distributed multiprocessing. This is the very environment that data-flow multitasking programs should be ported to in order to use their built-in concurrency.

3. A critical technical advantage of a real-time OS is determinism -- the fact that task switching happens on predictable events or system calls, and not according to some application-invisible time-slicing algorithm. This is extremely important when there is more than one task accessing the same input and/or output mailbox.

Both res_task( ) and res_task2( ) in Listings Two and Three, read their input data from the global variable num (their input mailbox). Each one sets num to 0.0 after reading it, to tell num_task( ) that the mailbox has been read. The C code that does this is:

     res = num / (2 * y[index2].a);
     num = 0.0;

These C statements translate into several machine instructions each. If the OS implements multitasking by a round-robin time-slicing algorithm, a task switch could occur anytime between the machine instruction that reads num from memory and the one that sets it to 0.0. If res_task( ) were so interrupted, and res_task2( ) was activated before res_task( ) resumed, both tasks would read the same data and compute the same result twice. A similar problem was described in the previous section regarding the use of two processors per operation.

In iRMX II, task switching between equal priority tasks will not occur until a system call is made. Thus, res_task( ) and res_task2() will not be switched in or out except at the point where they call rq$sleep. Most real-time kernels and OSs behave similarly. Some non-real-time OSs do allow you to stop task-switching between certain boundaries ("critical sections" in OS/2, for instance), but you must use this feature explicitly. If an OS offers multitasking without this capability, only one task should be spawned per operation.

Summary

Data-flow multitasking is a promising solution to the challenges software faces over the next decade. It involves looking at a sequential program through new eyes. Each major operation, function, or subroutine is viewed as a separate task. The activation of tasks is synchronized by the flow of processed data from one task to another. When ported to a multi-processor environment, each task can be concurrently executed by one or more processors.

The use of data-flow multitasking may impose some performance penalty in a computation-intensive single-CPU environment. This can be more than offset if multitasking is used to overlap the slow I/O operations executed by most programs. Porting a program to a multi-CPU environment almost always brings a huge performance boost.

Data-flow multitasking is particularly relevant for real-time applications. The need for high performance at moderate cost is much more pressing in the real-time world than it is in batch or interactive computing. Real-time multitasking operating systems and/or kernels also provide a high level of determinism, which precludes race conditions associated with multiple task access data mailboxes.

_DATE-FLOW MULTITASKING_ by Rabindra Kar

[LISTING ONE]

<a name="0226_000c">

/*  quad_eq.c -
 *  Calculate roots of quadratic equation:  ax  + bx + c = 0
 */

#include "stdio.h"
#include "rmx.h"
#include "math.h"

#define NUM_VALS 100

unsigned status;
struct coeffs {double a, b, c;} y[NUM_VALS+1];
double det;
double num;
double res;

/* Calculate determinant */
determinant(i)
unsigned i;
{
    det = y[i].b * y[i].b - 4 * y[i].a * y[i].c;
}

/* Calculate numerator */
numerator(i)
unsigned i;
{
    num = pow(det,0.5) - y[i].b;
}

/* Calculate, print result */
result(i)
unsigned i;
{
    res = num / (2 * y[i].a);
    printf("Coeffs = %.1f,%.1f,%.1f  Root = %7.2f\n",
                            y[i].a,y[i].b,y[i].c,res);
}

main()
{
  unsigned i;
  /* Create input data. Done for convenience. Could have read the data
   * from any input device.
   */
  for (i = 1; i < NUM_VALS; i++)
    {y[i].a = (double)i; y[i].c = (double)i; y[i].b = 3*(double)i;}

  /* Compute the roots */

  for (i = 1; i < NUM_VALS; i++)
  {
    determinant(i);
    numerator(i);
    result(i);
  }
  printf("\n");
}




<a name="0226_000d"><a name="0226_000d">
<a name="0226_000e">

[LISTING TWO]

<a name="0226_000e">

/**************************************************************************\
 *  dflow.c -
 *    Compute the roots of a quadratic equation using multitasking
 *    and data flow concepts.
 *  Operating System: iRMX II. Compiler: iC-286 V3.2
\**************************************************************************/

#include "stdio.h"
#include "rmx.h"
#include "math.h"

#define NUM_VALS 100

unsigned status, task_t;
FILE *fp;

/* "union" used to decompose a pointer into segment:offset */
typedef struct {unsigned offset; unsigned sel;} ptr_s;
union { unsigned *pointer; ptr_s ptr; } ptr_u;

/* Input data area */
struct coeffs {double a, b, c;} y[NUM_VALS+1];

/* "DET" mailbox -
 *      Output from main task; input to num_task
 */
double det = 0.0;
unsigned index1 = 0;

/* "NUM" mailbox -
 *      Output from num_task; input to res_task
 */
double num = 0.0;
unsigned index2 = 0;

/************************ Computational tasks *************************/

det_task()
{    /* Calculate determinant */
  unsigned i;
  for (i = 1; y[i].a != 0.0; i++)
  {
    while (det != 0.0) rq$sleep(0,&status);
    index1++;
    det = y[i].b * y[i].b - 4 * y[i].a * y[i].c;
  }
}

num_task()
{    /* Calculate numerator */
  while (y[index1].a != 0.0)
  {
    while ((det == 0.0) || (num != 0.0)) rq$sleep(0,&status);
    index2++;
    num = pow(det, 0.5) - y[index1].b;
    det = 0.0;
  }
  rq$delete$task(0,&status);    /* Delete self */
}

res_task()
{    /* Calculate result, print on console */
  double res;
  while (y[index2].a != 0.0)
  {
    while (num == 0.0) rq$sleep(0,&status);
    res = num / (2 * y[index2].a);
    num = 0.0;
    printf("Coeffs = %.1f,%.1f,%.1f  Root = %7.2f\n",
                            y[index2].a, y[index2].b, y[index2].c, res);
  }
  printf("\n");
  rq$delete$task(0,&status);    /* Delete self */
}

/****************************** Main Program ******************************/

main()
{
  unsigned i;
  unsigned short pri;

  /* Create input data. Done for convenience. Could have read the data
   * from any input device.
   */
  for (i = 1; i < NUM_VALS; i++)
    {y[i].a = (double)i; y[i].c = (double)i; y[i].b = 3*(double)i;}

  y[NUM_VALS].a = 0.0;   /* 0.0 indicates end of input data */

  /* Place a pointer to any variable in union "ptr_u", so the data segment
     of this program becomes known.
   */
  ptr_u.pointer = &status;

  /* Find the priority level that iRMX accords the main task */
  pri = rq$get$priority (NULL, &status);

  /*
   * Spawn the operation tasks. All have the same priority as main.
   */
  task_t = rq$create$task (pri, (long)det_task, ptr_u.ptr.sel,
                     0L, 1024, 1, &status);
  if (status != 0) printf("det_task create error = %u\n",status);

  task_t = rq$create$task (pri, (long)num_task, ptr_u.ptr.sel,
                     0L, 1024, 1, &status);
  if (status != 0) printf("num_task create error = %u\n",status);

  task_t = rq$create$task (pri, (long)res_task, ptr_u.ptr.sel,
                     0L, 1024, 1, &status);
  if (status != 0) printf("res_task create error = %u\n",status);

  /*
   * Lower main task's priority so main() will be idled by OS while other
   * tasks do computations. Will return here when input data is exhausted.
   */
  rq$set$priority (NULL, (pri+1), &status);

  /*********************************************************************/

  rq$sleep(10, &status);     /* Let other tasks complete */
  printf("\n  Done! \n");

}




<a name="0226_000f"><a name="0226_000f">
<a name="0226_0010">

[LISTING THREE]

<a name="0226_0010">


/*
 * This code is similar to "res_task()" in dflow.c, except
 * that the computed result is output to a file instead of the (default)
 * console screen.
 *     This code could be added to dflow.c along with two statements in
 * main() to: 1. create a "res_task2()" task, and 2. open a file for
 * output with fp as its file pointer. By so doing, the I/O delay involved
 * in waiting for console output could be overlapped with file output delays,
 * thus speeding up the program. Of course, the results would then be
 * distributed between the console and the output file.
 */

res_task2()
{    /* Calculate result, print to file */
  double res;

  while (y[index2].a != 0.0)
  {
    while (num == 0.0) rq$sleep(0,&status);
    res = num / (2 * y[index2].a);
    num = 0.0;
    fprintf(fp, "Coeffs = %.1f,%.1f,%.1f  root = %7.2f\n",
                            y[index2].a, y[index2].b, y[index2].c, res);
  }
  rq$delete$task(0,&status);    /* Delete self */
}

Previous 2 3 4 5 6 7

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Data-Flow Multitasking

DATA-FLOW MULTITASKING

A bridge between single and multiple processor systems

Rabindra P. Kar

Data-Flow Concepts

Multiple Tasks Simulate Multiple CPUs

A Programming Example

Multitasking Versus Single-Task Performance

Multiple Processors

Relevance of Real-Time Operating Systems

Summary

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Data-Flow Multitasking

DATA-FLOW MULTITASKING

Rabindra P. Kar

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content