Channels ▼

Jim Dempsey

Dr. Dobb's Bloggers

Two-Stage Input Parallel Pipeline: Part 1

January 22, 2010

Welcome to the Quick Threads blog. Today we will look at using the QuickThread threading toolkit to write a two-stage input pipeline to boost the performance of the input side of your applications. QuickThread is available from QuickThread Programming.As an example program, we will use the Black-Scholes benchmark included in the PARSEC benchmark suite from Princeton University http://parsec.cs.princeton.edu/. This program uses the Black-Scholes analytical method for calculating European Options and is often used for measuring the floating point computational efficiency for compilers and processor architectures. The typical use of this benchmark program is:

a) Read-in option data consisting of 10,000,000 options in ASCII text format and convert into internal data structures. Input first reads and converts from text lines to an Array Of Structures then converts the Array Of Structures into a Structure Of Arrays.

b) Repeat 100 times: for each option, pass option data through the Black-Scholes function computing strike price and saving it in an output array.

c) Convert the 10,000,000 price information results back to text and write to output file.

For the typical benchmark, only the run time for step b) is reported. Step b) is called the Region Of Interest (ROI) for those familiar to the PARSEC benchmark programs.

When your interest is in measuring the computational performance of a compiler or processor architecture, this benchmark provides a reasonable measure of the floating point performance.

However, when your interest is in actually using applications that approach of the nature of this benchmark program, your interest will be better served by looking at the total application run time. Principally the run time for step a), plus the run time for one iteration of step b), plus the run time for step c.

On a Dell R610 with 2x Intel Xeon 5570 processors, with one iteration for step b), we have as time in seconds:

Serial: a) 33.361, b) 0.859, c) 8.394, total 42.614 OpenMP: a) 33.636, b) 0.107, c) 8.439, total 42.181 TBB: a) 33.402, b) 0.108, c) 9.317, total 42.826 Win32 Threads: a) 41.888, b) 0.089, c) 8.421, total 41.888

We find that even for the Serial run, 98% of the application run time is spent in the read-in and write-out phases. For the best threaded version (Win32 threads) the threading portion for step b) runs 9.65x faster than serial run with a similar ratio for the larger 100 iteration run of b). However, note that this amounts to less than 2% faster for the total application.

Clearly, for applications of this nature, the focus of optimization should be on the input and output side of the application.

We will address this issue using two parallel programming techniques: Using a two-stage parallel pipeline for the input side, and a two-stage output technique that does not use the parallel pipeline.

Using these techniques, plus parallel code for step b) and QuickThread we obtain:

QuickThread: a) 4.393, b) 0.088, c) 1.565, total 6.047

With these two techniques we see an overall boost in application performance of 692.7% (almost 7x the performance of the best alternate programming model).

To be fair, I must state that the other programming techniques were never focused on reducing the run time of the input and the output parts of the benchmark. It will be left as an exercise for the reader to attempt to make similar improvements for the other threading models.

When looking at the input stage a) we find the two-stage parallel pipeline was 7.8x faster than the serial input. Let's look at the code to see what is involved in writing a two-stage pipeline using QuickThread.

First look at the serial input code (abridged for this article, error tests other statements removed):

<font face="Courier New">

    file = fopen(inputFile, "r");
    rv = fscanf(file, "%i", &numOptions); // read number of options

    // allocate Array Of Structures storage
    ...

    // read-in from ASCII file
    for ( i = 0; i < numOptions; ++i )
    {
        rv = fscanf(file, "%f %f %f %f %f %f %c %f %f",
                    &data[i].s, &data[i].strike,
                    &data[i].r, &data[i].divq,
                    &data[i].v, &data[i].t,
                    &data[i].OptionType, &data[i].divs,
                    &data[i].DGrefval);
    }
    rv = fclose(file); // close input file

    // allocate Structure of Arrays memory
    ...

    // convert from AOS to SOA
    for (i=0; i < numOptions; i++)
    {
        otype[i]      = (data[i].OptionType == 'P') ? 1 : 0;
        sptprice[i]   = data[i].s;
        strike[i]     = data[i].strike;
        rate[i]       = data[i].r;
        volatility[i] = data[i].v;    
        otime[i]      = data[i].t;
    }
    // input done
     

At first glance you will likely say to yourself "there is hardly any code here". What code is in there to parallelize?

You know that you cannot read the sequential file in parallel thus leaving the "copy from AOS to SOA" to parallelize. This loop is a memory bandwidth limited loop so just what code is there to parallel?

The answer here is there is a lot of computation going on in the convert ASCII input line to AOS as performed with:

<font face="Courier New">

        rv = fscanf(file, "%f %f %f %f %f %f %c %f %f",
                    &data[i].s, &data[i].strike,
                    &data[i].r, &data[i].divq,
                    &data[i].v, &data[i].t,
                    &data[i].OptionType, &data[i].divs,
                    &data[i].DGrefval);

But that code is reading from the input file. Just how do you go about parallelizing a sequential operation?

The answer to this is you separate the read-in the text of the file, from the conversion of the text from ASCII to the internal storage format. Thus making a two step process: a) read in text lines, and b) convert text lines to internal storage format.

You will be required to write some additional code to accomplish this. Fortunately, QuickThread does the hard work for you.

Abbreviated code follows. Error tests, initialization, and some functionality removed. Full code (links there to) will be provided in a full feature article on ddj.com/go-parallel.

<font face="Courier New">

// Define your parallel pipeline I/O context object
class MyPipelineIoContext : public qt::PipelineIoContext
{
public:
    ... // i/o context data here (e.g. FILE *inputStream; )
    MyPipelineIoContext(); // ctor
    ~MyPipelineIoContext(); // dtor
    // your arbitrary init function for the I/O context
    int Init(char *_inputFile, int _arrayCount);
}; // class MyPipelineIoContext : public qt::PipelineIoContext

// Define your pipeline buffer token
class MyPipelineBuffer : public qt::PipelineBuffer
{
public:
    // N.B. PARSEC Black-Scholes original source
    // came with option data stored three ways:
    //	a) externally as text lines
    //	b) internally as an Array Of Structures
    //	c) internally rearranged as Structure Of Arrays (for SIMD)
    // A later optimization can be made by eliminating the AOS
    // and reading directly into SOA
    
    // Option data as text lines for read-in (prior to conversion)
    OptionDataAsText*	inputLines; // Allocate to pio->arrayCount
    
    // Option data as Array Of Structures
    OptionData * data; // Allocate to pio->arrayCount

    int numOptionsInBuffer;
    int baseOptionNumberInBuffer;
    int arrayCount;	// allocation count of elements in arrays

    // The ctor of the pipeline buffer is called from
    // from within the startup code of the pipeline
    // The (your) pipeline buffer may have fixed size buffers
    // (allocated by way of the QuickThread NUMA aware allocator)
    // and/or dynamically allocated buffers allocated by your ctor.
    // Here we use buffers allocated by your ctor.
    // (allocated by way of the QuickThread NUMA aware allocator)
    MyPipelineBuffer(MyPipelineIoContext* pio);

    // dtor frees up buffers
    ~MyPipelineBuffer();

    // Convert from input text lines to Array Of Structures
    // almost identical code to your serial code
    bool ConvertTextLinesToAOS()
    {
        for ( int i = 0; i < numOptionsInBuffer; ++i )
        {
            int rv = sscanf(inputLines[i].line,
                        "%f %f %f %f %f %f %c %f %f",
                        &data[i].s, &data[i].strike,
                        &data[i].r, &data[i].divq,
                        &data[i].v, &data[i].t,
                        &data[i].OptionType, &data[i].divs,
                        &data[i].DGrefval);
        }
        return true;
    }

    // Convert AOS to SOA
    // almost identical code to your serial code
    // excepting for partial sources coming from within
    // your MyPipelineBuffer
    void ConvertAOStoSOA();
}; // class MyPipelineBuffer : public qt::PipelineBuffer

// task for reading buffer
void PipelineReadBuffer(MyPipelineIoContext* io, MyPipelineBuffer* b)
{
    ... sanity checks and variable initialization code here
    // PipelineReadBuffer performs as little work as possible
    // in this case this is reduce to reading the appropriate
    // number of lines into the line-by-line input buffers
    for ( int i = 0; i < b->numOptionsInBuffer; ++i )
    {
        if(fgets(b->inputLines[i].line,
                 MAX_INPUT_LINE_SIZE,
                 io->inputStream)
            == NULL)
        {
            ... error code
            return;
        }
    }
} // void PipelineReadBuffer

// task to process buffer
void PipelineProcessBuffer(MyPipelineBuffer* b)
{
	// Convert from input text lines to Array Of Structures
	if(!b->ConvertTextLinesToAOS())
      {
            b->Status = ExitError$;
		return;	// error
      }

	// Convert AOS to SOA
	b->ConvertAOStoSOA();
} // void PipelineProcessBuffer(MyPipelineBuffer* b)

int main(...
{
    ... initialization

    // QuickThread thread pool initialization
    qtInit  qtInit(nThreads, 2);// # compute threads, # I/O threads

    // instantiate a QuickThread pipeline I/O context
    MyPipelineIoContext	pio;

    // Initialize the I/O context and return the number of options
    numOptions = pio.Init(inputFile, bufferSizeInElements);
    ... allocate buffers based on numOptions

    // instantiate QuickThread pipeline object
    qtPipeline<MyPipelineIoContext, MyPipelineBuffer>    pipeline;

    // add pipes (2 for two-stage pipeline)
    pipeline.addPipe(PipelineReadBuffer);
    pipeline.addPipe(PipelineProcessBuffer);

    // start the pipeline
    pipeline.run(&pio);

    // (run other code here if you want prior to waiting)
    pipeline.WaitTillDone(); // wait for pipeline to finish

    ... input data loaded
    ... remainder of program here
}

The amount of coding effort will vary from application to application. In the Black-Scholes application approximately 3 pages of code were added. And a significant portion of this code was a copy and paste from an earlier parallel pipelined program. As you get familiar with programming using this technique figure on one day's worth of effort (including testing) to install a parallel pipeline into your code.

My next article will cover a parallel two stage output technique that does not use a pipeline.

Jim Dempsey http://www.quickthreadprogramming.com

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video