Two-Stage Input Parallel Pipeline: Part 1
Welcome to the Quick Threads blog. Today we will look at using the QuickThread threading toolkit to write a two-stage input pipeline to boost the performance of the input side of your applications. QuickThread is available from QuickThread Programming.
As an example program, we will use the Black-Scholes benchmark included in the PARSEC benchmark suite from Princeton University http://parsec.cs.princeton.edu/. This program uses the Black-Scholes analytical method for calculating European Options and is often used for measuring the floating point computational efficiency for compilers and processor architectures. The typical use of this benchmark program is:
a) Read-in option data consisting of 10,000,000 options in ASCII text format and convert into internal data structures. Input first reads and converts from text lines to an Array Of Structures then converts the Array Of Structures into a Structure Of Arrays.
b) Repeat 100 times: for each option, pass option data through the Black-Scholes function computing strike price and saving it in an output array.
c) Convert the 10,000,000 price information results back to text and write to output file.
For the typical benchmark, only the run time for step b) is reported. Step b) is called the Region Of Interest (ROI) for those familiar to the PARSEC benchmark programs.
When your interest is in measuring the computational performance of a compiler or processor architecture, this benchmark provides a reasonable measure of the floating point performance.
However, when your interest is in actually using applications that approach of the nature of this benchmark program, your interest will be better served by looking at the total application run time. Principally the run time for step a), plus the run time for one iteration of step b), plus the run time for step c.
On a Dell R610 with 2x Intel Xeon 5570 processors, with one iteration for step b), we have as time in seconds:
Serial: a) 33.361, b) 0.859, c) 8.394, total 42.614
OpenMP: a) 33.636, b) 0.107, c) 8.439, total 42.181
TBB: a) 33.402, b) 0.108, c) 9.317, total 42.826
Win32 Threads: a) 41.888, b) 0.089, c) 8.421, total 41.888
We find that even for the Serial run, 98% of the application run time is spent in the read-in and write-out phases. For the best threaded version (Win32 threads) the threading portion for step b) runs 9.65x faster than serial run with a similar ratio for the larger 100 iteration run of b). However, note that this amounts to less than 2% faster for the total application.
Clearly, for applications of this nature, the focus of optimization should be on the input and output side of the application.
We will address this issue using two parallel programming techniques: Using a two-stage parallel pipeline for the input side, and a two-stage output technique that does not use the parallel pipeline.
Using these techniques, plus parallel code for step b) and QuickThread we obtain:
QuickThread: a) 4.393, b) 0.088, c) 1.565, total 6.047
With these two techniques we see an overall boost in application performance of 692.7% (almost 7x the performance of the best alternate programming model).
To be fair, I must state that the other programming techniques were never focused on reducing the run time of the input and the output parts of the benchmark. It will be left as an exercise for the reader to attempt to make similar improvements for the other threading models.
When looking at the input stage a) we find the two-stage parallel pipeline was 7.8x faster than the serial input. Let’s look at the code to see what is involved in writing a two-stage pipeline using QuickThread.
First look at the serial input code (abridged for this article, error tests other statements removed):
file = fopen(inputFile, "r");
rv = fscanf(file, "%i", &numOptions); // read number of options// allocate Array Of Structures storage
...// read-in from ASCII file
for ( i = 0; i < numOptions; ++i )
{
rv = fscanf(file, "%f %f %f %f %f %f %c %f %f",
&data[i].s, &data[i].strike,
&data[i].r, &data[i].divq,
&data[i].v, &data[i].t,
&data[i].OptionType, &data[i].divs,
&data[i].DGrefval);
}
rv = fclose(file); // close input file// allocate Structure of Arrays memory
...// convert from AOS to SOA
for (i=0; i < numOptions; i++)
{
otype[i] = (data[i].OptionType == 'P') ? 1 : 0;
sptprice[i] = data[i].s;
strike[i] = data[i].strike;
rate[i] = data[i].r;
volatility[i] = data[i].v;
otime[i] = data[i].t;
}
// input done
At first glance you will likely say to yourself "there is hardly any code here". What code is in there to parallelize?
You know that you cannot read the sequential file in parallel thus leaving the "copy from AOS to SOA" to parallelize. This loop is a memory bandwidth limited loop so just what code is there to parallel?
The answer here is there is a lot of computation going on in the convert ASCII input line to AOS as performed with:
rv = fscanf(file, "%f %f %f %f %f %f %c %f %f",
&data[i].s, &data[i].strike,
&data[i].r, &data[i].divq,
&data[i].v, &data[i].t,
&data[i].OptionType, &data[i].divs,
&data[i].DGrefval);
But that code is reading from the input file. Just how do you go about parallelizing a sequential operation?
The answer to this is you separate the read-in the text of the file, from the conversion of the text from ASCII to the internal storage format. Thus making a two step process: a) read in text lines, and b) convert text lines to internal storage format.
You will be required to write some additional code to accomplish this. Fortunately, QuickThread does the hard work for you.
Abbreviated code follows. Error tests, initialization, and some functionality removed. Full code (links there to) will be provided in a full feature article on ddj.com/go-parallel.
// Define your parallel pipeline I/O context object
class MyPipelineIoContext : public qt::PipelineIoContext
{
public:
... // i/o context data here (e.g. FILE *inputStream; )
MyPipelineIoContext(); // ctor
~MyPipelineIoContext(); // dtor
// your arbitrary init function for the I/O context
int Init(char *_inputFile, int _arrayCount);
}; // class MyPipelineIoContext : public qt::PipelineIoContext// Define your pipeline buffer token
class MyPipelineBuffer : public qt::PipelineBuffer
{
public:
// N.B. PARSEC Black-Scholes original source
// came with option data stored three ways:
// a) externally as text lines
// b) internally as an Array Of Structures
// c) internally rearranged as Structure Of Arrays (for SIMD)
// A later optimization can be made by eliminating the AOS
// and reading directly into SOA
// Option data as text lines for read-in (prior to conversion)
OptionDataAsText* inputLines; // Allocate to pio->arrayCount
// Option data as Array Of Structures
OptionData * data; // Allocate to pio->arrayCountint numOptionsInBuffer;
int baseOptionNumberInBuffer;
int arrayCount; // allocation count of elements in arrays// The ctor of the pipeline buffer is called from
// from within the startup code of the pipeline
// The (your) pipeline buffer may have fixed size buffers
// (allocated by way of the QuickThread NUMA aware allocator)
// and/or dynamically allocated buffers allocated by your ctor.
// Here we use buffers allocated by your ctor.
// (allocated by way of the QuickThread NUMA aware allocator)
MyPipelineBuffer(MyPipelineIoContext* pio);// dtor frees up buffers
~MyPipelineBuffer();// Convert from input text lines to Array Of Structures
// almost identical code to your serial code
bool ConvertTextLinesToAOS()
{
for ( int i = 0; i < numOptionsInBuffer; ++i )
{
int rv = sscanf(inputLines[i].line,
"%f %f %f %f %f %f %c %f %f",
&data[i].s, &data[i].strike,
&data[i].r, &data[i].divq,
&data[i].v, &data[i].t,
&data[i].OptionType, &data[i].divs,
&data[i].DGrefval);
}
return true;
}// Convert AOS to SOA
// almost identical code to your serial code
// excepting for partial sources coming from within
// your MyPipelineBuffer
void ConvertAOStoSOA();
}; // class MyPipelineBuffer : public qt::PipelineBuffer// task for reading buffer
void PipelineReadBuffer(MyPipelineIoContext* io, MyPipelineBuffer* b)
{
... sanity checks and variable initialization code here
// PipelineReadBuffer performs as little work as possible
// in this case this is reduce to reading the appropriate
// number of lines into the line-by-line input buffers
for ( int i = 0; i < b->numOptionsInBuffer; ++i )
{
if(fgets(b->inputLines[i].line,
MAX_INPUT_LINE_SIZE,
io->inputStream)
== NULL)
{
... error code
return;
}
}
} // void PipelineReadBuffer// task to process buffer
void PipelineProcessBuffer(MyPipelineBuffer* b)
{
// Convert from input text lines to Array Of Structures
if(!b->ConvertTextLinesToAOS())
{
b->Status = ExitError$;
return; // error
}// Convert AOS to SOA
b->ConvertAOStoSOA();
} // void PipelineProcessBuffer(MyPipelineBuffer* b)int main(...
{
... initialization// QuickThread thread pool initialization
qtInit qtInit(nThreads, 2);// # compute threads, # I/O threads// instantiate a QuickThread pipeline I/O context
MyPipelineIoContext pio;// Initialize the I/O context and return the number of options
numOptions = pio.Init(inputFile, bufferSizeInElements);
... allocate buffers based on numOptions// instantiate QuickThread pipeline object
qtPipelinepipeline; // add pipes (2 for two-stage pipeline)
pipeline.addPipe(PipelineReadBuffer);
pipeline.addPipe(PipelineProcessBuffer);// start the pipeline
pipeline.run(&pio);// (run other code here if you want prior to waiting)
pipeline.WaitTillDone(); // wait for pipeline to finish... input data loaded
... remainder of program here
}
The amount of coding effort will vary from application to application. In the Black-Scholes application approximately 3 pages of code were added. And a significant portion of this code was a copy and paste from an earlier parallel pipelined program. As you get familiar with programming using this technique figure on one day’s worth of effort (including testing) to install a parallel pipeline into your code.
My next article will cover a parallel two stage output technique that does not use a pipeline.
Jim Dempsey
http://www.quickthreadprogramming.com
Designing Parallel Algorithms: Part 3
Agglomeration: Moving from the abstract toward the concrete
Advisor Origins: Part 2
Putting variables to workPredicting and Measuring Parallel Performance
Structured Patterns: An Overview
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
December 15, 2009
How to Use Intel® Parallel Studio to Streamline Code Development in a Multicore Environment
Speaker: Matt Dunbar, Director for Performance Technology, SIMULIA (Bio)Matt Dunbar is the director for performance technology at SIMULIA. Since joining the company in 1993, he has worked on parallelization of the Abaqus suite of products, initially for shared memory architectures and more recently for distributed memory architectures. Dunbar has also been intimately involved in selecting both the hardware and software tools used in the development of the Abaqus product line.
Abstract:
Resolve elusive, costly multithreading errors quickly and efficiently with Intel® Parallel Studio. While many coding problems that lead to bugs in software applications are typically straightforward logic errors, errors in managing memory and in multithreading code can sometimes take weeks to months to diagnose and fix. Matt Dunbar explores how and why taking advantage of multicore processors through multithreaded code is critical for compute-intensive applications. While spotlighting his work on SIMULIA's Abaqus finite element solver, Dunbar addresses the need for multicore execution and shares his experiences using Intel Parallel Studio to streamline code development in a multicore environment.


