Loop Splitting Under Windows NT

Charles presents a C++ class that lets you use multiple threads to implement "loop splitting"that is, dividing processing of a loop over multiple processorsunder Windows NT.

July 01, 1996
URL:http://www.drdobbs.com/windows/loop-splitting-under-windows-nt/184409918

July 1996: Loop Splitting Under Windows NT

Loop Splitting Under Windows NT

Presenting the CParallelProcess class

Charles Letner

Charles is a member of the department of biochemistry at Wright State University. He can be contacted at [email protected].

As the need for faster code execution increases, so does the importance of parallel processing. One approach to parallel processing is that of "loop splitting"-dividing processing of a loop over multiple processors. As Figure 1 illustrates, this is accomplished in the following manner: One processor executes the instructions for a single iteration through the loop while another executes the instructions for a separate iteration through the loop until all available processors are utilized. When the first processor finishes its pass though the loop, it starts on another iteration of the loop. In an optimal situation, it is possible to decrease the loop-processing time by a factor approximately equal to the number of processors available to execute the loop. One requirement of loop splitting is that each pass through the loop not rely on the results of any other pass, because there is no guarantee of the order in which iterations of the loop will be executed.

In this article, I'll present a C++ class that uses multiple threads to implement loop splitting under Windows NT. For each processor available to the loop, a thread will be created. Each thread will be assigned a set of loop iterations to process. Because NT doesn't provide a mechanism for assigning a thread to a processor, there is no guarantee that multiple processors will be used. Instead, the NT scheduler determines the need for multiple processing based on job priorities and load levels on the machine running the code. Though this could be considered a drawback of parallel processing under NT, you'll see that the scheduler does handle the sample applications efficiently.

The CParallelProcess Class

CParallelProcess (see Listings One and Two) is the base class for a derived class that specifies the data and member functions. This allows the parallel-processing class to be reused in multiple projects. By hiding the implementation of loop splitting, the class can be ported to other parallel architectures without modifying the actual code defining the derived-class data and member functions.

As Figure 2 illustrates, CParallelProcess's two most important functions are CallParallelLoop and ParallelLoop. Also important is the pure virtual function LoopFunction. The base class also includes functions to manipulate the number of processors available to the loop. A critical-section object (m_CSforProcessors) and event object (m_CounterEvent) are instantiated in this class. These are used to protect the data determining the number of processors to act on the loop and the data initializing the indexes for each thread of the loop.

The definition of CallParallelLoop is void CallParallelLoop(unsigned int m_First, unsigned int m_Last);. The m_First and m_Last parameters specify the number of iterations through the loop and are analogous to the first two expressions of a for statement: m_First specifies the first index of the loop, and m_Last specifies the last. m_CSforProcessors, which is entered after initialization of local variables in this function, protects the values of m_Processors, the variable that specifies the number of processors available to the loop. It is conceivable that another thread could call SetNumberProcessors while CallParallelLoop is executing, possibly creating problems for the executing loop.

Next, arrays for thread handles and thread IDs are allocated. Each thread has separate indexes. If two threads are created, one thread will go through the even indexes and the other through the odd indexes. The indexes for each thread are determined by the values of the member variables m_Start, m_Stop, and m_Increment. These variables are used to initialize the local variables used in ParallelLoop's for statement. The m_CounterEvent event is entered to protect the values of these variables. It is exited in the ParallelLoop function after the local variables of ParallelLoop are initialized. At this point, the m_Start, m_Stop, and m_Increment variables are available, so another thread can be created.

Once this initialization work is complete, a loop to create the threads is entered. This loop creates as many threads as there are processors available to the loop. The threads are created with a call to CreateThread, defined in the Win32 API. The third and fourth parameters are ParallelLoop and this. ParallelLoop is a pointer to the function that will index through the loop for each thread. The this pointer is required so that the thread knows to use the instance of CParallelProcess that executed CallParallelLoop. After this loop is finished executing, the main thread is put to sleep by the call to WaitForMultipleObjects. The first parameter tells the number of threads created, and the second parameter identifies the threads.

The main thread waits at this statement until all the created threads are finished. The last step is to release all handles and delete the arrays for the handles and thread IDs. Finally, the critical section for m_Processors is left.

In the call to CreateProcess, the pointer to ParallelLoop was specified. This function is declared as static DWORD WINAPI ParallelLoop(LPVOID lpvParameters);. The form of this declaration is specified by the Win32 API definition of CreateThread. The function must take a parameter of type LPVOID and return a DWORD. The LPVOID parameter provides the means for accessing the this pointer specified as the fourth parameter to CreateThread.

The first job of ParallelLoop is to declare a pointer to the instance that created the thread. This is accomplished by the first line of code in ParallelLoop. This class is meant to be used as a base class for a derived class, so there is no way to know the derived class name beforehand. To circumvent this, the derived class header file is required to contain a preprocessor definition for dDERIVED_CLASS_NAME. Additionally, dDERIVED_CLASS_HEADER should be defined as the name of the header file of the derived class, which is used in the statement #include dDERIVED_CLASS_HEADER at the beginning of the CParallelProcess header file. With this statement, the compiler can resolve the declarations required for this function. Although the ISO C standard specifies that the preprocessor allow this statement, Microsoft Visual C++ 2.2 does not follow the convention. If that's the variety of C++ you're using, the #include statement of CParallelProcess.h must be changed by hand. With these definitions, the pointer lpObject is instantiated and assigned the this pointer that has been cast to the derived class name. Consequently, ParallelProcess has access to all members of the derived class. Next, the values of first, last, and step are initialized. The critical section to protect m_Start, m_Step, and m_Increment is released after initialization of these variables and the loop is entered. The initialization of the counters controls how the loop functions. These initializations are, in turn, controlled by the number of processors (m_Processors) and the parameters (m_First and m_Last) passed to CallParallelLoop. The for statement of ParallelLoop contains index < last+1. last is defined as the last index of the loop. By allowing the for loop to run while index is less than this value, the need to create a separate value of last for each thread is avoided. The for loop is executed and LoopFunction is called with the index from the loop. LoopFunction is a pure virtual function in CParallelProcess, and it must be defined in the derived class. The purpose is to provide a stub in which the derived class can specify the functionality for each pass through the loop.

Sample Implementations

A sample implementation of CParallelProcess called "sample1.cpp," which simulates a long loop, is available electronically. In fact, all it does is enter a loop that chews up processor time. Also available electronically, sample2.cpp is a more practical use of the CParallelProcess class that demonstrates how you can create multiple parallel loops with different functionality.

The file CParallelCounter.h (Listing Three) derives a class CParallelCounter from CParallelProcess. The first function is a constructor that takes a single parameter and passes it on to the constructor for CParallelProcess. This parameter determines the number of threads and, as such, the possible number of processors that will be used to execute the parallel loop. Recall from CParallelProcess that the pure virtual function LoopFunction must be overridden. This function determines the action that the loop takes with each pass through the loop. The definition of LoopFunction in CParallelCounter takes a single variable. This parameter is not used in this implementation but is included only to get the override of CParallelProcess's virtual function correct. The body of the function is nothing more than a for loop that repeats a division operation, simulating a long set of instructions inside the parallel loop.

Sample1.cpp instantiates two objects of CParallelCounter. The first, scalar_object, passed the value 1 to the constructor. This is passed from CParallelCounter's constructor to CParallelProcess's constructor, which, in turn, sets the variable m_Processors to 1. In this way, the object is instantiated so that only a single thread will be created to process the loop. A second instance, parallel_object, is passed the value 2, which will create two threads and allow two processors to process the loop. After all the objects are created, the object scalar_object calls CallParallelLoop. Two parameters are passed, indicating the starting and final indexes that the loop should go through. The equivalent calls are made for parallel_object. Note that variables and functions are included to determine the maximum clock time required to complete the loop processing. Using Visual C++ 2.2, set for a release compile (and optimized for Pentiums), the scalar loop requires ten seconds, while the parallel loop requires only five seconds on dual 100-MHz Pentiums. The code was statically linked with libcmt.lib.

Sample2.cpp is more interesting. In this code, the derived class initializes an array and then sums its values. There are two potential candidates for parallel processing: array initialization and sum-of-array calculation. A problem arises in specifying two separate but different functions for parallel processing. The solution is to use pointers to functions.

In the data section of CMatrixSum (Listing Four) is the declaration of a pointer to a member function of CMatrixSum that takes an unsigned int as a parameter: void (CMatrixSum::*fp)(unsigned int);. The function LoopFunction is declared as before. However, note that it calls the function pointed to by the pointer fp and passes it the variable value. value indicates to the function the current index of the loop. The this pointer is used to resolve the call to the function pointer. With this code in place, the only thing that remains is to declare the processing that occurs in the loop. First, initialize the array in the class constructor using InitalizeArray. This function sets the pointer fp to CMatrixSum::Initialize, which sets the value of the array element (indicated by the parameter index) to 1.0. Next, CallParallelLoop is called to start the loop processing. The other function needed is SumArray. This function assigns the function CMatrixSum::Sum to the pointer fp. Sum takes the values at each array location and adds them to obtain a sum of all the values in the array. A critical section is declared that protects the value of m_Sum. It is conceivable that two threads could be calculating the sum at the same time. If this occurs, the wrong result could be obtained. One thread could obtain the value of m_Sum and do the addition, while a second thread could be finishing up its addition phase and reassigning a new value to m_Sum. The first thread would not see this update to m_Sum and would update m_Sum without consideration of the value from the second thread. The use of the critical section here solves the problem of inaccurate results due to multiple threads manipulating m_Sum, but it slows down the loop. This results from the possibility of a thread having to wait at the critical section for another thread to finish with m_Sum. Another way to handle this is to declare an instance of m_Sum for each thread that will be processing the loop. Each thread can now update its m_Sum variable whenever it needs to without fear of corrupting another thread's result. Finally, the m_Sum from each thread will have to be summed to obtain the result for the whole array.

This implementation of CParallelProcess leapfrogs through the data. Each thread is indexed by the number of threads processing the loop. In the case of two threads, each thread processes every other index. It would be possible to have one thread process the first half of the indexes and the second process the second half. This may offer a speed advantage in that cache could be utilized more efficiently.

I/O should not be included in the loop; this would slow down processing due to threads having to wait for access to the I/O device. (For example, two threads cannot write to the screen simultaneously.)

For a short loop, it is probably not beneficial to implement loop splitting; the overhead of creating the parallel loop could very well outweigh any benefit. On the other hand, a loop that takes a significant amount of time is a candidate for loop splitting. Remember that to qualify for splitting, each iteration of the loop must not depend on a previous iteration of the loop, since there is no guarantee on the order of the passes though the loop.

Conclusion

The class and sample programs I've presented here demonstrate that parallel processing is feasible under NT, allowing us to begin running more computationally intense simulations on NT-based machines. The class also allows for the investigation of parallel code on machines of lower cost than many of the parallel machines currently available.

Listing One

// File: CParallelProcess.h -- Programmer: Charles Letner

/* This file contains the definiton of a class that implements loop splitting.There is a pure virtual function void LoopFunciton(unsigned int index) that must be defined in a derived class. This derived function contains the code 
for a single pass through the loop. The parameter is the index of the pass through the loop, ie/ 0->max index. Note that there is no garantee that lower indexs will be processed before a higher index. The derived header should include this file. Before the inclusion of this file the derived class header file must contain the two definitions: #define dDERIVED_CLASS_NAME <class name> (this is required for type casts that will be made in the function ParallelLoop) and #define dDERIVED_CLASS_HEADER "<class header file>" (which defines the 
name of the header file for the dervied class. This is needed so that CParallelProcess.cpp can get the definition of the derived class. Again this is required due to the pure virtual function ParallelLoop.)Though the ISO standard supports this second use of a preprocessor definiton MSVC++ version 2.1 does not. To compile with MSVC++ you will have to manually enter the derived class header in CParallelProcess.cpp.
*/

#ifndef cal_PARALLELPROCESS
#define cal_PARALLELPROCESS

#define dDEFAULT_PROCESSORS 2

#include <windows.h>

class CParallelProcess
{   // Memeber functions
    public:
        // Constructors and destructors
        CParallelProcess()
            {   m_Processors = dDEFAULT_PROCESSORS;
                m_CounterEvent = CreateEvent(NULL, FALSE, TRUE, NULL);
                InitializeCriticalSection(&m_CSforProcessors);
            };
        CParallelProcess(unsigned short int number_processors)
            {   m_Processors = number_processors;
                m_CounterEvent = CreateEvent(NULL, FALSE, TRUE, NULL);
                InitializeCriticalSection(&m_CSforProcessors);
            };
        ~CParallelProcess()
            {   CloseHandle(m_CounterEvent);
                DeleteCriticalSection(&m_CSforProcessors);
            };
        // This is the function that sets up the threads and executes
        // the parallel code.  It calls LoopFunction.
        void CallParallelLoop(unsigned int m_First, unsigned int m_Last);

        // Member functions to obtain access to the variable m_Processors.
        // These functions allow an application to change the number of 
        // processors that are available to a loop.
        unsigned int ReturnNumberProcessors(void)             {   return(m_Processors); };
        void SetNumberProcessors(unsigned int NumberProcessors)
            {   EnterCriticalSection(&m_CSforProcessors);
                m_Processors = NumberProcessors;
                LeaveCriticalSection(&m_CSforProcessors);
           };
    protected:
        // This is a pure virtual function.  It is required that derived
        // classes contain LoopFunction function to provide functionality
        // to the parallel loop.
        virtual void LoopFunction(unsigned int index)=0;
    private:
        // This is the function that each thread use to execute
        // the loop processing.
        static DWORD WINAPI ParallelLoop(LPVOID lpvParameters);
    // Member variables
    protected:
        // Variable defines the number of process to split loop over. This is 
        // private to allow for protection of this value from modification by 
        // another thread while CallParallelLoop is using this value;
        unsigned short int m_Processors;
        // These values are use to initalize the For loop in ParallelLoop.
        // Private to allow protection from modification by other threads.
        unsigned int m_Start;
        unsigned int m_Increment;
        unsigned int m_Stop;
        // The critical section and event objects to control accesd to 
        // m_Start/m_Increment/m_Stop and m_Processor
        HANDLE m_CounterEvent;
        CRITICAL_SECTION m_CSforProcessors;
};  // End class definition
#endif

Listing Two


// File: CParallelProcess.cpp -- Programmer: Charles Letner
// Choose one of these include depending on ISO compliency of your compiler. 
// MVC++ 2.1 does not support preprocessor defintions in #include statements. 

//#include dDERIVED_CLASS_HEADER
#include "CParallelCounter.h"

void CParallelProcess::CallParallelLoop(unsigned int first, unsigned int last)
{   HANDLE* lphThreads;
    DWORD* lpdThreadID;
    UINT process, index;

    // Enter a critical section to protect the value of m_Processors
    // from modification while the loop is executing.  Create
    // the array of handles and thread ID's for use in loop splitting.
    EnterCriticalSection(&m_CSforProcessors);
    lphThreads = new HANDLE[m_Processors];
    lpdThreadID = new DWORD[m_Processors]; 
    // Loop to create the threads that implement loop splitting.  One
    // thread is created for each processor available to execute the loop.
    for(process=first; process < m_Processors; process++)
    {   // Enter a Event to protect the values that initialize the
        // for loops of each thread.
        WaitForSingleObject(m_CounterEvent, INFINITE);  
        m_Start = process;
        m_Stop = last;
        m_Increment = m_Processors;
        lphThreads[process] = CreateThread(NULL, 0, ParallelLoop, this, 0, &lpdThreadID[process]);
    };  // End for loop

    // Put the main thread to sleep until all threads have exited and
    // the loop processing is complete
    WaitForMultipleObjects(m_Processors, lphThreads, TRUE, INFINITE);

    // Loop to close the handles to the threads
    for(index=0; index < m_Processors; index++)
        CloseHandle(lphThreads[index]);
    // Clean up the memory allocated for the thread handles
    // and arrays.  Leave the critical section that protects m_Processors.
    delete []lphThreads;
    delete []lpdThreadID;
    LeaveCriticalSection(&m_CSforProcessors);
}  // End CallParallelLoop

DWORD WINAPI CParallelProcess::ParallelLoop(LPVOID lpvParameters)
{   // Create a pointer to the object that called ParallelLoop by
    // use of the this pointer passed as the fourth parameter to
    // Create process in CallParallelLoop
    dDERIVED_CLASS_NAME *lpObject = (dDERIVED_CLASS_NAME *)lpvParameters;   
    unsigned int index, first, last, step;
    DWORD dwResult = 0;
    // Initialize the variables that determine the indexing of
    // the for loop.  Set the event to allow other
    // threads access to m_Start, m_Stop, and m_Increment.
    first = lpObject->m_Start;
    last = lpObject->m_Stop;
    step = lpObject->m_Increment;
    SetEvent(lpObject->m_CounterEvent);

    // Enter the for loop for the thread. Call LoopFunction to process data.  
    // The variable index contains the index of iteration through the loop.   
   for(index = first;  index < last+1; index = index + step )
        lpObject->LoopFunction(index);
    return(dwResult);
}  // End ParallelLoop

Listing Three


// File: CParallelCounter.h -- Programmer: Charles Letner

#define dDERIVED_CLASS_NAME CParallelCounter
// Determines the number of time that the loop to simulate long processing
// will run. See comments of sample1.cpp for information on timing obtained. 
#define dITERATIONS 10000

// Include the definiton of CParallelProcess
#include "CParallelProcess.h"

// Declare CParallelCounter to inherit CParallelProcess as public.  This
// is done so that CallParallelLoop is accesible in main.  
class CParallelCounter : public CParallelProcess
{   public:
       CParallelCounter(unsigned int processors) : CParallelProcess(processors)
            { /* Empty constructor */ };
        // function that implements functionality for each iteration of loop.
        void LoopFunction(unsigned int iteration)
            {   unsigned int index;
                double value;
                for(index=0; index < dITERATIONS; index++)
                    value = (double) 300.0 / (double) 1.5;
            };
}; // End CParallelCounter definition

Listing Four


// File: CMatrixSum.h -- Programmer: Charles Letner

#ifndef cal_CMATRIX_SUM
#define cal_CMATRIX_SUM
#define dDERIVED_CLASS_NAME CMatrixSum

#define dNUMBER_ELEMENTS 10000

#include <iostream.h>
#include <windows.h>

#include "CParallelProcess.h"

class CMatrixSum : protected CParallelProcess
{   public:
          CMatrixSum(unsigned int number_processors) : 
                                           CParallelProcess(number_processors)
            { InitializeArray();            
              m_Sum = (double) 0.0;
              InitializeCriticalSection(&m_CSforSum);
            };
        ~CMatrixSum()
            {   DeleteCriticalSection(&m_CSforSum);
            };
        // Two functions that can be called to process the array with
        // multipule processors
        void SumArray(void)
            {   // Initialize the pointer fp to point to Sum
                // and call CallParallelLoop.  The second parameter
                // passed to CallParallelLoop is the maximum allowed.
                fp = &CMatrixSum::Sum;
                CallParallelLoop(0, dNUMBER_ELEMENTS-1);  
            };
        void InitializeArray(void)
            {   // Initialize the pointer fp to point to Initialize
                // and call CallParallelLoop.  The second parameter
                // passed to CallParallelLoop is the maximum index
                // allowed.
                fp = &CMatrixSum::Initialize;
                CallParallelLoop(0, dNUMBER_ELEMENTS-1);
            };
        // A function to print out the sum of the elements in array
        void PrintSum(void)
            {   cout << "The sum of the array is: "<< m_Sum << endl;
            };
        /* ------  Section of definition needed for parallel processing ---- */
        // The function called by CParallelPrcoess to process a single index
       void LoopFunction(unsigned int value)
            {   // Call the function pointed to by the pointer fp and pass
                // it the value.
                (this->*fp)(value);
            };
        // The two functions that perform the tasks for each pass through the
        // parallel loop
        void Initialize(unsigned int index)
        {   m_Array[index] = (double) 1.0;
        };
        void Sum(unsigned int index)
        {   // Envoke a critical section so that two threads can not be
            // accessin m_Sum at the same time.
            EnterCriticalSection(&m_CSforSum);
            m_Sum = m_Sum + m_Array[index]; 
            LeaveCriticalSection(&m_CSforSum);
        }; 
    private:
        double m_Array[dNUMBER_ELEMENTS];
        double m_Sum;
        // The declaration of a pointer to a member function of CMatrixSum
        // that takes an unsigned int as a parameter.
        void (CMatrixSum::*fp)(unsigned int);
        // The critical section to protect the variable m_Sum
        CRITICAL_SECTION m_CSforSum;
}; // End CMatrixSum definiton
#endif