Parallel

Optimizing Software for Multicore Processors

By Edwin Verplanke, May 07, 2007

With the potential for real performance gains, multicore processors present the challenge of deciding how to validate and optimize code.

Decompose Code

VolPack performs the same operation on each pixel, and this characteristic, combined with the relatively simple data structure, provides for a straightforward parallelization strategy. We used POSIX threads to divide the New_Pixel_Loop into four subloops, each running on one of four cores.

In our implementation, Core 1 first executes serial code to load the image, which precedes the dataflow in Figure 2, while the other three cores are idle. Next, Core 1 uses POSIX threads to spawn three threads for cores 2-4. Our software experts took two days to inspect the code, parallelize the image-rendering code, and test the workload balance.

Create the Threads. Listing One initializes one thread per core using pthread_create. Each thread executes the same function, vp_thread_task_loop. Listing One uses a variable NUM_THREADS, which corresponds to the number of cores in the system. Because the number of cores is not hard coded, it can easily be ported to run on systems with any number of cores.

{
        vp_begin_threads();
        vp_threads_begun = 1;
}

void vp_begin_threads()
{
        int i;
        int mask = 0xf;
        vp_pthreads_data = vp_create_thread_data();
          if (vp_pthreads_data == NULL)
          {
                printf("Unable to allocate memory for threads.\n");
                return;
          }
       
        for(i=1;i<NUM_THREADS;i++)
        {
                vp_pthreads_args[i].data = vp_pthreads_data;
                vp_pthreads_args[i].id = i;
                pthread_create(&(vp_threads[i]), NULL, vp_thread_task_loop,
                                &(vp_pthreads_args[i]));
        }
}

Listing One

vp_thread_big_loop_args loop_args[NUM_THREADS];
            int num = (kcount)>>THREAD_SHIFT;
            int extras = (kcount)&THREAD_MASK;
           int cur_num = kstart;

    loop_args[0].vpc = vpc;
    loop_args[0].kstart = kstart;
    loop_args[0].kinc = kinc;
    loop_args[0].icount = icount;
    loop_args[0].jcount = jcount;
    loop_args[0].kcount = kcount;
    loop_args[0].istride = istride;
    loop_args[0].jstride = jstride;
    loop_args[0].kstride = kstride;
    loop_args[0].composite_func = composite_func;

    for(i=0;i<NUM_THREADS;i++)
    {
           loop_args[i] = loop_args[0];
           loop_args[i].kmystart = cur_num; 
           loop_args[i].id = i;
           cur_num += (num * kincr);
           cur_num += ((i < extras) * kincr);
           loop_args[i].kstop = cur_num; 
    }
                                                                                
   vp_pthreads_data->completed_threads = 0;
   for(i=1;i<NUM_THREADS;i++)
   {
        vp_pthreads_data->inputs[i] = &(loop_args[i]);
        vp_pthreads_data->task_number = LOOP_TASK;
        pthread_cond_signal(vp_pthreads_data->task_cond[i]);
   }

Listing Two

Start the Threads. After the threads have been created, Core 0 parallelizes Amide at the New_Pixel_Loop to run on four cores. Listing Two illustrates how this is accomplished. Each core is assigned variables in a global array using the instruction:


loop_args[i].kmystart = cur_num;

One quarter of the array indexes are passed to each core to process the image volume using the variable loop_args[i]. The four cores are started by first assigning memory for data with:


vp_pthreads_data->inputs[i] = 
  &(loop_args[i])

Next, each core begins executing the LOOP_TASK code initiated by:


vp_pthreads_data->task_number = 
  LOOP_TASK;

Finally, each core is released to begin processing using:


pthread_cond_signal
  (vp_pthreads_data->task_cond[i]

Previous 1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Optimizing Software for Multicore Processors

Decompose Code

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Optimizing Software for Multicore Processors

Decompose Code

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content