Decompose Code
VolPack performs the same operation on each pixel, and this characteristic, combined with the relatively simple data structure, provides for a straightforward parallelization strategy. We used POSIX threads to divide the New_Pixel_Loop into four subloops, each running on one of four cores.
In our implementation, Core 1 first executes serial code to load the image, which precedes the dataflow in Figure 2, while the other three cores are idle. Next, Core 1 uses POSIX threads to spawn three threads for cores 2-4. Our software experts took two days to inspect the code, parallelize the image-rendering code, and test the workload balance.
Create the Threads. Listing One initializes one thread per core using pthread_create. Each thread executes the same function, vp_thread_task_loop. Listing One uses a variable NUM_THREADS, which corresponds to the number of cores in the system. Because the number of cores is not hard coded, it can easily be ported to run on systems with any number of cores.
{ vp_begin_threads(); vp_threads_begun = 1; } void vp_begin_threads() { int i; int mask = 0xf; vp_pthreads_data = vp_create_thread_data(); if (vp_pthreads_data == NULL) { printf("Unable to allocate memory for threads.\n"); return; } for(i=1;i<NUM_THREADS;i++) { vp_pthreads_args[i].data = vp_pthreads_data; vp_pthreads_args[i].id = i; pthread_create(&(vp_threads[i]), NULL, vp_thread_task_loop, &(vp_pthreads_args[i])); } }
vp_thread_big_loop_args loop_args[NUM_THREADS]; int num = (kcount)>>THREAD_SHIFT; int extras = (kcount)&THREAD_MASK; int cur_num = kstart; loop_args[0].vpc = vpc; loop_args[0].kstart = kstart; loop_args[0].kinc = kinc; loop_args[0].icount = icount; loop_args[0].jcount = jcount; loop_args[0].kcount = kcount; loop_args[0].istride = istride; loop_args[0].jstride = jstride; loop_args[0].kstride = kstride; loop_args[0].composite_func = composite_func; for(i=0;i<NUM_THREADS;i++) { loop_args[i] = loop_args[0]; loop_args[i].kmystart = cur_num; loop_args[i].id = i; cur_num += (num * kincr); cur_num += ((i < extras) * kincr); loop_args[i].kstop = cur_num; } vp_pthreads_data->completed_threads = 0; for(i=1;i<NUM_THREADS;i++) { vp_pthreads_data->inputs[i] = &(loop_args[i]); vp_pthreads_data->task_number = LOOP_TASK; pthread_cond_signal(vp_pthreads_data->task_cond[i]); }
Start the Threads. After the threads have been created, Core 0 parallelizes Amide at the New_Pixel_Loop to run on four cores. Listing Two illustrates how this is accomplished. Each core is assigned variables in a global array using the instruction:
loop_args[i].kmystart = cur_num;
One quarter of the array indexes are passed to each core to process the image volume using the variable loop_args[i]. The four cores are started by first assigning memory for data with:
vp_pthreads_data->inputs[i] = &(loop_args[i])
Next, each core begins executing the LOOP_TASK code initiated by:
vp_pthreads_data->task_number = LOOP_TASK;
Finally, each core is released to begin processing using:
pthread_cond_signal (vp_pthreads_data->task_cond[i]