Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Introduction to Parallel Computing: Part 2


Parallel Examples

Array Processing

  • This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements.

  • The serial program calculates one element at a time in sequential order.

  • Serial code could be of the form:

    
    do j = 1,n
    do i = 1,n
      a(i,j) = fcn(i,j)
    end do
    end do
    
    

  • The calculation of elements is independent of one another - leads to an embarrassingly parallel situation.

  • The problem should be computationally intensive.

Array Processing
Parallel Solution 1

  • Arrays elements are distributed so that each processor owns a portion of an array (subarray).

  • Independent calculation of array elements insures there is no need for communication between tasks.

  • Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage.

  • Since it is desirable to have unit stride through the subarrays, the choice of a distribution scheme depends on the programming language.

  • After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example, with Fortran block distribution:

    
    do j = mystart, myend
    do i = 1,n
      a(i,j) = fcn(i,j)
    end do
    end do
    
    

  • Notice that only the outer loop variables are different from the serial solution.

One Possible Solution:

  • Implement as SPMD model.

  • Master process initializes array, sends info to worker processes and receives results.

  • Worker process receives info, performs its share of computation and sends results to master.

  • Using the Fortran storage scheme, perform block distribution of the array.

  • Pseudo code solution: red highlights changes for parallelism.

    
    find out if I am MASTER or WORKER
       
    if I am MASTER
       
      initialize the array
      send each WORKER info on part of array it owns
      send each WORKER its portion of initial array
       
      receive from each WORKER results 
       
    else if I am WORKER
      receive from MASTER info on part of array I own
      receive from MASTER my portion of initial array
    
      # calculate my portion of array
      do j = my first column,my last column 
    
      do i = 1,n
        a(i,j) = fcn(i,j)
      end do
      end do
    
      send MASTER results 
    
    endif
    

Example: MPI Program in C


/******************************************************************************
* FILE: mpi_array.c
* DESCRIPTION: 
*   MPI Example - Array Assignment - C Version
*   This program demonstrates a simple data decomposition. The master task
*   first initializes an array and then distributes an equal portion that
*   array to the other tasks. After the other tasks receive their portion
*   of the array, they perform an addition operation to each array element.
*   They also maintain a sum for their portion of the array. The master task 
*   does likewise with its portion of the array. As each of the non-master
*   tasks finish, they send their updated portion of the array to the master.
*   An MPI collective communication call is used to collect the sums 
*   maintained by each task.  Finally, the master task displays selected 
*   parts of the final array and the global sum of all array elements. 
*   NOTE: the number of MPI tasks must be evenly disible by 4.
* AUTHOR: Blaise Barney
* LAST REVISED: 04/13/05
****************************************************************************/
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define  ARRAYSIZE	16000000
#define  MASTER		0

float  data[ARRAYSIZE];

int main (int argc, char *argv[])
{
int   numtasks, taskid, rc, dest, offset, i, j, tag1,
      tag2, source, chunksize; 
float mysum, sum;
float update(int myoffset, int chunk, int myid);
MPI_Status status;

/***** Initializations *****/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks % 4 != 0) {
   printf("Quitting. Number of MPI tasks must be divisible by 4.\n");
   MPI_Abort(MPI_COMM_WORLD, rc);
   exit(0);
   }
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
printf ("MPI task %d has started...\n", taskid);
chunksize = (ARRAYSIZE / numtasks);
tag2 = 1;
tag1 = 2;

/***** Master task only ******/
if (taskid == MASTER){

  /* Initialize the array */
  sum = 0;
  for(i=0; i
  /* Send each task its portion of the array - master keeps 1st part */
  offset = chunksize;
  for (dest=1; dest
  /* Master does its part of the work */
  offset = 0;
  mysum = update(offset, chunksize, taskid);

  /* Wait to receive results from each task */
  for (i=1; i
  /* Get final sum and print sample results */  
  MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
  printf("Sample results: \n");
  offset = 0;
  for (i=0; i
  }  /* end of master section */



/***** Non-master tasks only *****/

if (taskid > MASTER) {

  /* Receive my portion of array from the master task */
  source = MASTER;
  MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);
  MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, 
    MPI_COMM_WORLD, &status);

  mysum = update(offset, chunksize, taskid);

  /* Send my results back to the master task */
  dest = MASTER;
  MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
  MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);

  MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);

  } /* end of non-master */


MPI_Finalize();

}   /* end of main */


float update(int myoffset, int chunk, int myid) {
  int i; 
  float mysum;
  /* Perform addition to each of my array elements and keep my sum */
  mysum = 0;
  for(i=myoffset; i < myoffset + chunk; i++) {
    data[i] = data[i] + i * 1.0;
    mysum = mysum + data[i];
    }
  printf("Task %d mysum = %e\n",myid,mysum);
  return(mysum);
  }

Example: MPI Program in Fortran


C *****************************************************************************
C FILE: mpi_array.f
C DESCRIPTION:
C   MPI Example - Array Assignment - Fortran Version
C   This program demonstrates a simple data decomposition. The master task 
C   first initializes an array and then distributes an equal portion that
C   array to the other tasks. After the other tasks receive their portion
C   of the array, they perform an addition operation to each array element.
C   They also maintain a sum for their portion of the array. The master task
C   does likewise with its portion of the array. As each of the non-master
C   tasks finish, they send their updated portion of the array to the master.
C   An MPI collective communication call is used to collect the sums
C   maintained by each task.  Finally, the master task displays selected
C   parts of the final array and the global sum of all array elements.
C   NOTE: the number of MPI tasks must be evenly disible by 4.
C AUTHOR: Blaise Barney
C LAST REVISED: 01/24/09
C **************************************************************************

      program array 
      include 'mpif.h'

      integer   ARRAYSIZE, MASTER
      parameter (ARRAYSIZE = 16000000)
      parameter (MASTER = 0)

      integer  numtasks, taskid, ierr, dest, offset, i, tag1,
     &         tag2, source, chunksize
      real*4   mysum, sum, data(ARRAYSIZE)
      integer  status(MPI_STATUS_SIZE)
      common   /a/ data

C ***** Initializations *****
      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
      i = MOD(numtasks, 4)
      if (i .ne. 0) then
        call MPI_Abort(MPI_COMM_WORLD,ierr)
        stop
      end if
      call MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr)
      write(*,*)'MPI task',taskid,'has started...'
      chunksize = (ARRAYSIZE / numtasks)
      tag2 = 1
      tag1 = 2

C***** Master task only ******
      if (taskid .eq. MASTER) then

C       Initialize the array
        sum = 0.0
        do i=1, ARRAYSIZE 
          data(i) = i * 1.0
          sum = sum + data(i)
        end do
        write(*,20) sum

C       Send each task its portion of the array - master keeps 1st part
        offset = chunksize + 1
        do dest=1, numtasks-1
          call MPI_SEND(offset, 1, MPI_INTEGER, dest, tag1, 
     &      MPI_COMM_WORLD, ierr)
          call MPI_SEND(data(offset), chunksize, MPI_REAL, dest, 
     &      tag2, MPI_COMM_WORLD, ierr)
          write(*,*) 'Sent',chunksize,'elements to task',dest,
     &      'offset=',offset
          offset = offset + chunksize
        end do

C       Master does its part of the work
        offset = 1
        call update(offset, chunksize, taskid, mysum)

C       Wait to receive results from each task
        do i=1, numtasks-1
          source = i
          call MPI_RECV(offset, 1, MPI_INTEGER, source, tag1,
     &      MPI_COMM_WORLD, status, ierr)
          call MPI_RECV(data(offset), chunksize, MPI_REAL, 
     &      source, tag2, MPI_COMM_WORLD, status, ierr)
        end do 

C       Get final sum and print sample results
        call MPI_Reduce(mysum, sum, 1, MPI_REAL, MPI_SUM, MASTER,
     &    MPI_COMM_WORLD, ierr)
        print *, 'Sample results:'
        offset = 1
        do i=1, numtasks
          write (*,30) data(offset:offset+4)
          offset = offset + chunksize
        end do
        write(*,40) sum

      end if


C***** Non-master tasks only *****

      if (taskid .gt. MASTER) then

C       Receive my portion of array from the master task */
        call MPI_RECV(offset, 1, MPI_INTEGER, MASTER, tag1,
     &    MPI_COMM_WORLD, status, ierr)
        call MPI_RECV(data(offset), chunksize, MPI_REAL, MASTER,
     &    tag2, MPI_COMM_WORLD, status, ierr)

        call update(offset, chunksize, taskid, mysum)

C       Send my results back to the master
        call MPI_SEND(offset, 1, MPI_INTEGER, MASTER, tag1,
     &    MPI_COMM_WORLD, ierr)
        call MPI_SEND(data(offset), chunksize, MPI_REAL, MASTER,
     &    tag2, MPI_COMM_WORLD, ierr)

        call MPI_Reduce(mysum, sum, 1, MPI_REAL, MPI_SUM, MASTER,
     &    MPI_COMM_WORLD, ierr)

      endif


      call MPI_FINALIZE(ierr)

  20  format('Initialized array sum = ',E12.6)
  30  format(5E14.6) 
  40  format('*** Final sum= ',E12.6,' ***')

      end


      subroutine update(myoffset, chunksize, myid, mysum)
        integer   ARRAYSIZE, myoffset, chunksize, myid, i
        parameter (ARRAYSIZE = 16000000)
        real*4 mysum, data(ARRAYSIZE)
        common /a/ data
C       Perform addition to each of my array elements and keep my sum
        mysum = 0 
        do i=myoffset, myoffset + chunksize-1
          data(i) = data(i) + i * 1.0 
          mysum = mysum + data(i)
        end do
        write(*,50) myid,mysum
  50    format('Task',I4,' mysum = ',E12.6) 
      end subroutine update

Array Processing
Parallel Solution 2: Pool of Tasks

  • The previous array solution demonstrated static load balancing:

    • Each task has a fixed amount of work to do
    • May be significant idle time for faster or more lightly loaded processors - slowest tasks determines overall performance.

  • Static load balancing is not usually a major concern if all tasks are performing the same amount of work on identical machines.

  • If you have a load balance problem (some tasks work faster than others), you may benefit by using a pool of tasks" scheme.

Pool of Tasks Scheme:

  • Two processes are employed

    Master Process:

    • Holds pool of tasks for worker processes to do
    • Sends worker a task when requested
    • Collects results from workers

    Worker Process: repeatedly does the following

    • Gets task from master process
    • Performs computation
    • Sends results to master

  • Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform.

  • Dynamic load balancing occurs at run time: the faster tasks will get more work to do.

  • Pseudo code solution: red highlights changes for parallelism.

    
    find out if I am MASTER or WORKER
    
    if I am MASTER
    
      do until no more jobs
        send to WORKER next job
        receive results from WORKER
      end do
    
      tell WORKER no more jobs
    
    else if I am WORKER
    
      do until no more jobs
        receive from MASTER next job
    
        calculate array element: a(i,j) = fcn(i,j)
    
    
        send results to MASTER
      end do
    
    endif
    
    

Discussion:

  • In the above pool of tasks example, each task calculated an individual array element as a job. The computation to communication ratio is finely granular.

  • Finely granular solutions incur more communication overhead in order to reduce task idle time.

  • A more optimal solution might be to distribute more work with each job. The "right" amount of work is problem dependent.

PI Calculation

  • The value of PI can be calculated in a number of ways. Consider the following method of approximating PI
    1. Inscribe a circle in a square
    2. Randomly generate points in the square
    3. Determine the number of points in the square that are also in the circle
    4. Let r be the number of points in the circle divided by the number of points in the square
    5. PI ~ 4 r
    6. Note that the more points generated, the better the approximation

  • Serial pseudo code for this procedure:

    
    npoints = 10000
    circle_count = 0
    
    do j = 1,npoints
      generate 2 random numbers between 0 and 1
      xcoordinate = random1 ; ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do
    
    PI = 4.0*circle_count/npoints
    
    

  • Note that most of the time in running this program would be spent executing the loop

  • Leads to an embarrassingly parallel solution
    • Computationally intensive
    • Minimal communication
    • Minimal I/O

PI Calculation
Parallel Solution

  • Parallel strategy: break the loop into portions that can be executed by the tasks.

  • For the task of approximating PI:
    • Each task executes its portion of the loop a number of times.
    • Each task can do its work without requiring any information from the other tasks (there are no data dependencies).
    • Uses the SPMD model. One task acts as master and collects the results.

  • Pseudo code solution: red highlights changes for parallelism.

    
    
    npoints = 10000
    circle_count = 0
    
    p = number of tasks
    num = npoints/p
    
    find out if I am MASTER or WORKER 
    
    do j = 1,num 
      generate 2 random numbers between 0 and 1
      xcoordinate = random1 ; ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do
    
    if I am MASTER
    
      receive from WORKERS their circle_counts
      compute PI (use MASTER and WORKER calculations)
    
    else if I am WORKER
    
      send to MASTER circle_count
    
    endif
    
    

Simple Heat Equation

  • Most problems in parallel computing require communication among the tasks. A number of common problems require communication with "neighbor" tasks.

  • The heat equation describes the temperature change over time, given initial temperature distribution and boundary conditions.

  • A finite differencing scheme is employed to solve the heat equation numerically on a square region.

  • The initial temperature is zero on the boundaries and high in the middle.

  • The boundary temperature is held at zero.

  • For the fully explicit problem, a time stepping algorithm is used. The elements of a 2-dimensional array represent the temperature at points on the square.

  • The calculation of an element is dependent upon neighbor element values.

  • A serial program would contain code like:

    
    do iy = 2, ny - 1
    do ix = 2, nx - 1
      u2(ix, iy) =
        u1(ix, iy)  +
          cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) +
          cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
    end do
    end do
    
    

Simple Heat Equation
Parallel Solution 1

  • Implement as an SPMD model

  • The entire array is partitioned and distributed as subarrays to all tasks. Each task owns a portion of the total array.

  • Determine data dependencies

  • Master process sends initial info to workers, checks for convergence and collects results

  • Worker process calculates solution, communicating as necessary with neighbor processes

  • Pseudo code solution: red highlights changes for parallelism.

    
    find out if I am MASTER or WORKER
    
    if I am MASTER
      initialize array
      send each WORKER starting info and subarray
       
      do until all WORKERS converge
        gather from all WORKERS convergence data
        broadcast to all WORKERS convergence signal
      end do
    
      receive results from each WORKER
    
    else if I am WORKER
      receive from MASTER starting info and subarray
    
      do until solution converged
        update time
        send neighbors my border info
        receive from neighbors their border info
     
        update my portion of solution array
         
        determine if my solution has converged
          send MASTER convergence data
          receive from MASTER convergence signal
      end do
     
      send MASTER results
          
    endif
    
    

Simple Heat Equation
Parallel Solution 2: Overlapping Communication and Computation

  • In the previous solution, it was assumed that blocking communications were used by the worker tasks. Blocking communications wait for the communication process to complete before continuing to the next program instruction.

  • In the previous solution, neighbor tasks communicated border data, then each process updated its portion of the array.

  • Computing times can often be reduced by using non-blocking communication. Non-blocking communications allow work to be performed while communication is in progress.

  • Each task could update the interior of its part of the solution array while the communication of border data is occurring, and update its border after communication has completed.

  • Pseudo code for the second solution: red highlights changes for non-blocking communications.

    
    find out if I am MASTER or WORKER
     
    if I am MASTER
      initialize array
      send each WORKER starting info and subarray
        
      do until all WORKERS converge
        gather from all WORKERS convergence data
        broadcast to all WORKERS convergence signal
      end do
     
      receive results from each WORKER
     
    else if I am WORKER
      receive from MASTER starting info and subarray
     
      do until solution converged
        update time
        
        non-blocking send neighbors my border info
        non-blocking receive neighbors border info
    
        update interior of my portion of solution array
        wait for non-blocking communication complete
        update border of my portion of solution array
        
        determine if my solution has converged
          send MASTER convergence data
          receive from MASTER convergence signal
      end do
      
      send MASTER results
           
    endif
    
    

1-D Wave Equation

  • In this example, the amplitude along a uniform, vibrating string is calculated after a specified amount of time has elapsed.

  • The calculation involves:

    • the amplitude on the y axis
    • i as the position index along the x axis
    • node points imposed along the string
    • update of the amplitude at discrete time steps.

  • The equation to be solved is the one-dimensional wave equation:
    
        A(i,t+1) = (2.0 * A(i,t)) - A(i,t-1) 
            + (c * (A(i-1,t) - (2.0 * A(i,t)) + A(i+1,t))) 
        
    where c is a constant

  • Note that amplitude will depend on previous timesteps (t, t-1) and neighboring points (i-1, i+1). Data dependence will mean that a parallel solution will involve communications.

1-D Wave Equation
Parallel Solution

  • Implement as an SPMD model

  • The entire amplitude array is partitioned and distributed as subarrays to all tasks. Each task owns a portion of the total array.

  • Load balancing: all points require equal work, so the points should be divided equally

  • A block decomposition would have the work partitioned into the number of tasks as chunks, allowing each task to own mostly contiguous data points.

  • Communication need only occur on data borders. The larger the block size the less the communication.

  • Pseudo code solution:

    
    find out number of tasks and task identities
    
    #Identify left and right neighbors
    left_neighbor = mytaskid - 1
    right_neighbor = mytaskid +1
    if mytaskid = first then left_neigbor = last
    if mytaskid = last then right_neighbor = first
    
    find out if I am MASTER or WORKER
    if I am MASTER
      initialize array
      send each WORKER starting info and subarray
    else if I am WORKER
      receive starting info and subarray from MASTER
    endif
    
    #Update values for each point along string
    #In this example the master participates in calculations
    do t = 1, nsteps
      send left endpoint to left neighbor
      receive left endpoint from right neighbor
      send right endpoint to right neighbor
      receive right endpoint from left neighbor
    
    #Update points along line
      do i = 1, npoints
        newval(i) = (2.0 * values(i)) - oldval(i) 
        + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1))) 
      end do
    
    end do
    
    #Collect results and write to file
    if I am MASTER
      receive results from each WORKER
      write results to file
    else if I am WORKER
      send results to MASTER
    endif
    
    

References and More Information

  • Author: Blaise Barney, Livermore Computing.
  • A search on the WWW for "parallel programming" will yield a wide variety of information.
  • These materials were primarily developed from the following sources, some of which are no longer maintained or available.
    • Tutorials located in the Maui High Performance Computing Center's "SP Parallel Programming Workshop".
    • Tutorials located at the Cornell Theory Center's "Education and Training" web page.
    • "Designing and Building Parallel Programs". Ian Foster.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.