Introduction to Parallel Computing: Part 2

Designing and implementing parallel programs


February 04, 2009
URL:http://www.drdobbs.com/introduction-to-parallel-computing-part/213001517


This two-part article covers the basics of parallel computing. Part 1 began with an overview, including concepts and terminology associated with parallel computing. The topics of parallel memory architectures and programming models were then explored. This installment presents a discussion on a number of issues related to designing parallel programs, concluding with several examples of how to parallelize simple serial programs. Provided courtesy Lawrence Livermore National Laboratory.

Designing Parallel Programs

Automatic vs. Manual Parallelization

Understand the Problem and the Program

Partitioning

Domain Decomposition:

Functional Decomposition:

Communications

Who Needs Communications?

Factors to Consider:

Synchronization

Types of Synchronization:

Data Dependencies

Definition:

Examples:

How to Handle Data Dependencies:

Load Balancing

How to Achieve Load Balance:

Granularity

Computation / Communication Ratio:

Fine-grain Parallelism:

  • Relatively small amounts of computational work are done between communication events

  • Low computation to communication ratio

  • Facilitates load balancing

  • Implies high communication overhead and less opportunity for performance enhancement

  • If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.

Coarse-grain Parallelism:

  • Relatively large amounts of computational work are done between communication/synchronization events

  • High computation to communication ratio

  • Implies more opportunity for performance increase

  • Harder to load balance efficiently

Which is Best?

  • The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs.

  • In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity.

  • Fine-grain parallelism can help reduce overheads due to load imbalance.

I/O

The Bad News:

The Good News:

Limits and Costs of Parallel Programming

Amdahl's Law:

Complexity:

Portability:

Resource Requirements:

Scalability:

Performance Analysis and Tuning

Parallel Examples

Array Processing

Array Processing
Parallel Solution 1

One Possible Solution:

Example: MPI Program in C


/******************************************************************************
* FILE: mpi_array.c
* DESCRIPTION: 
*   MPI Example - Array Assignment - C Version
*   This program demonstrates a simple data decomposition. The master task
*   first initializes an array and then distributes an equal portion that
*   array to the other tasks. After the other tasks receive their portion
*   of the array, they perform an addition operation to each array element.
*   They also maintain a sum for their portion of the array. The master task 
*   does likewise with its portion of the array. As each of the non-master
*   tasks finish, they send their updated portion of the array to the master.
*   An MPI collective communication call is used to collect the sums 
*   maintained by each task.  Finally, the master task displays selected 
*   parts of the final array and the global sum of all array elements. 
*   NOTE: the number of MPI tasks must be evenly disible by 4.
* AUTHOR: Blaise Barney
* LAST REVISED: 04/13/05
****************************************************************************/
#include "mpi.h"
#include 
#include 
#define  ARRAYSIZE	16000000
#define  MASTER		0

float  data[ARRAYSIZE];

int main (int argc, char *argv[])
{
int   numtasks, taskid, rc, dest, offset, i, j, tag1,
      tag2, source, chunksize; 
float mysum, sum;
float update(int myoffset, int chunk, int myid);
MPI_Status status;

/***** Initializations *****/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks % 4 != 0) {
   printf("Quitting. Number of MPI tasks must be divisible by 4.\n");
   MPI_Abort(MPI_COMM_WORLD, rc);
   exit(0);
   }
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
printf ("MPI task %d has started...\n", taskid);
chunksize = (ARRAYSIZE / numtasks);
tag2 = 1;
tag1 = 2;

/***** Master task only ******/
if (taskid == MASTER){

  /* Initialize the array */
  sum = 0;
  for(i=0; i
  /* Send each task its portion of the array - master keeps 1st part */
  offset = chunksize;
  for (dest=1; dest
  /* Master does its part of the work */
  offset = 0;
  mysum = update(offset, chunksize, taskid);

  /* Wait to receive results from each task */
  for (i=1; i
  /* Get final sum and print sample results */  
  MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
  printf("Sample results: \n");
  offset = 0;
  for (i=0; i
  }  /* end of master section */



/***** Non-master tasks only *****/

if (taskid > MASTER) {

  /* Receive my portion of array from the master task */
  source = MASTER;
  MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);
  MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, 
    MPI_COMM_WORLD, &status);

  mysum = update(offset, chunksize, taskid);

  /* Send my results back to the master task */
  dest = MASTER;
  MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
  MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);

  MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);

  } /* end of non-master */


MPI_Finalize();

}   /* end of main */


float update(int myoffset, int chunk, int myid) {
  int i; 
  float mysum;
  /* Perform addition to each of my array elements and keep my sum */
  mysum = 0;
  for(i=myoffset; i < myoffset + chunk; i++) {
    data[i] = data[i] + i * 1.0;
    mysum = mysum + data[i];
    }
  printf("Task %d mysum = %e\n",myid,mysum);
  return(mysum);
  }

Example: MPI Program in Fortran


C *****************************************************************************
C FILE: mpi_array.f
C DESCRIPTION:
C   MPI Example - Array Assignment - Fortran Version
C   This program demonstrates a simple data decomposition. The master task 
C   first initializes an array and then distributes an equal portion that
C   array to the other tasks. After the other tasks receive their portion
C   of the array, they perform an addition operation to each array element.
C   They also maintain a sum for their portion of the array. The master task
C   does likewise with its portion of the array. As each of the non-master
C   tasks finish, they send their updated portion of the array to the master.
C   An MPI collective communication call is used to collect the sums
C   maintained by each task.  Finally, the master task displays selected
C   parts of the final array and the global sum of all array elements.
C   NOTE: the number of MPI tasks must be evenly disible by 4.
C AUTHOR: Blaise Barney
C LAST REVISED: 01/24/09
C **************************************************************************

      program array 
      include 'mpif.h'

      integer   ARRAYSIZE, MASTER
      parameter (ARRAYSIZE = 16000000)
      parameter (MASTER = 0)

      integer  numtasks, taskid, ierr, dest, offset, i, tag1,
     &         tag2, source, chunksize
      real*4   mysum, sum, data(ARRAYSIZE)
      integer  status(MPI_STATUS_SIZE)
      common   /a/ data

C ***** Initializations *****
      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
      i = MOD(numtasks, 4)
      if (i .ne. 0) then
        call MPI_Abort(MPI_COMM_WORLD,ierr)
        stop
      end if
      call MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr)
      write(*,*)'MPI task',taskid,'has started...'
      chunksize = (ARRAYSIZE / numtasks)
      tag2 = 1
      tag1 = 2

C***** Master task only ******
      if (taskid .eq. MASTER) then

C       Initialize the array
        sum = 0.0
        do i=1, ARRAYSIZE 
          data(i) = i * 1.0
          sum = sum + data(i)
        end do
        write(*,20) sum

C       Send each task its portion of the array - master keeps 1st part
        offset = chunksize + 1
        do dest=1, numtasks-1
          call MPI_SEND(offset, 1, MPI_INTEGER, dest, tag1, 
     &      MPI_COMM_WORLD, ierr)
          call MPI_SEND(data(offset), chunksize, MPI_REAL, dest, 
     &      tag2, MPI_COMM_WORLD, ierr)
          write(*,*) 'Sent',chunksize,'elements to task',dest,
     &      'offset=',offset
          offset = offset + chunksize
        end do

C       Master does its part of the work
        offset = 1
        call update(offset, chunksize, taskid, mysum)

C       Wait to receive results from each task
        do i=1, numtasks-1
          source = i
          call MPI_RECV(offset, 1, MPI_INTEGER, source, tag1,
     &      MPI_COMM_WORLD, status, ierr)
          call MPI_RECV(data(offset), chunksize, MPI_REAL, 
     &      source, tag2, MPI_COMM_WORLD, status, ierr)
        end do 

C       Get final sum and print sample results
        call MPI_Reduce(mysum, sum, 1, MPI_REAL, MPI_SUM, MASTER,
     &    MPI_COMM_WORLD, ierr)
        print *, 'Sample results:'
        offset = 1
        do i=1, numtasks
          write (*,30) data(offset:offset+4)
          offset = offset + chunksize
        end do
        write(*,40) sum

      end if


C***** Non-master tasks only *****

      if (taskid .gt. MASTER) then

C       Receive my portion of array from the master task */
        call MPI_RECV(offset, 1, MPI_INTEGER, MASTER, tag1,
     &    MPI_COMM_WORLD, status, ierr)
        call MPI_RECV(data(offset), chunksize, MPI_REAL, MASTER,
     &    tag2, MPI_COMM_WORLD, status, ierr)

        call update(offset, chunksize, taskid, mysum)

C       Send my results back to the master
        call MPI_SEND(offset, 1, MPI_INTEGER, MASTER, tag1,
     &    MPI_COMM_WORLD, ierr)
        call MPI_SEND(data(offset), chunksize, MPI_REAL, MASTER,
     &    tag2, MPI_COMM_WORLD, ierr)

        call MPI_Reduce(mysum, sum, 1, MPI_REAL, MPI_SUM, MASTER,
     &    MPI_COMM_WORLD, ierr)

      endif


      call MPI_FINALIZE(ierr)

  20  format('Initialized array sum = ',E12.6)
  30  format(5E14.6) 
  40  format('*** Final sum= ',E12.6,' ***')

      end


      subroutine update(myoffset, chunksize, myid, mysum)
        integer   ARRAYSIZE, myoffset, chunksize, myid, i
        parameter (ARRAYSIZE = 16000000)
        real*4 mysum, data(ARRAYSIZE)
        common /a/ data
C       Perform addition to each of my array elements and keep my sum
        mysum = 0 
        do i=myoffset, myoffset + chunksize-1
          data(i) = data(i) + i * 1.0 
          mysum = mysum + data(i)
        end do
        write(*,50) myid,mysum
  50    format('Task',I4,' mysum = ',E12.6) 
      end subroutine update

Array Processing
Parallel Solution 2: Pool of Tasks

Pool of Tasks Scheme:

Discussion:

PI Calculation

PI Calculation
Parallel Solution

Simple Heat Equation

Simple Heat Equation
Parallel Solution 1

Simple Heat Equation
Parallel Solution 2: Overlapping Communication and Computation

1-D Wave Equation

1-D Wave Equation
Parallel Solution


References and More Information

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.