The emerging class of small, relatively inexpensive multiple-processor workstation computers provides a degree of raw computing power that seriously challenges the computing power offered by minisupercomputers. The availability of the appropriate software tools -- language extensions, parallelizing compilers, and parallel environment debuggers and profilers -- makes it possible to exploit this multiprocessor architecture in a straightforward manner.
This article explores the SGI parallelization scheme that Silicon Graphics has implemented in its Power Series two-processor and four-processor work-stations and their accompanying software development tools. I use a test program to take you through this exploration process. First, I'll introduce the syntax, highlighting coding problems and solutions. Next, I'll describe the actual implementation of the parallelization scheme. Finally, I'll show the bottom line for high-performance computing, and confirm that performance gains are possible with the SGI scheme if one is mindful of its limitations.
My exploration strategy centers around the DO loop. If the iterations of the loop are independent from each other, then the loop can be divided among n available processors, and can run roughly n times faster. These separate executing processes are known as "threads." This parallelization scheme is not intended for use with complex, independently executing routines (although a separate set of Unix tools exists to accomplish that).
The goal here is to accelerate the speed of more common forms of code. For instance, consider the two DO loops in Figure 1. The scalar DO loop is transformed into a parallel loop by the inclusion of a parallelization directive immediately before the DO statement.
<b>Scalar:</b> do i = 1, n a(i) = b(i) * c(i) enddo <b>Parallel:</b> c$doacross local(i), share(a, b, c, n) do i = 1, n a(i) = b(i) * c(i) enddo
The program is parallelized after compilation with the -mp option. The parallelization directive affects only the DO loop that immediately follows that directive. The directive resembles a Fortran comment, and is ignored by the compiler when the -mp option is not used. More importantly, the directive is ignored by other platforms that do not have multiple processors.
The syntax allows for LOCAL, SHARE, and LASTLOCAL declarations to restrict the scope of variables and improve program performance (see sidebar 1, page 74). LOCAL variables are localized to each thread. SHARE variables are in common between executing threads. LASTLOCAL variables are treated like LOCAL except that the value of the variable in the thread, which finishes last, is made available to code past the parallelized loop.
A variety of Fortran statements, including subroutines, functions, and intrinsics, are allowed inside of parallelized loops. The only restriction is that the statements must have no side effects -- the statements must depend upon input in the passed arguments, they cannot modify common data, and they cannot use static data. All of the intrinsic functions that are supplied in the library meet this requirement and can be used safely.
The one construct that is forbidden is branching outside the DO loop. Typically, this construct is a logical test that causes premature termination of the loop. It doesn't make sense to use a conditional exit, which may behave differently for each thread in a parallel environment.
Additionally, a number of functions that are unique to the parallel environment are available to help the executing program work more efficiently within the multiprocessor environment. The straightforward syntax of these functions, (which retains the algorithm in its original form), along with transparency to other platforms, facilitate implementation. (This is especially useful for programmers who do not wish to master the details of systems-level programming.) The syntax is conservative in the sense that the loop code itself needs no restructuring; the addition of the directive (and the optional addition of the scope declarations) are the only requirements. The retention of the basic algorithm maintains clarity and aids in the transformation of scalar code.
The parallel implementation restructures the code during compilation into separate slaved processes for the parallelized loop (see sidebar 2, page 76). The parallelized program is under software control during execution, and can create as many processes as are appropriate for the machine on which the code is executing. (The number of processes can range from 1, for a single processor system, to 4.) The sensing and adjustment is automatic so that separate versions of a program need not be created for a multiplatform environment. This feature offers a great deal of flexibility to programmers who use networked NFS systems, in which a single executable can go to any target machine. In practice, this works well, although parallelized programs that run on single processor systems execute more slowly due to the system overhead.
The primary limitation of a parallelized program is "data dependency," in which one of several currently executing threads alters data that can be read by any other executing thread, producing nonsensical results. When this method of parallelization is used, many common coding situations become unusable unless they are recognized, and then solutions are applied.
Consider, for example, the DO loop in Example 1. The process of successful parallelization requires each iteration of the loop to be data-independent of any other iteration of the loop. In this way, the loop can be broken into n separate parallel processes, or can even be left scalar, without affecting the results. As long as the code in the loop for all values of the index is run, regardless of the order or grouping of the code, the outcome is the same.
do i = 1, n a(i) = x * b(i) enddo
Data dependency occurs when a variable is modified (or could be modified) by one executing thread, while at the same time, other executing threads may be reading or modifying the same variable. The value of the variable becomes unpredictable, causing the code in the loop to produce results that vary from the scalar and also vary from run to run.
In the DO loop in Example 2, the array variable arr references a value that is not current with the index. The variable is manipulated and then assigned to arr(i). Other threads can write to arr(i-1), and arr(i) can be referenced by other threads. The iterations are not truly independent. When the DO loop is parallelized, the parallel version cannot guarantee the same result that is guaranteed by use of the scalar version of the loop. This type of data dependence is known as "recurrence."
Example 2: A DO loop in which the array variable references a value that is not current with the index
do i = 2, n arr(i) = b(i) - arr(i-1) enddo
Data dependency inhibits parallelization and does not yield the performance inherent in the machine. The solution to the problem of data dependency is recognition of its existence, followed by modification of the algorithm. A simple modification of the algorithm isn't always possible to do, and not all algorithms can be parallelized. In short, the ability to break data dependency is algorithm-specific. Most algorithms must be approached as special cases.
Many different types of dependency can prevent parallelization. A number of common situations have data dependencies that can be generalized. In addition, other situations have data dependencies that are not obvious. Once again, the key is recognition of the existence of data dependency, especially when existing scalar code is adapted to the parallel environment. When parallelization is not possible, the solution usually involves major restructuring of the algorithm.
The data dependency case discussed in the previous section is an example of recurrence. Modified values recur during the execution cycle of the threads that are associated with a parallelized loop, causing unpredictable results. There is no general way to parallelize loops in which recurrence occurs. One possible solution is to rewrite the algorithm. If the recurrence occurs in the inner loop of nested loops, a solution is to parallelize the outer loop.
To see how this works, consider the parallelization shown in Figure 2. If x is used only inside of the loop and the value of x is assigned before x is used, then the loop can be declared a local variable, and can be parallelized. If the value of x is not assigned first, then x becomes data dependent upon the value of x from other iterations. The loop cannot be parallelized by declaring x to be LOCAL.
<b>Local Variables:</b> do i = 1, n x = a(i)**3 b(i) = x * b(i) enddo
The variable x is dependent. Without an explicit declaration as LOCAL, this loop cannot be parallelized.
c$doaccross local(i, x) do i = 1, n x = a(i)**3 b(i) = x*b(i) enddo
Figure 3 presents another example of data dependence, where the value of indx is passed between iterations, causing data dependency to occur. Although indx is used inside of the loop, the value of indx is read before indx is assigned, so indx cannot be a local variable. Dependency can also be created by the manner in which indx is determined, but the determination of indx in this particular case is independent.
<b>Scalar:</b> index = 0 do i = 1, n indx = indx +i a(i) = b(indx) enddo <b>Parallel:</b> c$doacross local(i, indx) do i = 1, n indx = (i* (i + 1))/2 a(i) = b(indx) enddo
If the variable can be determined uniquely for each iteration, so that a value for the variable is not passed from iteration to iteration, then the loop can be parallelized. The value of indx in Figure 3 is determined only from the loop index i. In order to arrive at this solution, the original algorithm was rewritten and is no longer straightforward. This solution, then, is highly algorithm-dependent.
The next example of indirect indexing is shown in Figure 4. The problem is seen in the last line of the loop. The value of iix is dependent upon the values stored in the ixoffset and indexx arrays, and cannot be guaranteed to be independent from iteration to iteration. This problem is different from the loop carried value problem because the value of iix is looked up, rather than calculated. Unless the contents of indexx and ixoffset are known to not introduce dependency, iix must be considered dependent.
<b>Scalar:</b> do i = 1, n ix = indexx(i) iix = ixoffset(ix) total(iix) = total(iix) + delta enddo <b>Parallel:</b> c$doacross local(ix, i) do i = 1, n ix = indexx(i) iix(i) = ixoffset(ix) enddo do i = 1, n total(iix(i)) = total(iix(i)) + delta enddo
Indirect indexing is also a great source of run-time bugs that are nearly impossible to find. The solution is to remove the data-dependent element and place it into a nonparallelized loop. The array index for the second loop is now the only task of the first loop, and is stored in array elements indexed by i, which is common to both loops. Half of the original loop is now parallelized, but performance may be an issue if n is small. Rewriting the loop, or even leaving the original loop as a nonparallelized loop, should be considered if performance suffers in the two-loop case. Note that data dependency is still present in the second loop.
"Sum reduction" (Figure 5) is a common procedure that is highly data-dependent. The implementation of sum reduction is similar to the implementation of the loop-carried variable case. In the case of sum reduction, however, the value of total is presumably used outside of the loop (while indx was only used inside of the loop).
<b>Scalar:</b> total = 0.0 do i = 1, n total = total + a(i) enddo <b>Parallel:</b> c$doacross local(i, k) do k = 1, 4 ! assume no more than 4 processors sub_total(k) = 0.0 do i = k, n, 4 sub_total(k) = sub_total(k) +a (i) enddo enddo total = 0.0 do i = 1, 4 total = total + sub_total(i) enddo
The solution to the data dependency of sum reduction is similar to the solution used for the indirect indexing case: Subtotals are determined, permitting the loop to be parallelized, and the loop is followed by a grand summation loop. The original loop is converted into the inner loop of a nested pair of loops. Parallelization then takes place on the outer loop, which then divides the inner loop into the maximum number of parallel processes.
Unlike the example of indirect indexing, which contains two loops of roughly equal size, the bulk of the original loop in this case is parallelized and followed by a small grand summation loop. In order to prevent cache and performance problems, the threads interleave the array and are not handled as a set of contiguous blocks. The solution to sum reduction can also be made more efficient and more general. An example of this approach is shown in TEST.F (Listing One, page 122).
A number of issues can affect the performance of parallel schemes. Foremost among these issues are overhead, caching, and load balancing.
The CPU overhead is approximately 100 cycles for loop parallelization. The loop must iterate more than about 50 times in order to recover that overhead. Otherwise, small parallelized loops will run slower than the speed of scalar loops.
Two solutions can be employed: Enclosing the maximum amount of work in the parallelized loop, and using multiple versions of a loop. The parallelization process can accommodate a wide range of normal Fortran functionality, including IF ... THEN ... ELSE ... ENDIF blocks, other loops, function calls, subroutine calls, and even G0T0s. If data dependency does not introduce restrictions, the inclusion of as much functionality in the loop as possible cuts down on overhead and keeps the highest number of threads in action to improve performance. This solution also means that outer loops should be examined preferentially to inner loops so that the greatest amount of code is parallelized.
The other solution to the problem of CPU overhead is to write both parallel and scalar versions of a loop, and place them in an IF ... (index gt. threshold) ... THEN ... (do parallel) ... ELSE ... (do scalar) ... ENDIF block. This is a good approach for maximizing program performance when the value of the index varies across the threshold value of roughly 50 iterations.
These multi-processor machines have exceptionally large data caches. If programs are not written with the cache in mind, performance can suffer significantly. In practical terms, when a multidimensional array is referenced inside nested loops, the leftmost index of the array must be varied fastest by the innermost loop so that memory is addressed as contiguously as possible. (Fortran stores arrays in column order, so the arrays must be referenced in what always seems to me to be a backwards order. I will say more about cache effects later.)
The parallelization process handles load balancing by dividing the target DO loop into equal pieces, based upon the number of available threads (usually the same number as the number of available processors). If the loop contains IF blocks, or variable-sized inner loops (as is the case in the next example), then the execution time for each thread can be random. As a consequence, if the work of the parallelized loop is not performed equally, time is wasted as completed threads wait for the slower threads to finish. The execution of the code that follows the loop will not occur until all of the threads are completed. The load is not balanced, and the penalty is increased execution time due to the idle processors.
The code in Example 3 is a particularly vicious example of load imbalance. Assume that the outer loop is parallelized. The duration of the execution of the inner loop varies according to the value of the index of the outer loop. Although the outer loop's index is divided into ranges of equal size, the threads associated with the ranges that have larger values of the index will take longer to complete.
do i = 1, n do j = 1, i a(j, i) = a(j, i) * xmult enddo enddo
As Example 4 shows, a solution is to use index i to break the outer loop into interleaving blocks, such that each thread has a range of values of i. No thread has only the large values of i and the resulting longer execution time. The result is that all threads execute in about the same amount of time. The load is balanced. The only concern is that the process of interleaving can disrupt contiguous memory blocks, which can potentially degrade cache performance.
num_threads = mp_numthreads () c$doacross local (i, j, k) do k = 1, num_threads do i = k, n, num_threads do j = 1, i a (j, i) = a(j, i) * xmult enddo enddo enddo
The real test of the efficiency of the SGI parallelization scheme is with the clock. The goal of the scheme is faster execution, which allows larger and more complex problems to be addressed.
A short test program was developed that reflects the types of tasks I would like to speed up in my own work. It fills and manipulates large, multidimensional arrays using nested DO loops, and generates a grand total via sum reduction. The nested loops allow me to examine alternative parallelization strategies. The grand total number serves as a behavior reference for different cases.
Three cases were tested: A purely scalar case (SCA), an inner loops parallelized case (IN), and an outer loop parallelized case (OUT). To test each of the cases, the appropriate * comments were removed in order to unmask the desired compiler directives.
Each program was compiled at optimization level 2. This level is primarily loop optimization, and currently is the maximum level of optimization that is permitted for the -mp option. I have found that programs compiled at optimization level 1 typically perform about 40 percent slower than programs that are compiled at level 2, making it essential to compile at the higher optimization level.
Both a two-headed 4D/120s system and a four-headed 4D/240s system were evaluated. The various configurations are listed in Table 1. Instead of listing aggregate power, I prefer to list the power per processor. This gives an honest view of these machines and reflects nominal scalar performance. The processing power is relative to the old standard of scientific computing -- the VAX 11/780 -- as being 1 mips by definition. These 13 and 20 mips numbers are real!
model 4D/120s 4D/240s processors 2 4 processors MIPS R2000 MIPS R3000 clock speed (MHz) 16 25 processing power (mips) 2x13 4x20 FPU 2xR2010 4xR3010 Mflops (double precision) 2x1.5 4x3.0 data cache (kbytes/processor) 64 64 ram (Mbytes, nonstandard) 32 64
I have broken down the timing results into the categories of User, System, and Total (wall clock). The times were determined by using the Unix time utility, and were determined on machines free of competing tasks. The speedup is relative to scalar, and reflects the number of processors that contribute to the job. The times between the two-processor and four-processor systems are not comparable because the CPU and clock rates are very different for each system.
All versions of TEST.F are well-behaved. Each version returns a value of 0.518e+8 for the grand total, which indicates correct operation under both scalar and parallel modes. The value of ITEST that is typically returned is random, and ranges between 0 and 5 (except in the IN case, when it returns an enormous integer value that reflects the increased cycling time being measured, as it should). The random nature of the value of ITEST, which is declared as type LASTLOCAL, suggests that caution must be used in the interpretation and use of LASTLOCAL values.
The speedup, as shown in Tables 2 and 3, is impressive. In both processor tests, the difference between the IN and OUT tests likely reflects load balancing, but is small, regardless. This suggests that parallelization of the most time-dependent code, rather than of the largest quantity of most code, is the real issue. Outer loops that enclose much-varied code can be far more difficult to parallelize than inner loops, due to the occurrence of data dependency. The difference between IN and OUT suggests that no significant performance penalty may exist if time-dependent inner loops are preferentially parallelized. The results also suggest that no generalization is safe, and that a variety of coding possibilities should be examined.
case user sys total speedup --------------------------------- SCA 40.4 2.1 42.5 1.00 IN 21.0 0.8 21.8 1.95 OUT 21.8 1.0 22.8 1.86
case user sys total speedup --------------------------------- SCA 25.6 1.0 26.6 1.00 IN 7.1 0.2 7.3 3.64 OUT 6.4 0.5 6.9 3.86
While experimenting with TEST.F, I uncovered a performance issue that is apparently uncommon to programming on PC-class machines: Data cache management. Although the issue of data cache management is common to both scalar and parallel programming environments, the observed effects of mismanagement of the cache, and their cure, are important enough to highlight separately because they affect performance. The cure is the proper use of array indexing.
Intentional cache mismanagement leads to significant performance degradation. Management in this context consists of the maximization of the use of existing data in the cache, and the minimization of data movement into the cache from regular memory. The amount of the reuse of existing cache data is commonly referred to as the "hit rate". Better performance is obtained from optimizing the hit rate.
In practical terms, the hit rate can be increased by addressing memory contiguously. Memory references of this sort typically consist of array variables in which contiguous blocks of values can be moved into the cache with less overhead than is required in order to move single values into the cache multiple times. Keep in mind that in Fortran, multidimensional arrays are stored in column order; the leftmost index represents contiguous memory addresses. Therefore, nested DO loops must vary the leftmost index of a two-dimensional array fastest, so the references are to contiguous memory locations.
Cache mismanagement was tested by reversing the array index variables, and by reversing the array size declarations. The array sizes guarantee that the arrays cannot be wholly contained in the cache and that the movement of data into the cache is handled via the movement of individual array elements.
The reason why cache management is a concern is shown in Table 4. Note that the speedup for parallelization is less, due to increased system overhead, while the slowdown relative to the speed observed for proper array indexing is substantial. This translates to an additional day added to the time required to perform a three-day simulation, which is a significant penalty. The effects of cache mismanagement may be inconsequential for tasks that have small memory requirements. In the case of large memory requirements, the penalty for improper use of the cache can be considerable.
case user sys total speedup slowdown ----------------------------------------- SCA 45.7 6.9 52.6 1.00 1.24 OUT 22.8 7.0 29.7 1.77 1.31
Will the application of these tools to real programs translate to significant performance gains, as shown by TEST.F? This will very likely depend upon the individual program. In the case of one program that is important to my own work, I have created significant speed enhancements by the parallelization of key routines found by profiling. An additional increase in speed is also possible by paying attention to load balancing and cache management.
Number crunching has never been cheap. Not everyone has access to a supercomputer, and not every problem can justify the cost. Al Cameron, who writes the "Engineering/Science" column for MIPS magazine, advocates the use of workstations dedicated to running one problem around the clock for long periods of time as a low-cost alternative to supercomputing. I share his opinion, but this solution is clearly acceptable only if you can afford to wait. Machines such as the Power Series from Silicon Graphics provide the raw power needed for numerically intensive computing, plus the software tools needed to extract that power. With properly parallelized programs, the wait is considerably shorter.
I would like to thank Nancy Marx and Josh Blumert, both of Silicon Graphics, for their helpful discussions. I would especially like to thank the authors of the "SGI Parallelization Documentation" for an exceptionally clear and useful manual. All documentation should be written this way. Many of the examples (somewhat simplified) were taken directly from the documentation, with permission from Silicon Graphics.
Each directive begins in column 1 and behaves as a comment unless multiprocessing is enabled during compilation (-mp option). Code containing directives is generally portable except to platforms, which use a similar scheme such as the CRAY.
Variables referenced in the scope arguments are by name only; declaration is done in the usual manner at the top of the program. Default scope declaration is SHARE for all undeclared variables and for variables appearing in COMMON blocks, except the DO loop index, which is LASTLOCAL. Variables can only appear in one scope argument.
C$DOACROSS [LOCAL(list...), SHARE(list...), LASTLOCAL(list...)]
This is the principal directive used to select a loop for parallelization. It immediately proceeds the selected loop and any of three optional scope arguments. C$DOACROSS directives do not nest and subsequent nested directives are ignored.
LOCAL(list ... ) are all variables local to each separately executing thread. All LOCAL variables must be explicitly declared, and all data types, including arrays, can be of type LOCAL.
SHARE(list ... ) are all variables in common to all executing threads and the remainder of the program. SHARE variables are typically arrays, while nonarray variables can be used if read only, to avoid data dependency.
LASTLOCAL(list ... ) are local variables in which the last executing thread returns its value for use past the parallelized block. The value returned is not always the assumed value and great caution must be used especially if load balancing between threads is not even.
C$& -- This is treated as a continuation line for lenghty parallelization directives.
C$ -- This is used to selectively include lines or blocks of code during parallelization. Typically, this is used to mask lines that are parallel specific or call routines from the parallel library. -- B.B.
Upon encountering a C$DOACROSS statement, the compiler creates a new subroutine that represents the loop and its contained functionality, and then replaces the loop with a call to the library routine MP_MULTI. The new subroutine is named by prepending an underscore to the name of the routine that originally called the loop and then appending an underscore and a number to that name. The number that is appended starts at 1, and is incremented for each C$DOACROSS contained in the original routine. The new subroutine handles variables declared as LOCAL as local variables. Variables declared as SHARE are resolved by reference back to the original routine. Any required additional variables are local.
The new routine is called with four arguments: The starting value of the index, the maximum number of executions of the loop, the increment value of the index, and a unique process ID number. The remainder of the routine consists of the parallelized DO loop. The DO loop is restructured (see Listing Two, page 123) in order to permit multiple copies of the new routine running in parallel to work with independent sections of the DO loop. The code of the new subroutine is written once and the portion of the DO loop that is manipulated is controlled by the calling routine, MP_MULTI. This permits a great deal more run-time flexibility than is available when hard-coded routines are used.
At run time, the parallelization is effected by a master routine, which handles all of the scalar code. At the point in the code when the DO loop would have been executed, the master routine passes the name of the new process to MP_MULTI, along with the starting value of the DO loop index, the number of cycles, and the index increment, MP_MULTI divides the loop according to the number of available processors, and then invokes the new process that number of times, with each new process as a slave routine. Additional routines manage interprocess signalling and synchronization. The slave subprocesses do not have to run in synchronization with each other. When the last slave process is completed, control returns to the master routine at the point past the loop.
This parallelization scheme is very flexible. Parallelized code can run on a single processor system without change, while dynamically adapting to the number of available heads. At the other extreme, the number of threads can be limited dynamically by setting the environment variable NUM_THREADS to a number less than the maximum number of available heads.
_OPTIMIZING IN A PARALLEL ENVIRONMENT_ by Barr E. Bauer
Copyright © 1989, Dr. Dobb's Journal
program test 1
* purpose is to test SGI parallelization scheme for loop selection, 3
* numerically-intensive calculations, and total reduction. See text 4
* for details. 5
parameter (MAXFIRST=250, MAXSECOND=250, MAXTHIRD=10) 7
real*8 a(MAXTHIRD,MAXSECOND,MAXFIRST) 8
real*8 b(MAXTHIRD,MAXSECOND,MAXFIRST) 9
real*8 sub_total(MAXFIRST), partial_total(4) 10
real*8 d(MAXTHIRD), c, tmp ! local variables 11
real*8 dist(MAXSECOND,MAXFIRST), grand_total 12
real*8 grand_total ! test for proper operation 13
logical parallel ! selects 2-version loops 14
integer*4 iflag ! used to show LASTLOCAL value 15
data parallel /.false./ 17
data sub_total, iflag /MAXFIRST*0.0, 0/ 18
* outer loop: contains both interior loops 20
* C$doacross local(k,j,i,tmp,d,c), share(a,b,sub_total,dist), 23
* C$& lastlocal(iflag) 24
do i = 1, MAXFIRST 26
* first inner loop: fills arrays a and b 28
* C$doacross local(j,k,c), share(i,a,b) 31
do j = 1, MAXSECOND 33
do k = 1, MAXTHIRD 34
a(k,j,i) = dsqrt(dfloat(i*j*k)) 35
c = 1.0 - a(k,j,i) 36
if (c .le. 0.0 .and. i .lt. j*k) then 37
c = -c 38
c = c**2 40
b(k,j,i) = 32*(dcos(c)**5)*dsin(c)- 42
1 32*(dcos(c)**3)*dsin(c)+ 43
2 6*dcos(c)*dsin(c) 44
* seond inner loop: determines distance and starts summation 48
* c$doacross local(j,k,d,tmp), share(i,a,b,dist,sub_total), 51
* c$& lastlocal(iflag) 52
do j=1, MAXSECOND 54
tmp = 0.0 55
do k = 1, MAXTHIRD 56
d(k) = a(k,j,i) - b(k,j,i) 57
do k = 1, MAXTHIRD 59
tmp = tmp + d(k)**2 60
dist(j,i) = dsqrt(tmp) 62
if (dist(j,i) .le. 0.1) iflag = iflag + 1 63
sub_total(j) = sub_total(j) + dist(j,i) 64
* the next section is an example of sum reduction optimized to the 68
* parallel environment and the use of a more efficient 2 loop summation 69
* if -mp option is active, parallel is set to .true. which then 71
* selects the parallel version 72
C$ parallel = .true. 75
grand_total = 0.0 76
if (parallel) then ! parallel version 77
C$ num_threads = mp_numthreads() 78
ichunk = (MAXFIRST + (num_threads - 1))/num_threads 79
C$doacross local(k,j), 81
C$& share(num_threads,partial_total,sub_total,ichunk) 82
do k = 1, num_threads ! this loop is parallelized 84
partial_total(k) = 0.0 85
do j = k*ichunk - ichunk + 1, min(k*ichunk,MAXFIRST) 86
partial_total(k) = partial_total(k) + sub_total(j) 87
do j = 1, num_threads ! smaller loop handled as scalar 90
grand_total = grand_total + partial_total(j) 91
else ! the scalar version 93
do j = 1, MAXFIRST 94
grand_total = grand_total + sub_total(j) 95
if (parallel) then 99
C$ write (*,10) grand_total, num_threads 100
C$ write (*,20) iflag 101
write (*,30) grand_total 103
write (*,40) iflag 104
C$10 format(1x,'grand total = ',g10.3,'threads = ',i4) 107
C$20 format(1x,'parallel iflag = ',i10) 108
30 format(1x,'grand total = ',g10.3) 109
40 format(1x,'scalar iflag = ',i10) 110
<a name="023c_001f"><a name="023c_001f">
<a name="023c_0020">[LISTING TWO]
subroutine example(a, b, c, n)
real*4 a(n), b(n), c(n)
c$doacross local(i, x)
do i=1, n
x = a(n) * b(n)
c(n) = x**2
(the loop is transformed to)
1 _local_start, ! index starting value
2 _local_ntrip, ! number of loop executions
3 _incr, ! index increment
4 _my_threadno) ! unique process ID number
integer*4 _local_start, _local_ntrip, _incr, _my_threadno
integer*4 i ! declared local
real*4 x ! declared local
integer*4 _tmp ! created local
i = _local_start
do _tmp = 1, _local_ntrip
x = a(i) * b(i)
c(i) = x**2
i = i + _incr
Example 1: A typical DO loop
do i = 1, n
a(i) = x * b(i)
Example 2: A DO loop in which the array variable references a
value that is not current with the index
do i = 2, n
arr(i) = b(i) - arr(i-1)
Example 3: An example of load imbalance
do i = 1, n
do j = 1, i
a(j, i) = a(j, i) * xmult
Example 4: Load balancing
num_threads = mp_numthreads()
c$doacross local(i, j, k)
do k = 1, num_threads
do i = k, n, num_threads
do j = 1, i
a(j, i) = a(j, i) * xmult
Copyright © 1989, Dr. Dobb's Journal