Parallel Data Management Patterns
Data access and management patterns organize access to data but do not operate on the values themselves. Many combinations of specific data-access and computational patterns are common and may be considered patterns in their own right. This is because for efficient implementation it is often imperative for a data-access pattern to be fused with a specific parallel computational pattern.
Given a collection of indices and an indexable collection, a gather generates an output collection by reading from all the locations given by the indices in parallel.
A random read is a serial pattern but when used from within a map it becomes a collective gather. In addition, a gather might be supported by an explicit collective operation.
The search pattern is like gather, but retrieves data from a "database" collection based on matching against content. Parallelism is achieved by searching for all keys in parallel, or by searching in different parts of the database in parallel.
In parallel algorithms, we often want to divide the input into a number of pieces and then operate on each piece in parallel.
There are several possible variants of subdivision. The partition of a collection divides it into a nested collection of non-overlapping regions of the same size. The segmentation of a collection divides it into a segmented collection of non-overlapping regions of possibly different sizes. The tiling of a collection creates a collection of possibly overlapping references to regions within the larger collection.
A useful extension of map (which can also be seen as a regular variant of tiling followed by map) is the neighborhood stencil. In this pattern, regular spatial neighborhoods in an input array are operated on rather than only single elements. This is known as a "finite convolution" in signal processing, but this pattern also occurs in many simulation (PDE solvers) and matrix computations.
Some attention has to be paid to neighborhoods that extend past the edge of the array. Such accesses should be transformed (for example, by wrapping or clamping the index) so it maps to a well-defined value.
Implementing this pattern efficiently using low-level operations is surprisingly complicated, which argues for its inclusion as a basic pattern. It is useful to generate separate versions of the task for the interior of the input array and the boundaries. Also, a sliding window over partially overlapping regions of the input array may be useful.
However, for portability these optimizations can and should take place in the language implementation itself, since they vary by hardware target. Also, while induction variable analysis can and should be used to identify the stencil pattern whenever possible, the interface should also allow a straightforward and direct specification of the stencil.
Scatter writes data to a random location (given by an integer index) in a collection, overwriting any data that may already be stored at that location. Several varieties of scatter can be identified, depending on how collisions (parallel writes to the same location) are resolved.
A priority scatter allows writes into a collection of locations in an existing collection given a collection of data. Collisions (duplicate writes to the same location) are resolved using a deterministic rule.
The priority scatter operation is useful because the serial ordering of loop bodies can be used to generate the disambiguating rule. Loops with scatter operations (as long as they are not also loop-carried dependencies) can then be safely converted into priority scatters.
There are a number of ways to implement priority scatter efficiently. If it can be proven that no collisions are possible with other threads, then it can be implemented using ordinary serial semantics. For example, if the output of a map is a partition, it is only possible for each invocation of a function to scatter into its own output partition. Another case is when output data is allocated dynamically by a thread.
Atomic scatter is a non-deterministic pattern (the only one considered in this paper) that only guarantees that one result will survive in an output location when a collision occurs. Implementing atomic scatter is still potentially expensive if locking is necessary to preserve atomic writes.
A permutation scatter is a scatter that is only guaranteed to work correctly if there are no collisions. It can be typically be implemented efficiently in terms of an unsafe scatter. However, this means that it may produce incorrect results if incorrectly used with a set of write locations that do contain duplicate address values, so a debug mode should be provided that checks for such incorrect usage.
A merge scatter uses an associative operator to combine elements when a collision occurs. This rule can also be used to combine scattered values with existing data in an array. An example of this is the computation of a histogram.
The pack pattern is used to eliminate wasted space in a sparse collection and to handle variable-rate output from a map. From within map, each function activation is allowed to either keep or discard its outputs. The survivors are then packed together into a single collection. A variant of this is the expand pattern that can output zero or more values. A standalone pack collective operation is not as useful as one that can be fused with map, since the latter does not need to allocate memory for data that is to be discarded.
Deterministic parallel programs can be built from the bottom up by composing deterministic parallel patterns of computation and data access. However, an implementation of a programming model based on these patterns must not only support a sufficiently wide variety of patterns, it also needs to be able to control the granularity of their execution, expanding and fusing them as needed.
[Aldinucci 2007] M. Aldinucci and M. Danelutto, Skeleton-based parallel programming: Functional and parallel semantics in a single shot, Comput. Lang. Syst. Struct., 33(3-4), 2007, pp. 179-192.
[Asanovic 2006] K. Asanovic et al, The Landscape of Parallel Computing Research: A View from Berkeley, EECS Department, University of California, Berkeley EECS-2006-183, 2006.
[Blelloch 1993] G. E. Blelloch, J. C. Hardwick, S. Chatterjee, J. Sipelstein and M. Zagha, Implementation of a portable nested data-parallel language, PPOPP '93: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, 1993, pp. 102-111
[Blelloch 1996] G. E. Blelloch, Programming parallel algorithms, Commun. ACM, 39(3), 1996, pp. 85-97.
[Blelloch 1990] G. E. Blelloch, Vector models for data-parallel computing, MIT Press, 1990.
[Bosch 1998] J. Bosch. Design patterns as language constructs. Journal of Object-Oriented Programming, 11(2):18-32, 1998.
[Bromling 2002] S. Bromling, S. MacDonald, J. Anvik, J. Schaefer, D. Szafron, K. Tan, Pattern-based parallel programming, Proceedings of the International Conference on Parallel Programming (ICPP'2002), August 2002, Vancouver Canada, pp. 257-265.
[Buck 2007] I. Buck, GPU Computing: Programming a Massively Parallel Processor, Proceedings of the International Symposium on Code Generation and Optimization, IEEE Computer Society, March 11-14, 2007.
[Cole 1989] M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, Pitman/MIT Press, 1989.
[Cole 2004] M. Cole, Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming, Parallel Computing, 30(3), pp. 389-406, March 2004 [Czarnecki 2000] K. Czarnecki and U. Eisenecker, Generative Programming: Methods, Tools, and Applications, 2000, ACM Press/Addison-Wesley.
[Darlington 1995] J. Darlington, Y. Guo, H. W. To and J. Yang, Parallel skeletons for structured composition, SIGPLAN Not., 30(8), 1995, pp.19-28.
[Dijkstra 1968] E. Dijkstra, Go To Statement Considered Harmful, Communications of the ACM, 11 (3), March 1968, 147-148.
[Dorta 2006] A. Dorta, P. López and F. de Sande, Basic skeletons in 11c, Parallel Computing, 32(7), pp. 491-506, September 2006.
[Gamma 1994] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1994.
[Gorlatch 1999] S. Gorlatch, C. Wedler and C. Lengauer, Optimization Rules for Programming with Collective Operations, Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pp. 492-499, April 12-16, 1999.
[Herrington 2003] J. Herrington, Code Generation in Action, 2003, Manning Publications.
[Kessler 2004] C. Kessler, A practical access to the theory of parallel algorithms, ACM SIGCSE Bulletin, 36(1), March 2004
[Klusik 2000] U. Klusik, R. Loogen, S. Priebe, F. Rubio, Implementation Skeletons in Eden: Low-Effort Parallel Programming, Selected Papers from the 12th International Workshop on Implementation of Functional Languages, pp.71-88, September 2000.
[Lamport 1974] L. Lamport, The Parallel Execution of DO Loops, Communications of the ACM 17(2), February 1974, pp. 83-93.
[Lee 1996] P. Lee and M. Leone, Optimizing ML with run-time code generation, Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, p.137-148, May 1996.
[Lee 2006] E. A. Lee, The Problem with Threads, IEEE Computer, 39(5), May 2006, pp. 33-42.
[MacDonald 2002] S. MacDonald, J. Anvik, S. Bromling, D. Szafron, J. Schaeffer and K. Tan. From patterns to frameworks to parallel programs, Parallel Computing, 28(12);1663-1683, 2002.
[Massingill 1999] M. Massingill, T. Mattson, and B. Sanders. A pattern language for parallel application programs. Technical Report CISE TR 99-022, University of Florida, 1999.
[Mattson 2004] T. Mattson, B. Sanders, B. Massingill, Patterns for Parallel Programming, 2004, Pearson Education.
[McCool 2002] M. McCool, Z. Qin, and T. Popa, Shader metaprogramming, HWWS '02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2002, pp. 57-68.
[McCool 2004] M. McCool, S. Du Toit, T. Popa, B. Chan and K. Moule, Shader algebra, ACM Trans. Graph., 23(3), 2004, pp. 787-795.
[McCool 2004] M. McCool and S. Du Toit, Metaprogramming GPUs with Sh, 2004, AK Peters.
[McCool 2006] M. McCool. Data-Parallel Programming on the Cell BE and the GPU Using the RapidMind Development Platform. GSPx Multicore Applications Conference, 9 pages, 2006.
[Pelagatti 1998] S. Pelagatti, Structured development of parallel programs, Taylor & Francis, Inc., Bristol, PA, 1998.
[Owens 2005] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krueger, A. E. Lefohn, and T. J. Purcell, A survey of general-purpose computation on graphics hardware, Eurographics 2005, State of Art Report.
[Schmidt 2000] D. Schmidt, M. Stal, H. Rohnert, and F. Buschmann. Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, volume 2. Wiley & Sons, 2000.
[Serot 2002] J. Sérot, D. Ginhac, Skeletons for parallel image processing: an overview of the SKIPPER project, Parallel Computing, 28(12), pp.1685-1708, December 2002.
[Siu 1996] S. Siu, M. De Simone, D. Goswami, and A. Singh. Design patterns for parallel programming. Proceedings of the 1996 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'96), pp. 230-240, 1996.
[Skillicorn 1998] D. B. Skillicorn and D. Talia, Models and languages for parallel computation, ACM Comput. Surv., 30(2), 1998, pp.123-169.
[Tan 2003] K. Tan, D. Szafron, J. Schaeffer, J. Anvik and S. MacDonald, Using generative design patterns to generate parallel code for a distributed memory environment, PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, 2003, pp 203-215.
[Veldhuizen 1999] T. L. Veldhuizen, C++ templates as partial evaluation, In ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation, 1999.