### Case Study: Content-based Image Retrieval

Experience has shown that an easy way to understand patterns and how they are used is to follow an example. In this section we describe a problem and its parallelization by using patterns from OPL. In doing so we describe a subset of the patterns and give some indication of the way we make transitions between layers in the pattern language.

In particular, to understand how OPL can help software architecture, we use a content-based image retrieval (CBIR) application as an example. From this example (drawn from [14]), we show how structural and computational patterns can be used to describe the CBIR application and how the lower-layer patterns can be used to parallelize an exemplar component of the CBIR application.

In Figure 2 we see the major elements of our CBIR application as well as the data flow. The key elements of the application are the feature extractor, the trainer, and the classifier components. Given a set of new images the feature extractor will collect features of the images. Given the features of the new images, chosen examples, and some classified new images from user feedback, the trainer will train the parameters necessary for the classifier. Given the parameters from the trainer, the classifier will classify the new images based on their features. The user can classify some of the resulting images and give feedback to the trainer repeatedly in order to increase the accuracy of the classifier. This top-level organization of CBIR is best represented by the Pipe-and-Filter structural pattern. The feature-extractor, trainer, and classifier are filters or computational elements that are connected by pipes (data communication channels). Data flows through the succession of filters that do not share state and only take input from their input pipe(s). The filters perform the appropriate computation on those data and pass the output to the next filter(s) via its output pipe. The choice of Pipe-and-Filter pattern to describe the top-level structure of CBIR is not unusual. Many applications are naturally described by Pipe-and-Filter at the top level.

In our approach we architect software by using patterns in a hierarchical fashion. Each filter within the CBIR application contains a complex set of computations. We can parallelize these filters using patterns from OPL. Consider, for example, the classifier filter. There are many approaches to classification, but in our CBIR application we use a support-vector machine (SVM) classifier. SVM is widely used for classification in image recognition, bioinformatics, and text processing. The SVM classifier evaluates the function:

where **x _{i}** is the

**i**

^{th}support vector,

**z**is the query vector, Φ is the kernel function, α

_{i}is the weight,

**y**in {-1, 1} is the label attached to support vector

_{i}**x**,

_{i}**b**is a parameter, and

**sgn**is the sign function. In order to evaluate the function quickly, we identified that the kernel functions are operating on the products and norms of xi and z. We can compute the products between a set of query vectors and the support vectors by a BLAS level-3 operation with higher throughput. Therefore, we compute the products and norms first, use the results for computing the kernel values, and sum up the weighted kernel values. We architect the SVM classifier as in Figure 3. The basic structure of the classifier filter is itself a simple Pipe-and-Filter structure with two filters: the first filter takes the test data and the support vectors needed to calculate the dot products between the test data and each support vector. This dot product computation is naturally performed by using the Dense-Linear-Algebra computational pattern. The second filter takes the resulting dot products, and the following steps are to compute the kernel values, sum up all the kernel values, and scale the final results if necessary. The structural pattern associated with these computations is Map-Reduce (see the Map-Reduce sidebar).

In a similar way the feature-extractor and trainer filters of the CBIR application can be decomposed. With that elaboration we would consider the "high-level" architecture of the CBIR application complete. In general, to construct a high-level architecture of an application, we decompose the application hierarchically by using the structural and computational patterns of OPL.

Constructing the high-level architecture of an application is essential, and this effort improves not just the software viability but also eases communication regarding the organization of the software. However, there is still much work to be done before we have a working software application. To perform this work we move from the top layers of OPL (structural and computational patterns) down into lower layers (concurrent algorithmic strategy patterns etc.). To illustrate this process we provide additional detail on the SVM classifier filter.

**Structural Pattern**: Map-Reduce

Solution: A solution is structured in two phases: (1) a map phase where items from an "input data set" are mapped onto a "generated data set" and (2) a reduction phase where the generated data set is reduced or otherwise summarized to generate the final result. It is easy to exploit concurrency in the map phase, since the map functions are applied independently for each item in the input data set. The reduction phase, however, requires synchronization to safely combine partial solutions into the final result.

**Concurrent Algorithmic Strategy Patterns**

After identifying the structural patterns and the computational patterns in the SVM classifier, we need to find appropriate strategies to parallelize the computation. In the Map-Reduce pattern the same computation is mapped to different non-overlapping partitions of the state set. The results of these computations are then gathered, or reduced. If we are interested in arriving at a parallel implementation of this computation, then we define the Map-Reduce structure in terms of a Concurrent Algorithmic Strategy. The natural choices for Algorithmic Strategies are the Data-Parallelism and Geometric-Decomposition patterns. By using the Data-Parallelism pattern we can compute the kernel value of each dot product in parallel (see the Data-Parallelism sidebar). Alternatively, by using the Geometric-Decomposition pattern (see the Geometric-Decomposition sidebar) we can divide the dot products into regular chunks of data, apply the dot products locally on each chunk, and then apply a global reduce to compute the summation over all chunks for the final results. We are interested in designs that can utilize large numbers of cores. Since the solution based on the Data-Parallelism pattern exposes more concurrent tasks (due to the large numbers of dot products) compared to the more coarse-grained geometric decomposition solution, we choose the Data-Parallelism pattern for implementing the map reduce computation.

**Algorithm Strategy Pattern**: Geometric-Decomposition

**Solution**: An algorithm is organized by (1) dividing the key data structures within a problem into regular chunks, and (2) updating each chunk in parallel. Typically, communication occurs at chunk boundaries so an algorithm breaks down into three components: (1) exchange boundary data, (2) update the interiors or each chunk, and (3) update boundary regions. The size of the chunks is dictated by the properties of the memory hierarchy to maximize reuse of data from local memory/cache.

The use of the Data-Parallelism algorithmic strategy pattern to parallelize the Map-Reduce computation is shown in the pseudo code of the kernel value calculation and the summation. These computations can be summarized as shown in Figure 4. Line 1 to line 4 is the computation of the kernel value on each dot product, which is the map phase. Line 5 to line 13 is the summation over all kernel values, which is the reduce phase. Function **NeedReduce** checks whether element "i" is a candidate for the reduction operation. If so, the **ComputeOff** set function calculates the offset between element "i" and another element. Finally, the **Reduce** function conducts the reduction operation on element "i" and "i+off set".

**Implementation Strategy Patterns**

To implement the data parallelism strategy from the Map-Reduce pseudo-code, we need to find the best Implementation Strategy Pattern. Looking at the patterns in OPL, both the Strict-Data-Parallel and Loop-Parallel patterns are applicable.

**Implementation Strategy Pattern**: Strict-Data-Parallel

**Solution**: Implement a data parallel algorithm as a single stream of instructions applied concurrently to the elements of a data set. Updates to each element are either independent, or they involve well-defined collective operations such as reductions or prefix scans.

Whether we choose the Strict-data-parallel or Loop-parallel patterns in the implementation layer, we can use the SIMD pattern for realizing the execution. For example, we can apply SIMD on line 2 in Listing 1 for calculating the kernel value of each dot product in parallel. The same concept can be used on line 7 in Listing 1 for conducting the checking procedure in parallel. Moreover, in order to synchronize the computations on different processing elements on line 4 and line 12 in Listing 1, we can use the barrier construct described within the Collective-Synchronization pattern for achieving this goal.

function ComputeMapReduce( DotProdAndNorm, Result) { 1 for i <-- 1 to n { 2 LocalValue[i] <-- ComputeKernelValue(DotProdAndNorm[i]); 3 } 4 Barrier(); 5 for reduceLevel <-- 1 to MaxReduceLevel { 6 for i <-- 1 to n { 7 if (NeedReduce(i, reduceLevel) ) { 8 off set <-- ComputeOff set(i, reduceLevel); 9 LocalValue[i] <-- Reduce(LocalValue[i], LocalValue[i+off set]); 10 } 11 } 12 Barrier(); 13 } 14}

In summary, the computation of the SVM classifier can be viewed as a composition of the Pipe-and-Filter, Dense-Linear-Algebra, and Map-Reduce patterns. To parallelize the Map-Reduce computation, we used the Data-Parallelism pattern. To implement the Data-Parallelism Algorithmic Strategy, both the Strict-Data-Parallel and Loop-Parallel patterns are applicable. We choose the Strict-Data-Parallel pattern, since it seemed a more natural choice given the fact we wanted to expose large amounts of concurrency for use on many-core chips with large numbers of cores. It is important to appreciate, however, that this is a matter of style, and a quality design could have been produced by using the Loop-Parallel pattern as well. To map the Strict-Data-Parallel pattern onto a platform for execution, we chose a SIMD pattern. While we did not show the details of all the patterns used, along the way we used the Shared-Data pattern to define the synchronization protocols for the reduction and the Collective-Synchronization pattern to describe the barrier construct. It is common that these functions (reduction and barrier) are provided as part of a parallel programming environment; hence, while a programmer needs to be aware of these constructs and what they provide, it is rare that they will need to explore their implementation in any detail.

### Other Patterns

OPL is not complete. Currently OPL is restricted to those parts of the design process associated with architecting and implementing applications that target parallel processors. There are countless additional patterns that software development teams utilize. Probably the best known example is the set of design patterns used in object-oriented design [8]. We made no attempt to include these in OPL. An interesting framework that supports common patterns in parallel object-oriented design is Threaded Building Blocks (TBB) [15].

OPL focuses on patterns that are ultimately expressed in software. These patterns do not, however, address methodological patterns that experienced parallel programmers use when designing or optimizing parallel software. The following are some examples of important classes of methodological patterns.

- Finding Concurrency patterns [7]. These patterns capture the process that experienced parallel programmers use when exploiting the concurrency available in a problem. While these patterns were developed before our set of Computational patterns was identified, they appear to be useful when moving from the Computational patterns category of our hierarchy to the Parallel Algorithmic Strategy category. For example, applying these patterns would help to indicate when geometric decomposition is chosen over data parallelism as a dense linear algebra problem moves toward implementation.
- Parallel Programming "Best Practices" patterns. This describes a broad range of patterns we are actively mining as we examine the detailed work in creating highly-efficient parallel implementations. Thus, these patterns appear to be useful when moving from the Implementation Strategy patterns to the Concurrent Execution patterns. For example, we are finding common patterns associated with optimizing software to maximize data locality.

There is a growing community of programmers and researchers involved in the creation of OPL. The current status of OPL, including the most recent updates of patterns, can be found at http://parlab.eecs.berkeley.edu/wiki/patterns/patterns. This website also has links to details on our shorter monthly patterns workshop as well as our longer, two-day, formal patterns workshop. We welcome your participation.

### Summary, Conclusions, and Future Work

We believe that the key to addressing the challenge of writing software is to architect the software. In particular, we believe that the key to addressing the new challenge of programming multi-core and many-core processors is to carefully architect the parallel software. We can define a systematic methodology for software architecture in terms of design patterns and a pattern language. Toward this end we have taken on the ambitious project of creating a comprehensive pattern language that stretches all the way from the initial software architecture of an application down to the lowest-level details of software implementation.

OPL is a work in progress. We have defined the layers in OPL, listed the patterns at each layer, and written text for many of the patterns. Details are available online [4]. On the one hand, much work remains to be done. On the other hand, we feel confident that our structural patterns capture the critical ways of composing software, and our computational patterns capture the key underlying computations. Similarly, as we move down through the pattern language, we feel that the patterns at each layer do a good job of addressing most of the key problems for which they are intended. The current state of the textual descriptions of the patterns in OPL is somewhat nascent. We need to finish writing the text for some of the patterns and have them carefully reviewed by experts in parallel applications programming. We also need to continue mining patterns from existing parallel software to identify patterns that may be missing from our language. Nevertheless, last year's effort spent in mining five applications netted (only) three new patterns for OPL. This shows that while OPL is not fully complete, it is not, with the caveats described earlier, dramatically deficient.

Complementing the efforts to mine existing parallel applications for patterns is the process of architecting new applications by using OPL. We are currently using OPL to architect and implement a number of applications in areas such as machine learning, computer vision, computational finance, health, physical modeling, and games. During this process we are watching carefully to identify where OPL helps us and where OPL does not offer patterns to guide the kind of design decisions we must make. For example, mapping a number of computer-vision applications to new generations of many-core architectures helped identify the importance of a family of data layout patterns.

The scope of the OPL project is ambitious. It stretches across the full range of activities in architecting a complex application. It has been suggested that we have taken on too large of a task; that it is not possible to define the complete software design process in terms of a single design pattern language. However, after many years of hard work, nobody has been able to solve the parallel programming problem with specialized parallel programming languages or tools that automate the parallel programming process. We believe a different approach is required, one that emphasizes how people think about algorithms and design software. This is precisely the approach supported by design patterns, and based on our results so far, we believe that patterns and a pattern language may indeed be the key to finally resolving the parallel programming problem.

While this claim may seem grandiose, we have an even greater aim for our work. We believe that our efforts to identify the core computational and structural patterns for parallel programming has led us to begin to identify the core computational elements (computational patterns, analogous to atoms) and means of assembling them (structural patterns, analogous to molecular bonding) of all electronic systems. If this is true, then these patterns not only serve as a means to assist software design but can be used to architect a curriculum for a true discipline of computer science.

### References

[1] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. "A View of the Parallel Computing Landscape." Communications of the ACM, volume 51, pages 56-67, 2009.

[2] B. Chapman, G. Jost, and R van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. MIT press, Cambridge, Massachusetts, 2008.

[3] W. Gropp, E. Lusk, A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. 2nd edition, MIT Press, Cambridge, Massachusetts, 1999.

[4] http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

[5] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object Oriented Software. Addison-Wesley, Reading, Massachusetts, 1994.

[6] C. Alexander, S. Ishikawa, M. Silverstein. A Pattern Language: Towns, Buildings, Construction. Oxford University Press, New York, New York, 1977.

[7] T. G. Mattson, B. A. Sanders, B. L. Massingill. Patterns for Parallel Programming. Addison Wesley, Boston, Massachusetts, 2004.

[8] D. Garlan and M. Shaw. "An introduction to software architecture." Carnegie Mellon University Software Engineering Institute Report CMU SEI-94-TR-21, Pittsburg, Pennsylvania, 1994.

[9] Mary Shaw and David Garlan. Software Architecture: Perspectives on an Emerging Discipline. Prentice Hall, Upper Saddle River, New Jersey, 1995.

[10] K. Asanovic, et al. " e landscape of parallel computing research: A view from Berkeley." EECS Department, University of California, Berkeley, Technical Report UCB/EECS-2006-183, 2006.

[11] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, and M. Stal. Pattern-Oriented Software Architecture - A System of Patterns. Wiley, Hoboken, New Jersey, 1996.

[12] J. Dean and S. Ghemawat. "MapReduce: Simplifi ed Data Processing on Large Clusters." In Proceedings of OSDI '04: 6th Symposium on Operating System Design and Implementation. San Francisco, CA, December 2004.

[13] L. G. Valiant, "A Bridging Model for Parallel Computation." Communication of the ACM, volume 33, pages 103-111, 1990.

[14] Catanzaro, B., B. Su, N. Sundaram, Y. Lee, M. Murphy, and K. Keutzer. "Efficient, High-Quality Image Contour Detection." IEEE International Conference on Computer Vision (ICCV09), pages 2381-2388, Kyoto Japan, 2009.

[15] J. Reinders. Intel Threaded Building Blocks. O'Reilly Press, Sebastopol, California, 2007.

### Acknowledgments

The evolution of OPL has been strongly influenced by the collaborative environment provided by Berkeley's Par Lab. The development of the language has been positively impacted by students and visitors in two years of graduate seminars focused on OPL: Hugo Andrade, Chris Batten, Eric Battenberg, Hovig Bayandorian, Dai Bui, Bryan Catanzaro, Jike Chong, Enylton Coelho, Katya Gonina, Yunsup Lee, Mark Murphy, Heidi Pan, Kaushik Ravindran, Sayak Ray, Erich Strohmaier, Bor-yiing Su, Narayanan Sundaram, Guogiang Wang, and Youngmin Yi. The development of OPL has also received a boost from Par Lab faculty -- particularly Krste Asanovic, Jim Demmel, and David Patterson. Monthly pattern workshops in 2009 also helped to shape the language. Special thanks to veteran workshop moderator Ralph Johnson as well as to Jeff Anderson-Lee, Joel Jones, Terry Ligocki, Sam Williams, and members of the Silicon Valley Patterns Group.

*This article and more on similar subjects may be found in the Intel Technology Journal, December 2009 Edition, "Addressing the Challenges of Tera-scale Computing". More information can be found at http://intel.com/technology/itj.*