Christopher Diggins is a freelance programmer and consultant. He can be contacted at [email protected].
Parallel computing used to be the specialty domain of supercomputers. These days, however, even computers aimed at the home market have at least two processor cores. Four-core machines are already widely available, with affordable six processors on the horizon. As if that's not enough, some hard-core gamers have dual quad-core processors installed, while companies like Intel are prototyping 80-core processors.
But for decision makers in the software industry, the most pressing question is how does this shift toward parallelism in hardware affect them? To address this question, we turned to Cilk Arts -- a company co-founded by legendary computer scientist Charles E. Leiseron -- which recently interviewed more than 150 companies regarding their challenges and priorities in supporting multicore platforms.
When asked what are the top three reasons motivating companies to move to multicore, Cilk Arts' Ilya Mirman reports that they repeatedly heard three key themes:
Application Performance. Achieving good performance in a concurrent application on multicore hardware is not as simple as adding a bunch of threads to an appliction. In the performance critical sections of the software, all of the cores have to be kept busy. This is especially challenging because the number of cores cannot be known ahead of time. Performance has to be as good as possible if there are 1, 2, or even 16 or more cores available. Other factors affecting performance of concurrent software are efficient management of synchronization mechanisms (e.g., locks), and efficient distribution of work across the cores. Locks are a widely used mechanism for assuring that resources and memory can be shared between threads without corruption. Inadequate usage of locks can lead to race conditions, while overuse leads to poor performance. To maximize usage of the cores, work has to be distributed among the cores dynamically as each core completes its various tasks.
According to a Rogue Wave survey, performance requirements are the primary reason for them to shift to multi-core hardware:
- 58% of respondents said that an increase in performance has been the reason behind their organizations shifting existing applications to multi-core hardware.
- 92% said that their business applications have high-performance requirements; of those that have high performance apps, 69% said that their business applications have requirements to support high throughput.
- 82% said that performance requirements for their organizations' apps are on the rise
Development time is a direct consequence of the fundamental complexity of concurrent software development. Managing this increased complexity takes an increased amount of time for everyone involved in software development: designers, developers, and testers. Closely related to development time is cost. Not only is concurrent software more expensive to write, because it takes longer, but it is more expensive per-programmer because it requires specialized knowledge and training.
Software reliability in a concurrent system is an especially thorny problem. When concurrent software runs on a serial (non-parallel) computer, the parallelism is typically emulated using a timesharing technique. When running the same software on a parallel computers, new possibilities for race conditions arise because of the fact that assembly instructions can be executed simultaneously. According to Mirman "these pernicious bugs are notoriously hard to find. You can run regression tests in the lab for days without a failure only to discover that your software crashes in the field with regularity. If you're going to multicore-enable your application, you need a reliable way to find and eliminate race conditions. To avoid race conditions access to shared resources and memory must be synchronized."
Most companies when confronted with the prospect of migrating their software base to address the needs of multicore hardware first try an approach of manually managing native threads using thread pools and other techniques, what Leiserson calls Do It Yourself (DIY) Multithreading. According to Mirman, this is the least fruitful approach for writing concurrent software because it takes too long, is too expensive, and generally produces less reliable software.
There are solutions to the problem of writing scalable and reliable concurrent software for multicore platforms that don't require a lot of retraining. Many of these are specifically for the C++ developer arena, where performance tends to be of a more immediate concern, and the challenges of concurrent software development are more acute.
Referred to as "concurrency platforms" by Charles Leiserson in his article The Case for a Concurrency Platform and elsewhere, these solutions are either library-based solutions or minor language extensions. They all provide new abstractions for expressing the inherent parallelism in software, which have to be added by the programmers, but they solve the problem of dividing the work to be done-up among the core efficiently. In effect load balancing the work among the core. All the programmer has to do, is point out where the opportunities for parallelism exist.
Many of these solutions are based on the principle of "work-stealing". Cilk Arts co-founders Matteo Frigo and Leiserson introduced work-stealing techniques in an award-winning paper Implementation of the Cilk-5 Multithreaded Language. This research on work-stealing formed the basis of the Cilk++ product, and influenced other projects such as Intel's Threaded Building Blocks.
We asked Mirman to tell us a bit about work-stealing in Cilk++: The Cilk++ Runtime System (RTS) enables a Cilk++ program to dynamically and automatically exploit an arbitrary number of available processor cores. With sufficient parallelism and memory bandwidth, the RTS delivers near-perfect linear speed-up as the number of cores increases. Mirman went on to describe how Cilk++ uses a decentralized scheduling algorithm to efficiently distribute work.
Some of the other concurrency platforms aimed at the C++ audience migrating to multicore hardware are:
- Intel's Threading Building Blocks (TBB) offers a complete approach to expressing parallelism in a C++ program. It is a library that helps you take advantage of multicore processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability and performance.
- The OpenMP API supports multi-platform sharedmemory parallel programming in C/C++ and Fortran. OpenMP is a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer.
- The RapidMind Development Platform is a framework for expressing data-parallel computations from within C++ and executing them efficiently on multicore processors. (See RapidMind: C++ Meets Multicore by Stefanus Du Toit and Michael McCool.)
- The Task Parallel Library is a managed code library for conveniently expressing potential parallelism in existing sequential code, where the exposed parallel tasks will be run concurrently on all available processors.
- Cilk++ from Cilk Arts simplifies the task of parallelizing code. Specialized keywords are introduced into what would otherwise be a compliable C program. The keywords indicate the functions that can be parallelized and work units that comprise those functions. The runtime system schedules the work units among the available processing elements, using a "work stealing" paradigm.
Achieving scalable performance in the face of hardware parallelism without sacrificing reliability, or significantly increasing development costs is an issue of managing complexity. Concurrency platforms can alleviate this complexity by providing a level of abstraction for expressing parallelism which remove the burden of manual thread management.
There are still going to be some changes needed to the software development pipeline for developing concurrent software for parallel hardware. Here it is best to take the leads from the high-performance computing industry who have been refined their processes for developing concurrent software over several decades. Here is how the different phases of software development are affected by concurrent software development:
- Design. Because complexity of concurrent software is increased, careful design is more important than ever. Interactions between modules need to be reduced. Synchronization bottlenecks need to be identified and avoided as early as possible, to avoid spending developer time later on.
- Development Concurrency platform are needed for programmers to express algorithmic parallelism, without worrying about the details of distributing work across the number of cores.
- Quality assurance is where the biggest investments and changes may need to be made for concurrent software. The QA team needs more time and resource to test a far greater number of hardware configurations. In addition to testing, new tools and processes need to be implemented for static and dynamic analysis, in addition to profiling and debugging.
- Support can be expected that despite a company's best efforts there are going to be more problems after deployment. Field support engineers can help QA by working closely with Beta users.
So while there are steps we can take to manage the complexity of concurrent software in a parallel world, it is still going to take some investment in training and new tools. Careful investment in the correct tools and a re-examination of the software pipeline will go a long way to mitigating these costs.
Thanks to Ilya Mirman and Cilk Arts for sharing data from their interviews, and to Kris Unger for providing feedback and suggestions on the article.