But what if the application can't be scaled by simply increasing the number of independent runs of identical programs? What if you want to double the speed of your single application every 18 to 24 months in accordance with the rights and privileges afforded by the prior single-threaded performance aspect of Moore's Law? And you want to do this without having to become a wizard at parallelizing your program, or even worse, rewriting your application? If you're busy nodding your head up and down, you'll agree with this statement: Exploiting parallelism within a socket should be the responsibility of the compilers, tools, schedulers, operating systems, and so on. It's okay if parallelism across sockets remains "hard," but single socket stuff should just happen automatically.
Unfortunately, the industry does not have the technology to accept the responsibility of automatically taking advantage of multicore processors. It is up to the software developer to rewrite an application, and develop an understanding of parallelism in order to do so.
The Solution: A Collaborative Approach
With these issues in mind, there are many industry specialists that are devising technologies to make the parallelization task easier and more productive. However, technology is not enough. The entire industry must work together to address "the multicore menace." Areas to be addressed include:
- Advancements in programming languages and programming styles from academia and industry.
- Compilers, tools, and runtimes to support these languages and styles. Parallelism across sockets, as well as among cores, will require support (but, as discussed above, this may be "hard to do"). Furthermore, support will be required for optimization of memory bus bandwidth utilization.
- Operating systems and system schedulers (e.g. Slurm) that account for multicore topologies, and take compiler- and application-supplied hints to properly lay out a parallel application on a socket/server/cluster.
- Algorithms that require reconsideration for their applicability to multicore programming. In particular, transactional memory systems incur a high cost for hot elements in a data structure, e.g. tree balancing algorithms that frequently update the root of a tree are inherently less scalable than those that don't.
- Business models that recognize multicore realities. Application licensing costs sometimes scale with thread count or with socket count, and incredibly, sometimes at a fractional power of thread count. The industry must quickly converge on a pricing model that is fair and understandable.
- Education of developers and users concerning the technologies being developed for multicore processors.
This list is a good start, but not all inclusive. It is imperative that the industry distinguishes the pros and cons of each technology, identifies the preferences of the developers and users, and addresses those preferences. The industry must quickly provide effective solutions to parallelize applications, and it is vital that developers and users are involved in defining those solutions.
Along these lines, a number of developments have progressed. Some examples:
- Hewlett-Packard has collected a number of products into a Multicore Toolkit providing developers with "one-stop shopping" for products and information. HP also provides HP-MPI, HP-UPC (a Unified Parallel C implementation), and other products useful for multicore development. Additionally, HP has been involved in improving the open source support; for example, enhancing SLURM to support multicore resources.
- Acumem SlowSpotter samples execution of a live application to obtain information about memory bus and cache utilization characteristics. The data is subsequently analyzed, giving the user direct relationships between memory access patterns and analyzed code. Suggestions are provided, based on the results of an analysis, as to how performance improvements can be achieved.
- Cilk Arts is working on Cilk++, which simplifies the task of parallelizing code. A small set of specialized keywords are introduced into what would otherwise be a compliable C program. The keywords indicate the functions that can be parallelized and work units that comprise those functions. The runtime system schedules the work units among the available processing elements, using a "work stealing" paradigm.
- Rapidmind represents a different technique for specifying parallelizable code. An application writer specifies a parallelizable kernel function of an algorithm. This kernel is then applied over a set of data, e.g. an array calculation. The runtime system then relies on a Just In Time (JIT) compiler to build the parallelizable code, taking into account the available processing elements; for example, a multicore CPU, a graphics processor, or perhaps some other exotic ASIC. If this technique maps well to an algorithm, it can not only distribute the work over available CPU cores in a transparent manner, it can also allow the use of a graphics card as an application accelerator.
- Intel Ct takes a somewhat similar approach, but is more explicitly vectorized. Using a C++ template to describe a Throughput VECtor (TVEC), the programmer defines vectorized operations. The compiler and runtime system divides the work into units and then distributes the work units across the available processing elements. Intel has also announced plans for the Intel Parallel Studio, a set of resources targeted at parallel program development to be integrated with Microsoft's Visual Studio.
- Microsoft in collaboration with Intel, is working on the Task Parallel Library (TPL) to allow expression of parallelism with normal function calls. The resultant code will automatically utilize the available cores. TPL is another implementation that relies on "work-stealing" to dynamically distribute the load among the available threads.
- Rogue Wave attacking the problem from another angle, provides software that supports a Service Oriented Architecture. Rather than parallelizing within a program, the developer can use Hydra to create a service grid and execute a number of concurrent processes, each handling some element of the overall application.
- eXludus brings a system resource manager to the table, particularly useful for time-sharing systems. Scheduling of processes is dynamically adjusted based on resource utilization, facilitating optimal use of the overall system and associated throughput improvements.
- Visual Numerics provides enhanced mathematics libraries that allow for parallelized operation within the library calls, supporting OpenMP and MPI. The level of parallelism can thus be adjusted using environment variables.
- Transactional memory this idea has become a popular topic in academic and industry circles. TM improves on traditional mutual exclusion locks by removing the cost of uncontested locks while greatly increasing the cost of conflicts. TM also has some nice properties that make things easier for large applications comprised of independent components.
- The DARPA High Productivity Computing System (HPCS) effort has yielded three proposed language designs: Sun's Fortress, IBM's X10, and Cray's Chapel. These languages go beyond scaling within a single socket, and are intended to allow programmers to be more productive when programming exascale systems (comprising thousands of sockets).
The memory latency aspect mentioned earlier may work to the advantage of the multicore designers for a while, because an increase in the number of cores will come with an increase in memory bus bandwidth requirements. A fundamental change in the memory subsystem will eventually be required to keep up with core counts. One such technology might be stacked memory -- a technique that places memory on a die directly above or below the CPU. Each core would have access to the memory directly attached to it, and bandwidth could scale with core counts. Another possible choice is to replace the copper connections to memory with optical links. A move away from the traditional globally clocked bus architecture is also possible, replacing it with, for example, a self-timed communications fabric. A melding of techniques may be necessary to give the right combination of price, power consumption, capacity, and bandwidth.
Although it makes sense in 2008 to assume a modest number of sockets will have access to a coherent shared memory, there may come a time when this is no longer true. For example, it may be possible that even cores within a socket may be partitioned into multiple coherency domains. This is game changing: Applications written today to be "multicore aware" quite commonly assume coherent memory access across all cores. When this is no longer true, other techniques (such as message passing) are required to take advantage of all the cores. Thus, even single socket applications might look more like cluster-aware distributed memory codes currently popular in the supercomputing community.
As stated, though multicore techniques have made processor design more practical, they have complicated the lives of application developers, tool writers, and so on. Therefore, the entire computer ecosystem (academic and commercial) must respond to this challenge, and quickly develop a set of technologies to allow developers and users to take advantage of the potential performance improvements afforded by new platforms. Though this is well underway, the pain may last for years while application developers gain an understanding of how to use these emerging technologies. They must adjust to the relatively constant clock rates of newly developed processors and take advantage of the explosion of available computing resources in the form of cores. The hardware is way ahead of everyone else at this point, and catching up will be very, very difficult.