University of Illinois computer science professor Josep Torrella has demonstrated that easing a programmer's burden in parallel computing does not compromise system performance or increase the complexity of hardware implementation. In "The Bulk Multicore Architecture for Improved Programmability" Communications of the ACM, Torrellas details his Bulk Multicore Architecture and calls for a change to the way in which multicore architectures are designed.
"While the computer science and engineering community has frequently focused on advancing the technology for parallel processing, this time around the stakes are truly high," says Torrellas. "There is no other obvious route to higher computing performance than through parallelism."
Torrellas calls for breakthroughs in all layers of the computing stack, including languages, programming models, compilation and runtime software, programming and debugging tools, and hardware architectures.
Torrellas designed his Bulk Multicore Architecture system specifically to address the complexity of parallel programming. He proposes using the hardware architecture to relieve programmers (and runtime systems) of the burden of managing data sharing in parallel environments, as well as providing new hardware-supported mechanisms to minimize programming errors.
The system eliminates one of the traditional tenets of processor architecture, namely the need to commit instructions in order, providing the architectural state of the processor after each instruction.
In the Bulk Multicore Architecture, the default execution mode of a processor is to commit chunks of instructions at a time. Torrellas explains, "Such a chunked mode of execution and commit is a hardware-only mechanism, invisible to the software running on the processor.' Moreover, its purpose is not to parallelize a thread, but to improve programmability and performance." This invisibility to the software removes programmer restrictions as to the choice of programming model, language, or runtime system.
Importantly, Torrellas is able to demonstrate that these programmability advantages do not come at the expense of performance. Furthermore, Torrellas explains that not only does Bulk Multicore reduce complexity of parallel programming, but that it also reduces hardware complexity in multiprocessor environments. In fact, the system requires simpler processor hardware than current machines.
The idea of making parallel computing simple is at the core of the Illinois Universal Parallel Computing Research Center's research agenda. UPCRC Illinois is a joint research effort of the Illinois department of computer science and the Coordinated Science Laboratory, with funding from corporate partners Microsoft and Intel. Torrellas and his team plan to expand their work on Bulk Multicore in several ways. The team will be examining the scalability of the chunk commit model, as well as how the model can enable efficient support for new program-development and debugging tools, aggressive autotuners and compilers, and even novel programming models.
On a related note, Torrellas, along with Brian Greskamp and Ulya R. Karpuzcu, recently won the Best Paper Award at the International Symposium on Microarchitecture (MICRO) for their work entitled "The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration," which discussed promising new methods for pushing back the power wall for multicore computing architectures.
In their paper, the team proposes to push back the many-core power wall with a new scheme called Dynamic Voltage Scaling for Aging Management (DVSAM). The team's system manages processor aging to attain higher performance or lower power consumption.
To make use of this new scheme, the team developed BubbleWrap, a novel many-core architecture that identifies the most power-efficient set of cores in a variation-affected chip -- the largest set that can be simultaneously powered-on. BubbleWrap then designates those cores as Throughput cores dedicated to parallel-section execution. The rest of the cores are designated as Expendable and are dedicated to accelerating sequential sections. BubbleWrap attains maximum sequential acceleration by sacrificing Expendable cores one at a time, running them at elevated supply voltage for a significantly shorter service life each, until they completely wear-out and are discarded.
The team was also able to demonstrate significant performance increases. In simulated 32-core chips, BubbleWrap provides substantial gains over a plain chip with the same power envelope. On average, the most aggressive design runs fully-sequential applications at a 16% higher frequency, and fully parallel ones with a 30% higher throughput.