Parallel Impetus to Transform Computer Science
David Bader, Executive Director of High-Performance Computing at Georgia Tech, talks about parallel algorithms, hardware architectures for parallel programming, and the need for teaching parallelism as the norm in computer science.Dr. Dobb's: Are you on one side or the other with regard to shared memory vs. message passing?
David Bader: I have interest in applications that span both message passing and shared memory. There is conventional wisdom on each that, to me, really depends on how a person first learns to approach parallel programming and also their particular application class.
There are generalizations made, for instance, message passing is "easier" than shared memory, or vice versa. You have to look at the particular problem.
I design algorithms for certain classes of problems and also do practical parallel programming, as well as teaching at Georgia Tech.
DB: The NSF awards peer reviewed highly competitive projects. Throughout my career from post-doc on through being a faculty member I've been successful at NSF awards.
Some of our work has been new types of parallel programming for hybrid compute systems, for shared memory, and now for multicore.
DD: What is the definition of "hybrid" in this context?
DB: For instance, we looked at systems which incorporated both message passing and shared memory nodes. We were doing this nearly 15 years ago.
DD: Was this special architecture or just software?
DB: At the time, it was clusters of shared memory nodes with multiple sockets. That was our foray into focusing on shared memory, because we realized it would do well on a hybrid system. You needed to have a good shared memory algorithm and implementation.
So we developed practical parallel algorithms for symmetric multiprocessors (SMP). As we moved to multicore, there's been a natural evolution to shrink that work down on multicore to things like multiple cores on a single socket as well as multiple sockets on a board.
DD: There was not an automatic telescoping down of the model down onto multicore?
DB: As we moved from multiple sockets for SMPs to multicore ... we'd had experience over years with multicore, it's not new, IBM and other companies have been producing multicore processors long before Intel.
DD: The 390 architecture ...
DB: Yes, and IBM also had in their supercomputers Power processors. We've been using multicore and understanding the issues with cores on a single chip as well as cores across sockets all within a single system image.
DD: So you worked with multicore RISC when they were building the IBM SP?
DB: That's correct. We've worked with the SP-1, the SP-2 and others. We've had multichip modules, and each module had multiple sockets, each socket had multiple cores. The pSeries servers at IBM today are an evolution of that.
In going from multiple sockets down to multicore there similarities but also important differences in what is shared in the architecture and how you tune your application.
DD: Obviously you can do better when you don't have to jump so far to communicate.
DB: It goes beyond that. If I look at energy-efficient algorithms, most of our power goes for driving pins on a chip. The more we can keep local, the more energy efficient our solution. Power is a driving cost and a constraint for a machine room. Having a most power-efficient high performance computer is a primary concern, versus raw performance.
DD: So would such algorithms be implicit in the processor architecture, or be found in the microcode, or the firmware?
DB: No, I'm talking about user-level algorithms.
DD: User-level algorithms tuned to save electrical power in parallel computation?
DB: Right. For instance, since as I've said, the majority of power goes to drive pins on a chip, we can look at algorithms and their tradeoffs and choose the algorithms which move data less and thus have a more power-efficient solution.
DD: Under any architecture and for many reasons you do better if you move data less.
DB: The problem we face today is we haven't given our programmers the tools to understand data movement and locality. Since the 1960s, we teach students that a compute operation, a multiply, an addition, is expensive but all memory accesses come for free. That was true through the mid 1990s, but as memory access started to become expensive, we introduced caches, prefetching, and other techniques to maintain the illusion that memory is fast and close.
We're still not teaching students at most schools in most programs that the expensive part is the data movement.
DD: Even cache access is painful?
DB: The cache gives the illusion that we have a very large memory running at processor speed. As our processors have gotten faster, memory has gotten faster, but not at the same rate.
As we go to multicore, we have added pressure on the cache. After two to four cores sharing a cache, we find that they start to stomp on each other.
DD: Cache architecture wasn't really designed for multicore.
DB: Correct, nor was it designed for today, when processor speeds are so much faster than memory speeds. Yet we haven't taught our students and our programmers to exploit locality which is need to get good performance from cache.
And then there are applications which don't perform well with cache. For instance, if you have a business application which just jumps around in memory ...
DD: ... doing lookups ...
DB: ... you end up pulling in a full cache line even though you only need a byte or a word.
DD: So the caches slow things down.
DB: Yeah. I sense that I'm preaching to the choir.
DD: You're saying things I always suspected were true but didn't have the impetus to research.
DB: Most people don't understand this. The high-performance computing community has been a niche of the top-notch people in computer science. If you're dealing with their problems, linear algebra and dense matrices, this often isn't an issue, since you stream through memory as you do big dense matrix operations.
DD: In order.
DB: Or even reformulated to have very good cache locality, versus most client applications, business applications, which don't have such a regular memory access pattern.
DD: The sort of thing where you jump around in records randomly.
DB: Or where internal storage is linked lists, pointer-based data structures.
DD: As Tim Mattson asked, "We know how to hire experts. How do we get a million regular programmers around the world to routinely write parallel code?"
DB: In my experience over a number of years, both my personal learning experience plus the students that I've taught, we have to redo our education system.
At Georgia Tech, we created a new department that we call Computational Science and Engineering which handles multicore, parallel programming and high performance computing. We're changing our curriculum to make all algorithms in programming classes parallel at the start. We feel that we need to teach parallelism as a first topic, not an add-on later.