Joining us today is Ryan Schneider, CTO and co-founder of Acceleware, a company that provides tools for using graphics-processing units (GPUs) to address general-purpose computing tasks.
DDJ: Ryan, what is "GPU computing"?
RS: To Acceleware, GPU Computing means using Graphics Processing Units (GPUs) to perform the calculations traditionally performed by Computer Processing Units (CPUs) to solve the math required for data processing and scientific simulation ("Technical Computing" and/or a lot of the problems being solved in HPC today). By virtue of what GPUs have been optimized to do historically (many integer and/or floating-point operations, in parallel = fast, high quality frame rates in a game, with amazing new features/lighting/shading/whatever, every year) they are now very suited -- much more so than alternatives -- for addressing computationally-intensive, parallel problems.
While this new approach to computing has its roots in GPGPU (General Purpose GPU), at least early on as we followed this area, most applications were focused on enhancing the visual image quality (High Dynamic Range Lighting) or putting something cool on the screen (Mark Harris/Ian Buck's 2D Navier-Stokes, fluid-flow model). One of the key things to point out is that, with our definition of GPU Computing, we do not display anything to the screen. The perfect analogue to how we view GPUs and GPU Computing are the math/floating-point/multiplication co-processors that were attached to the 386/486 processors. When certain mathematical functions were required and a co-processor was present, instructions were shipped off to the co-processor, which arrived at the answer much much sooner than its parent would have. Now, it is possible to migrate entire kernels and large-scale algorithms onto these GPU "co-processors". Again, to reiterate the above point, there is now a class of applications that will "compute" for minutes to hours at a time on GPUs without ever displaying a single thing.
There is one more subtlety that I'd like to point out: In Acceleware's current implementation of GPU Computing, this is really a heterogeneous computing environment. We still work within the boundaries of an operating system and definitely leverage the capabilities of CPUs both as infrastructure as well as in cases where the CPU is the best processor for the job. Some people would like to view GPUs/GPU Computing as some kind of "replacement" for CPUs, but we disagree and view them to be complementary. All of our solutions contain multi-core CPUs and "multi-core" GPUs, and we maximize both technologies to achieve commercially-relevant performance. GPUs will likely never be good at running an operating system. On the other hand, there are some computing tasks for which GPUs will always excel at compared to CPUs. With this symbiotic nature, we actually look forward to improvements on the CPU side because that can allow us to achieve higher performance from the existing GPUs.
DDJ: What types of applications are suitable for GPU Computing?
RS: There are really a few answers to this. We believe that most problems that are running on HPC clusters (to differentiate from "business"/database/web servers) today, could be suitable targets for GPU Computing. Similarly, HPC applications that run for hours to days on multi-core systems and clusters due to either memory bandwidth or computational constraints could benefit from GPUs. The last ingredient is that the application/algorithm needs to be parallel-izable (Sorry not a word...:)). One of the main difficulties that programmers and ISVs face today is that they were lulled by the Moore's law performance gains of single threaded computing in the past. With multi-core as the answer to scale performance within a given heat/power envelope, now ISVs are facing massive overhauls of legacy code. In order to be successful, it would probably be "better" if companies started their algorithmic development from scratch, with parallelism in mind... and then they would need to find people that were well-schooled and understood the parallel computing paradigm. We believe these kinds of people/skills are scarce and Acceleware is hoarding them.
DDJ: Can you give us an example of a unique GPU computing application/system?
RS: "Unique" makes it tricky to answer. Acceleware has been solving problems uniquely using GPUs, in several verticals. In the beginning, we used FPGAs and then GPUs to tackle a finite-difference method for Maxwell's equations, which model electromagnetic propagation. Now, we are the de facto standard for using GPUs to solve electromagnetic problems using finite-difference. It doesn't sound that sexy/impacting but the top six cell-phone manufacturers use our technology today to design better cell phones. And Boston Scientific, Philips, etc. are employing our technology to model MRI machines, etc. There were a lot of ISVs in this space before Acceleware, some for more than 15 years; this is why I had trouble with "unique". They were running on traditional Intel/AMD machines and transitioning to multi-core. We provided GPU-aware libraries that allowed them to offer order of magnitude performance increases to their end users. It was like night and day.
Acceleware was also the first company to ship a production/commercial product that used 4 GPUs attached to a single workstation -- our Quad or Q30, which was released in June 2007. The first market that Quad was employed was in electromagnetic, although you can think of it as a "4-core" system. Acceleware was also first to launch a commercial product for Kirchoff Time Migration (the seismic processing market's workhorse) using GPUs.
One of the big things about GPU Computing and the whole accelerator space is that many companies and chip vendors are currently espousing the benefits of this technology. People also show crazy speed ups for one part of an algorithm and research groups are definitely engaged. The difference for Acceleware is that we have real industrial/enterprise end-users who are solving real problems, and seeing a speed up even with I/O and the whole workflow running on a heterogeneous system. On top of that, Acceleware has been serving those enterprises for nearly three years.
Again, I am not sure quite how to answer your question... if by unique, you are asking: Is this technology really useful? Then Acceleware can definitely follow up with ALL kinds of examples -- large pharma companies who used to have to wait a week to reconstruct images from their million dollar scanner and now it's all done in half an hour... and so on.
DDJ: It seems like a system running, say a NVIDIA GPU and a many-core CPU could get pretty complicated to program for. If so, what's a developer to do?
RS: Hide. Look for a different job. Day trade on the stock market... Personally, I find that the fetal position helps. :) In all seriousness though, this is a nasty problem. Your question really describes a heterogeneous system, and most ISVs etc. are probably having enough trouble squeezing decent performance/multiples out of a quad-core CPU without adding another, different beast into the mix.
While we're on that train of thought, I'll throw a few more wrenches into the machinery. Accelerators and GPUs also have their own drivers and nuances. (What about backwards-compatibility as the technology progresses?) Speaking from experience, Acceleware has invested huge efforts into maintaining compatibility for our GPU-aware software libraries across evolving device drivers, operating systems (Windows vs. Linux vs. Vista), and computer systems (to the point that we certify very few). As the heterogeneous computing paradigm evolves, everyone is pushing the boundaries of what motherboards, drivers, and operating systems were envisioned to do. To get good performance, you need to get close to the hardware, but in doing so, you get closer to Pandora's box too. Most ISVs stayed away from "hardware" for a reason; Acceleware manages these complexities so that end users can solve their problems -- the "how" is less important to them.
Back to answering your question though: Numerous companies are trying to solve this problem: RapidMind and formerly Peakstream, for example, but I don't recall anything that deals with heterogeneous processors right now. Similarly, there is a lot of focus on individual processors and technologies but what about when you want to move to clustering? There is a ton of value to solving any of HPCs varying needs -- i.e., a lot of people are investing a lot of money to solve problems via computation -- so I think there will always be more than one way to skin the cat.
It's always possible for developers to make use of what Acceleware provides. (So I guess the answer is: Leave the nastiest to someone who is equipped to take it on.) Acceleware abstracts the hardware complexity AS WELL AS the changing hardware landscape from the programmer, such that they don't have to worry about these details. We do this by providing very high-level libraries for very specific verticals. We also reserve the right to add/remove new technologies (be that multi-core, multi-GPUs, Intel's Larrabee, AMD's Fusion, etc.) in the future, below the API.
One final comment: Languages, tools, etc. are mainly for developers to express themselves and their design. Definitely tools help but I think it still comes back to the developer. The best possible things a developer could do are to either 1) become skilled in "parallelism" both at the algorithmic as well as implementation levels or 2) partner with companies, like Acceleware, that have those skills. Companies and their developers have more than enough problems to solve without needing or having the luxury to become experts in parallel/HPC programming -- which also seems to be a moving target.
DDJ: Is concurrency now in the mainstream of software development?
RS: Depends on the definitions of "mainstream" and "software development"...Is concurrency probably the biggest problem facing any developer who is measured by "performance" and/or runtime? Yes!
Are the expectations built by CPU vendor marketing teams (Quad-core = 4X) and the legacy expectations of Moore's law + single thread hard to achieve in today's parallel world? Yes!
Is concurrency well understood and/or being dealt with well by ISVs and programmers? Doubtful.
Is it here on the hardware side? Heck yeah!
Current "parallelism" is the tip of a Titanic-sinking iceberg. 16-core, 32-core, 80-core are in the near future... Heterogeneity is a given too...
DDJ: Systems like those developed by Acceleware -- GPU, Clusters, etc -- demand tools and techniques (algorithms) that are based on multithreading, concurrency, parallelization, or whatever term is used. The tools are coming along, right? What about the algorithms, which also need to be parallelized it would seem. Thoughts?
Easy problems would have been solved already. Ultimately there is no easy solution to developing parallel systems. The tools and techniques for developing clusters have been around for awhile but the algorithmic level remains a challenge. Acceleware is well equipped to take on that challenge. Acceleware has proven its success in algorithms and that's why we work with partners such as CST, Synosys, Agilent, SPEAG. We have also hired parallel programming experts in the electromagnetic, seismic and imaging verticals and we are confident that this strategy will deliver the best possible solution for the end user.
DDJ: Is there a web site that readers can go on for more information on this topics?
RS: Of course, there's our company's web site: www.acceleware.com. A few others include: www.gpgpu.org, Intel/AMD's resources for multi-core training, David Kirk's course notes on parallelism, and NVIDIA Tesla.