December 19, 2008
Q&A with a TBB JunkieMeet Dmitriy V'jukov -- a high-performance developer who is an assiduous observer of Intel Threading Building Blocks
Developing "lock-free, wait-free, obstruction-free, atomic-free synchronization algorithms and data structures" is Dmitriy V'jukov's hobby. Based on his frequent postings, he's a "brown belt" ninja contributor on the Intel Software Network Forum, and one of the site's newest bloggers. Dmitriy is a high-performance computer systems developer who is an assiduous observer of Intel Threading Building Blocks (TBB) and the adoption of parallelism by developers around the world. Go Parallel invited V'jukov to share his opinions about TBB, the Microsoft Task Parallel Library, other tools to support concurrency and the proposed Intel Parallel Studio. Q: What is your software development background? A: I hold a masters degree in computer science from Moscow State Technical University. I have five years of experience as a C/C++ software development engineer, focused mainly on client/server systems and network servers. In my spare time, I deal with synchronization algorithms, programming models for multi-core and multi-threading verification tools. Q: How long have you been using TBB and for what purpose? A: I am quite aware of things happening around and inside TBB, but frankly I was not using TBB "in production." I was studying user interfaces and implementation of TBB in detail. I've developed a library for unit-testing/formal verification of synchronization algorithms (or small pieces of multi-threaded code). It's called Relacy Race Detector. I have had some preliminary conversations with TBB developers with regards to its usage in the development of TBB. I am going to provide a free license for TBB developers. I had an analogous conversation with IBM's Paul McKenney (he works on high-end Intel platforms and Linux technology) with regards to its usage in the development of Linux kernel. But I'm not sure whether Relacy Race Detector itself will be interesting to the general public, because it's targeted mostly at experts who develop very low-level and complicated algorithms. Q: What difficulties do you see developers having with TBB? A: In forums and discussion groups I see that developers face three kinds of problems with TBB algorithms:
While this advice is applicable to a single-threaded environment too, in a task-based model it's harder to realize whether, for example, access will be in stride or not. Once again, higher-level abstractions are less prone to the problem. Q: How much are these problems with parallel programming vs. problems with TBB in particular? A: These problems are related to parallel programming in general, and in particular to all other parallel programming libraries: OpenMP, Task Parallel Library, Cilk, etc. Q: When you discuss granularity size, are you talking about the general parallel programming issue of task size, or referring to the problematic TBB 1.0 requirement to pick an explicit grain size (which was fixed in TBB 2.0 with the auto_partitioner)? A: I am talking about the general parallel programming issue of task size. Q: What's your biggest challenge in concurrent programming? A: My biggest challenge in concurrent programming is debugging. Things like non-determinism, asynchronism, the absence of total order of events and state of distribution make debugging of concurrent systems beyond the human brain's strength sometimes. Every "little" error in source code can take up to several days or weeks to fix. And that's the best case scenario. In the worst case, you don't know that there is an error until you get the call from an enraged customer. And the customer can't say under what circumstances it happens. A: This is a field where I am looking forward to strong tool support, of all kinds: static analysis, dynamic analysis, post-mortem analysis, advanced IDE support. I have developed some in-house tools for my purposes. But not every developer is able to develop a comprehensive toolset manually. Q: Have you used Intel Thread Checker, VTune or Thread Profiler? A: Yes, I've used them to a certain extent. They're invaluable tools in one's toolbox. It's difficult to add anything else. Q: Have you used the Intel parallelizing compiler or any of the other tools on Intel's WhatIf.intel.com site? A: There are a number of really interesting projects on the WhatIf site. Particularly, I have evaluated the C++ STM Compiler. While I don't think that transactional memory implemented purely in software is viable, because of high overheads and high centralization, transactional memory itself is definitely a very promising programming model. Q: You have been following the C++0x standard. What would you like to see in terms of parallelism there? Do you think the implementations are better or worse than those in TBB? A: Everything I want to see in C++0x is already there. The ISO committee has carried out a tremendous amount of work with respect to multi-threading and parallelism support. There is a multi-threaded memory model which defines atomicity, visibility and ordering guarantees in the presence of multiple threads, as well as an atomics library and basic primitives (thread, mutex etc). Q: Are you aware of the Microsoft Task Parallel Library? Have you compared the algorithms there to those of TBB? A: Yes, I am aware of the TPL (as well as the Microsoft Parallel Pattern Library, Cilk++, Java Fork/Join). They all are basically the same in main part (not counting that TPL and Java Fork/Join are for managed code of course): the same task-based programming, the same work-stealing scheduler, they even use the same names for concepts. TPL includes something called CDS (Concurrent Data Structures), which includes things like ConcurrentStack, ConcurrectQueue, SpinLock, WriteOnce, etc. TBB also includes some similar things: tbb::concurrent_queue, tbb::concurrent_vector, tbb::spin_mutex. So I can't think of any crucial difference. Q: Is it possible that the performance of one implementation vs. another might be significantly different, or that one might scale better than another? A: [I assume] scalability must be the same, i.e., linear provided that user doesn't make any mistakes. Frankly, I haven't measured the performance of the TPL. But I think that it must be roughly equal to that of TBB. I see no fundamental reasons for a substantial performance difference. Although "quality of implementation" can result in some limited performance differences, but it's too early to talk about quality of implementation of TPL because it is still in CTP (Community Technology Preview) status. Q: Have you seen any improvement among other developers in comprehension or application of thread-safe coding principles, or has it stayed the same over the past year or two? A: Definitely there is some improvement, but it is not as big as one wants to see. I think that the primary work on education is still ahead of us. It's not possible to educate the whole community in two years, although there are early adopters for certain. Q: Do you have any sense of the relative ease of understanding of threading for managed code/byte code vs. threading for C++? Will C++ developers grasp concurrency better than Java developers? A: That's a difficult question. It reminds me of those holy-war questions like "Who creates better applications, C++ or Java developers?" But I will try to answer from my point of view. First of all, it depends more on particular developer. And if one says that Java developers will grasp multi-threading better, this doesn't mean that every Java developer will grasp multi-threading better than every C++ developer. I hope everybody understands this, but this is something I have to say anyway. I think we can consider three levels of multi-threading.
To summarize, I think that basic and lower-intermediate multi-threading will be grasped equally by C++ and Java developers. And in the field of upper-intermediate and expert multi-threading there will be a prevalence of native developers. Q: What most interests you about the Parallel Studio? Have you seen anything like it in the market? A: I'm most interested in Parallel Inspector. As I said, debugging, verification and localization of errors is the hardest part of multi-threading for me. I look forward to what Intel will offer to us. Also, I would not mind a better performance analyzer for multi-core (i.e. Parallel Amplifier) which will be much more aware of multi-threading/multi-core than "plain old profilers", which will be able to detect things like false-sharing and point directly to the problematic variables in source code. Have I seen anything like it in the market? It's difficult to say, because I don't know what exactly Parallel Studio will offer. As for Parallel Advisor, I can't remember anything similar. Though, as I said, it will depend on what Parallel Advisor will offer to us. As for Parallel Composer, well, it seems that Cilk/Cilk++ Composer (which was created back in 1994) is very close to Parallel Composer. As for Parallel Inspector, Microsoft Visual Studio offers some basic support for debugging multi-threaded code. Also, some products exist for static or dynamic verification of multi-threaded code (Valgrind, CHESS, etc.). As for Parallel Amplifier, I think it's a kind of profiler, and we have worked with profilers for decades. Everything will depend on what exactly Intel will offer, how intelligent it will be, how integrated it will be, and how aware of multi-core it will be. Q: Developers can now sign up for the Parallel Studio beta program. Have you done that? A: Yes, of course!
|
|
|||||||||||||||||
|
|
|
|