FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
December 19, 2008
Q&A with a TBB Junkie

Meet Dmitriy V'jukov -- a high-performance developer who is an assiduous observer of Intel Threading Building Blocks

Developing "lock-free, wait-free, obstruction-free, atomic-free synchronization algorithms and data structures" is Dmitriy V'jukov's hobby. Based on his frequent postings, he's a "brown belt" ninja contributor on the Intel Software Network Forum, and one of the site's newest bloggers. Dmitriy is a high-performance computer systems developer who is an assiduous observer of Intel Threading Building Blocks (TBB) and the adoption of parallelism by developers around the world. Go Parallel invited V'jukov to share his opinions about TBB, the Microsoft Task Parallel Library, other tools to support concurrency and the proposed Intel Parallel Studio.

Q: What is your software development background?

A: I hold a masters degree in computer science from Moscow State Technical University. I have five years of experience as a C/C++ software development engineer, focused mainly on client/server systems and network servers. In my spare time, I deal with synchronization algorithms, programming models for multi-core and multi-threading verification tools.

Q: How long have you been using TBB and for what purpose?

A: I am quite aware of things happening around and inside TBB, but frankly I was not using TBB "in production." I was studying user interfaces and implementation of TBB in detail. I've developed a library for unit-testing/formal verification of synchronization algorithms (or small pieces of multi-threaded code). It's called Relacy Race Detector.

I have had some preliminary conversations with TBB developers with regards to its usage in the development of TBB. I am going to provide a free license for TBB developers. I had an analogous conversation with IBM's Paul McKenney (he works on high-end Intel platforms and Linux technology) with regards to its usage in the development of Linux kernel.

But I'm not sure whether Relacy Race Detector itself will be interesting to the general public, because it's targeted mostly at experts who develop very low-level and complicated algorithms.

Q: What difficulties do you see developers having with TBB?

A: In forums and discussion groups I see that developers face three kinds of problems with TBB algorithms:

  • Task granularity size. In order to achieve good performance, task granularity must be carefully chosen. Tasks that are too fine-grained will lead to high overheads. And tasks that are too coarse-grained will lead to bad scalability due to lack of "parallel slack."
  • Excessive sharing. In order to achieve good scalability, each thread must work mainly with private data. Having each thread, on each iteration, update some global variable (or variables) will turn scalability from a linear positive to a super-linear negative. Task-based programming is especially prone to the problem. Higher-level abstractions (tbb::parallel_reduce, tbb::parallel_scan) incorporate more intelligence to overcome the problem. This strongly suggests that developers should use as high-level abstractions as possible.
  • Locality. Though the modern computer memory sub-system is still called RAM (random-access memory), it's a kind of complicated, distributed, heterogeneous, hierarchical system now. Fortunately, there are very simple tips on how to use it efficiently: First, prefer stride access; second, use all data loaded into the cache; and third, reuse the data in cache while it's still there.

While this advice is applicable to a single-threaded environment too, in a task-based model it's harder to realize whether, for example, access will be in stride or not. Once again, higher-level abstractions are less prone to the problem.

Q: How much are these problems with parallel programming vs. problems with TBB in particular?

A: These problems are related to parallel programming in general, and in particular to all other parallel programming libraries: OpenMP, Task Parallel Library, Cilk, etc.

Q: When you discuss granularity size, are you talking about the general parallel programming issue of task size, or referring to the problematic TBB 1.0 requirement to pick an explicit grain size (which was fixed in TBB 2.0 with the auto_partitioner)?

A: I am talking about the general parallel programming issue of task size.

Q: What's your biggest challenge in concurrent programming?

A: My biggest challenge in concurrent programming is debugging. Things like non-determinism, asynchronism, the absence of total order of events and state of distribution make debugging of concurrent systems beyond the human brain's strength sometimes. Every "little" error in source code can take up to several days or weeks to fix. And that's the best case scenario. In the worst case, you don't know that there is an error until you get the call from an enraged customer. And the customer can't say under what circumstances it happens.

A: This is a field where I am looking forward to strong tool support, of all kinds: static analysis, dynamic analysis, post-mortem analysis, advanced IDE support. I have developed some in-house tools for my purposes. But not every developer is able to develop a comprehensive toolset manually.

Q: Have you used Intel Thread Checker, VTune or Thread Profiler?

A: Yes, I've used them to a certain extent. They're invaluable tools in one's toolbox. It's difficult to add anything else.

Q: Have you used the Intel parallelizing compiler or any of the other tools on Intel's WhatIf.intel.com site?

A: There are a number of really interesting projects on the WhatIf site. Particularly, I have evaluated the C++ STM Compiler. While I don't think that transactional memory implemented purely in software is viable, because of high overheads and high centralization, transactional memory itself is definitely a very promising programming model.

Q: You have been following the C++0x standard. What would you like to see in terms of parallelism there? Do you think the implementations are better or worse than those in TBB?

A: Everything I want to see in C++0x is already there. The ISO committee has carried out a tremendous amount of work with respect to multi-threading and parallelism support. There is a multi-threaded memory model which defines atomicity, visibility and ordering guarantees in the presence of multiple threads, as well as an atomics library and basic primitives (thread, mutex etc).

Q: Are you aware of the Microsoft Task Parallel Library? Have you compared the algorithms there to those of TBB?

A: Yes, I am aware of the TPL (as well as the Microsoft Parallel Pattern Library, Cilk++, Java Fork/Join). They all are basically the same in main part (not counting that TPL and Java Fork/Join are for managed code of course): the same task-based programming, the same work-stealing scheduler, they even use the same names for concepts.

TPL includes something called CDS (Concurrent Data Structures), which includes things like ConcurrentStack, ConcurrectQueue, SpinLock, WriteOnce, etc. TBB also includes some similar things: tbb::concurrent_queue, tbb::concurrent_vector, tbb::spin_mutex.

So I can't think of any crucial difference.

Q: Is it possible that the performance of one implementation vs. another might be significantly different, or that one might scale better than another?

A: [I assume] scalability must be the same, i.e., linear provided that user doesn't make any mistakes. Frankly, I haven't measured the performance of the TPL. But I think that it must be roughly equal to that of TBB. I see no fundamental reasons for a substantial performance difference. Although "quality of implementation" can result in some limited performance differences, but it's too early to talk about quality of implementation of TPL because it is still in CTP (Community Technology Preview) status.

Q: Have you seen any improvement among other developers in comprehension or application of thread-safe coding principles, or has it stayed the same over the past year or two?

A: Definitely there is some improvement, but it is not as big as one wants to see. I think that the primary work on education is still ahead of us. It's not possible to educate the whole community in two years, although there are early adopters for certain.

Q: Do you have any sense of the relative ease of understanding of threading for managed code/byte code vs. threading for C++? Will C++ developers grasp concurrency better than Java developers?

A: That's a difficult question. It reminds me of those holy-war questions like "Who creates better applications, C++ or Java developers?" But I will try to answer from my point of view.

First of all, it depends more on particular developer. And if one says that Java developers will grasp multi-threading better, this doesn't mean that every Java developer will grasp multi-threading better than every C++ developer. I hope everybody understands this, but this is something I have to say anyway.

I think we can consider three levels of multi-threading.

  • Basic multi-threading: things like "Here is how you can start a thread" or "You must protect the data accessed by more than thread simultaneously with mutexes". On this level, it's irrelevant what language one uses. Concepts are really very simple. Yes, Java has built-in support for threads and mutexes, but I'm aware of no precedents that absence of that support was a show-stopper. And there are libraries like Boost and ACE (btw, ACE appeared long before Java and was way more portable than Java).
  • Intermediate multi-threading: things like how to apply hierarchical or fine-grained locking, how to efficiently use OpenMP or java.util.concurrent, how to detect and eliminate false-sharing. Well... it seems that this "level" is also mostly "language-independent". Although I think that C++ developers have more means and proclivity to such low-level things.
  • Expert multi-threading: things like partitioning of data between processors, algorithms based on relaxed atomic operations, extremely precise data layouts based on cache-line size etc. This level will be occupied mostly by native C/C++ developers, just because Java developers don't have the means to work on this level. Although this level is the only way to conquer many-core for some types of applications, and it deals with several orders of magnitude performance difference.

To summarize, I think that basic and lower-intermediate multi-threading will be grasped equally by C++ and Java developers. And in the field of upper-intermediate and expert multi-threading there will be a prevalence of native developers.

Q: What most interests you about the Parallel Studio? Have you seen anything like it in the market?

A: I'm most interested in Parallel Inspector. As I said, debugging, verification and localization of errors is the hardest part of multi-threading for me. I look forward to what Intel will offer to us. Also, I would not mind a better performance analyzer for multi-core (i.e. Parallel Amplifier) which will be much more aware of multi-threading/multi-core than "plain old profilers", which will be able to detect things like false-sharing and point directly to the problematic variables in source code.

Have I seen anything like it in the market? It's difficult to say, because I don't know what exactly Parallel Studio will offer. As for Parallel Advisor, I can't remember anything similar. Though, as I said, it will depend on what Parallel Advisor will offer to us.

As for Parallel Composer, well, it seems that Cilk/Cilk++ Composer (which was created back in 1994) is very close to Parallel Composer.

As for Parallel Inspector, Microsoft Visual Studio offers some basic support for debugging multi-threaded code. Also, some products exist for static or dynamic verification of multi-threaded code (Valgrind, CHESS, etc.).

As for Parallel Amplifier, I think it's a kind of profiler, and we have worked with profilers for decades.

Everything will depend on what exactly Intel will offer, how intelligent it will be, how integrated it will be, and how aware of multi-core it will be.

Q: Developers can now sign up for the Parallel Studio beta program. Have you done that?

A: Yes, of course!

TOP 5 ARTICLES
No Top Articles.
DR. DOBB'S CAREER CENTER
Ready to take that job and shove it? open | close
Search jobs on Dr. Dobb's TechCareers
Function:

Keyword(s):

State:  
  • Post Your Resume
  • Employers Area
  • News & Features
  • Blogs & Forums
  • Career Resources

    Browse By:
    Location | Employer | City
  • Most Recent Posts:
    MEDIA CENTER  more
                                   
    INFO-LINK

    Resource Links: