James Reinders: Today I have Steve Teixeira, Product Unit Manager for Parallel Developer Tools at Microsoft, and we thought we'd talk a little bit about parallelism. Steve, we're both very enthusiastic about parallelism and encouraging developers to take advantage of the concurrency in multicore. What thoughts do you have now that we have Visual Studio 2010 out there for developers to use?
Steve Teixeira: I'm really excited about it. We've been working on this for a long time. We had a team squirreled away, led by David Callahan as kind of the architect, trying to figure out this parallelism thing and trying to figure out how we can get this shipping as a mainstream part of Visual Studio.
I'll be honest, even a couple of years ago, we had people that were thinking, Yeah, it's kind of this propeller head, tinfoil hat type of science fiction thing that they're working on. But over time as we would get out and get in front of developers and talk to them and explain to them the hardware reality in terms of sequential performance kind of near plateauing, got to go parallel to kind of get the most performance out of your software that message was really resonating with developers who were really concerned about it and we started to see pull from customers. Then, whatever kind of non believers we had internally started looking to the customers and going, Wow, this is something that people are really excited about in the next version of Visual Studio. So it's been a long time and it's the first version of Visual Studio that other than say threads or thread pools or some really core screened version of parallel computing the first version of Visual Studio that really has all this stuff baked in as a first class citizen.
Reinders: I think the things that are baked in, the things that we can present with Visual Studio and with Parallel Studio, I think they look pretty simple and some people might wonder what took so long. But you and I both know working with the teams how much complexity is underneath where you want to have that complexity so that it really works but you want to present an easy interface. One of those complexities that I'm excited about, we've talked about, is concurrency run time. Can you tell us a little bit about Concurrency Runtime and how you think about it?
Teixeira: If we would have just shipped the Concurrency Runtime, we probably could have shipped a year ago. But it's building all that stuff that makes it easy on top of the hard stuff that kind of makes it take a while to surface a really reasonable programming model.
So the programming models -- the Concurrency Runtime, first of all -- is a runtime that includes a couple pieces. One is a piece that sort of abstracts away the details of the underlying hardware so developers don't necessarily have to be concerned about the bare metal. And then there is a scheduler -- or any number of schedulers -- that can live on top of that, and those two components together are called the Concurrency Runtime. Now on top of that we can build a variety of programming models so, of course, at Microsoft we did PPL and we did asynchronous agents. At Intel, Intel is building TBB, perhaps OpenMP on top of the Concurrency Runtime, and we can invite other companies that have their favorite languages to also build on top of the Concurrency Runtime.
But the core idea is this notion of a task abstraction. So the idea is to get developers out of thinking the business of threads. I like to think of it this way: Threads force the developer to think like a computer. You have to think in terms of execution flow on the hardware when you're thinking about threads. This is how my program is going to run through execution on this particular piece of code. What tasks allow you to do is tasks allow you to think of bite size chunks of work in your program so you can literally divide your program up in bite size chunks of work and schedule those bite size chunks as tasks.
Reinders: The more the better.
Teixeira: Exactly, the more the better. We want to lower the cost for these costs because right now threads are really expensive to create and you don't want to just spin up any number of threads. You kind of have to be pretty smart about it: I've got this many procs. The system is this busy so I want to spin up this many threads.
The idea with tasks is, within reason, you should just be able to create these things like candy and spin them up and use them and then they go away. We're thinking it's close to an order of magnitude less expensive to create a task versus creating a thread.
Reinders: On the order of magnitude?
Teixeira: Only about an order of magnitude. I haven't seen the latest timings but I think it's something on the order of about -- you need a chunk of work on the order of maybe 10,000 cycles to amortize the cost of creating a task across all of that work. But that's actually a fairly small chunk of work in computer science speed.
So as long as you can divide up your work in that way -- and you can do that manually or you can do that through something like a Parallel_for loop and just say: Take this For loop and spin it off and create a bunch of tasks behind the scenes. The Concurrency Runtime underneath the hood will do a whole bunch of magic in terms of scheduling to make sure that those tasks are executed in the most efficient possible way on the hardware. But as a developer it's nice to know that that's happening for me but I don't really have to care exactly what's happening, except that it's going fast and it's being efficient.
Reinders: Two of the messages I hear you say that I hope are getting understood by every developer, I think, really important is this idea of programming in task instead of threads, a thread being: I've got a quad core. Let's divide up into four pieces. That's just not the right thing for an application we're trying to do. So I like to say if an application is asking how many cores there are, there is something wrong. It should just be creating lots of tasks. But Concurrency Runtime I like to think about what is the world like in the future if we don't have Concurrency Runtime? What problem is it solving? What disaster are we avoiding? It's something we often call just over subscription, right?
Teixeira: That's right. To give you some insight into how my twisted mind works, whenever this topic comes up, I think about that Star Trek episode of Kirk versus the Gorn, that lizard man, where they're fighting.
Reinders: He makes saltpeter.
Teixeira: That's right. Thank you. So Kirk and the Gorn are fighting and that's my mental image for what happens if you don't have something like the Concurrency Runtime is you've got all these applications on the system. All these applications are asking: How many cores are on the system? And let me just assume that I own all of them and then you end up with this nasty oversubscription problem where you've got an amount of work that you're trying to do on the system that's disproportionate to the amount of execution resources available on the system.
So there are two problems there: One is, don't do that. That's really bad. I think a task-based programming model helps quite a bit but it doesn't really alleviate the responsibility of developers and even people running software to understand like, I've got a bunch of stuff still happening on my system. Tasks help but they don't necessarily get rid of the problem of oversubscription. You could still have that kind of problem. It's just you'll probably spend less effort, as a part of this oversubscription like spinning up and shutting down threads or this sort of thing.
Reinders: I know some developers will use Concurrency Runtime directly but I think the most important use is for Intel's OpenMP, Microsoft's TPL, PPL, Intel threading building blocks. Then we've got things built on top of these that people don't think about like our math kernel library uses threading building blocks which now will use Concurrency Runtime. But before that someone makes a math library call and they happen to be using TBB or using Microsoft's OpenMP. Every one of these is creating a thread pool or thinking about it and it's not very hard to write a program and use a couple of models. Okay, if you're in a dual or a quad core, you're only got a thread pool of two or four. But we're going to see machines -- 8, 16, 32 -- very quickly here and just casually calling into three or four models, each one of them creates a thread pool of 32. There is going to be oversubscription really fast that makes no sense to do that. Concurrency Runtime solves that problem by being an agent that every one of these models calls to and says: I want some threads. I want a thread pool and then the thread pool can be shared and coordinated and we don't end up with ten different thread pools of 32, hopefully.
Teixeira: Yeah. The point you bring up, James , is a really good one. It's not just about separate applications running, spinning up their own thread pools. Even within one application, if I'm using libraries A, B and C and they're all parallelized they could all be parallelized by having their own little thread pools and we actually see this. We talk to customers and they're using some third party library. They don't necessarily have the source code for but you bring this stuff up in the profiler and you can see you launch this app and suddenly 80 threads are going wild.
Reinders: What is this library that I was using? I didn't know it created all those threads.
Teixeira: Totally. So the idea of having a base layer that's really going to manage the concurrency across the whole application that everybody can just understand that is there, that's available.
One of the things we did to kind of create that assurance for developers is we put the Concurrency Runtime in the Microsoft C runtime. So it's kind of if you're building an application on Windows, whether you're just strictly using the Visual C++ library or you're also mixing it or the Visual C++ compiler or you're also mixing in the Intel compiler. Either way you can link with the Concurrency Runtime. It's there, it's available. You don't have to do any weird like download this, get this. If you're building a C++ app it's just kind of there for you.
Reinders: So on top of Concurrency Runtime, then we see models like PPL, TPL, Intel Threading Building Blocks (TBB) coming in on top, all of them espousing task and looking that we use a task stealing algorithms underneath. How important is task stealing? Or how should I be thinking about task stealing?
Teixeira: Task stealing is a really cool concept. One of the issues you run into -- Let's take a simple case like a Parallel_for loop. So let's say I've got this For loop that iterates N times and I've got a quad core machine so kind of a naïve way to parallelize that. We check for the number of processors. We say we've got four. Let's take this For loop into and divide it by four chunks and we'll give each thread one-quarter of the workload and they will all run and that's a great way. In fact many applications still today parallelize that way.
Reinders: I call that 'static scheduling'.
Teixeira: Static schedule processors.
Reinders: Everybody gets a quarter. Hope it takes the same amount of time.
Teixeira: Then here is the canonical problem that blows up a static scheduling algorithm: calculating prime numbers. It's that kind of workload where you're close to zero. It's very little work to see if that number is prime and then when you get closer to infinity it becomes really quite a lot of work to figure out whether or not that number is prime.
So a static scheduling algorithm will end up with an imbalanced workload. You'll have a quarter of their workload will finish really soon; then the next quarter will finish a little bit longer, a little bit longer and a little bit longer. The problem is the time it takes you to solve that problem -- so you time to solution -- is as long as that longest piece of work. What tasks and work stealing allow you to do is break that task up into even finer grain pieces of work. So not just four; maybe it's going to break it up into 8 or 16 or 32 chunks of work and within that when one of those four things runs out of work it can steal some tasks from a neighbor. So let me draw you a little picture and kind of show you how that works. Let's say I've got my application and my application has some task or some thread that's running and executing. We'll call that a little button -- that looks like a little power button -- and I've got this global queue here and this global queue is going to store a bunch of tasks that I create so my thread can spin up Task 1, Task 2, Task 3, Task 4.
Reinders: One of these is running and the rest are sitting in the queue waiting.
Teixeira: In this case, these are all right now sitting in queue so I've got a Parallel_for and it's spun up all these things. What the underlying runtime will do is underneath the sheets it will actually create a thread pool, although you don't really have to know that, as a developer, but that's kind of what's happening, and the thread pool will figure out the right number of threads to create in order to handle the workload that is present in the queue.
So let's say I actually need four of these things so I'm going to create four buttons so each of these are threads that are running. We'll call this Worker 1, Worker 2, Worker 3 and Worker 4. So Worker 1 says, I see the global queue has an item. I'm going to go grab that and I'm going to start executing on it. Each of these ends up actually with their own local queues, as well. I'll tell you why in just a second.
So they end up with their own local queues. Let's say Worker 1 pulls off Task 1, starts executing it but Task 1 now starts spinning off its own task. Let's say it's a nested Parallel_for loop. So those will actually end up in a local queue, a queue that's local to Worker 1. So we'll call these 5, 6 and 7. We'll throw 8 in for good measure. So these end up on a local queue.
One of the advantages of this is this thing is managed by a kind of a rocket science, lock free algorithm for the global queue. You kind of what as little contention on that as possible so one of the ways we can avoid that is saying, Okay, if I spin up some work locally I'm just going to put this on my local queue and then when I'm done executing T1 I can just pop T5 off and then T6 and then T7 and T8, in that order.
One of the advantages of working in this direction is that you can some cash affinity so the logic is: If I spun up these tasks one after the other, they're probably kind of located near each other, either from a code standpoint or even from a memory consumption standpoint, so I'm going to get great cash behavior. Now, obviously these other workers are also going to be looking to the global queue so in this model Worker 2 might pop off Task 2 and Worker 3 might pop off Task 3, etc.
One of the cool advantages that work stealing give you is let's say this is my prime number calculation and let's say Worker 1 is running the far end of that prime number, the thing that takes a long time. So when Worker 2 through 4 finish, their task queues are empty. Let's say the global queue is empty. What Worker 2 will do is it will look at its neighbor and go, Hey, neighbor, you got any work that I can take off your hands. And it will see, yeah, I've got all this work here. It will actually start stealing from the top of the task so T8 will disappear from Worker 1's local queue and reappear here in Worker 2's local queue.
You notice how it pops it from the top. So in this case thinking, I don't want to mess with this good cash mojo that Worker 1 might have rolling so I'm going to steal some stuff from the top thinking that I'm going to have the least protuberance on my neighbor.
Reinders: Caches matter a lot.
Teixeira: Exactly. So in this way, you can image of all of these things pooling from the global queue, working off their own queues, taking from their neighbors when their neighbors have work and they're empty. What you end up with is in that prime number example I ended up with instead of having a very staggered finish and a very imbalanced workload you end up with all four things finishing at around the same time and your time to solution ends up being much more quickly.
Reinders: You've stressed the imbalance that can be in the equations. I do that when I describe this as well and then inevitably I'll get people who tell me, No, my problem is not imbalanced. But if Worker 3 is running on a core and one of our favorite things like a virus checker the same thing happens, the same imbalance. That worker thread won't be getting work done that's essentially global, even though we've spread it around for very good reasons and the stealing algorithms work just as well if the interference is coming from outside your application.
Teixeira: That's a great point.
Reinders: That turns out to be important because if you ignore that the amount of time necessary to run our application, at least this phase, is equal to whoever does the work the slowest, even if it's because a virus checker kicked in on core number three or something.
Teixeira: That's right. Or even internal the application you could end up in a situation where one worker starts false sharing for some reason and so you end up with the work just taking a lot longer between two particular workers. So you can end up with that situation, even within one application and even when you have calculations that ostensibly should take about the same amount of time. They may or may not.
Reinders: The task stealing algorithms come out of research in MIT -- Cilk research -- and that's, I think, become the way to build this sort of scheduler and such. But this is complexity that I know you and I enjoy talking about but it's under the hood. We've given them abstractions on we've got Intel TBB and we've got the Parallel Patterns Library (PPL) in Visual Studio 2010 that give this to C and C++ programmers. But Microsoft also took it a step further and has this task stealing available on .NET, the TPL.
Teixeira: The TPL. That's right.
Reinders: What is unique about TPL? What should we know about TPL? It has the same thing under the hood. Is there anything different?
Teixeira: Yes. Just to rewind a tiny bit. One of the points you bring up is a good one which is there are a lot of different entry points for this technology. Some people may want a program directly to the Concurrency Runtime and that's totally cool. Some people may want to use the parallel pattern library. Some people may want to use TBB. Some people may want to continue to use Open MP which people have used successfully for many years.
We have a couple of other programming models in native code. We also have asynchronous agents so this idea of having agents that are from a code standpoint disconnected with one another. They don't share any state and so you get around all these locking problems by having them not share state and communicate only via messaging. But at its core that is leveraging the functionality and the Concurrency Runtime as well.
Reinders: Is that Axum? Is that the name for it?
Teixeira: Axum is actually a variation on that theme which focuses on managed code. The asynchronous agents are shipping in the native code library in this release. So, yeah, we have some incubations and kind of technology that we've made available to the public via our dev labs download site.
Reinders: So we're getting this basic infrastructure -- Concurrency Runtime, task stealing -- and then we can have a proliferation of models on top and we can mix and match them.
Teixeira: That's the idea.
Reinders: That's really the great thing about having the infrastructure done right.
Teixeira: We've talked a lot about the native code stuff. You brought up Task Parallel Library (TPL) which is on the managed code side so within .NET TPL is very much like PPL. It's a task-based programming model. You can sort of empirically create tasks by saying, Task, create. Here is your lambda function to go execute. Or you can more declaratively state that task should be created by doing a Parallel_for or Parallel_for e-generation and then underneath the hood tasks start getting spun up, although from a developer standpoint all you said was Parallel_for.
Also, on the managed code side, we've got another interesting technology which is 'parallel linq', plinq. So linq is this interesting programming model that kind of takes some SQL-like elements and brings them into .NET languages so you can start doing SELECT statements and order by's and various SQL kind of things that feel very native to SQL people but they appear as a native looking part of the managed code programming models.
What we did in this release is take that even a little step further and you can add 'dot as parallel' to any of those queries and if it's on a local data -- in other words, it's not a query going off to a database, databases are really good at parallelizing already -- but if you're working with some local data, it will help you manage that data and it will use the available cores on your machine to basically lower your time to solution. It's almost like magic. It's like I had this linq query, I added 'as parallel' and suddenly it's like 3.9 times faster on my four-way machine.
Reinders: I know. In your demos, you go in and you just open up the code and you add this little 'as parallel'. It looks so simple and then you run it and it runs faster. But you've been seeing people get very creative with the uses of this. You were telling me that you've seeing things you weren't expecting because it's a pretty general mechanism.
Teixeira: That's right. It works on any IEnumerable dataset and so we're starting to see people go, Wow, I could use this for everything. Anytime I have a list of stuff I can suddenly just add 'dot as parallel' to it and I can work with this list of stuff much more quickly. So we're kind of delighted. I always think of Guy Kawasaki's book "Rules for Revolutionaries". He talks about this idea of build your technology such that people can surprise you by using them in ways you never intended. This is a case of that where people just kind of have fun using the stuff in ways we didn't necessarily think would be mainstream but sometimes the workloads that turns out to be mainstream aren't necessarily the ones you initially optimized for but the trick is to embrace those things as you revise the software and make sure those things become focuses of how you evolve the software.
Reinders: I'm very excited about this. We've got great infrastructure. We've got some great models together coming out: Visual Studio and Parallel Studio together. What a great environment. But what do we do to help people when it doesn't go right? I know it probably never happens to a developer that we have to debug anything. But does concurrency make debugging more difficult? Or, are we getting some things done there that we should talk about?
Teixeira: Yeah, man, it makes it way more difficult. Just the simple cases we've talked about already like parallel linq expressions or Parallel_for loops. We talked about kind of the bright sunny unicorns and rainbow side of it. There is this other side of it that is: What if I'm reading/writing a global variable within this For loop or this Linq expression? Or, what if I have some loop carried dependency where the value of one iteration depends on the value of a previous iteration. Suddenly my loop doesn't decompose in a parallel manner very effectively. The problem is the programming models don't tell you that and so basically you just kind of get it wrong and then you get the wrong answer and you go, Gosh, why did I get the wrong answer? So we're bringing a couple technologies to bear here: one is on the Visual Studio side some new debugging and profiling support to kind of help you figure out where the problems are, help you optimize performance.
One of the things I'm very excited about on the Parallel Studio side of things is the ability to test for correctness in various ways. So when I have these race conditions that may not be obvious to me by looking at the source code, I actually run them through the Parallel Studio tools and go, I've got a race here and now I need to go fix it.
Yeah, it's very, very difficult for developers to sort of get this right, even super advanced developers. This isn't one of these situations where this tale of two cities: we have the developers that are awesome and they can cope with concurrency and those that aren't and can't. this is a problem all developers share. Even awesome developers get concurrent algorithms wrong. It's way too hard. Herb Sutter likes to call it a 'hard for geniuses' problem which I think is a great description.
Reinders: Absolutely. But we've been missing some of the things that we're used to in sequential programming: being able to see what's really happening in the hardware which is what tools that help do the profiling that come both Visual Studio has its views and Parallel Studio has view. I mean, by 'views' it's how many ways can you show me what's going on. So together we've got all these great things. Oh, that's what's really happening on the machine. Those 'ahas' are great and then being able to pinpoint an error and say, This source line is questionable, as opposed to: Guess what? Your variables have gone the wrong values. Good luck figuring how that happened.
We're used to that in sequential programming. We've gotten that over years, tools to help us with all these idiosyncrasies and now we're seeing the emergence of these tools in Visual Studio and Parallel Studio to show us what's really going on, even though there is concurrency, pinpoint an error, even though it's because of concurrency and I just look at it as maturing of the tools. I don't care what a genius you are. If you don't have tools that are going to help you with those problems you're lost. Now, I think that with the combination of Visual Studio and Parallel Studio are going out on a limb. I think right now, without a doubt, Windows has got the most support for concurrency and writing a concurrent program, debugging. I see the richest environment there because of this work and it's very exciting, I think, for developers that are looking to add concurrency that we've got that altogether now.
Teixeira: You certainly won't get any argument from me. I totally agree. One of the big problems that we've seen, for example, in the debugging world is that the evolution of how debugging tools have evolved haven't necessarily taken concurrency into account until more recently with what we're doing in this set of releases. Two big holes I've always thought in debugging: one is the ability to debug at the same level of abstraction that the developer writes the code. So too often we let developers use some high level programming model -- OpenMP or whatever -- to express their parallelism and then when there is a problem and it's time to debug, now we're in threads and I'm looking at threads and I'm looking at call stacks and I'm looking at various variables inside stack frames within that call stack.
Reinders: Good luck with that.
Teixeira: Yeah, and I'm like, I said OMP Parallel_for. Suddenly I've got all these threads. There is this big black box in-between obviously. I don't know how to rational these two pieces anymore. So that's one of the big efforts we had in this Visual Studio 2010 is to surface the programming models as a first class citizen so that if I program with tasks the ID actually shows me: these are the tasks I created, here is where they are in source code. I can double click on them. I can see which ones are running, which one's the schedulers are scheduled and waiting to be run. I can even see if I've got two tasks that deadlocked on one another so really rich tooling.
The other debugging problem that I've seen is just the ability to cope with information overload and scale. As we add greater and greater levels of parallelism, it's great if I have this list box of threads; that scales really well to a dozen threads, maybe a dozen and a half threads. But when I start getting 32 or 64 or I go to a GPU and I get gosh knows how many threads I can't just look at a list and understand it anymore. I need some ways to break down, categorize it, help me sort it, help me do parent-child hierarchies on these things. So it's tools that enable you to cope with the vast scale that parallelism presents.
Reinders: And it's not going to get easier.
Teixeira: No, we're just scratching the surface.
Reinders: We're going to keep adding more and more threads.
Teixeira: Absolutely. So these are the two areas of focus: the idea of surfacing programming models as a first class citizen and the idea of helping developers cope with scale of parallelism.
Reinders: I know we have a lot of problems to tackle in the future on the tooling side in developers but I really do think we've hit a sweet spot now where the tooling has reached the maturity that anybody that's doing any application should be looking at, how to take advantage of multicore and the toolings there. How would you recommend people get started?
Teixeira: Pretty easy to get started. You can go to the Microsoft Web site and download a copy of Visual Studio 2010 trial.
Reinders: You can do that for free now?
Teixeira: Do that for free now.
Reinders: Then you go visit the Intel site.
Teixeira: Get a copy of Parallel Studio trial.
Reinders: And pick up a Parallel Studio trial and off you go.
Teixeira: It's really easy. One of the areas that I think they work really well together is this area around using Parallel Studio to help you with correction analysis, deadlocks and things like that, also memory safety. Visual Studio, we talked about debugging. One thing we didn't talk about is performance analysis which is another piece that we filled in and it actually works complimentary with the Intel VTune tool as well.
What we've done is one of the things that we discovered is it gets really hard when you start throwing all these numbers at developers with Concurrency. You've got all these threads. They're generating all this data. Perhaps it's hardware tick data or perf counters -- whatever -- but it's data overload. So we're drawing pictures. We went Crayola for the profiler and it's kind of a tea leaf type of thing. What we discovered is that if we paint developers a visualization of their concurrency what are all my threads doing over time? When are they executing? When are they waiting at a lock? When are they put to sleep in the pool? When is some other thing taking their time quantum away to run some other workload on the system?
We found that by looking at these pictures developers can go, I recognize that. My system is oversubscribed. Or, I recognize that. I've got a load imbalance. Or, I recognize that. I've got a locked convoy. All my threads are convoying up at one lock. You can kind of get these just by looking at a picture which we found was kind of a discovery process. We knew that there was something there and as we kept iterating on this thing throughout the release we found this is really kind of a cool tea leaf reading exercise.
Reinders: Absolutely. You try to draw a simple picture that, then, intuitively makes sense and you go, I see it. Yeah, I do think I've seen you do some demos and show some of those. It's not just the idea that you want to do that. It's when you actually find the pictures that make sense and that matter. That's what I call these different views and I think Visual Studio has got some wonderful views that do that. We've got some that we've very proud of like we have our lock wait analysis that is hunting for the thing you really want which is highly contended locks. You may have hundreds of locks in a program but which ones are really causing the problem. It's amazing if you draw the right picture, you see them. It's that lock in conjunction with all the other threads waiting and nothing is happening. It's like I don't like big pauses in application where no concurrency is happening. It is interesting when you get the right views how simple the problem looks, as opposed to if you had 32 debugger windows open.
Teixeira: You don't know where to start, yeah.
Reinders: And you have to advance each one. It's much more effective.
Teixeira: I think the key in both tools is action ability like I can not only see this picture but I can double click on something of interest to me and go back to the source code and then I can edit the source code that's related to some of the pictures that I'm seeing -- that 'twowayness' of the tool helps with the actionability of the information so it's not just a neat essay on the performance of my program. It's actually something I can take action on and improve my program.
One of the demos I saw you give once was this idea of incorporating memory safety into your analysis workflow and this is not a new thing for developers but one of the things that is new is this parallelism pivot to memory safety and I wonder if you can talk a little bit about that.
Reinders: This is one of the 'ahas', if you will, we've had as we watch developers adding concurrency to their applications parallelism, they found that one of their critical tools to do memory analysis, look for leaks and different memory problems that those tools weren't up to the task and that alone was -- If you have a memory leak in a concurrent program, the headache of debugging that will drive you nuts. It's one of those: What do memory allocations have to do with parallelism? Actually there are some things you want to do there, too, but just analyzing for those bugs and having a tool that can do that. So it's a very critical part of Intel Parallel Studio to be able to memory leak detection, the various different memory error detections, but to do it even in a concurrent program and to do it correctly. When I say 'even in a concurrent program' the reason that it doesn't work in other tools trying to do this is that they don't scale. They've made assumptions based on the fact that the programs they're testing had one thread or maybe two because of some concurrency in a library. But when you expand and you've got four, eight, 16 threads going on and each one is doing some memory allocation, this is a critical thing.
I think developers have for a long time thought that one of the PKMS in development is, at least in the late stages, is make sure that you don't have memory leaks because programs eventually become unstable if you do. So now what we're saying is you need to continue that best practice, even though you have concurrency in your application and Parallel Studio gives a very easy to use tool to do that that's very fast and works in a concurrent environment.
That is sort of, in my view, it's a very important thing to have in tools, a very important reason to have Parallel Studio but it's kind of revisiting things that we've been doing and making sure they're ready for this concurrent universe. But we've PKMS to develop in the future. We're adding parallelism which is new to most of us. Do you have some favorite PKMS that we should keep in mind? I'm going to go run off and add some parallelism in my code. What should I be keeping in mind?
Teixeira: Taking a sequential application or even an inefficiently parallelized application and kind of updating it to take advantage of the latest parallel hardware. It's a pretty tough task.
Couple things that I always talk to developers about in terms of doing this is one is don't rush out and just parallelize the world. You're probably doing the wrong thing. You want to do some forethought. Recognize that parallelism adds complexity, even with all the stuff we're doing to make it easier, it's still more complex than the sequential thing. So you get the advantage that it goes fast, you get the disadvantage that it's a little bit more complex and you have that complexity to manage. So be thoughtful about where you parallelize. It's kind of my main point on that.
Then, as you're thinking about parallelize a lot of times it helps to have a canonical sequential version of some algorithm that you want to write that you can compare to so you can say, I wrote this sequentially first and I know that it goes this fast. And that way when you hopefully optimize it by parallelizing it you actually optimize it and not pessimize it because that can happen if you parallelize something incorrectly. Let's say, you put a lock in the wrong place. Suddenly it's actually slower than the sequential version. So being able to do that A-B comparison with the sequential implementation is pretty important.
Reinders: I'll take it even further. I think you should not, if at all possible -- and I think it generally is -- avoid creating a program that will only work when it runs in parallel. Make sure that if it's running on a single thread that the program will still work. There are clever things you can do writing a program that can make it so that it will only run in parallel and avoid those. Just avoid them. You were advocating the reason for performance comparisons but falling back in a single threaded mode for debugging and things can be so valuable that you don't want to lose that ability.
Teixeira: Absolutely. It's a great point. I would also say along the lines of performance parallelism is all about getting performance. I often find that folks too often don't have goals. Like what goal are you trying to reach by parallelism? It takes ten seconds now. Do you want it to take five? Do you want it to take eight? How do you know when you're done? Being able to either have a target or raise the flag on the hill and declare victory those are really important.
Reinders: Shouldn't I just assume it will run four times as fast.
Teixeira: Absolutely, yeah, you should always assume that, at least four times as fast.
Reinders: But I know what you're saying. I'll say, How much do you want on a quad core? Is two times good enough? That's a very puzzling thing for people. Sometimes it's like two times maybe that's plenty. In fact, I think that's quite good for a lot of applications and be happy. Your program is running twice as fast as it ever did before but have you thought about how much you need or expect?
Teixeira: Testing scale is an important aspect of that, too. So as I've thrown a dual core at it or a quad core I can see how fast it gets and then when I go 8-16-32 what does my scalability curve look like? And when does it start to taper due to data or locks.
Reinders: Or tanking.
Teixeira: Okay, that can happen as well.
Reinders: That's an important reason you do some testing beyond the platforms you expect to be running on because you can make an error where it looks lovely at four, eight, 16 and 32 is suddenly slower than eight.
Teixeira: That's right.
Reinders: You don't want that and it's usually something quite debuggable that can be fixed if it's tanking. The leveling off is harder; that's not a bug.
Teixeira: That's right. You can also find that you discover bugs. When you crank up the scale, it's like, I didn't see that even at eight cores. But suddenly at 16 cores I can see all these latent bugs that were in my system that were very hard to observe when there was less parallelism in the system.
Reinders: Better testing in our home lab than at the customer site.
Teixeira: That's right. That's a very important aspect.
Reinders: Any other PKMS to share or thoughts as people go rushing off to throw the parallelism in?
Teixeira: I think we've given a good overview. I would say be safe, as you parallelize and think in terms of tasks rather than thinking in terms of threads and machine execution flow. And if you do those things I think folks will see a lot of success.
Reinders: Absolutely. You were really emphasizing the idea of having a plan. You said it different ways and I want to emphasize that because it's something as we've studied people that have added parallelism or attempted to the number one reason for failure that we see is that people didn't see ahead to the problems they were introducing. I'd really like to stress the idea of finding a way to prototype before you make a mess of investment and recoding. We have a project at Intel we call Parallel Advisor that we're working to become part of our product lineup this year. We have a free download on Whatif.Intel.com like you're incubation projects. This is an incubation project.
What is really amazing about it is it's a tool to help with the advice that I was giving which is you've got to find a way to prototype and the tool isn't magic. It's not about finding and telling me where to add the parallelism. But once you have the notion of where you think you want the parallelism what we're doing is building a tool that will let you annotate the code and then we will go tell you implications about it that are faster than you probably can use in the existing tools. We'll say, "did you consider your.... You just told us to global something but you've created a race condition if you write that way. Did you notice that?" Or, "did you notice there is a global that is going to be highly contended?" Things like that to steer you.
What I see developers do, though, is without something to help them with that they code it up. I've been amazed at how seriously coded some things can get before someone realizes, Oh, my gosh. There is a global variable involved here and this will never work.
But if you spend a month developing code that you're very proud of and then you realize that there was a flaw in your thinking, something that you didn't think of, well, so much for that. That's a failed attempt in parallelism and then you go do something else for the next year and take a stab at parallelism again.
I would say taking a deep breath and hopefully our Parallel Advisor can help with this. But take a deep breath and make sure you have some prototyping strategy to get a feeling for whether you're going to be in the right ballpark when you actually finish coding everything up.
Teixeira: That's right. That's a great point.
Reinders: Obviously, I'm talking beyond the simple add a Parallel_for loop because parallelism is more than just finding all the loops and adding Parallel_for.
Teixeira: Parallelism is interesting in that it's one of these fields where you can have two correct chunks of code that are parallel and then when you put them together the sum of the two is no longer correct. That's a somewhat unique aspect of parallelism. Often still things we don't teach our Computer Science students in schools like many of us came up with there is this parallel thing and you kind of learn it. Maybe folks that went and did a PhD in Parallel Computing they kind of learned that. But for the majority of undergraduate computer science programs even today you don't get a strong foundation in how to do parallel computing. So trial-and-error experimentation and learning on the job is important.