Much of my programming life these last 15 years has explored fine-grained parallelism; that is, the use of parallel threads within the same program. I coauthored one of the first books on the topic for Intel Press, in which I carefully explained the traditional techniques of mutual exclusion. I even drilled into the then-novel concept of OpenMP as a way of simplifying the burdensome work of getting concurrency right. Since then, I've explored actors. And I'm currently toying with the concepts of channels in Google's Go language. Had I the time and the inclination, I could have strolled into Cilk, Intel's OpenMP-like syntax for concurrency or thrown myself into numerous other concurrency options.
The chief motivation of all these approaches is to make concurrency easier so that programmers will adopt it and make better use of the processor hardware. This view conditioned my thinking for a long time, and I've written about it in detail, as has Herb Sutter, and our blogger, the much-admired Clay Breshears. But I've come to the conclusion that the effort is a quixotic campaign that will never attain its intended results. Despite built-in primitives for parallel work in Scala, Erlang, Go, and increasingly C++, the only developers using these options are on the server side. No one is threading on the client side unless they writing games, work for an ISV, or are doing scientific work. Of course, there are exceptions, but their numbers are few and looking at programmer forums, I see zero evidence that their numbers are about to surge. Even less that the only thing holding them back is the wait for the many-core era to begin. To respond to Herb Sutter's contention that the free lunch (of single threaded programming) is over, I'd have to say no one is going hungry doing exactly what they were doing before.
If you put aside server software for the moment, the only folks who are doing anything with parallel on the client are those who can offload what is called "embarrassingly parallel" data processing. That is, doing repeated computations on arrays or matrices. And they're not using OpenMP or mutual exclusion. They're wrapping the data in a little packet and sending it to the GPU for processing, using CUDA, OpenCL, OpenACC, or whatever.
And what about all those cores that will go to waste? The ones that Herb and others (including me previously) asserted users would insist on in order to get top-of-the-line performance for their apps? What was I smoking? Most cores today sit around doing almost nothing on most PCs. And here's something you never hear anyone say anymore, "Boy, my laptop is so slow." It ain't happening. Give me 8 cores instead of the 4 I have and they will do the same amount of nothing unless Windows 8 or Ubuntu 14 sops up the spare cycles. If tool vendors and other ISVs provide more threading, I might gain some minimal advantage. If they don't, I'll still be fine. Except for games and some build cycles, I'm almost never waiting because the CPU has maxed out. Same thing on my tablet. CPU speed trumps thread count all day long. Faster processors, great 2 cores versus 4 cores at the same processor speed? I can barely tell the difference.
Coarse-grained parallelism, however, is about to get significantly more interesting. "Coarse-grained" refers to process-level parallelism; that is, running separate processes on separate processors. Right now, we have the x86 standard fare of 8-core processors with threats that Intel's MIC initiative will scale cores in to the 20s and 30s and more on a single piece of silicon. What this will mean is that suddenly we can run many instances of the same program on one machine. Servers in particular will like this. Load balancers on Web servers will now moderate among numerous individual servers running on the same piece of silicon. The idea of single-chip clusters has some fascinating possibilities that I'll examine in the future.
But the shiny object that has my attention at present consists of low-voltage ARM-type chips running on tiny inexpensive systems that can be stacked together to do all kinds of interesting things for a fraction of the power my Intel Xeon uses. Think of a PC that's just a console that dials in to a range of small machines, each hosting its own app. So, for example, one ARM chip runs my browser. I can load it up with all kinds of things without slowing down the system I am actually working on. Moreover, malware attacks just that one VM. The stack of systems over there? That's my gaming platform. And over there, those machines are my build system with two machines that host my IDE and debugger. My multimedia runs on that little guy over there. And there I host my blog.
Each system will become a collection of smaller systems, rather like a blade server, except made up of tiny machines that sip power and deliver exactly what I need all using separate processors and with separate process spaces, so no one system interferes with the performance of any other. Life is good. I'll be gobbling up cores like popcorn and still not writing parallel code. Why? Because instead of scaling up (more threads in one process), I've scaled out (more processes).