Until a few years ago, the processor hardware community translated Moore's Law of transistor density directly into single-threaded performance gains as a result of increasing clock frequencies. Lately, this translation has been hampered by the effects of clock frequency on power consumption and heat generation. The new reality is that per-thread performance is essentially static, and an increase in performance is delivered by an increase in the number of available processor cores per socket. This greatly simplifies a processor developer's job in two ways: First, it avoids radically increasing per-socket power consumption; and second, it holds the complexity of a single core somewhat static over time. Thus, processor design remains a tractable discipline.
However, nothing in life comes for free. Just as RISC stood for "Relegate the Important Stuff to the Compiler," increasing the number of threads per socket places an enormous stress -- to parallelize applications and reap the performance increases -- on almost the entire computing industry. Those affected include the system manufacturers, the compiler writers (who are used to being the punching bags), the operating system designers, the application writers, and finally (and inevitably) the users. The single-threaded application typically does not benefit from multicore hardware, it may actually suffer degraded performance; and the great majority of applications on the market today are single-threaded in design. In general, clock frequency increases allowed developers to avoid the inevitable transition to multiprocessor environments (and thus the parallelization of their applications), but this free ride is now over.
In addition to the difficulties of parallelizing an application, there is a limit to the performance gain that a parallelized version of an application can achieve. This is defined by a variant of "Amdahl's Law" (which fundamentally states that an application cannot run faster than the aggregate of its sequential components). An analysis of the application may indicate that parallelization at the code level will not yield the sought-after performance gains, and that some other approach (for instance, running multiple copies of a program to increase throughput) will be more effective. Furthermore, on the hardware front, increasing core counts will result in pressure on communication bandwidth capacities, particularly for memory accesses. Maintaining the traditional clocked bus architectures will result in a need for increased bus clock frequencies, again pushing on the very power dissipation issues that resulted in the move to multicore architectures in the first place. Increasing the width of the involved busses may provide some temporary relief, but that approach can only be viewed as a patch.
Despite these challenges, the component manufacturers (Intel, AMD, etc.) are gambling that system and software developers will pick up the ball and move toward parallelizing their applications quickly enough to keep up with the core count arms race.
Okay, It's Not All Doom and Gloom
Luckily, some applications are inherently scalable and can accommodate exponential growths in data by applying additional computing resources. For example, digital content creators generally assign one frame of a movie to a single thread. It really doesn't matter how long that thread takes to complete as long as the number of thread completions per unit of time scales appropriately. Other examples include some High Performance Computing (HPC) applications (often developed using the Message Passing Interface (MPI) or OpenMP programming models) or other applications developed using Google's MapReduce library (see also Apache Hadoop).
Furthermore, a nice property of the transition from clock frequency increases to multicore processors is that memory latency is no longer a worsening problem. Those of us who remember the utter joy of upgrading from 026 to 029 keypunches (the 10 or so of you who are now in danger of dropping your dentures!) also remember when memory systems ran at the same speed as the CPU; a word of memory was one clock tick away from the CPU. As processor speeds increased, this relationship failed to keep pace, up to the point where at 3GHz (.333ns cycle) and approximately 80ns latency, a word of memory is approximately 240 ticks away from the CPU.
A particularly helpful side effect of keeping a constant CPU frequency is that the relative time (in clock ticks) to access memory no longer increases with new generations of processors.
Many applications aren't able to take full advantage of the current CPU clock speeds as their performances are dictated by memory latency (for example, programs that spend most of their time chasing linked lists). If many copies of these programs are run in parallel, latency can be overcome -- the cores collectively can provide enough outstanding loads to "hide" the underlying memory system latency. In fact, early multicore systems were designed with such workloads in mind.
Also, keeping a relatively stable clock speed has had a positive affect on CPU power consumption. Calculations by Intel for 65nm CPUs showed that increasing the core frequency by a factor of 1.3 would double CPU power consumption. Doubling the core count and retarding the clock by 10 percent, in contrast, would boost performance by a factor of 1.8 with no increase in CPU power consumption.