Dr. Dobb's | The Future of Computing

The Future of Computing

Worries about runaway power consumption may replace concerns about speed on the next generation of CPUs

July 17, 2006
URL:http://www.drdobbs.com/parallel/the-future-of-computing/190400539

Max Fomitchev is an Assistant Professor of Computer Science at Penn State. He has a Ph.D. in computer engineering from Moscow Institute of Electronic Engineering. He is an author of "Software and Web Development" blog at TechSearch.com and the book Enterprise Application Development with Visual C++ 2005.

Life goes in circles, or in spirals to be more precise. Thus to get a glimpse of the future you perhaps should look in the past. Computing has become an inseparable component of our lives and therefore is a subject to the same law of cycles and spirals. So let's start in the 1940s, at the dawn of electronic computing when ENIAC was a pinnacle of scientific engineering.

Initially occupying whole buildings, then scaling down to individual rooms and towering boxes computers of 1950s, '60s, and '70s were essentially mainframes: Large and powerful, special-purpose, accessible to few. This centralized approach to computing changed dramatically in the late '70s and early '80s with the introduction of microcomputers such as Apple II and IBM PC. All of a sudden computers ceased to be shared resources built for a particular purpose and instead became personal tools for facilitating general-purpose tasks and throughout '80s and '90s began to occupy our desktops, bedrooms and closets.

This trend of decentralized computing met a subtle reverse in '90s when the Internet and World Wide Web provided a means for integrating decentralized computational resources into a unified client-server environment. In reality, what seems like 60 years of technological advancement represents a full evolutionary cycle: We started with shared computational resources occupying rooms of equipment and through brief desktop detour arrived at shared computational resource model built on the backbone of intranet/Internet. There is unquestionable numerical difference between what we 30 years ago and what we have now in the sense that computers now are used by much larger population and for a far wider range of tasks. There is a characteristic difference too: we kept our desktop PCs and we use them for more then mere terminals. In fact computing did not just come a full circle; it came a loop of spiral: We have not really come back to good old centralized computing but rather to arrived at distributed computing model. Although a bulk of work may be done by centralized resources such as servers providing computational services, our desktop PCs and client workstations independently handle a multitude of tasks.

Our equipment rooms are different too from what we had decades ago: instead of one large computer modern day data centers are filled with hundreds and thousands of servers both rack mount and blades. Such evolutionary change parallels that of the evolution of electronic components: Discrete elements were gradually replaced with integrated circuits just as individual mainframes are now replaced with blade server racks. Extrapolating the parallel further we may expect even tighter-integrated "microblades" to arrive in the near future when we master computer integration to the same degree as we have mastered integrated circuits.

Evolutionary Changes

Internally, computers are undergoing cyclical evolutionary changes as well: CPUs gradually evolved from spending tens if not hundreds of cycles on individual instruction to just one cycle per instruction (scalar architecture, for example). Then introduction of additional execution units allowed CPUs processing several instructions per clock cycle thus exploiting instruction level parallelism (superscalar architecture, for instance). Later several CPUs were crammed on motherboard (multi-processor architecture). Now several CPUs are fused together in a single multi-core package, and several such multi-core chips can be installed on a single motherboard. So in the end a data center is filled with clusters of stacks of server blades with each blade sporting one or more multi-core super-scalar CPUs. So we have at least five different levels of integration:

Cluster
Server
CPU
Core
Execution unit

with four levels typically found on desktop PCs. But why do we need this complexity and how did it come into being?

As computer clock speed increased from kilohertz to gigahertz so did our imagination and understanding of what can be done with this computational power to serve our needs; for example, provide entertainment (at home) and boost productivity (which is a practical reason for computers in the workplace). Originally computers were designed to serve a clearly defined special purpose and therefore were meant to perform specific single task. When computer power grew beyond immediate needs multi-tasking was invented to allow multiple users access to the spare computational resources. But when the hardware costs were reduced to consumer level personal computers came out, and they quite naturally were designed to be single-tasking.

The first mass-produced personal computers were quite slow and the need for performance increase of microcomputer processors was justified at first: We wanted good response from desktop applications and occasionally we wanted to play arcade games. And 4.77 MHz of PC XT was not always good enough for the purpose. So as CPU power grew to meet specific tasks we wanted our PCs to perform it became too much for general tasks such as text editing or spread-sheeting. That extra power just as in the case of old mainframes led to the adoption of multi-tasking operating systems on desktop and personal computers. We had extra power and we wanted to do something with it.

Ironically, mass adoption of multi-tasking operating system on PCs (Microsoft Windows, for instance) coincided with the introduction of graphical user interface. Thus formerly more or less satisfactory CPU performance became vastly inadequate and spurred a race to increase CPU performance necessary to compensate for inefficient software. The whole transition from single-threading text-based DOS programs to graphic interfaces and multithreading Windows resulted in unprecedented bloating of software code and general system slow down due to lacking graphic and disk I/O performance. Erroneous programming paradigms such as dynamically loaded libraries, dynamic memory allocation, shared components, inefficient object-oriented programming, and multi-layered libraries also greatly contributed to the slow down. All these inefficiencies instantly justified the need for further CPU performance increases. Now we needed faster computers just to run our operating system and new versions of old software burdened with graphic user interfaces.

Thus paradoxically desktop computers of '80s initiated a major leap in Wirth's Law which states that software is getting slower faster than computers are getting faster. Perhaps the first loop of Wirth's Law spiral was objective: Initial CGA and EGA hardware and CPU performance of 12-16 MHz was barely enough for running programs with complicated graphic interfaces. However, further unraveling of Wirth's Law was completely subjective in the sense that slowdown of software that occurred further resulted from our attempts to boost programmer productivity by employing various "coding techniques" that promised simplicity at the cost of efficiency.

Software Performance Versus Developer Productivity

Indeed the first mainframes were programmed directly in machine code and a bit later in assembler. Severe memory limitations and simplicity of original instruction set (PDP 11 is a classical example) and relative simplicity programming tasks at hand resulted in highly efficient if not bare bone code that, unfortunately, was difficult to write. When in late 1950s computers became fast enough to relieve some of the coding burden from the shoulders of programmers high level languages were developed such as Ada, Algol, Fortran and C. While sacrificing code efficiency big time these high-level languages allowed us to write code faster and thus extract more productivity gains from computers.

As time passed we kept sacrificing software performance in favor of developer productivity gains first by adopting object-oriented languages and more recently settling with garbage-collected memory, runtime interpreted languages and 'managed' execution. It is these "developer productivity" gains that kept the pressure on hardware developers to come up with faster and faster performing processors. So one may say that part of the reason why we ended up with gigahertz-fast CPUs was "dumb" (lazy, uneducated, expensive -- pick your favorite epithet) developers. Even now major OS release (such as upcoming issue of Windows Vista) seems to be main reason for computer upgrades because new software runs slower doing the same tasks as the older software it replaces. Of course, the new software usually does much more than the old one. So an objective reason for faster computers is higher expectations and expanded feature set of the new software. After all there are some mission-critical applications that really demand performance. Database applications on server side and games on desktop size are good examples of such apps that drive hardware development towards faster performance objectively.

Still, if you look at your desktop OS now it is unbelievably bloated. For instance, when running Windows XP under normal circumstances you will easily count 50 processes, 500 threads and only about 5-10 percent CPU utilization. This is what I get when I type this article in Word 2003 running on Athlon XP 3200 under Windows XP. Thus a 10 times less powerful CPU would have equally well satisfied my requirements for browsing and typing... Yet computers in general and CPUs in particular keep getting faster and faster driven both by developer productivity needs and the requirements of mission-critical applications.

Incredibly a new factor kicked in that is threatening to curb raw clock speed increases -- runaway power consumption. It is not unusual for a modern CPU to dissipate in excess of 100 Watts, which in the case of data centers translate into tens of millions of direct power and cooling costs. So on one hand we have a habit (but rarely a need) for higher performance and on the other hand we have a looming fossil fuel crisis, global warming and rising energy prices. Shall we finally stop racing the clock speed?

Apparently, the trend for higher clock speed has already been reversed when cooler yet efficiently running AMD's Athlon processor managed to win sizeable market share from hotter and faster by clock speed Intel's Pentium 4. So how are we going to keep up with performance demands without liberal increases in CPU clock frequency?

Maintaining Performance

Well, there are many ways to maintain performance. The first one -- exploitation of instruction level parallelism -- resulted in creation of super-scalar processors that we see today. Theoretically any modern CPU whether from Intel, AMD, IBM, or Sun can process and retire multiple instruction per cycle due to multiple parallel internal execution units. Funny enough, instruction-level parallelism does not yet allow sustained performance of substantially more that 1 instruction per cycle (IPC) on general benchmarks due to memory latency and branch misprediction penalty that stalls even the fastest CPUs more than half the time (source:Intel). Only highly-optimized tests or special-purpose code is capable of 3x to 5x performance boost warranted by multiple execution units. Practical gains due to architectural improvements of cache coherency or branch prediction amount to mere 5 percent in general. Long multi-stage execution pipelines that were developed to achieve higher clock speeds and inadequate memory performance created a situation when CPU can process data faster than the data can be supplied. So the trend for higher clock speed has already reversed in favor of shorter pipelines and better memory throughput. The best example of pipeline shortening is UltraSparc T1 processor with its six stage pipeline as opposed to 31-stage Pentium 4 models (Athlon XP has 10-stage pipeline and Intel's new "Woodcrest" server chip as only 14). Extrapolating the trend it is reasonable to expect CPU frequency to roughly remain the same while the CPU performance will increase due to pipeline shortening and emphasis on memory subsystem performance improvements.

Still, there is a hard limit for instruction-level parallelism, which makes it difficult in practice to keep individual execution units inside a CPU busy. Thus to improve CPU efficiency two alternative approaches are currently being pursued. One approach is super-threading (or Hyper-threading if we use Intel's terms), which allows CPU to process several parallel threads simultaneously switching from one thread to another when a stall occurs. UltraSparc T1 takes this approach to extreme by executing four threads on each core (with 32 threads on 8-core chip), switching threads in round-robin manner and when a stall occurs. While super-threading certainly boosts performance of multi-threaded applications speculative threading is pursued for improving performance of critical single-threaded applications. Intel is highly involved in speculative threading research and offers a Mitosis technology that with the help of compilers designates threads most suitable for speculative execution. AMD is developing similar technology, although the company is more tight-lipped about it. Still many rumors are circulating about AMD's clandestine "inverse hyper-threading" technology allegedly capable of uniting two individual CPU cores into a single CPU super-core CPU that would crunch single-threaded applications with a considerable performance boost. Yet the only piece of evidence on AMD's involvement with speculative threading that so far surfaced is infamous U.S. patent # 6,574,725 that looks like hardware support for speculative threading in the vein of to Intel's Mitosis. So with clock-speed increases effectively curbed by power consumption concerns most likely performance gains on upcoming CPUs would be due to super-threading (server chips) and speculative-threading (desktop chips).

There is another approach for boosting instruction-level parallelism, which has been pursued on and off by various commercial and government entities. I mean very-large instruction word (VLIW) or explicitly-parallel instruction set (EPIC) computing. First successful application of VLIW concept can be tracked back to early 1980s when a group of Russian engineers lead by Boris Babayan (who is now an Intel fellow) development a series of Elbrus supercomputers that were produced as a part of the anti-ballistic missile defense system deployed around Moscow. Massive performance gains warranted by proper application of VLIW concept allowed Elbrus machines to overcome manufacturing and technological limitations and beautifully serve their purpose. Remember that these were a special-purpose computers running hand-optimized code.

VLIW

Commercial applications of VLIW concept in the U.S. were less successful: Multiflow Computer went down in 1990 and Intel's EPIC/Itanium adventure of late 90s and today proved to be far from successful. The reason for VLIW failure on general purpose computers is the lack of compilers, cross-compilers and automatic code optimization techniques. Intel is still heavily involved in honing EPIC compilers for Itanium (with Babayan's current team and Intel's Israeli's office heavily involved). Yet the state of current technology is such that current VLIW/EPIC compilers are not yet good enough for general purposes and therefore theoretically possible performance gains are almost never achieved (VLIW processors can execute as many as 32 instructions in parallel if a compiler can find and schedule that many). More recent attempt by Transmeta was also unsuccessful and for the same reason, although it's new Efficieon CPU looks more promising than flopped Crusoe. Still, with Itanium disappointment tarnishing commercial VLIW prospects perhaps permanently we are unlikely to see more general-purpose VLIW computers, but instead are likely to seem them in niche markets employed for solving a very limited set of special-purpose tasks.

Quite another alternative to VLIW that is already sprouting profusely is multi-core CPUs. Both Intel and AMD have been shipping dual-core chips for quite some time now with quad-core chips promised in 2007. Sun is already shipping 8-core UltraSparc T1 chips, while Rapport Inc. and IBM have already announced development of Kilocore technology that allows combining as many as 1,024 8-bit processors with a PowerPC core on a single low-cost chip. Thus extrapolating current trends we are likely to see further profusion of multi-core CPUs from all leading manufacturers, especially for server markets. Chances are that as number of on-chip cores grow the cores itself would become more simple and less-deeply pipelined (kind of like UltraSparc T1 is doing already). We are also likely to see some dedicated co-processor-like cores suitable for performing SIMD/multimedia instructions while other cores might be deprived of such capacity in favor of improved energy efficiency and increased overall number of cores.

Perhaps the most noteworthy point is that we are unlikely to see dramatic single-threaded code performance improvements unless a way of frequency increases is found that does not result in the market increase in power consumption (for example, new manufacturing technology in the vein of IBM's recent report of experimental SiGe chips running at 350 GHz at room temperature and at 500 GHz when chilled by liquid helium).

And the truth is that there is no compelling need for further raw CPU speed increases for the following key reasons:

Computers are already much more powerful than most common tasks require.
Code efficiency is at all time low and potentially hide at least a order of magnitude performance boost if we just optimize the code.
Memory and I/O bottlenecks are most common causes of slow-down.

What is amazing is that for a long time we have been using only a handful of CPU models under the aegis of general purpose computing. Furthermore we thought that a better CPU makes a better computer, which is no longer so. What seems to be more important now is overall system design rather than just CPU design, and we are likely to see more system and CPU specialization (and models) targeting different application areas.

Emerging Processor Lines

We already see three major lines of processors targeting mobile, desktop, and server markets. This trend is likely to continue and result in appearance of even more processor lines optimized not only for various segments but for various applications or intended uses as well. For instance, for application servers we may see Intel and AMD delivering vastly multi-core CPU with good integer capacity and dedicated encryption/decryption hardware in the vane of Sun's UltraSparc, while in mobile market we may see stripped-down extra-low-power CPUs that ensure very long batter life, perhaps with finer frequency scaling similar to what coarse-grained AMD's PowerNow! technology does now. There certainly seems to be a room for lower-performance CPUs for ultra-mobile computers since most of them are used for reading, browsing and other simple tasks that do not require much of CPU power (specific tasks such as multimedia encoding/decoding and 3D graphics are already partially offloaded to dedicated hardware and are likely to be even more confined to specialized chips in the future).

So focusing on mobile CPU market it is clear that power efficiency and that not only of CPU but of the entire system is likely to be much more important than raw processor speed. After all most mobile users are not likely to exploit potential CPU performance to the fullest extent unless we throw at them really bad code. Almost commodity pricing on computational power today is such that consumers can afford buying more and more specialized hardware that is better suited for a particular purpose thus fulfilling Bill Gates' vision of computers in every pocket. This is in fact already happening as we all are grabbing iPods, cell phones, PDAs, and BlackBerry devices to complement our laptops and desktop PCs. No more one-size fits all. This is the most certain prediction that one can make about future CPUs. We shall see more and more specialized models and not necessarily more powerful ones. Thus as far as mobile market is concerned we might see CPUs with more finely grained frequency control that responds to idle time, variable rotation rate hard drives and possibly stripped out of some advanced features such as enhanced multimedia processing instructions in favor of dedicated hardware performing the latter tasks.

In fact AMD is already making some steps in this direction with its upcoming 4x4 platform and open specification enabling 3rd party co-processor design. In the long term it makes little sense to burden CPU with DVD playback or SSL encryption. These and similar tasks should and with time will be handled completely by dedicated hardware that is going to be far more efficient (power and performance-wise) than CPU. Further variety of coprocessors will allow enhanced physics and environmental effect experience for gaming enthusiasts and improved performance for scientific/multimedia applications. Thus the role of CPU is likely to diminish with time living little reason for further clock-speed improvement.

Frankly the role of CPU as a jack of all trades started to wane with the advent of GPUs. 3D graphics was the most compelling reason to boost CPU power. Now PCs typically have a dedicated processor (or two in the case of AMD's 4x4 platform) that is far better suited for the task. Similarly most music/multimedia hardware relies on its own expansion boards outfitted with custom logic/DSP processors (take ProTools or Creamware products, for example). And with time we are likely to end up with a motherboard design that would contain numerous specialized chips or co-processors designed with a single task in mind. So in this respect we are back to the single-purpose computing we have started with, although such return is a mere new loop in the spiral.

Ironically, return to special-purpose computing results in further relaxing of requirements for higher processor performance: special-purpose code is usually better optimized and thus can perform equally well on much slower CPUs. In reality most hand-held devices are powered with few hundred MHz CPUs that are capable of providing similar experience (save for small screen and tight keypad) we have with our gigahertz-fast desktop PCs. Similarly specially-designed DSPs are far better for MPEG playback or sound processing than general-purpose CPU that can do the same running at high GHz.

In other words, what is likely to happen is that CPU frequency increases will become very modest in the near future. As hardware manufacturers compete for the markets we are likely to see less and less general-purpose and more and more specialized hardware for various purposes. Perhaps in 10 years today's Athlon and Xeon CPUs would seem like dinosaurs, hot, big, and less than bright, with the role of CPU in the computer reduced from the do-it-all-yourself to coordinate-the-work-of-others.

Conclusion

There is another compelling reason to believe that big, hot, and insanely fast CPUs will die out due to natural selection. As people become more and more aware of "green" concepts and conscious of power consumption our eyes will finally open to extremely bloated code that out GHz-rated CPUs execute at the same rate as MHz-rated processors in specialized devices. The proper question to ask would be "How much power does your software require?", where power means electricity with the implication of the high energy cost. Indeed that would mean that slow and bloated software is expensive software for it requires CPU to run at full blast. To make this point more clear think of a datacenter with a thousand blade servers with each server sporting several CPUs and hard disks. Bloated and slow software that we have today implies that the operating cost of the datacenter is high for it needs a thousand blade servers, thousand terabyte disks, and gigabytes and gigabytes of memory with cooling and power cost of 10 million a year. Now let's say if we are to optimize our software to reduce RAM, disk, and CPU performance requirements by an order of magnitude (which is easily achieved if we scrap interpreted and otherwise "managed" code with inefficient memory management model and multi-layered libraries and invest in compiler and optimizer development) and reduce the number of servers 10 times? Or instead replacing huge blade servers with gigahertz CPUs with compact pocket-size microblades outfitted with megahertz-rated CPUs, few megabytes of RAM and a microdrive?

Needless to say, there is amble room for software optimization that has been ignored for decades since the increases in CPU performance allowed us to neglect it. Yet now the situation with energy resources is such that slow and bloated software means higher costs both in direct electric power required by CPU to process it and indirectly in power consumed by RAM, enormous hard drives and cumulative cooling costs. Furthermore recent tendency to aggregate multiple software components running on shared computational resource (that is, a server) under control of a multitasking OS should be reversed in favor of completely isolated software components running on low-power dedicated hardware. Thus if we are to begin optimizing our code we are likely to see blade server racks replaced with microblade server racks where each microblade is performing a dedicated task, consuming less power; and where the total number of microblades is much greater than the number of initial "macro" blades.

Indeed such complete isolation of software components (database instances, web applications, network services, and the like) that are currently squeezed together on the same server should greatly improve system robustness due to the possibility of real-time component hot-swap or upgrade and completely eliminating software installation, deployment and patch conflicts that plague large servers of today.

When and if that happens depends on two factors energy costs and code optimization efficiency. The former drives the latter. Therefore further increase in energy prices is likely to result in gradual reduction of the role of CPU in computer system, more optimized code and return towards single-processor/single-task special-purpose computing paradigm. On the other hand this vision may never materialize if a technological breakthrough occurs on manufacturing side that would allow further CPU speed increases without the increased energy dissipation (quantum computing, advances in superconductors, photonics, and so on). However, one thing is clear -- the role of CPU performance is definitely waning, and if a radical new technology fails to materialize quickly we will be compelled to write more efficient code for power consumption costs and reasons.