The Software Optimization Cookbook
Richard Gerber ([email protected])
280 pages + CD-ROM
Intel Press
$49.95
ISBN: 0971288712
www.intel.com/intelpress/swoptcookbook/The Software Optimization Cookbook, by Richard Gerber, is a deep and insightful look into how the implementation details of your microprocessor interact with C programming techniques. Specifically, the optimizations address algorithms and how they interact with memory access, branching, SIMD instructions, multiple threads, and floating-point calculations. As you might expect, many of the optimizations are CPU specific and as such are tuned to the Pentium 4 architecture. In fact, a better title for this book might have been Optimizing Pentium 4 in C. With the Pentium 4 gains in market share (according to News.com, Pentium 4 unit shipments more than doubled during the fourth quarter of 2001 to about 15 million), now is the time to tune your app for performance in this regime. Since this is an Intel Press book, it of course fails to mention AMD at all.
It's my guess that a lot of programmers know little more about their CPU than its clock speed and would be hard pressed to describe the mechanics of the "L2" cache, for example. As such, the early chapters provide a thorough description of caches, pipelines, branch prediction, and other components of modern microprocessors. Until you understand how cache lines are organized and sized, you may be unintentionally writing code that is less than cache friendly.
Since few C/C++ programmers concern themselves with assembly language details on a regular basis, Gerber always supplies examples in C where possible. However, readers who have taken an introductory college-level course on assembly language will get the most out of the later chapters in this book. I quickly realized how much has changed since I wrote an 80286 CPU simulator in grad school some 15 years ago! With fewer developers coding in assembly each year, there are extremely few current instruction set reference books on the shelves. Fortunately, the accompanying CD-ROM contains the "IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference," a 984-page reference book.
The Performance Issues section forms the core of the book and surveys how algorithms, branching, memory, loops, slow operations, and floating point impact performance. It then continues with new techniques to try such as SIMD, processor-specific optimizations, and parallel programming. Though I knew that multiple pipelines were critical to high-performance CPUs, I was surprised to learn that a little bit of loop unrolling could greatly increase instruction parallelism and thus avoid idling pipelines. In another case, Gerber demonstrates how trying to use branches to avoid doing more work can actually reduce overall performance because the load of mispredicted branches it produces outweighs the savings in reduced calculations. In both cases, the optimizations seemed counter-intuitive because what looks efficient in C may in fact be a hindrance to the CPU.
It's impossible to tune your code or even to gauge which part of your code is in need of help without a performance analyzer. The most basic of these tools is a "profiler" such as GNU gprof or CompuWare's TrueTime. A profiler simply breaks down by percentage the functions where the CPU spends most of its time to determine the "hotspots." To understand why a particular function may be using up so much CPU, you need to drill deeper with a tool such as Intel's VTune. VTune can report on more than 100 specific event counters indicating potential problems such as mispredicted branches, misaligned data accesses, cache load misses, idle time, floating-point operations retired, resource stalls, and dozens more. You might be wary of the text as a sales pitch for VTune, but the tool does provide a unique and comprehensive view of CPU events. In the book, VTune mainly demonstrates particular bottlenecks, such as how to spot L1 cache misses. Since the VTune screenshots are all black and white, I found the shades of gray sometimes hard to discriminate.
The CD-ROM actually includes the entire three volume set of IA-32 Software Developer's Manuals as PDFs (totaling a whopping 2000 pages). These manuals are also available as a free download from developer.intel.com. Also included are two other manuals devoted to optimizing Pentium 4 code (another 600 pages), which could be considered the theoretical basis for the practical advice Gerber provides. These latter books may mostly be of interest to compiler writers or device driver specialists. Last, the CD-ROM includes a coupon for $50 off the Intel C++ compiler or VTune, valid until the end of 2002. Since all this occupies a mere 22 MB, I was actually hoping to find an evaluation copy of VTune on the CD-ROM. Fortunately, you can download a 30-day free trial edition (34 MB) also at developer.intel.com.
I would recommend the The Software Optimization Cookbook to any C/C++ developer who wants to improve performance of an existing application or, better yet, to design a new app with performance foremost in mind. Only a little "assembly" is required. Developers working in interpreted virtual machine environments such as Java, VB, C#, or .NET won't benefit much from this since their world squarely faces the virtual machine rather than the CPU.
Victor R. Volkman received a BS in Computer Science from Michigan Technological University. He has been a frequent contributor to Windows Developer Magazine since 1990. He is the author of C/C++ Treasure Chest (CMP Books, 1998), which includes 300 products on CD-ROM. He can be reached by e-mail at [email protected] or through http://www.HAL9K.com/.