Daniele Paolo Scarpazza is a member of the Multicore Computing Department at the IBM T. J. Watson Research Center. He is the author of articles and scholarly papers on parallel algorithms and performance optimization on emerging multicore architectures.
Larrabee, the new multicore processor that Intel will release in the next months, is going to catch most programmers unprepared. For once, it is difficult to estimate its potential on the basis of the very few details available. And programmers of non-numerical developers, traditionally secondary audience for GPGPU architects, are going to face an even steeper programming challenge.
In this article, I consider the example of regular-expression matching, arguably one of the hardest non-numerical workloads to parallelize, and map it onto LRBni, Larrabee's instruction set. I show that a good mapping of this code onto LRBni is possible, but only at the cost of a radical data-parallel redesign. Indeed, the data-parallel techniques that I learned on earlier SIMD (Single Instruction, Multiple Data) architectures become even more effective on Larrabee. Because of the wide SIMD nature of Larrabee, developers who won't adopt a similar data-parallel approach risk leaving most of the processor unused, and thus up to 93% of its compute power on the table.
Unfortunately, at this point, nobody (outside Intel) can estimate how fast Larrabee will run the code I provide. With such complex instructions, potentially translated into dozens of serialized micro-instructions, and involving multiple potential cache misses, it will be a challenge for Intel engineers to match the throughput and power efficiency of more stream-lined RISC architectures featuring higher clock rates and simpler instruction.
Why Regular Expressions?
Regular expressions (RE) are those patterns that you feed to grep on the command line when searching log files for messages matching a given template, the ones that you provide to flex when designing the tokenizer for your newly invented programming language, the ones that tell network intrusion detection systems like Snort what malicious traffic looks like, etc.
You might wonder why, among many possible workloads I could examine, I chose regular expression matching. Here are five good reasons:
- RE matching is the core of common and emerging applications like search-engine indexing, XML processing, application-level intrusion detection, and language analytics. In all of these, it is one important (if not the most important) consumer of execution time and hardware resources.
- RE matching relies on finite-state machines, a class of workloads notoriously hard to parallelize. In his famous presentation  -- almost a manifesto -- on parallel computing research, David Patterson mentions finite-state machines (FSM) as the #1 of his "dwarfs", i.e., workloads that deserve attention and effort.
- Larrabee is a SIMD architecture. On a SIMD architecture, each register contains multiple operands, and a SIMD instruction can process all those operands in parallel. We are used to 128-bit registers, divided in four 32-bit lanes. Larrabee has 512-bit registers, divided in 16 lanes. Traditionally, compilers don't usually do a very good job in translating irregular, non-numerical code into SIMD code. If you want good code, chances are you have to write SIMD code yourself, by hand. If you don't, you might leave up to 75% of the performance on the table on contemporary 128-bit SIMD architecture. On Larrabee, the potential loss is even higher: you might be using only 1/16 of the compute power (i.e., waste 93.75%).
- Presenting a realistic application in depth in this short space (and within the short attention span of a busy programmer) is a challenge. RE matching boils down to a handful of instructions, making this possible.
- RE matching is not your usual GPU workload. GPU architects catered primarily to the gaming and the computer graphics communities. More recently, the scientific and supercomputing communities have gained some attention by showing that GPGPUs can accelerate numerical, dense algorithms from domains such as fluid dynamics, protein folding and acoustics. But "the rest of us -- developers of non-numerical applications -- are still at the door, waiting. I use this example to show a way to exploit GPGPUs to carry out jobs previously thought "hard" to parallelize. There is no reason why, in a near future, most desktop applications (and not just games or graphics codes) should not offload their heavyweight kernels to GPGPUs.
The main message of this article is that the programming challenges that LRBni brings about are not trivial, but also not new, and that they can be attacked successfully with data-parallel techniques that proved effective in the past. I propose an approach based on the complete elimination of the control flow from the compute kernels (the hot spots of your application), and its transformation into data flow.
Thinking in terms of branch-less data flows is the most natural way to take advantage of SIMD. In the past, I was able to apply this approach on the same RE matching workload I am presenting here, while designing a SIMD-based tokenizer kernel  intended for use in a search engine indexer  on the IBM Cell Broadband Engine processor. This kernel achieved 99.3% of cycle utilization, it is the fastest software tokenizer available on any architecture, and it delivers, with some restrictions, a level of throughput comparable to dedicated hardware.
Much Ado About Something?
Larrabee is surrounded by hype, mystery, and controversy. More than a year has passed since Larrabee was first announced to the world in a SIGGRAPH 2008 paper Larrabee: A Many-Core x86 Architecture for Visual Computing , and still nobody outside Intel has seen silicon or, for what it matters, a cycle-accurate simulator. Nevertheless, the anticipation keeps growing. While Intel keeps parceling out details, and competitors like nVidia react with nervous anxiety and criticism, most developers are left in the dark, wondering if it makes sense to commit resources and attention to the new platform, in the hope of accelerating most applications, or if Larrabee will just work well with regular, numeric, array-based workloads (like gaming graphics and physics) and won't be a true game-changer for everyone else.
A stream of press releases, rumors, and blog posts have created more confusion, rather than shed light. Plus, we developers don't enjoy much rummaging through the news. We are much happier with final specifications, reliable benchmarks, and stable tools to play with. So, in this sea of dubious, fragmented and sometimes contradictory information, two events stand out. The first is Michael Abrash's articleA First Look at the Larrabee New Instructions (LRBni) in Dr. Dobb's , which finally detailed the new LRBni instructions. That article is, I bet, on its way to become the most cited non-peer-reviewed publication among academic, peer-reviewed papers of 2009-2010 in the high-performance computing field. The second is the release, last June, of Intel's C++ Larrabee Prototype Library , a simple header file that -- all the no-warranty disclaimers factored in -- provides the actual prototypes of the intrinsics associated to new instructions, along with reference semantics detailed in plain C code. This prototype library allows you to write, today, code that will run without major changes on the real hardware. The promise is that some day a compiler will read these intrinsics and emit corresponding LRBni instructions. This makes the library a convenient prototyping tool, especially if you have limited experience with writing SIMD code.
Unfortunately, the library has a major weakness: it won't give you the slightest hint on the performance of the code you write. If you work in high-performance computing, you might be familiar with profiling and simulating your code, estimating its latency and throughput, and measuring indicators like cache misses ratios and Clocks Per Instruction (CPI). This library won't allow you to do any of these.
This is, alas, a big question mark. With instructions of unprecedented complexity, that may potentially decompose into hundreds of microinstructions and entail up to 16 memory references, each taking even 800 clock cycles to complete in case of cache misses, estimating the performance of LRBni code on paper is a major challenge.