Programming the Cell: Rethinking Bottom-Up
The Cell is so different from traditional processors that you can't get the best out of it unless your code explicitly takes advantage of its hardware. The Cell contains nine processorsone Power Processing Element (PPE), a 64-bit PowerPC-like processor that is mainly used to run the operating system and load the tasks, and eight worker elements called "Synergistic Processing Elements" (SPEs).
SPEs have 128-bit-wide registers and Single Instruction Multiple Data (SIMD) instructions. Because they have a simple branch prediction, jumps are expensive and your code should avoid them, possibly replacing them with speculative execution. Additionally, SPEs cannot access the RAM directly and they have no cache. Instead, they have a small 256-KB memory called "local store" (LS), and transfer data to/from main memory with asynchronous DMA transfers. By cleverly scheduling DMA transfers in your code (with, for instance, double buffering), you can hide part or all of the transfer latency. See our article "Programming the Cell" (www.ddj.com/dept/64bit/197801624).
These hardware features revolutionize how you write code. Programmers have been told since Programming 101 to develop code in a top-down fashion. Now on the Cell, we are forced to look down to the bottom and make both ends meet. Hardware constraints strongly influence the algorithm.
Searching Strings: One Problem, Too Many Solutions
Whether you are scanning files for viruses, performing full-text searches on a large collection of articles, or detecting intrusions on network traffic, you're doing string searching. Given a text and a dictionary (a set of patterns), your task is to find all occurrences of any of the patterns in the text.
String searching is a beaten track, with many proposed algorithms. We employ Aho-Corasick, which is famous for appearing in the original "fgrep" UNIX utility. Despite its age (it was presented in 1975), this algorithm is simple and appears more suitable to the Cell than others (Knuth-Morris-Pratt, Boyer-Moore, Commentz-Walter, Wu-Manber, and the like). In fact, Aho-Corasick can be expressed well as a branchless data flow that best exploits the SIMD capabilities of the SPEs. On the other hand, its evolutions have more complex control flows, which suffer from frequent and large branch misprediction penalties when run on an SPE.
The variant of Aho-Corasick we employ is based on a Deterministic Finite Automaton (DFA), and is fast for two reasons:
- The DFA is based on a keyword tree (a trie). A trie has a path of labeled edges per each pattern in the dictionary. To process an input character, you just choose the outgoing edge, which is labeled with that character, and jump to the following node along that edge.
- In a DFA, no special rollback actions are required in case of mismatches. A non-matching input character triggers a simple state transition, like any other character. For more information, see "The Aho-Corasick Algorithm" (en.wikipedia .org/wiki/Aho-Corasick_algorithm).