Why Aho-Corasick and the Cell Make a Good Match
The DFA-based Aho-Corasick suits the Cell very well. In fact, processors with SIMD instructions (like the SPE) offer lots of data-level parallelism that you can exploit by running multiple DFAs together. Moreover, the many (128) registers available allow extensive loop unrolling and code replication.
A single instance of a DFA is not sufficient to satisfy this hunger for parallelism. The most natural approach to parallelize the algorithm (thus multiplying throughput) is to adopt multiple instances of the automaton. All the instances act as the same automaton (that is, they share the same state transition table), but they will be provided with distinct chunks of the input text; see Figure 1. The chunks are partially overlapped; otherwise, occurrences of a pattern that appear across a boundary would not be matched.
Because a DFA does exactly one state transition per consumed input character, multiple DFAs are naturally synchronizedthey start/finish processing a given quantity of text at the same time. Consequently, they can share a single program control structure. This is a great advantage. We can fuse multiple DFAs in single, long basic blocks, which are crucial for performance on SPEsthey mitigate branch overhead and give the compiler enough elbow room during instruction scheduling. This results in reduced stalls, hiding of load/store latencies, better dual pipeline exploitation, and lower clock Cycles Per Instruction (CPI) values.
Our Fast Solution
The smaller the dictionary, the smaller the corresponding State Transition Table (STT). If the STT is small enough, we can keep it entirely in the local store rather than in main memory. This speeds up the algorithm because main memory accesses are slow. Our fast implementation is based on the simple idea of keeping the STT in the local store.
As we anticipated, a single automaton poorly utilizes the SPE hardware. Its implementation uses only four registers out of the 127 available, and no SIMD instructions. Moreover, an SPE has two pipelines and could issue two instructions at the same time if they do not conflict, but this implementation does not exploit this feature. As a result, the CPI is high (2.6) and throughput is low (1.35 Gbps).
But if we fuse together 16 automata, things improve a lot, because with SIMD instructions our code can process data from four automata at the same time. This implementation better exploits the registers (40 out of 127) and the two pipelines (43.8 percent of the issues are dual), reaching a CPI of 0.67.
This is a big improvement, but the code still stalls 7.4 percent of the cycles. To get rid of these stalls, we unroll the code manually, creating longer basic blocks. This gives the compiler more freedom to reschedule instructions and fill the stalls. We found experimentally that the best conditions happen when we unroll the main loop three times. The resulting code uses almost all the registers (124 out of 127) without spilling, it has a high dual-issue rate, and no stalls. This leads to a remarkable throughput of 5.11 Gbps.
Loading text from memory is not in the way: We adopt a double buffering scheme to load the next chunk while the current one is processed. This completely hides the load latency, because transfers happen asynchronously and take less time than processing. In fact, loading a 16-KB text chunk takes 5.94 microseconds, whereas processing it takes 25.64 microseconds.
All this happens on a single SPE. To unleash all of the Cell's power, the remaining seven SPEs can be put to work, too. We can do this in two ways:
- When they operate in parallel, all the SPEs have identical STTs, but they process distinct chunks of input text. This multiplies the combined throughput by 8, which is 40 Gbps.
- Alternatively, we can put SPEs "in series." The SPEs are all given the same chunk of input text, but they employ distinct STTs. Each SPE has an STT that represents a portion of the initial dictionary. This way, the available dictionary capacity is multiplied by a factor of 8.
Thanks to the regularity of the DFA and absence of variable-latency operations, the throughput values reported above are independent from the input. This is highly desired in security applications, because it means immunity to content-based attacks.
The main drawback is that the STT must be small enough to fit the local store. Ranging between 190-214 KB in size, each STT will contain between 1520 and 1712 states, assuming an alphabet of 32 characters. Such an STT may encode dictionaries of approximately 150 strings, of average length 10 (more, if many prefixes are common).