Parallel

Fast String Search on Multicore Processors

By Daniele Paolo Scarpazza, Oreste Villa and and Fabrizio Petrini, March 13, 2008

Multicores are the future, and you need to start mapping basic algorithms onto this new hardware.

Why Aho-Corasick and the Cell Make a Good Match

The DFA-based Aho-Corasick suits the Cell very well. In fact, processors with SIMD instructions (like the SPE) offer lots of data-level parallelism that you can exploit by running multiple DFAs together. Moreover, the many (128) registers available allow extensive loop unrolling and code replication.

A single instance of a DFA is not sufficient to satisfy this hunger for parallelism. The most natural approach to parallelize the algorithm (thus multiplying throughput) is to adopt multiple instances of the automaton. All the instances act as the same automaton (that is, they share the same state transition table), but they will be provided with distinct chunks of the input text; see Figure 1. The chunks are partially overlapped; otherwise, occurrences of a pattern that appear across a boundary would not be matched.

[Click image to view at full size]

Figure 1: Input text.

Because a DFA does exactly one state transition per consumed input character, multiple DFAs are naturally synchronized—they start/finish processing a given quantity of text at the same time. Consequently, they can share a single program control structure. This is a great advantage. We can fuse multiple DFAs in single, long basic blocks, which are crucial for performance on SPEs—they mitigate branch overhead and give the compiler enough elbow room during instruction scheduling. This results in reduced stalls, hiding of load/store latencies, better dual pipeline exploitation, and lower clock Cycles Per Instruction (CPI) values.

Our Fast Solution

The smaller the dictionary, the smaller the corresponding State Transition Table (STT). If the STT is small enough, we can keep it entirely in the local store rather than in main memory. This speeds up the algorithm because main memory accesses are slow. Our fast implementation is based on the simple idea of keeping the STT in the local store.

As we anticipated, a single automaton poorly utilizes the SPE hardware. Its implementation uses only four registers out of the 127 available, and no SIMD instructions. Moreover, an SPE has two pipelines and could issue two instructions at the same time if they do not conflict, but this implementation does not exploit this feature. As a result, the CPI is high (2.6) and throughput is low (1.35 Gbps).

But if we fuse together 16 automata, things improve a lot, because with SIMD instructions our code can process data from four automata at the same time. This implementation better exploits the registers (40 out of 127) and the two pipelines (43.8 percent of the issues are dual), reaching a CPI of 0.67.

This is a big improvement, but the code still stalls 7.4 percent of the cycles. To get rid of these stalls, we unroll the code manually, creating longer basic blocks. This gives the compiler more freedom to reschedule instructions and fill the stalls. We found experimentally that the best conditions happen when we unroll the main loop three times. The resulting code uses almost all the registers (124 out of 127) without spilling, it has a high dual-issue rate, and no stalls. This leads to a remarkable throughput of 5.11 Gbps.

Loading text from memory is not in the way: We adopt a double buffering scheme to load the next chunk while the current one is processed. This completely hides the load latency, because transfers happen asynchronously and take less time than processing. In fact, loading a 16-KB text chunk takes 5.94 microseconds, whereas processing it takes 25.64 microseconds.

All this happens on a single SPE. To unleash all of the Cell's power, the remaining seven SPEs can be put to work, too. We can do this in two ways:

When they operate in parallel, all the SPEs have identical STTs, but they process distinct chunks of input text. This multiplies the combined throughput by 8, which is 40 Gbps.
Alternatively, we can put SPEs "in series." The SPEs are all given the same chunk of input text, but they employ distinct STTs. Each SPE has an STT that represents a portion of the initial dictionary. This way, the available dictionary capacity is multiplied by a factor of 8.

Thanks to the regularity of the DFA and absence of variable-latency operations, the throughput values reported above are independent from the input. This is highly desired in security applications, because it means immunity to content-based attacks.

The main drawback is that the STT must be small enough to fit the local store. Ranging between 190-214 KB in size, each STT will contain between 1520 and 1712 states, assuming an alphabet of 32 characters. Such an STT may encode dictionaries of approximately 150 strings, of average length 10 (more, if many prefixes are common).

Previous 1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Fast String Search on Multicore Processors

Why Aho-Corasick and the Cell Make a Good Match

Our Fast Solution

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Fast String Search on Multicore Processors

Why Aho-Corasick and the Cell Make a Good Match

Our Fast Solution

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content