Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

Fast String Search on Multicore Processors


Why Aho-Corasick and the Cell Make a Good Match

The DFA-based Aho-Corasick suits the Cell very well. In fact, processors with SIMD instructions (like the SPE) offer lots of data-level parallelism that you can exploit by running multiple DFAs together. Moreover, the many (128) registers available allow extensive loop unrolling and code replication.

A single instance of a DFA is not sufficient to satisfy this hunger for parallelism. The most natural approach to parallelize the algorithm (thus multiplying throughput) is to adopt multiple instances of the automaton. All the instances act as the same automaton (that is, they share the same state transition table), but they will be provided with distinct chunks of the input text; see Figure 1. The chunks are partially overlapped; otherwise, occurrences of a pattern that appear across a boundary would not be matched.

[Click image to view at full size]

Figure 1: Input text.

Because a DFA does exactly one state transition per consumed input character, multiple DFAs are naturally synchronized—they start/finish processing a given quantity of text at the same time. Consequently, they can share a single program control structure. This is a great advantage. We can fuse multiple DFAs in single, long basic blocks, which are crucial for performance on SPEs—they mitigate branch overhead and give the compiler enough elbow room during instruction scheduling. This results in reduced stalls, hiding of load/store latencies, better dual pipeline exploitation, and lower clock Cycles Per Instruction (CPI) values.

Our Fast Solution

The smaller the dictionary, the smaller the corresponding State Transition Table (STT). If the STT is small enough, we can keep it entirely in the local store rather than in main memory. This speeds up the algorithm because main memory accesses are slow. Our fast implementation is based on the simple idea of keeping the STT in the local store.

As we anticipated, a single automaton poorly utilizes the SPE hardware. Its implementation uses only four registers out of the 127 available, and no SIMD instructions. Moreover, an SPE has two pipelines and could issue two instructions at the same time if they do not conflict, but this implementation does not exploit this feature. As a result, the CPI is high (2.6) and throughput is low (1.35 Gbps).

But if we fuse together 16 automata, things improve a lot, because with SIMD instructions our code can process data from four automata at the same time. This implementation better exploits the registers (40 out of 127) and the two pipelines (43.8 percent of the issues are dual), reaching a CPI of 0.67.

This is a big improvement, but the code still stalls 7.4 percent of the cycles. To get rid of these stalls, we unroll the code manually, creating longer basic blocks. This gives the compiler more freedom to reschedule instructions and fill the stalls. We found experimentally that the best conditions happen when we unroll the main loop three times. The resulting code uses almost all the registers (124 out of 127) without spilling, it has a high dual-issue rate, and no stalls. This leads to a remarkable throughput of 5.11 Gbps.

Loading text from memory is not in the way: We adopt a double buffering scheme to load the next chunk while the current one is processed. This completely hides the load latency, because transfers happen asynchronously and take less time than processing. In fact, loading a 16-KB text chunk takes 5.94 microseconds, whereas processing it takes 25.64 microseconds.

All this happens on a single SPE. To unleash all of the Cell's power, the remaining seven SPEs can be put to work, too. We can do this in two ways:

  • When they operate in parallel, all the SPEs have identical STTs, but they process distinct chunks of input text. This multiplies the combined throughput by 8, which is 40 Gbps.
  • Alternatively, we can put SPEs "in series." The SPEs are all given the same chunk of input text, but they employ distinct STTs. Each SPE has an STT that represents a portion of the initial dictionary. This way, the available dictionary capacity is multiplied by a factor of 8.

Thanks to the regularity of the DFA and absence of variable-latency operations, the throughput values reported above are independent from the input. This is highly desired in security applications, because it means immunity to content-based attacks.

The main drawback is that the STT must be small enough to fit the local store. Ranging between 190-214 KB in size, each STT will contain between 1520 and 1712 states, assuming an alphabet of 32 characters. Such an STT may encode dictionaries of approximately 150 strings, of average length 10 (more, if many prefixes are common).


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.