Channels ▼
RSS

Parallel

Is Larrabee For the Rest of Us?


Show Me Some Code Already!

Enough with the preliminaries, it's time to get our hands dirty. My example runs through the four pieces of code in Listings 1 through 4. These excerpts describe a simple tokenizer, to be used in a search engine indexer after case folding, not much different from the actual tokenizer that Lucene [7], the most popular open-source search engine library, uses internally. Listing 1 shows the flex specification; Listing 2 contains the tokenizer kernel automatically generated by flex; Listing 3 is a rewrite that I find simpler to read and optimize; and, finally, Listing 4 shows my parallel implementation that exploits SIMD LRBni intrinsics.

Listing 1 is my specification that tells flex what patterns I'm looking for, and what actions to perform when patterns match. Note the three sections, separated by "%%" lines (if you are unfamiliar with this syntax, take a quick look at flex's manual [8]). The first section defines patterns such as letters, digits, alphabetic, and alphanumeric strings, e-mail addresses, company names, and stop words (i.e., words that you want to filter out). The second section says that we are looking for three classes of tokens: (1) stop words and stray single characters, which we ignore; (2) regular tokens that we keep unchanged; and (3) acronyms, that we emit with a special flag, so that they are marked for some post-processing (e.g. dot removal) in later stages. The third section, omitted here, would be the C implementation of utility functions like emit_token.


LETTER       [a-z]
DIGIT        [0-9] 
P            ("_"|[,-/])
HAS_DIGIT    ({LETTER}|{DIGIT})*{DIGIT}({LETTER}|{DIGIT})*
ALPHA        {LETTER}+
ALPHANUM     ({LETTER}|{DIGIT})+
ACRONYM      {ALPHA}"."({ALPHA}".")+
COMPANY      {ALPHA}("&"|"@"){ALPHA}
EMAIL        {ALPHANUM}(("."|"-"|"_"){ALPHANUM})*"@"{ALPHANUM}(("."|"-"){ALPHANUM})+
HOST         {ALPHANUM}("."{ALPHANUM})+
NUM          {ALPHANUM}{P}{HAS_DIGIT}|{HAS_DIGIT}{P}{ALPHANUM}|{ALPHANUM}({P}{HAS_DIGIT}{P}{ALPHANUM})+|{HAS_DIGIT}({P}{ALPHANUM}{P}{HAS_DIGIT})+|{ALPHANUM}{P}{HAS_DIGIT}({P}{ALPHANUM}{P}{HAS_DIGIT})+|{HAS_DIGIT}{P}{ALPHANUM}({P}{HAS_DIGIT}{P}{ALPHANUM})+
STOPWORD     "a"|"an"|"and"|"are"|"as"|"at"|"be"|"but"|"by"|"for"|"if"|"in"|"into"|"is"|"it"|"no"|"not"|"of"|"on"|"or"|"s"|"such"|"t"|"that"|"the"|"their"|"then"|"there"|"these"|"they"|"this"|"to"|"was"|"will"|"with"
KEPT_AS_IS    {ALPHANUM}|{COMPANY}|{EMAIL}|{HOST}|{NUM}
%%
{STOPWORD}|.|\n    /*do nothing*/;
{KEPT_AS_IS}       emit_token(CLASS_TOKEN, yytext);
{ACRONYM}          emit_token(CLASS_ACRONYM, yytext);
%%
/* C code omitted */ 

Listing 1

You might remember from your college days that a finite-state machine (FSM) is sufficient to recognize a regular language. Flex generates that machine, in the form of C code, from the specification in Listing 1. That C code contains (among a sea of nitty-gritty details) two main blocks: The state transition table of the state machine (split in variables yy_nxt and yy_accept), and the C kernel that runs that machine, summarized in Listing 2. Don't be scared, I don't expect you to dive into this code, I just want to point out a few details. This code loops over the input, and it is divided into an inner loop and a switch block. The inner loop runs an FSM using the state transition table yy_nxt until it finds a valid pattern (i.e., the input characters between pointers yy_bp and yy_cp). When that happens, the switch block performs an appropriate action. Note that flex has preserved the actions we indicated in the specification, and they appear here as case 1, 2, and 3.


while ( 1 ) /* loops until end-of-file is reached */
  {
    yy_cp = (yy_c_buf_p);
    
    /* Support of yytext. */
    *yy_cp = (yy_hold_char);
    
    /* yy_bp points to the position in yy_ch_buf of the start of
     * the current run. */
    yy_bp = yy_cp;    
    yy_current_state = (yy_start);
  yy_match:
    while ( (yy_current_state = yy_nxt[yy_current_state][ YY_SC_TO_UI(*yy_cp) ]) > 0 )
      {
	if ( yy_accept[yy_current_state] )
	  {
	    (yy_last_accepting_state) = yy_current_state;
	    (yy_last_accepting_cpos) = yy_cp;
	  }
	
	++yy_cp;
      }
      yy_current_state = -yy_current_state;
    
  yy_find_action:
    yy_act = yy_accept[yy_current_state];
    
    YY_DO_BEFORE_ACTION;
    
  do_action:      /* This label is used only to access EOF actions. */
    
    switch ( yy_act )
      { /* beginning of action switch */
      case 0: /* must back up */
	/* undo the effects of YY_DO_BEFORE_ACTION */
	*yy_cp = (yy_hold_char);
	yy_cp = (yy_last_accepting_cpos) + 1;
	yy_current_state = (yy_last_accepting_state);
	goto yy_find_action;
	
      case 1:
	/* rule 1 can match eol */
	YY_RULE_SETUP
	/*do nothing*/;
        YY_BREAK

      case 2:
	YY_RULE_SETUP
	emit_token(CLASS_TOKEN, yytext);
        YY_BREAK

      case 3:
	YY_RULE_SETUP
	emit_token(CLASS_ACRONYM, yytext);
        YY_BREAK
	  ...
      }
  }

Listing 2

The purpose of case 0 is related to the concept of back-up, and it requires a little explanation. Consider that flex's tokenizers always try to find the longest match; for example, when looking for pattern "i[a-z]*n", the string "internationalization" matches, dominating all its shorter substrings ("in", "intern", "internation", "ionalization", "ization", etc.), which are ignored. This feature is called longest-of-the-leftmost semantics. To implement it, the tokenizer may delay acting on a valid pattern; rather, it saves the current state and position in the input (the if statement within the inner while loop), and continues examining more input in an attempt to find a longer match. Sometimes the attempt succeeds, sometimes not. When it fails, the machine falls into a "back-up state", and reaches case 0. There, the machine restores the last saved accepting position, and executes the saved associated action.

For my convenience I rewrite that kernel in the form of Listing 3. The new form is easier to read and uses a single table yy_next, where each entry (a state code) contains both a state number and condition flags. The machine behavior is now regulated by condition flags. Whenever a match is found, the machine enters a final state, i.e., BIT_FINAL is active in the state code. If a token is matched (as opposed to a stop word), then BIT_TOKEN is active. If the token is an acronym, BIT_ACRONYM is active (to support more token types, you just need to reserve more flag bits). Finally, the save/back-up mechanism is cleanly regulated by BIT_SAVE and BIT_RESTORE flags. I wrote a simple utility to convert tables yy_nxt and yy_accept from the format that flex generates to table yy_next, in the format I just described. Also, you might have noticed that Listing 3 only saves and restores the last accepting position, and not the associated action. This is the result of an optimizing manipulation of the state machine, which is beyond the scope of this article [5, Section 5.5].


void tokenize ( unsigned char * const input, 
		unsigned char * const input_end)
{    
  state_t current_state = 0; // initial state
    
  unsigned char * yy_bp = input; // base pointer
  unsigned char * yy_cp = input; // current pointer
  
  unsigned char * last_accepting_cpos   = (unsigned char*) UNKNOWN;

  while (1) {   
    const unsigned char in_chr = *yy_cp;      
    const state_t next_code  = yy_next[current_state * ALPHABET + in_chr];
    const state_t next_state = next_code >> 8 /* last 8 bits reserved for flags */;

    const bool is_final    = next_code & BIT_FINAL;
    const bool is_save     = next_code & BIT_SAVE;
    const bool is_restore  = next_code & BIT_RESTORE;
    const bool is_token    = next_code & BIT_TOKEN;
    const bool is_acronym  = next_code & BIT_ACRONYM;
  
    if (! is_final ) {	
      ++yy_cp;
      if ( is_save ) last_accepting_cpos   = yy_cp;
      if ( yy_cp >= input_end ) goto the_end;
    } else {     
      if ( is_restore ) 
	yy_cp = last_accepting_cpos;      
      
      if (is_token) 
	emit_token(CLASS_TOKEN + is_acronym, yy_bp, yy_cp);       
            
      yy_bp = yy_cp;        	
    }	    
    current_state = next_state;	
  }
  
 the_end:
  ;
}

Listing 3

The code you see in Listing 3 is sequential (as opposed to parallel) and scalar (no SIMD). I like to say it operates on one lane. To exploit the Larrabee's SIMD instructions, we extend this code to operate on the 16 lanes that Larrabee's 512-bit registers offer. The result is in Listing 4.


#define LANES 16

typedef union {
  __m512i mm;
  __m512  mf;
  __m512d md;
  uint32_t u32[16];
} variant512_t;

#define EXTRACT(_x_,_i_) ({ variant512_t t = {mm:_x_}; t.u32[_i_];})
#define CAST_F2I(_x_)    ({ variant512_t t = {mf:_x_}; t.mm;})
#define CAST_I2F(_x_)    ({ variant512_t t = {mm:_x_}; t.mf;})

inline __m512i _mm512i_mask_movd(__m512i v1_old, __mmask k1, __m512i v2)
{ 
  return CAST_F2I( _mm512_mask_movd( CAST_I2F(v1_old), k1, CAST_I2F(v2)) );
}

typedef struct { 
  uint32_t start;
  uint32_t stop;
  uint32_t type;
  uint32_t unused;
} tt_entry_t;

//unified output token table
tt_entry_t ttable_all  [ TEST_TOKEN_TABLE_SIZE * LANES ] align64;

tt_entry_t * ttp_start [ LANES ]; // beginning of per-automaton output table partition
tt_entry_t * ttp_end   [ LANES ]; // end of per-automaton output table partition
tt_entry_t * ttp_used  [ LANES ]; // end of valid data in the output table partition

void tokenize ( const unsigned char * const input_starts [LANES], 
		const unsigned char * const input_ends   [LANES])
{      
  const unsigned input_base  = 0;          /* zero for the moment, useful when the input will be above 4GB */
  void * const   stt_base   = yy_next;

  __m512i mm_current_states      = _mm512_set_1to16_pi( 0 ); 
  __m512i mm_next_states;

  //////// CONSTANTS /////////////////////////////////////////////////////////////////////////////////////////
  const __mmask mask_all_ones       = 0xFFFF; /* sixteen bits, all ones, used to check collective termination  */
  const __m512i mm_zeroes           = _mm512_set_1to16_pi( 0 ); 
  const __m512i mm_ones             = _mm512_set_1to16_pi( 1 ); 
  const __m512i mm_twos             = _mm512_set_1to16_pi( 2 );  /* to increment tteptr */

  const __m512i mm_final_mask       = _mm512_set_1to16_pi( BIT_FINAL   ); 
  const __m512i mm_save_mask        = _mm512_set_1to16_pi( BIT_SAVE    );
  const __m512i mm_restore_mask     = _mm512_set_1to16_pi( BIT_RESTORE );
  const __m512i mm_token_mask       = _mm512_set_1to16_pi( BIT_TOKEN   );
  const __m512i mm_acronym_mask     = _mm512_set_1to16_pi( BIT_ACRONYM );
  const __m512i mm_kill_flags_mask  = _mm512_set_1to16_pi( 0xFFFFFF00  );

  const __m512i mm_SR_ofs       = _mm512_set_1to16_pi( 8 /* shift right 8 bits */ );
  assert(ALPHABET == 128 /* if it is not, change the 7 below */);
  const __m512i mm_SL_alpha     = _mm512_set_1to16_pi( 7 ); // for shift left 7 ( == mul 128) operations  
  ////////////////////////////////////////////////////////////////////////////////////////////////////////////

  __m512i mm_bp   = * (const __m512i*) input_starts;
  __m512i mm_cp   = * (const __m512i*) input_starts;
  __m512i mm_lacp = * (const __m512i*) input_starts;
  __m512i mm_eob  = * (const __m512i*) input_ends;

  /* quadword index into the token table used, for use with scatterd */
#define TTEP_IDX(_dfa_) ( U32( (ttp_used[ (_dfa_)]-ttable_all) ) * sizeof(tt_entry_t) / (4 /*SCALE*/) )
 
  __m512i mm_tte_idx = { TTEP_IDX( 0), TTEP_IDX( 1), TTEP_IDX( 2), TTEP_IDX( 3), 
			 TTEP_IDX( 4), TTEP_IDX( 5), TTEP_IDX( 6), TTEP_IDX( 7), 
			 TTEP_IDX( 8), TTEP_IDX( 9), TTEP_IDX(10), TTEP_IDX(11), 
			 TTEP_IDX(12), TTEP_IDX(13), TTEP_IDX(14), TTEP_IDX(15) }; 
 
  while (1) {   
    const __m512i mm_in_chars = CAST_F2I( _mm512_gatherd( mm_cp, input_base, _MM_FULLUPC_UINT8I, _MM_SCALE_1, _MM_HINT_NONE ));
        
    const __m512i mm_states_xalpha     = _mm512_sll_pi ( mm_current_states, mm_SL_alpha );   
    const __m512i mm_next_code_indexes = _mm512_add_pi ( mm_states_xalpha, mm_in_chars );        
    const __mmask mask_not_eob         = _mm512_cmpnle_pu(mm_eob, mm_cp); // "not at end-of-buffer" condition; 
    const __mmask mask_eob             = ~mask_not_eob;                   // "at end-of-buffer"     condition
    
          __m512i mm_next_codes        = CAST_F2I ( _mm512_mask_gatherd( CAST_I2F(mm_next_codes), mask_not_eob, mm_next_code_indexes, 
									 stt_base, _MM_FULLUPC_NONE, _MM_SCALE_4, _MM_HINT_NONE ));                               
    const __m512i mm_next_states       =  _mm512_sra_pi ( mm_next_codes, mm_SR_ofs);

    const __mmask mask_final           = _mm512_cmpnle_pu( _mm512_and_pi ( mm_next_codes, mm_final_mask   ),  mm_zeroes);
    const __mmask mask_save            = _mm512_cmpnle_pu( _mm512_and_pi ( mm_next_codes, mm_save_mask    ),  mm_zeroes);
    const __mmask mask_restore         = _mm512_cmpnle_pu( _mm512_and_pi ( mm_next_codes, mm_restore_mask ),  mm_zeroes);
    const __mmask mask_token           = _mm512_cmpnle_pu( _mm512_and_pi ( mm_next_codes, mm_token_mask   ),  mm_zeroes);

    const __m512i mm_are_acronym       = _mm512_and_pi ( mm_next_codes, mm_acronym_mask ); /* can be more than 1 bit */
    mm_next_codes = _mm512_and_pi ( mm_next_codes, mm_kill_flags_mask );
    const __mmask mask_not_final       = ~ mask_final;

    mm_cp             = _mm512_mask_add_pi ( mm_cp, mask_not_final & mask_not_eob,  mm_cp, mm_ones); 
    mm_current_states = _mm512i_mask_movd  ( mm_current_states, mask_not_eob, mm_next_states ); 
    mm_lacp           = _mm512i_mask_movd  ( mm_lacp,           mask_save,    mm_cp          );
    mm_cp             = _mm512i_mask_movd  ( mm_cp,             mask_restore, mm_lacp        ); 
  
    if ( mask_eob == mask_all_ones ) goto the_end; // all automata are complete
    
    /* selectively commit results into the output token tables */
    _mm512_mask_scatterd(ttable_all, mask_token, mm_tte_idx, CAST_I2F(mm_bp),          _MM_DOWNC_NONE, _MM_SCALE_4, _MM_HINT_NT );
    mm_tte_idx = _mm512_mask_add_pi ( mm_tte_idx, mask_token, mm_tte_idx, mm_ones ); 
    _mm512_mask_scatterd(ttable_all, mask_token, mm_tte_idx, CAST_I2F(mm_cp),          _MM_DOWNC_NONE, _MM_SCALE_4, _MM_HINT_NT );
    mm_tte_idx = _mm512_mask_add_pi ( mm_tte_idx, mask_token, mm_tte_idx, mm_ones ); 
    _mm512_mask_scatterd(ttable_all, mask_token, mm_tte_idx, CAST_I2F(mm_are_acronym), _MM_DOWNC_NONE, _MM_SCALE_4, _MM_HINT_NT );
    mm_tte_idx = _mm512_mask_add_pi ( mm_tte_idx, mask_token, mm_tte_idx, mm_twos /*1 + padding*/); 

    mm_bp = _mm512i_mask_movd( mm_bp, mask_final, mm_cp); 
  }

 the_end: ;  
  // export token table pointers to global vars
  const unsigned quadwords_per_entry = sizeof (tt_entry_t) / sizeof (uint32_t);
  for ( unsigned dfa ; dfa<LANES ; dfa++) {
    const unsigned n_entries = EXTRACT(mm_tte_idx, dfa) / quadwords_per_entry;
    ttp_used[ dfa] = ttable_all + n_entries;      
  }
}

Listing 4

In the preamble, I define a variant512_t union to allow convenient access to the contents of a 512-bit variable, as if it were a SIMD vector of integers, floats or doubles, or as a C array of 32-bit unsigned integers. Thanks to this union, I can write macros (lines 10-12) to extract elements and cast float vectors to integer vectors and vice versa, using clean, pointer-less GCC code (beware of my use of statement expressions [9]). Other methods use brute-force pointer casting and dereferencing, and have disadvantages: They hijack the language type safety and mess with the compiler's type-based pointer aliasing rules (with noisy warnings and potentially incorrect code).

The need to cast SIMD vectors and the need for a masked move that operates on integer SIMD vectors (my _mm512i_mask_movd function) arise from Intel's choice to define gather and scatter intrinsics only for float vectors. I use a CAST_I2F before a scatter, to pretend I'm storing floating points, and a CAST_F2I after a gather. It's okay, Intel, we non-numerical programmers are used to being treated as a "secondary audience". We have been used to this since the days of MMX and SSEx.

The code explosion from Listing 3 to Listing 4 is common when doing SIMDization. The code grows for two reasons. First, we write at a lower level, where C operators like +, & and > become verbose intrinsics like _mm512_add_pi, _mm512_and_pi, and _mm512_cmpnle_pu. Except for manual register allocation, this is pretty much like writing assembly code. We have also inlined function emit_token.

A second, more profound, reason is that we have replaced control flow with data flow. With the exception of the if that verifies the termination condition, the 16 FSMs on the 16 lanes are processed by the same instructions, and therefore must follow the control flow. They can't diverge like separate threads, they can't take separate paths, they can't have their own ifs, whiles, and switchs. Each iteration in the while loop describes 16 FSMs, all simultaneously consuming input, performing a state transition and possibly generating output. The execution of these FSMs is conjoined like the life of 16 siamese twins (called "sexdecuplets", if Wikipedia can be trusted). This is true to the extent that if one FSM gets to the end of its input, it must keep iterating until all the other FSMs also complete.

Each FSM sets a bit in variable mask_eob upon its end of buffer; when the mask is all ones, we are done. This is not a major issue and can be solved without performance degradation, but not without making the code more complex and difficult to explain.

As a consequence, we must rewrite the FSM's choices as branchless expressions. We use a technique sometimes called software-level speculation: we compute the result of both sides of a branch and then choose the relevant one with a selection instruction (and this takes more lines of code than the original branch.) On the Cell, you use the spu_sel intrinsic; on Larrabee, you use masked instructions. To see this technique in action, note the three masked scatter instructions in lines 109, 111, and 113. For each FSM, they store the values of bp, cp, and token_type at the end of that FSM's own token table. But this only happens if that FSM has matched a valid token in this iteration, as reflected by the respective bit in mask_token. Three associated masked adds (lines 110, 112, 114) conditionally advance the end-of-table indexes (mm_tte_idx), for the only automata that wrote output. Similarly, we turn the updates of bp, cp, and last_accepting_cpos (that appeared under if conditions in Listing 3) into masked adds or masked moves (lines 101-104).

Except for the if statement that detects the global termination condition, the code is branchless and it corresponds to the data flow of Figure 1. The figure represents precisely one iteration of the loop in Listing 4, where 16 conjoined FSMs read the respective inputs, load the next state codes from the state transition table, decode them (possibly generating outputs) and jump to the next states. A black box represents a vector variable, a grey box represents a constant, a red box represents a mask, and a blue circle represents a vector operation.

[Click image to view at full size]
Figure 1

You can unroll this code ad libitum with no special prologues or epilogues, and no impact on correctness. To increase performance, you can also remove the termination condition if everywhere but in the last unrolled iteration.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video