Conclusions
I have shown how to port a simple, sequential piece of code into its fully-parallel, SIMDized version. I realize that the kind of effort required for the data-parallel redesign is not trivial, but I don't believe that it is beyond the reach of many programmers already involved in crafting hand-optimized code.
This kind of human effort is crucial because compilers won't likely be able to deliver a similar quality of result. Furthermore, it is more important on Larrabee than on previous Intel machines because Larrabee's SIMD width is four times larger than the usual 128-bit SIMD units: As a consequence, non-SIMD code will likely leave a much higher performance fraction on the table.
On the bright side, coding with LRBni intrinsics seems more natural and less verbose than coding with SSE intrinsics. One iteration of the loop in the code I presented needs approximately 35 instructions to transition 16 finite-state machines (approximately 2.2 instructions/transition), while other 128-bit SIMD instruction sets require at least 100 instructions to transition 4 machines (approximately 25 instructions/transition).
Don't get me wrong. Without a clue on instruction latencies, you can not translate these instruction economy statistics into performance figures. At this time, nobody except Intel may estimate the amount of clock cycles taken by any LRBni instruction. Scatter/gather instructions, for example, will likely be decomposed in multiple pointer arithmetic and store/load microinstructions, which might take a high cumulative number of clock cycles to complete. The performance of this code depends on how successful Intel engineers will be in squeezing LRBni instructions into a handful of clock cycles. It's not an easy task, especially for complex instructions like scaled, masked scatter/gather with type conversion.
Acknowledgments
Thanks to Sally A. McKee, Jamin Naghmouchi, Michael Perrone and Greg Pfister for their useful comments.
Disclaimer: Any claim or information reported here on Larrabee or other Intel products may not be final or reliable. The author assumes no liability for the use or interpretation of information contained herein. This article reflects the views and the opinions solely of the author, which may not necessarily be endorsed or approved by IBM.
References
[1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan, 2008. "Larrabee: A Many-Core x86 Architecture for Visual Computing", ACM Transactions on Graphics, 27, 3, 2008. http://portal.acm.org/citation.cfm?doid=1360612.1360617
[2] M. Abrash, "A First Look at the Larrabee New Instructions (LRBni)", Dr. Dobb's, April 1st, 2009. http://www.ddj.com/hpc-high-performance-computing/216402188
[3] Intel Software Network, "C++ Larrabee Prototype Library", June 19, 2009. http://software.intel.com/en-us/articles/prototype-primitives-guide/
[4] K. Asanovic, R. Bodik, J. Demmel, J. Kubiatowicz, K. Keutzer, E. Lee, G. Necula, D. Patterson, K. Sen, J. Shalf, J. Wawrzynek, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley". http://science.officeisp.net/ManycoreComputingWorkshop07/Presentations/David%20Patterson.pdf
[5] D. P. Scarpazza, G. F. Russell, "High-performance Regular Expression Scanning on the Cell/B.E. Processor", 23rd International Conference on Supercomputing (ICS'09), IBM T.J. Watson Research Center, Yorktown Heights, NY, USA, June 2009. http://domino.research.ibm.com/comm/research_people.nsf/pages/scarpazza.pubs.html/$FILE/2009-06-ICS-scarpazza.pdf
[6] D. P. Scarpazza, G. W. Braudaway, "Workload Characterization and Optimization of High-performance Text Indexing on the Cell Processor", IEEE International Symposium on Workload Characterization (IISWC'09), Austin, TX, October 4, 2009. http://domino.research.ibm.com/comm/research_people.nsf/pages/scarpazza.pubs.html/$FILE/2009-10-04-IISWC-scarpazza.pdf
[7] The Apache Software Foundation. Lucene. http://lucene.apache.org
[8] V. Paxson, flex: a fast lexical analyzer generator. http://flex.sourceforge.net/manual/
[9] The Free Software Foundation, Using the GNU compiler collection (GCC), "Section 5.1: Statements and Declarations in Expressions". http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Statement-Exprs.html


