Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Getting to 1 Teraflop on the Intel Phi Coprocessor

Generating an Autoencoder

The genFunc.py code can be used to generate a more complicated autoencoder that provides more floating-point operations per byte fetched from main memory and will get closer to peak performance on the Phi coprocessor. Copy the simpleExample directory to auto2x10x1x10x2 and generate the new fcn.h source code shown in Example 11.

  cp -r simpleExample auto2x10x1x10x2
  python genFunc.py > auto2x10x1x10x2/fcn.h

Example 11: Commands to generate and build a PCA autoencoder.

The PCA version can be built by simply changing the command-line argument to the BUILD_LINEAR_TIMING script in Example 12. The nonlinear version can be built with the BUILD_NONLINEAR_TIMING script.

  sh BUILD_LINEAR_TIMING auto2x10x1x10x2

Example 12: Using the BUILD_LINEAR_TIMING shell script to build an autoencoder.

Runtime Results

To facilitate testing, the PCA_TIMING and NLPCA_TIMING shell scripts always create binaries of the same name. It is important to pay attention to what binary is being used and to the output that reports the function being evaluated. The binaries created are:

  • timing.omp: an OpenMP executable that runs on the host processor
  • timing.mic: a native mode Intel Xeon Phi coprocessor executable
  • timing.off: an offload mode executable that will run on both the host and Intel Xeon Phi coprocessor

Running a linear PCA function using the timing code in auto2x10x1x10x2 in offload mode generates the output in Example 13.

$ ./timing.off 10 1000 60000000
TIMING TEST: generated_PCA_func LINEAR()
Runtime 8.41651, aveObjRuntime 0.00841619
number OMP threads 240
DataLoadTime 1.86613
AveObjTime 0.00841619, countObjFunc 1000, totalObjTime 8.41619
Estimated flops in myFunc 128, estimated average GFlop/s 912.527
Estimated maximum GFlop/s 979.096, minimum GFLop/s 319.506
Double check based on overall runtime
maxTime 0.0240381, minTime 0.00784516, aveTime 0.00841639 per call
maxGFLOPS 978.947, minGFLOPS 319.493, aveGFLOPS 912.505 per call

Example 13: Example of output for a linear offload run.

The output tells us that we are running a linear PCA function. In this example, the data transfer to the Phi coprocessor achieved 257 MB/second. This timing information should be reliable as the average runtime is consistent between both overall and perCall measurement methods. The Phi coprocessor utilized 240 threads.

The key floating-point metrics show that average performance is 912 GF/second. The fastest offload runtime achieve nearly a TF/second. However, there is almost a factor of 3x difference between the fastest and slowest runtimes even though the code performed 10 warm-up runs.

The timing executable can run natively on the Phi coprocessor after copying the executable and libiomp5.so to the device plus setting the LD_LIBRARY_PATH correctly (or the micnativeloadex script could be used or keep one window open on the Intel Xeon Phi coprocessor, see Example 14.)


scp $LIBDIR/libiomp5.so mic0:/tmp
scp timing.mic $DEV:
ssh $DEV "export LD_LIBRARY_PATH=/tmp; export OMP_NUM_THREADS=240; \ 
./timing.mic 10 1000 60000000"

Example 14: Commands to prepare and run on mic0.

Even though the number of threads is the same, the same code running in native mode was on average 1.173x faster than offload mode in Example 15. Offload mode can be nearly as efficient as native mode when the time spent performing the computation is large relative to the latency of the data transfers on the PCIe bus. The runtime difference will increase as the problem size decreases.

libiomp5.so                                   100%  956KB 955.6KB/s   00:00    
timing.mic                                    100%   36KB  36.5KB/s   00:00    
TIMING TEST: generated_PCA_func LINEAR()
Runtime 7.17827, aveObjRuntime 0.00717398
number OMP threads 240
DataLoadTime 0
AveObjTime 0.00717398, countObjFunc 1000, totalObjTime 7.17398
Estimated flops in myFunc 128, estimated average GFlop/s 1070.54
Estimated maximum GFlop/s 1086.89, minimum GFLop/s 1032.81
Double check based on overall runtime
maxTime 0.00743794, minTime 0.00706887, aveTime 0.00717716 per call
maxGFLOPS 1086.45, minGFLOPS 1032.54, aveGFLOPS 1070.06 per call

Example 15: Output from a native mode linear run.

In addition, the operating system jitter discussed in the first article appears to be the cause of much of this variation. (An excellent starting paper on this topic is "The Case of the Missing Supercomputer Performance.") In offload mode, small latencies as the device driver moves the parameters onto the Phi and the single floating-point error estimate off the device decrease performance.

Running the OpenMP version on a 12-core 3.3 GHz Intel X5680 Westmere chipset shows the linear code runs on average 8.5x slower than the offload code and 10x slower that the native mode; see Example 16.

$ ./timing.omp 10 1000 60000000
TIMING TEST: generated_PCA_func LINEAR()
Runtime 71.2253, aveObjRuntime 0.0712247
number OMP threads 24
DataLoadTime 0
AveObjTime 0.0712247, countObjFunc 1000, totalObjTime 71.2247
Estimated flops in myFunc 128, estimated average GFlop/s 107.828
Estimated maximum GFlop/s 121.569, minimum GFLop/s 54.8578
Double check based on overall runtime
maxTime 0.139998, minTime 0.063174, aveTime 0.0712251 per call
maxGFLOPS 121.569, minGFLOPS 54.8578, aveGFLOPS 107.827 per call

Example 16: Output from a linear run on a conventional processor.

Examining fcn.h in auto2x10x1x10x2 shows that the processor core is performing a large number of dot products that utilize the fused multiply-add instruction. Changing to a nonlinear function illustrates the impact of adding a division and absolute value to the calculation with the Elliott activation function, G(x) = x/(1+|x|), as can be seen in the following table where the timing programs were built with NLPCA_TIMING script:

  • timing.mic: maxGFLOPS 360.278, minGFLOPS 281.438, aveGFLOPS 358.51 per call
  • timing.off: maxGFLOPS 349.66, minGFLOPS 326.766, aveGFLOPS 348.257 per call
  • timing.omp: maxGFLOPS 91.4893, minGFLOPS 71.9806, aveGFLOPS 90.8493 per call

The key point is that the Intel Xeon Phi coprocessor in native mode runs the nonlinear problem with a performance comparable to the offload mode. This indicates that the runtime is dominated more by the computation runtime rather than latency-limited operations such as the summation spinlock and PCIe data transfers. Note that there is still significant variation between minimum and maximum performance. Performance profiling with Intel's profiler, VTune, will help identify the reasons for these performance changes.


Peak performance is a useful marketing metric that condenses the complexity of any machine from a cellphone to a leadership-class supercomputer into a single number that people can easily grasp and categorize. While peak performance has its place, sophisticated performance competitions such as the TOP500 and GRAPH500 attempt to more realistically evaluate system performance for specific problem domains.

The key to entering the high-performance arena with the Phi product family is to express sufficient parallelism and vector capability to fully utilize the device. Optimized libraries such as MKL can achieve very high performance. Matrix multiplication is a useful computational tool that also makes a great benchmark because it can show how close a device can get to peak theoretical.

The massively parallel mapping utilized in this article has proven to be an excellent framework for solving real-world problems, as a teaching tool, and as a performance evaluation tool. The autoencoder objective functions used in this tutorial solve real-world PCA and NLPCA problems, yet they can also be modified to stress either the memory subsystem or floating-point capability of a device. It is also possible to define an autoencoder architecture that is not limited by memory bandwidth or computation, but rather by the synchronization required to perform a reduction on a parallel computer. The heavy use of the fused multiply-add instruction means that it is possible to fully utilize the floating-point capability of some devices, and achieve high-performance across a wide range of devices. The near-linear scaling of this mapping means that you can run it with high-performance on a single device or on a supercomputer over a wide range of problem domains.

I encourage you to explore the Intel Xeon Phi coprocessor performance envelope through the use of the provided Python code generator and by writing your own functions. My next article will demonstrate that these objective functions can indeed solve real optimization problems with high performance.



The current source code needs to be compiled with the older 13.0.0 Intel compiler. While the code can be compiled with the more recent compilers (13.0.1 and 13.1.0), care must be taken that the loop in the objective function vectorizes.

Rather than generating myFunc(), it is more convenient to write a single function that loops over the connections between neurons in different layers. Using loops in myFunc() appears to prevent vectorization and results in a significant performance drop. Unfortunately, loop unrolling does not appear to help.

The article "Optimization and Performance Tuning for Intel Xeon Phi Coprocessors Part 1: Optimization Essentials" is a useful reference for high-performance Intel Xeon Phi coprocessor programming. It notes that alignment of the vectors with __declspec(align(64)) is important. Utilizing -vec-report=6 when compiling confirms that the values are aligned. The Intel article also notes, "Code will run best when data are accessed in sequential address-order from memory. Frequently, developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays (AoS to SoA)." Users can test the performance effects of AoS versus. SoA by changing the IN() macro.

The article "Test-Driving Intel Xeon Phi Coprocessors with a Mastic N-body Simulation" is a good additional resource to consult when evaluating how to write high-performance code for Intel Xeon Phi coprocessors.

Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.

Related Article

Programming Intel's Xeon Phi: A Jumpstart Introduction

Numerical and Computational Optimization on the Intel Phi

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.