Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Getting to 1 Teraflop on the Intel Phi Coprocessor

Definition of the PCA/NLPCA Vector Function

For flexibility, myFunc.h uses the C preprocessor to include the code for the objective function from the file fcn.h. Note that both linear and nonlinear G() functions are defined via C preprocessor macros in the myFunc.h source code. Example 6 is the full source listing for the 2x1x2 autoencoder that includes all the preprocessor defines.

#define N_INPUT (2)
#define N_H1 (1)
#define N_OUTPUT (0)
#define EXAMPLE_SIZE (2)
inline float myFunc(const int vIndex, const float  * restrict P, 
                    const float * restrict ex, const int nExamples,
                    float * restrict pred)
   float in[2];

   // load the input vector
   in[0] = ex[IN(0, nExamples, vIndex)];
   in[1] = ex[IN(1, nExamples, vIndex)];

   register float h1_0 = P[0]; // bottleneck neuron 0 offset

   h1_0 += in[0] * P[1]; // input 0 to the bottleneck neuron
   h1_0 += in[1] * P[2]; // input 1 to the bottleneck neuron

   h1_0 = G(h1_0); // Calculate G() if nonlinear

   register float o,sum = 0.f;

   // Bottleneck neuron to Output 0
   o = P[3]; // Output 0 offset
   o += h1_0 * P[4];
#ifdef DO_PRED
   pred[0] = o;
   o -= in[0]; // calculate error
   sum += o*o;

   // Bottleneck neuron to Output 1
   o = P[5]; // Output 0 offset
   o += h1_0 * P[6];
#ifdef DO_PRED
   pred[1] = o;
   o -= in[1]; // calculate error
   sum += o*o;


#define N_PARAM (7)
#define FLOP_ESTIMATE (14 + 1 * G_ESTIMATE)

Example 6: A simple 2x1x2 autoencoder inline function.

As noted previously, this code makes heavy use of the fused-multiply-add instruction so it should provide high-floating point performance. Loops were avoided because they did interfere with vectorization. (Loop unrolling pragmas did not appear to help.) Further, the code for an autoencoder is very nice for benchmarking purposes because all computations occur in vector registers once the in[] vector is read from main memory.

Example 7 is a Python code generator that provides a convenient way to generate arbitrary nInput x Nh3 x nh3 x NH3 x nInput autoencoders. By varying the size of the autoencoder, the reader can adjust the number of flops per byte fetched from memory from low to very high, as well as investigate the size of program that can be utilized before instruction fetch overhead dominates the runtime. By default, this Python code generates a 2x10x1x10x2 autoencoder.


print "#define N_INPUT (%d)" % (nInput) 
print "#define N_H1 (%d)" % (nH1) 
print "#define N_h3 (%d)" % (nh3) 
print "#define N_H3 (%d)" % (nH3) 
print "#define N_OUTPUT (%d)" % (0) 
print "#define EXAMPLE_SIZE (%d)" % (nInput) 


#print start of inline function
print "inline float myFunc(const int vIndex, const float  * restrict P, " 
print "                    const float * restrict ex, const int nExamples,"
print "                    float * restrict pred)"
print "{"
print "   float in[%d];" % (nInput)

for i in range(0,nInput):
    print "   in[%d] = ex[IN(%d, nExamples, vIndex)];" % (i,i)

for i in range(0,nH1):
   print "   register float h1_%d = P[%d];" % (i,index)
   index += 1

#input to h1
for i in range(0,nInput):
    for j in range(0,nH1):
        print "   h1_%d += in[%d] * P[%d];" % (j,i,index)
        index += 1
        flopEstimate += 2

for j in range(0,nH1):
    print "   h1_%d = G(h1_%d);" % (j,j)
    gCalls += 1
for i in range(0,nh3):
   print "   register float h3_%d = P[%d];" % (i,index)
   index += 1
for i in range(0,nH1):
    for j in range(0,nh3):
        print "   h3_%d += h1_%d * P[%d];" % (j,i,index)
        index += 1
        flopEstimate += 2

for i in range(0,nH3):
   print "   register float h3_%d = P[%d];" % (i,index)
   index += 1
for i in range(0,nh3):
    for j in range(0,nH3):
        print "   h3_%d += h3_%d * P[%d];" % (j,i,index)
        index += 1
        flopEstimate += 2

for j in range(0,nH3):
    print "   h3_%d = G(h3_%d);" % (j,j)
    gCalls += 1
print "   register float o,sum = 0.f;"

for i in range(0,nInput):
    print "   o = P[%d];" % (index)
    index += 1
    for j in range(0,nH3):
        print "   o += h3_%d * P[%d];" % (j,index)
        index += 1
        flopEstimate += 2

    print "#ifdef DO_PRED"
    print "   pred[%d] = o;" %(i)
    print "#endif"
    print "   o -= in[%d];" % (i)
    print "   sum += o*o;"
    flopEstimate += 3

print "   return(sum);"
print "}"
print "#define N_PARAM (%d)" % (index) 
flopEstimate += 2 # to account for the square and global sum
print "#define FLOP_ESTIMATE (%d + %d * G_ESTIMATE)" % (flopEstimate, gCalls) 

Example 7: Python code to generate a range of autoencoders.

The Timing Framework

Example 8 is the complete source code for the timing code that calls the objective function. This framework creates a user-specified-size dataset with random numbers. In offload mode, the ALLOC preprocessor define is used along with an in() clause to move the data to the device. The code then fills the parameter vector with random numbers and calls the objective function a user-specified number of times. The user-reported timing statistics are created by calling the objective function a user-specified number of times.

A simple sanity check is provided, which verifies that the objective function returns the same result during the timing run. An overall timing check is also performed to see if the granularity of the measured wall clock time has affected the timing results.

// Rob Farber
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <math.h>

#include "myFunc.h"

int main(int argc, char* argv[])
  userData_t uData = {0};

  if(argc < 3) {
    fprintf(stderr,"Use: nWarmup nTests nExamples\n");
    return -1;
  int nWarmup=atoi(argv[1]);
  int nTests=atoi(argv[2]);
  int nExamples=atoi(argv[3]);


  // ********* Allocation  *********************
  // aligned allocation of the data
  uData.example = memalign(64,nExamples*EXAMPLE_SIZE*sizeof(float));
  if(!uData.example) {
    fprintf(stderr,"Not enough memory for examples!\n");
  // aligned allocation of the on-device parameters
  uData.param = memalign(64,N_PARAM*sizeof(float));
  if(!uData.param) {
    fprintf(stderr,"Not enough memory for the parameters!\n");

  // ********* Fill data  *********************
  // generate random examples
  // fill with random data
  for(int i=0; i < uData.nExamples*EXAMPLE_SIZE; i++)
    uData.example[i]= (rand()/RAND_MAX);


  // Define some random parameters
  double x[N_PARAM]; 
  // fill with random data
  for(int i=0; i < N_PARAM; i++) x[i] = 0.1*(rand()/(double)RAND_MAX);

  // Do warmup calls, summing errors to ensure compilation
  uData.isWarmup = 1;
  float tmp=0;
  for(int i=0; i < nWarmup; i++)
    tmp += objFunc(N_PARAM,  x, NULL, &uData);
  uData.isWarmup = 0;

  // Perform timing runs
  double result[nTests]; // Sanity check for consistency
  double perCall[nTests];
  double startTime=getTime();

  for(int i=0; i < nTests; i++) {
    double callStart=getTime();
    result[i] = objFunc(N_PARAM,  x, NULL, &uData);
    perCall[i] = getTime() - callStart;
  double totalTime=getTime()-startTime;

  printf("TIMING TEST: %s\n", desc);
  printf("Runtime %g, aveObjRuntime %g\n",totalTime, uData.timeObjFunc/uData.countObjFunc);

  // Report times and free MIC data

  // calculate and report the min, max and average runtime behavior
  double minTime=perCall[0];
  double maxTime=perCall[0];
  double aveTime=0.;

  for(int i=0; i < nTests; i++) {
    minTime = (minTime < perCall[i])?minTime:perCall[i];
    maxTime = (maxTime > perCall[i])?maxTime:perCall[i];
    aveTime += perCall[i];
  aveTime /= (double) nTests;

  printf("Double check based on overall runtime\n");
  printf("maxTime %g, minTime %g, aveTime %g per call\n",maxTime,minTime,aveTime);
  printf("maxGFLOPS %g, minGFLOPS %g, aveGFLOPS %g per call\n",

  for(int i=0; i < nTests; i++)
    printf("ObjFunc time %g\n", perCall[i]);

  // check that the results are the same
  for(int i=1; i < nTests; i++) {
    if(result[i] != result[0])
      fprintf(stderr,"result %d differs from result[0] (%g,%g)\n",i, result[0],result[i]);

Example 8: Source code for timing.c.

Building the Timing Framework

Save the source files myFunc.h and fcn.h in a subdirectory. (For example, the previous source listings should be saved in the directory simpleExample. Keep the source code for timing.c and genFunc.py in the top-level directory).

Example 9 is a shell script builds the OpenMP native and offload versions of the timing function in the simpleExample directory:

FLAGS="-DUSE_LINEAR -O3 -std=gnu99"
icc -openmp -no-offload -Wno-unknown-pragmas $FLAGS -I $APP timing.c -o $APP/timing.omp
icc -O3 -mmic -openmp -Wno-unknown-pragmas $FLAGS -I $APP timing.c -o $APP/timing.mic
icc -O3 $FLAGS -I $APP timing.c -o $APP/timing.off

Example 9: Shell script to build linear examples.

Simply by removing the preprocessor definition –DUSE_LINEAR, this same code will compile to an NLPCA run using the x/(1+|x|) sigmoid, or S-shaped curve. This activation function is handy for timing purposes because we know how many floating-point operations, but as noted in my article, "A Better Activation Function for Artificial Neural Networks," it might actually take more optimization steps to reach a solution when solving real problems. It is not clear how many floating-point operations are required to calculate a tanhf() or expf(). By default, the code in myFunc.h assumes seven floating-point operations per call. The article "Test-driving Intel Xeon Phi Coprocessors with a Basic N-body Simulation" notes the actual number of floating-point operations performed by the expf() call varies.

FLAGS="-DUSE_LINEAR -O3 -std=gnu99"
icc -openmp -no-offload -Wno-unknown-pragmas $FLAGS -I $APP timing.c -o $APP/timing.omp
icc -O3 -mmic -openmp -Wno-unknown-pragmas $FLAGS -I $APP timing.c -o $APP/timing.mic
icc -O3 $FLAGS -I $APP timing.c -o $APP/timing.off

Example 10: Shell script to build nonlinear examples.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.