Channels ▼
RSS

C/C++

Numerical and Computational Optimization on the Intel Phi


Listing Four, genData.c, creates a cigar shaped cloud of points around a two-dimensional line as shown in the figure. Figure 2 is the point cloud distribution produced by the compiled genData.c application for a hundred data points with a 0.1 variance.

Example of a linear dataset for PCA
Figure 2: Example of a linear dataset for PCA.

The program writes the data in binary format to a user-specified file for use in the data-fitting step. The data can be piped through stdout to another program (say, one running natively on the Intel Xeon Phi coprocessor) by specifying a filename of "-" on the command line.

Listing Four

// Rob Farber
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

// get a uniform random number between -1 and 1 
inline float f_rand() {
  return 2*(rand()/((float)RAND_MAX)) -1.;
}
void genData(FILE *fn, int nVec, float xVar)
{
  float xMax = 1.1; float xMin = -xMax;
  float xRange = (xMax - xMin);

  // write header info
  uint32_t nInput=2; fwrite(&nInput,sizeof(int32_t), 1, fn);
  uint32_t nOutput=0; fwrite(&nOutput,sizeof(int32_t), 1, fn);
  uint32_t nExamples=nVec; fwrite(&nExamples,sizeof(int32_t), 1, fn);

  for(int i=0; i < nVec; i++) {
    float t = xRange * f_rand();
    float z1 = t +  xVar * f_rand();
#ifdef USE_LINEAR
    float z2 = t +  xVar * f_rand();
#else
    float z2 = t*t*t +  xVar * f_rand();
#endif
    fwrite(&z1, sizeof(float), 1, fn);
    fwrite(&z2, sizeof(float), 1, fn);
  }
}

int main(int argc, char *argv[])
{
  if(argc < 4) {
    fprintf(stderr,"Use: filename nVec variance seed\n");
    exit(1);
  }
  char *filename=argv[1];
  FILE *fn=stdout;

  if(strcmp("-", filename) != 0)
    fn=fopen(filename,"w");

  if(!fn) {
    fprintf(stderr,"Cannot open %s\n",filename);
    exit(1);
  }
  int nVec = atoi(argv[2]);
  float variance = atof(argv[3]);
  srand(atoi(argv[4]));
  genData(fn, nVec, variance);

  if(fn != stdout) fclose(fn);
  return 0;
}

Building and Running the PCA Analysis

Once the nlopt library has been built, the train.c and pred.c source files from this article should be copied to disk. In addition, the Python script, genFunc.py from the previous article needs to be copied to this directory. Create and change to a subdirectory and call it pca. Copy the genData.c and myFunc.h files to this subdirectory. Now use the genFunc.py script to create a 2x10x1x10x2 autoencoder:

  python ../genFunc.py > fcn.h

The following bash commands will build the executable files. The nlopt default installation directory $HOME/install was used to access the nlopt include and library files.

APP=pca
FLAGS="-DUSE_LINEAR -std=c99 -O3 -openmp -fgnu89-inline "
INC=$HOME/install/include
LIB=$HOME/install/lib

icc $FLAGS genData.c -o gen_$APP

icc $FLAGS ../train.c -I . -I $INC  -L $LIB -lnlopt -lm -o train_$APP.off

icc $FLAGS -Wno-unknown-pragmas -no-offload -O3 ../train.c -I . -I $INC \
				-L $LIB -lnlopt -lm -o train_$APP.omp

icc $FLAGS -Wno-unknown-pragmas -no-offload ../pred.c -I . -lm -o pred_$APP

FLAGS+=" -mmic -Wno-unknown-pragmas"
INC=$HOME/install_mic/include
LIB=$HOME/install_mic/lib

icc $FLAGS ../train.c -I . -I $INC   -L $LIB -lnlopt -lm -o train_$APP.mic

These commands make up the BUILD script, which will create the following applications:

  • gen_pca: Generates the PCA data set.
  • train_pca.mic: The native mode training application.
  • train_pca.off: The offload mode training application.
  • train_pca.omp: A training application that will run in parallel on the host processor cores.
  • pred_pca: The sequential prediction program that will run on the host.

Fitting a PCA Autoencoder Using Offload Mode

The following RUN_OFFLOAD script will generate a PCA data set of 30,000,000 observations with a variance of 0.1 that the offload mode train_pca.off executable will fit. A 1000-point prediction set with zero variance will be used for prediction purposes. The UNIX tail command strips off some informative messages at the beginning of the prediction results save in the file plot.txt to make it easy to graph the final result. The original results are kept in the output.txt file.

APP=pca
EXAMPLES=30000000
VAR=0.1

./gen_$APP $APP.train.dat $EXAMPLES $VAR 1234
./train_$APP.off $APP.train.dat $APP.param 
./gen_$APP $APP.pred.dat 1000 0 1
./pred_$APP $APP.param $APP.pred.dat > output.txt

# create file for gnuplot
tail -n +3 output.txt > plot.txt

rm *.dat

Results in:

$ sh RUN_OFFLOAD 
myFunc generated_PCA_func LINEAR()
nExamples 30000000
Number Parameters 83
Optimization Time 94.0847
found minimum 28.28534843 ret 1
number OMP threads 240
DataLoadTime 3.02742
AveObjTime 0.00425339, countObjFunc 22108, totalObjTime 94.0339
Estimated flops in myFunc 128, estimated average GFlop/s 902.81
Estimated maximum GFlop/s 942.816, minimum GFLop/s 13.2578

The offload mode training averaged 902 gigaflops in part because there was at least one call to the objective function that ran at only 13 gigaflops per second. Use of the timing framework in Getting to 1 teraflop on the Phi shows that the first few calls to the Phi can be slow relative to the performance shown in the remaining calls.

The gnuplot application is used to generate a scatterplot of the predicted data using the fitted parameters:

  gnuplot -e "unset key; set term png; set output \"pca_pred.png\"; \
     plot \"plotdata.txt\" u 5:6"

Comparison of resulting graph (Figure 3) shows that the optimized autoencoder did find a reasonable looking fit to the data shown in Figure 2.

Offload mode PCA line prediction
Figure 3: Offload mode PCA line prediction.

VTune Performance Analysis

Symbol table information can be added to the offload compilation command with the –g option. This option enables runtime behavior of the Phi coprocessor to be examined with the Intel VTune performance analyzer. The following commands were used to compile train_pca.off for VTune analysis. Note the addition of the -g option to the FLAGS variable:

APP=pca
FLAGS="-g -DUSE_LINEAR -std=gnu99 -O3 -openmp"
INC=$HOME/install/include
LIB=$HOME/install/lib

icc $FLAGS ../train.c -I . -I $INC  -L $LIB -lnlopt -lm -o train_$APP.off

The basic information for utilizing VTune can be found in the Intel VTune Amplifier XE 2013 Start Here document.

Once configured, starting amplxe-gui can perform the "Knights Corner Platform — Lightwight Hotspots Analysis." Set "automatically stop application" to 60 seconds. Click start and the summary screen shown in Figure 4 will appear after the application runs for one minute.

Numerical Optimization and VTune Profiling
Figure 4: CPI and Top Hotspots.

The summary shows that most of the runtime is spent in myFunc(). Reviewing the Intel document "Intel VTune Performance Analyzer Basics: WhatIis CPI and How Do I UseIt?" makes it reasonable to suspect that the warning about the high CPI is the result heavy use of the per-core wide-vector units on the the KNC chip.

Some of the time is spent in the Intel Xeon Phi coprocessor threads and on the CPU. In a tightly coupled calculation like a reduction, the slowest thread is the one that controls the overall runtime.

Looking at the source level view in the Figure 5, we see that the dot products consume a significant amount of the runtime (see the left side). The VTune analyzer very nicely highlights the assembly language instructions in the Assembly view (shown on the right) associated with the C source line on the left. The highlighted assembly instructions show that the dot-product is composed of a data move and a wide-vector fused multiply-addition instruction, VFMADD213PS. The "Intel Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual" states:

"VFMADD213PS - Multiply First Source By Destination and Add Second
Source Float32 Vectors Performs an element-by-element multiplication between float32 vector zmm2 and float32 vector zmm1 and then adds the result to the loat32 vector result of the swizzle/broadcast/conversion process on memory or vector loat32 zmm3. The final sum is written into float32 vector zmm1."

VTune Source and Assembly views
Figure 5: VTune Source and Assembly views.

From this brief analysis using VTune, we have a fair degree of confidence that the most of the computation time is spent in myFunc() performing efficient fused multiply-add wide-vector instructions.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video