Listing Four, genData.c
, creates a cigar shaped cloud of points around a two-dimensional line as shown in the figure. Figure 2 is the point cloud distribution produced by the compiled genData.c
application for a hundred data points with a 0.1 variance.
Figure 2: Example of a linear dataset for PCA.
The program writes the data in binary format to a user-specified file for use in the data-fitting step. The data can be piped through stdout
to another program (say, one running natively on the Intel Xeon Phi coprocessor) by specifying a filename of "-" on the command line.
Listing Four
// Rob Farber #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> // get a uniform random number between -1 and 1 inline float f_rand() { return 2*(rand()/((float)RAND_MAX)) -1.; } void genData(FILE *fn, int nVec, float xVar) { float xMax = 1.1; float xMin = -xMax; float xRange = (xMax - xMin); // write header info uint32_t nInput=2; fwrite(&nInput,sizeof(int32_t), 1, fn); uint32_t nOutput=0; fwrite(&nOutput,sizeof(int32_t), 1, fn); uint32_t nExamples=nVec; fwrite(&nExamples,sizeof(int32_t), 1, fn); for(int i=0; i < nVec; i++) { float t = xRange * f_rand(); float z1 = t + xVar * f_rand(); #ifdef USE_LINEAR float z2 = t + xVar * f_rand(); #else float z2 = t*t*t + xVar * f_rand(); #endif fwrite(&z1, sizeof(float), 1, fn); fwrite(&z2, sizeof(float), 1, fn); } } int main(int argc, char *argv[]) { if(argc < 4) { fprintf(stderr,"Use: filename nVec variance seed\n"); exit(1); } char *filename=argv[1]; FILE *fn=stdout; if(strcmp("-", filename) != 0) fn=fopen(filename,"w"); if(!fn) { fprintf(stderr,"Cannot open %s\n",filename); exit(1); } int nVec = atoi(argv[2]); float variance = atof(argv[3]); srand(atoi(argv[4])); genData(fn, nVec, variance); if(fn != stdout) fclose(fn); return 0; }
Building and Running the PCA Analysis
Once the nlopt library has been built, the train.c
and pred.c
source files from this article should be copied to disk. In addition, the Python script, genFunc.py
from the previous article needs to be copied to this directory. Create and change to a subdirectory and call it pca.
Copy the genData.c
and myFunc.h
files to this subdirectory. Now use the genFunc.py
script to create a 2x10x1x10x2 autoencoder:
python ../genFunc.py > fcn.h
The following bash commands will build the executable files. The nlopt default installation directory $HOME/install
was used to access the nlopt include and library files.
APP=pca FLAGS="-DUSE_LINEAR -std=c99 -O3 -openmp -fgnu89-inline " INC=$HOME/install/include LIB=$HOME/install/lib icc $FLAGS genData.c -o gen_$APP icc $FLAGS ../train.c -I . -I $INC -L $LIB -lnlopt -lm -o train_$APP.off icc $FLAGS -Wno-unknown-pragmas -no-offload -O3 ../train.c -I . -I $INC \ -L $LIB -lnlopt -lm -o train_$APP.omp icc $FLAGS -Wno-unknown-pragmas -no-offload ../pred.c -I . -lm -o pred_$APP FLAGS+=" -mmic -Wno-unknown-pragmas" INC=$HOME/install_mic/include LIB=$HOME/install_mic/lib icc $FLAGS ../train.c -I . -I $INC -L $LIB -lnlopt -lm -o train_$APP.mic
These commands make up the BUILD
script, which will create the following applications:
gen_pca
: Generates the PCA data set.train_pca.mic
: The native mode training application.train_pca.off
: The offload mode training application.train_pca.omp
: A training application that will run in parallel on the host processor cores.pred_pc
a: The sequential prediction program that will run on the host.
Fitting a PCA Autoencoder Using Offload Mode
The following RUN_OFFLOAD
script will generate a PCA data set of 30,000,000 observations with a variance of 0.1 that the offload mode train_pca.off
executable will fit. A 1000-point prediction set with zero variance will be used for prediction purposes. The UNIX tail
command strips off some informative messages at the beginning of the prediction results save in the file plot.txt
to make it easy to graph the final result. The original results are kept in the output.txt
file.
APP=pca EXAMPLES=30000000 VAR=0.1 ./gen_$APP $APP.train.dat $EXAMPLES $VAR 1234 ./train_$APP.off $APP.train.dat $APP.param ./gen_$APP $APP.pred.dat 1000 0 1 ./pred_$APP $APP.param $APP.pred.dat > output.txt # create file for gnuplot tail -n +3 output.txt > plot.txt rm *.dat
Results in:
$ sh RUN_OFFLOAD myFunc generated_PCA_func LINEAR() nExamples 30000000 Number Parameters 83 Optimization Time 94.0847 found minimum 28.28534843 ret 1 number OMP threads 240 DataLoadTime 3.02742 AveObjTime 0.00425339, countObjFunc 22108, totalObjTime 94.0339 Estimated flops in myFunc 128, estimated average GFlop/s 902.81 Estimated maximum GFlop/s 942.816, minimum GFLop/s 13.2578
The offload mode training averaged 902 gigaflops in part because there was at least one call to the objective function that ran at only 13 gigaflops per second. Use of the timing framework in Getting to 1 teraflop on the Phi shows that the first few calls to the Phi can be slow relative to the performance shown in the remaining calls.
The gnuplot application is used to generate a scatterplot of the predicted data using the fitted parameters:
gnuplot -e "unset key; set term png; set output \"pca_pred.png\"; \ plot \"plotdata.txt\" u 5:6"
Comparison of resulting graph (Figure 3) shows that the optimized autoencoder did find a reasonable looking fit to the data shown in Figure 2.
Figure 3: Offload mode PCA line prediction.
VTune Performance Analysis
Symbol table information can be added to the offload compilation command with the –g
option. This option enables runtime behavior of the Phi coprocessor to be examined with the Intel VTune performance analyzer. The following commands were used to compile train_pca.off
for VTune analysis. Note the addition of the -g
option to the FLAGS
variable:
APP=pca FLAGS="-g -DUSE_LINEAR -std=gnu99 -O3 -openmp" INC=$HOME/install/include LIB=$HOME/install/lib icc $FLAGS ../train.c -I . -I $INC -L $LIB -lnlopt -lm -o train_$APP.off
The basic information for utilizing VTune can be found in the Intel VTune Amplifier XE 2013 Start Here document.
Once configured, starting amplxe-gui
can perform the "Knights Corner Platform Lightwight Hotspots Analysis." Set "automatically stop application" to 60 seconds. Click start and the summary screen shown in Figure 4 will appear after the application runs for one minute.
Figure 4: CPI and Top Hotspots.
The summary shows that most of the runtime is spent in myFunc()
. Reviewing the Intel document "Intel VTune Performance Analyzer Basics: WhatIis CPI and How Do I UseIt?" makes it reasonable to suspect that the warning about the high CPI is the result heavy use of the per-core wide-vector units on the the KNC chip.
Some of the time is spent in the Intel Xeon Phi coprocessor threads and on the CPU. In a tightly coupled calculation like a reduction, the slowest thread is the one that controls the overall runtime.
Looking at the source level view in the Figure 5, we see that the dot products consume a significant amount of the runtime (see the left side). The VTune analyzer very nicely highlights the assembly language instructions in the Assembly view (shown on the right) associated with the C source line on the left. The highlighted assembly instructions show that the dot-product is composed of a data move and a wide-vector fused multiply-addition instruction, VFMADD213PS
. The "Intel Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual" states:
"VFMADD213PS - Multiply First Source By Destination and Add Second
Source Float32 Vectors Performs an element-by-element multiplication between float32 vector zmm2 and float32 vector zmm1 and then adds the result to the loat32 vector result of the swizzle/broadcast/conversion process on memory or vector loat32 zmm3. The final sum is written into float32 vector zmm1."
Figure 5: VTune Source and Assembly views.
From this brief analysis using VTune, we have a fair degree of confidence that the most of the computation time is spent in myFunc()
performing efficient fused multiply-add wide-vector instructions.