# Numerical and Computational Optimization on the Intel Phi

Listing Four,` genData.c`, creates a cigar shaped cloud of points around a two-dimensional line as shown in the figure. Figure 2 is the point cloud distribution produced by the compiled `genData.c ` application for a hundred data points with a 0.1 variance.

Figure 2: Example of a linear dataset for PCA.

The program writes the data in binary format to a user-specified file for use in the data-fitting step. The data can be piped through `stdout` to another program (say, one running natively on the Intel Xeon Phi coprocessor) by specifying a filename of "-" on the command line.

Listing Four

```// Rob Farber
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

// get a uniform random number between -1 and 1
inline float f_rand() {
return 2*(rand()/((float)RAND_MAX)) -1.;
}
void genData(FILE *fn, int nVec, float xVar)
{
float xMax = 1.1; float xMin = -xMax;
float xRange = (xMax - xMin);

uint32_t nInput=2; fwrite(&nInput,sizeof(int32_t), 1, fn);
uint32_t nOutput=0; fwrite(&nOutput,sizeof(int32_t), 1, fn);
uint32_t nExamples=nVec; fwrite(&nExamples,sizeof(int32_t), 1, fn);

for(int i=0; i < nVec; i++) {
float t = xRange * f_rand();
float z1 = t +  xVar * f_rand();
#ifdef USE_LINEAR
float z2 = t +  xVar * f_rand();
#else
float z2 = t*t*t +  xVar * f_rand();
#endif
fwrite(&z1, sizeof(float), 1, fn);
fwrite(&z2, sizeof(float), 1, fn);
}
}

int main(int argc, char *argv[])
{
if(argc < 4) {
fprintf(stderr,"Use: filename nVec variance seed\n");
exit(1);
}
char *filename=argv[1];
FILE *fn=stdout;

if(strcmp("-", filename) != 0)
fn=fopen(filename,"w");

if(!fn) {
fprintf(stderr,"Cannot open %s\n",filename);
exit(1);
}
int nVec = atoi(argv[2]);
float variance = atof(argv[3]);
srand(atoi(argv[4]));
genData(fn, nVec, variance);

if(fn != stdout) fclose(fn);
return 0;
}```

### Building and Running the PCA Analysis

Once the nlopt library has been built, the `train.c` and `pred.c` source files from this article should be copied to disk. In addition, the Python script, `genFunc.py` from the previous article needs to be copied to this directory. Create and change to a subdirectory and call it `pca.` Copy the `genData.c` and `myFunc.h` files to this subdirectory. Now use the `genFunc.py `script to create a 2x10x1x10x2 autoencoder:

`  python ../genFunc.py > fcn.h`

The following bash commands will build the executable files. The nlopt default installation directory `\$HOME/install `was used to access the nlopt include and library files.

```APP=pca
FLAGS="-DUSE_LINEAR -std=c99 -O3 -openmp -fgnu89-inline "
INC=\$HOME/install/include
LIB=\$HOME/install/lib

icc \$FLAGS genData.c -o gen_\$APP

icc \$FLAGS ../train.c -I . -I \$INC  -L \$LIB -lnlopt -lm -o train_\$APP.off

icc \$FLAGS -Wno-unknown-pragmas -no-offload -O3 ../train.c -I . -I \$INC \
-L \$LIB -lnlopt -lm -o train_\$APP.omp

icc \$FLAGS -Wno-unknown-pragmas -no-offload ../pred.c -I . -lm -o pred_\$APP

FLAGS+=" -mmic -Wno-unknown-pragmas"
INC=\$HOME/install_mic/include
LIB=\$HOME/install_mic/lib

icc \$FLAGS ../train.c -I . -I \$INC   -L \$LIB -lnlopt -lm -o train_\$APP.mic```

These commands make up the `BUILD` script, which will create the following applications:

• `gen_pca`: Generates the PCA data set.
• `train_pca.mic`: The native mode training application.
• `train_pca.off`: The offload mode training application.
• `train_pca.omp`: A training application that will run in parallel on the host processor cores.
• `pred_pc`a: The sequential prediction program that will run on the host.

### Fitting a PCA Autoencoder Using Offload Mode

The following `RUN_OFFLOAD` script will generate a PCA data set of 30,000,000 observations with a variance of 0.1 that the offload mode `train_pca.off `executable will fit. A 1000-point prediction set with zero variance will be used for prediction purposes. The UNIX `tail` command strips off some informative messages at the beginning of the prediction results save in the file `plot.txt` to make it easy to graph the final result. The original results are kept in the `output.txt` file.

```APP=pca
EXAMPLES=30000000
VAR=0.1

./gen_\$APP \$APP.train.dat \$EXAMPLES \$VAR 1234
./train_\$APP.off \$APP.train.dat \$APP.param
./gen_\$APP \$APP.pred.dat 1000 0 1
./pred_\$APP \$APP.param \$APP.pred.dat > output.txt

# create file for gnuplot
tail -n +3 output.txt > plot.txt

rm *.dat```

Results in:

```\$ sh RUN_OFFLOAD
myFunc generated_PCA_func LINEAR()
nExamples 30000000
Number Parameters 83
Optimization Time 94.0847
found minimum 28.28534843 ret 1
AveObjTime 0.00425339, countObjFunc 22108, totalObjTime 94.0339
Estimated flops in myFunc 128, estimated average GFlop/s 902.81
Estimated maximum GFlop/s 942.816, minimum GFLop/s 13.2578
```

The offload mode training averaged 902 gigaflops in part because there was at least one call to the objective function that ran at only 13 gigaflops per second. Use of the timing framework in Getting to 1 teraflop on the Phi shows that the first few calls to the Phi can be slow relative to the performance shown in the remaining calls.

The gnuplot application is used to generate a scatterplot of the predicted data using the fitted parameters:

```  gnuplot -e "unset key; set term png; set output \"pca_pred.png\"; \
plot \"plotdata.txt\" u 5:6"```

Comparison of resulting graph (Figure 3) shows that the optimized autoencoder did find a reasonable looking fit to the data shown in Figure 2.

Figure 3: Offload mode PCA line prediction.

### VTune Performance Analysis

Symbol table information can be added to the offload compilation command with the `–g` option. This option enables runtime behavior of the Phi coprocessor to be examined with the Intel VTune performance analyzer. The following commands were used to compile` train_pca.off `for VTune analysis. Note the addition of the `-g` option to the `FLAGS` variable:

```APP=pca
FLAGS="-g -DUSE_LINEAR -std=gnu99 -O3 -openmp"
INC=\$HOME/install/include
LIB=\$HOME/install/lib

icc \$FLAGS ../train.c -I . -I \$INC  -L \$LIB -lnlopt -lm -o train_\$APP.off
```

The basic information for utilizing VTune can be found in the Intel VTune Amplifier XE 2013 Start Here document.

Once configured, starting `amplxe-gui` can perform the "Knights Corner Platform — Lightwight Hotspots Analysis." Set "automatically stop application" to 60 seconds. Click start and the summary screen shown in Figure 4 will appear after the application runs for one minute.

Figure 4: CPI and Top Hotspots.

The summary shows that most of the runtime is spent in `myFunc()`. Reviewing the Intel document "Intel VTune Performance Analyzer Basics: WhatIis CPI and How Do I UseIt?" makes it reasonable to suspect that the warning about the high CPI is the result heavy use of the per-core wide-vector units on the the KNC chip.

Some of the time is spent in the Intel Xeon Phi coprocessor threads and on the CPU. In a tightly coupled calculation like a reduction, the slowest thread is the one that controls the overall runtime.

Looking at the source level view in the Figure 5, we see that the dot products consume a significant amount of the runtime (see the left side). The VTune analyzer very nicely highlights the assembly language instructions in the Assembly view (shown on the right) associated with the C source line on the left. The highlighted assembly instructions show that the dot-product is composed of a data move and a wide-vector fused multiply-addition instruction, `VFMADD213PS`. The "Intel Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual" states:

Source Float32 Vectors Performs an element-by-element multiplication between float32 vector zmm2 and float32 vector zmm1 and then adds the result to the loat32 vector result of the swizzle/broadcast/conversion process on memory or vector loat32 zmm3. The final sum is written into float32 vector zmm1."

Figure 5: VTune Source and Assembly views.

From this brief analysis using VTune, we have a fair degree of confidence that the most of the computation time is spent in `myFunc()` performing efficient fused multiply-add wide-vector instructions.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.