Fitting a PCA autoencoder using native mode
The following commands will utilize the
train_pca.mic executable to fit the data. Note that the data is piped to the executable to preserve precious onboard RAM resources. The variable
DEV can be modified to run on any Phi coprocessor in the system. In this example,
DEV is set to
scp command is used to transfer data and results between the Intel Xeon Phi coprocessor and host. It is assumed that the
libiomp5.so shared object file was previously copied to
/tmp on the device.
APP=pca DEV=mic1 scp train_$APP.mic $DEV: ./gen_$APP - 30000000 0.1 1234 \ | ssh $DEV "export LD_LIBRARY_PATH=/tmp; ./train_$APP.mic - $APP.param" scp $DEV:$APP.param . #clean up ssh $DEV "rm train_$APP.mic $APP.param" ./gen_$APP - 1000 0 1 | ./pred_$APP $APP.param - > output.txt # create file for gnuplot tail -n +3 output.txt > plot.txt
Figure 6 shows the performance of the linear 2x10x1x10x2 autoencoder-based objective function as the data set size varies and according to processing mode. As can be seen, the Phi coprocessor quickly outstrips both the offload and the 3.3 GHz Westmere x5680 dual socket host processor in both offload and native modes.
Figure 6: Performance of a linear 2x10x1x10x2 PCA autoencoder according to size, machine, and mode.
The performance of the offload mode gradually improves as the latency and bandwidth limitations of the PCIe bus are overshadowed by the runtime of the objective function. It is expected that performance of the offload mode will improve with time; especially as this is the only way to utilize multiple devices within a system or as MPI processes in a compute cluster.
A Nonlinear Principal Components Optimization
While PCA utilizes straight lines, NLPCA can utilize continuous open or closed curves to account for variance in data. As a result, NLPCA has the ability to represent nonlinear problems in a lower dimensional space. NLPCA has wide applicability to numerous challenging problems including image and handwriting analysis, biological modeling, climate and chemistry.
Building and Running the PCA Analysis
An NLPCA analysis can be performed by changing the definition of the
G() function in
myFunc.h by simply editing the
"-DUSE_LINEAR" flag in the build script. The source code was designed to make this process as easy as copying the
pca directory to an
nlpca directory and editing the
BUILD script. Here is the complete
BUILD script for the
APP=nlpca FLAGS="-DUSE_ELLIOTT -std=gnu99 -O3 -openmp " INC=$HOME/install/include LIB=$HOME/install/lib icc $FLAGS genData.c -o gen_$APP icc $FLAGS ../train.c -I . -I $INC -L $LIB -lnlopt -lm -o train_$APP.off icc $FLAGS -Wno-unknown-pragmas -no-offload -O3 ../train.c -I . -I $INC \ -L $LIB -lnlopt -lm -o train_$APP.omp icc $FLAGS -Wno-unknown-pragmas -no-offload ../pred.c -I . -lm -o pred_$APP FLAGS+=" -mmic -Wno-unknown-pragmas" INC=$HOME/install_mic/include LIB=$HOME/install_mic/lib icc $FLAGS ../train.c -I . -I $INC -L $LIB -lnlopt -lm -o train_$APP.mic
These commands comprise the
BUILD script, which will create the following applications:
gen_nlpca: Generates the
train_nlpca.mic: The native mode training application.
train_nlpca.off: The offload mode training application.
train_nlpca.omp: A training application that will run in parallel on the host processor cores.
pred_nlpca: The sequential prediction program that will run on the host.
The Elliott activation function,
x/(1+|x|) , used in this article is nice for timing purposes because we know how many floating-point operations it requires. Unfortunately, this activation function, as noted in "A Better Activation Function for Artificial Neural Networks," may require more optimization steps to reach a solution than more-conventional activation functions when solving real problems. Several conventional activation functions,
tanh() and the logistic function, can be enabled by simply changing the definition of
G() with a preprocessor
define. For timing purposes, each call to
expf() is assumed to take seven floating-point operations. Note that this is only a guess because the number of instructions required for each of these functions varies. However, you can use this code to experiment with different activation functions.
Fitting an NLPCA Autoencoder Using Offload Mode
The following bash script is nearly identical to
RUN_OFFLOAD for the
pca directory, but the modified script will create an NLPCA data set of 30,000,000 observations generated with a variance of 0.1 that the offload mode
train_nlpca.off executable will fit. A 1000 point prediction set with zero variance will be used for prediction purposes. This is identical to the size and character of the
pca runs. The UNIX
tail command is used to strip off some informative messages at the beginning of the prediction results save in the file plot.txt to make it easy to graph the final result. The original results are kept in the output.txt file.
APP=nlpca EXAMPLES=30000000 VAR=0.1 ./gen_$APP $APP.train.dat $EXAMPLES $VAR 1234 ./train_$APP.off $APP.train.dat $APP.param ./gen_$APP $APP.pred.dat 1000 0 1 ./pred_$APP $APP.param $APP.pred.dat < output.txt # create file for gnuplot tail -n +3 output.txt < plot.txt rm *.dat
The output of the NLPCA training when running in offload mode on the Intel Xeon Phi coprocessor follows. Note the
objective function was called 109,197 times and it delivered on average 342 gigaflops of performance.
$ sh RUN_OFFLOAD myFunc generated func Eliott activation: x/(1+fabsf(x)) nExamples 30000000 Number Parameters 83 Optimization Time 1800.01 found minimum 39.71399312 ret 6 number OMP threads 240 DataLoadTime 3.02155 AveObjTime 0.0164814, countObjFunc 109197, totalObjTime 1799.72 Estimated flops in myFunc 188, estimated average GFlop/s 342.203 Estimated maximum GFlop/s 345.952, minimum GFLop/s 14.8949
Comparison of resulting graph (Figure 7) shows that the optimized autoencoder did find a reasonable-looking fit to the data shown in Figure 6.
Figure 7: Offload mode NLPCA line prediction.
VTune Performance Analysis
After building the
train_nlpca.off executable with the
–g flag, running
amplxe-gui, performing a Hot Spot analysis limited to one minute, the CPU usage is in the ideal range and
myFunc consumes most of the runtime. (The G() function also constitutes a hot spot.) As with the PCA timeline, most threads start and compete at the same time and many seem to fully occupy the processing core. The dot products still consume a significant amount of runtime; even a simple
G() function requires small instructions and data movements. VTune highlighted the assembly language instructions associated with the appropriate line of C source code (in this case, a small number of instructions to perform the Elliott activation
G() function). However, this operation does not perform two operations per clock, hence, it slows overall performance.
Fitting an NLPCA Autoencoder Using NativeMmode
The following commands utilize the
train_nlpca.mic executable to fit the data. As with the
pca run, the data is piped to the executable to preserve precious onboard RAM resources. The variable
DEV can be modified to run on any Phi coprocessor in the system. In this example,
DEV is set to
scp command is again used to transfer data and results between the Phi coprocessor and host.
APP=nlpca DEV=mic1 scp train_$APP.mic $DEV: ./gen_$APP - 30000000 0.1 1234 \ | ssh $DEV "export LD_LIBRARY_PATH=/tmp; ./train_$APP.mic - $APP.param" scp $DEV:$APP.param . #clean up ssh $DEV "rm train_$APP.mic $APP.param" ./gen_$APP - 1000 0 1 | ./pred_$APP $APP.param - > output.txt # create file for gnuplot tail -n +3 output.txt > plot.txt
Figure 8 shows the performance of a 2x10x1x10x2 autoencoder based objective function using the Elliott activation function as the data set size varies. Multiple surveys from using the host Westmere processor, and the Intel Xeon Phi coprocessor operating in offload and native modes are shown on this graph.
As can be seen, native mode performance on the Phi coprocessor quickly outstrips both the offload and the 3.3 GHz Westmere x5680 dual-socket host processor.
The performance of the offload mode gradually improves as the latency and bandwidth limitations of the PCIe bus are dominated by the runtime of the objective function. It is expected that performance of the offload mode will improve with time; especially since this is the only way to utilize multiple devices within a system or as MPI processes in a compute cluster.
Figure 8: Performance of a 2x10x1x10x2 NLPCA autoencoder according to size, machine, and mode.
This article demonstrates how to combine Phi coprocessor-based objective functions with existing numerical optimizations libraries to solve real problems with high performance. The freely available nlopt library was built to run on the Intel Xeon Phi coprocessor in both host and offload mode. The objective functions discussed in Getting to 1 Teraflop on the Intel Phi Coprocessor were used to fit example data sets in both native and offload modes, while still delivering performance in the 300-to-teraflop-per-second performance range. A survey across problem sizes was performed to get a sense of how offload mode compares with native execution and against a 24-core 3.3 GHz Westmere processor set. Small problems in particular performed nicely on the Phi coprocessor due to the elimination of latencies associated with the PCIe bus.
The Intel VTune performance analyzer confirmed that the application was effectively using the Intel Xeon Phi wide-vector instructions. Thread utilization across all the cores was excellent. The use of the VTune analyzer allowed us to examine hotspots and the memory bandwidth behavior of complex functions.
Finally, I encourage you to explore the Phi coprocessor performance envelope through the use of the provided Python code generator and by performing your own optimizations. The software framework in this article is general enough so that Phi coprocessors can be integrated into existing analytic workflows.
Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.