Generating an Autoencoder
genFunc.py code can be used to generate a more complicated autoencoder that provides more floating-point operations per byte fetched from main memory and will get closer to peak performance on the Phi coprocessor. Copy the
simpleExample directory to
auto2x10x1x10x2 and generate the new
fcn.h source code shown in Example 11.
cp -r simpleExample auto2x10x1x10x2 python genFunc.py > auto2x10x1x10x2/fcn.h
Example 11: Commands to generate and build a PCA autoencoder.
The PCA version can be built by simply changing the command-line argument to the
BUILD_LINEAR_TIMING script in Example 12. The nonlinear version can be built with the
sh BUILD_LINEAR_TIMING auto2x10x1x10x2
Example 12: Using the BUILD_LINEAR_TIMING shell script to build an autoencoder.
To facilitate testing, the
NLPCA_TIMING shell scripts always create binaries of the same name. It is important to pay attention to what binary is being used and to the output that reports the function being evaluated. The binaries created are:
- timing.omp: an OpenMP executable that runs on the host processor
- timing.mic: a native mode Intel Xeon Phi coprocessor executable
- timing.off: an offload mode executable that will run on both the host and Intel Xeon Phi coprocessor
Running a linear PCA function using the timing code in
auto2x10x1x10x2 in offload mode generates the output in Example 13.
$ ./timing.off 10 1000 60000000 TIMING TEST: generated_PCA_func LINEAR() Runtime 8.41651, aveObjRuntime 0.00841619 number OMP threads 240 DataLoadTime 1.86613 AveObjTime 0.00841619, countObjFunc 1000, totalObjTime 8.41619 Estimated flops in myFunc 128, estimated average GFlop/s 912.527 Estimated maximum GFlop/s 979.096, minimum GFLop/s 319.506 Double check based on overall runtime maxTime 0.0240381, minTime 0.00784516, aveTime 0.00841639 per call maxGFLOPS 978.947, minGFLOPS 319.493, aveGFLOPS 912.505 per call
Example 13: Example of output for a linear offload run.
The output tells us that we are running a linear PCA function. In this example, the data transfer to the Phi coprocessor achieved 257 MB/second. This timing information should be reliable as the average runtime is consistent between both
perCall measurement methods. The Phi coprocessor utilized 240 threads.
The key floating-point metrics show that average performance is 912 GF/second. The fastest offload runtime achieve nearly a TF/second. However, there is almost a factor of 3x difference between the fastest and slowest runtimes even though the code performed 10 warm-up runs.
The timing executable can run natively on the Phi coprocessor after copying the executable and
libiomp5.so to the device plus setting the
LD_LIBRARY_PATH correctly (or the micnativeloadex script could be used or keep one window open on the Intel Xeon Phi coprocessor, see Example 14.)
DEV=mic0 LIBDIR=/opt/intel/composer_xe_2013.0.079/compiler/lib/mic scp $LIBDIR/libiomp5.so mic0:/tmp scp timing.mic $DEV: ssh $DEV "export LD_LIBRARY_PATH=/tmp; export OMP_NUM_THREADS=240; \ ./timing.mic 10 1000 60000000"
Example 14: Commands to prepare and run on mic0.
Even though the number of threads is the same, the same code running in native mode was on average 1.173x faster than offload mode in Example 15. Offload mode can be nearly as efficient as native mode when the time spent performing the computation is large relative to the latency of the data transfers on the PCIe bus. The runtime difference will increase as the problem size decreases.
$ sh RUN_NATIVE_TIMING libiomp5.so 100% 956KB 955.6KB/s 00:00 timing.mic 100% 36KB 36.5KB/s 00:00 TIMING TEST: generated_PCA_func LINEAR() Runtime 7.17827, aveObjRuntime 0.00717398 number OMP threads 240 DataLoadTime 0 AveObjTime 0.00717398, countObjFunc 1000, totalObjTime 7.17398 Estimated flops in myFunc 128, estimated average GFlop/s 1070.54 Estimated maximum GFlop/s 1086.89, minimum GFLop/s 1032.81 Double check based on overall runtime maxTime 0.00743794, minTime 0.00706887, aveTime 0.00717716 per call maxGFLOPS 1086.45, minGFLOPS 1032.54, aveGFLOPS 1070.06 per call
Example 15: Output from a native mode linear run.
In addition, the operating system jitter discussed in the first article appears to be the cause of much of this variation. (An excellent starting paper on this topic is "The Case of the Missing Supercomputer Performance.") In offload mode, small latencies as the device driver moves the parameters onto the Phi and the single floating-point error estimate off the device decrease performance.
Running the OpenMP version on a 12-core 3.3 GHz Intel X5680 Westmere chipset shows the linear code runs on average 8.5x slower than the offload code and 10x slower that the native mode; see Example 16.
$ ./timing.omp 10 1000 60000000 TIMING TEST: generated_PCA_func LINEAR() Runtime 71.2253, aveObjRuntime 0.0712247 number OMP threads 24 DataLoadTime 0 AveObjTime 0.0712247, countObjFunc 1000, totalObjTime 71.2247 Estimated flops in myFunc 128, estimated average GFlop/s 107.828 Estimated maximum GFlop/s 121.569, minimum GFLop/s 54.8578 Double check based on overall runtime maxTime 0.139998, minTime 0.063174, aveTime 0.0712251 per call maxGFLOPS 121.569, minGFLOPS 54.8578, aveGFLOPS 107.827 per call
Example 16: Output from a linear run on a conventional processor.
auto2x10x1x10x2 shows that the processor core is performing a large number of dot products that utilize the fused multiply-add instruction. Changing to a nonlinear function illustrates the impact of adding a division and absolute value to the calculation with the Elliott activation function, G(x) = x/(1+|x|), as can be seen in the following table where the timing programs were built with
timing.mic: maxGFLOPS 360.278, minGFLOPS 281.438, aveGFLOPS 358.51 per call
timing.off: maxGFLOPS 349.66, minGFLOPS 326.766, aveGFLOPS 348.257 per call
timing.omp: maxGFLOPS 91.4893, minGFLOPS 71.9806, aveGFLOPS 90.8493 per call
The key point is that the Intel Xeon Phi coprocessor in native mode runs the nonlinear problem with a performance comparable to the offload mode. This indicates that the runtime is dominated more by the computation runtime rather than latency-limited operations such as the summation spinlock and PCIe data transfers. Note that there is still significant variation between minimum and maximum performance. Performance profiling with Intel's profiler, VTune, will help identify the reasons for these performance changes.
Peak performance is a useful marketing metric that condenses the complexity of any machine from a cellphone to a leadership-class supercomputer into a single number that people can easily grasp and categorize. While peak performance has its place, sophisticated performance competitions such as the TOP500 and GRAPH500 attempt to more realistically evaluate system performance for specific problem domains.
The key to entering the high-performance arena with the Phi product family is to express sufficient parallelism and vector capability to fully utilize the device. Optimized libraries such as MKL can achieve very high performance. Matrix multiplication is a useful computational tool that also makes a great benchmark because it can show how close a device can get to peak theoretical.
The massively parallel mapping utilized in this article has proven to be an excellent framework for solving real-world problems, as a teaching tool, and as a performance evaluation tool. The autoencoder objective functions used in this tutorial solve real-world PCA and NLPCA problems, yet they can also be modified to stress either the memory subsystem or floating-point capability of a device. It is also possible to define an autoencoder architecture that is not limited by memory bandwidth or computation, but rather by the synchronization required to perform a reduction on a parallel computer. The heavy use of the fused multiply-add instruction means that it is possible to fully utilize the floating-point capability of some devices, and achieve high-performance across a wide range of devices. The near-linear scaling of this mapping means that you can run it with high-performance on a single device or on a supercomputer over a wide range of problem domains.
I encourage you to explore the Intel Xeon Phi coprocessor performance envelope through the use of the provided Python code generator and by writing your own functions. My next article will demonstrate that these objective functions can indeed solve real optimization problems with high performance.
The current source code needs to be compiled with the older 13.0.0 Intel compiler. While the code can be compiled with the more recent compilers (13.0.1 and 13.1.0), care must be taken that the loop in the objective function vectorizes.
Rather than generating
myFunc(), it is more convenient to write a single function that loops over the connections between neurons in different layers. Using loops in
myFunc() appears to prevent vectorization and results in a significant performance drop. Unfortunately, loop unrolling does not appear to help.
The article "Optimization and Performance Tuning for Intel Xeon Phi Coprocessors Part 1: Optimization Essentials" is a useful reference for high-performance Intel Xeon Phi coprocessor programming. It notes that alignment of the vectors with
__declspec(align(64)) is important. Utilizing
-vec-report=6 when compiling confirms that the values are aligned. The Intel article also notes, "Code will run best when data are accessed in sequential address-order from memory. Frequently, developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays (AoS to SoA)." Users can test the performance effects of AoS versus. SoA by changing the
The article "Test-Driving Intel Xeon Phi Coprocessors with a Mastic N-body Simulation" is a good additional resource to consult when evaluating how to write high-performance code for Intel Xeon Phi coprocessors.
Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.