Performance of a Two Coprocessor Workstation
The script in Listing Three was used to optimize a 2x10x1x10x2 autoencoder using the
mpiTrain executable in both the
nlpca_mpi directories. Note the data sets were individually generated using
genData for each MPI client. A different seed is passed to
genData for each client dataset.
Listing Three: The RUN_MPI script.
APP=pca_mpi NUM_NODES=2 EXAMPLES=`expr 30000000 / $NUM_NODES` VAR=0.1 NUM_CLIENTS=`expr $NUM_NODES - 1` for i in `seq 0 $NUM_CLIENTS` do ./gen_$APP $APP.train.dat.$i $EXAMPLES $VAR $i & done wait mpiexec -np $NUM_NODES ./mpiTrain_$APP.off $APP.train.dat $APP.param $APP.timing.txt ./gen_$APP $APP.pred.dat 1000 0 1 ./pred_$APP $APP.param $APP.pred.dat > output.txt # create file for gnuplot tail -n +3 output.txt > plot.txt rm *.dat.*
The following results are based on a 12-core 3.3 GHz X5680 Westmere workstation containing two Intel Xeon Phi coprocessors.
The PCA run utilized 15 million examples per MPI client. The overall average MPI runtime for 30-million examples on two Intel Xeon Phi devices was 1,621 GF/s. The average sustained performance on the master was 810 GF/s, which indicates the MPI code is delivering a 2x speedup with two coprocessors.
$ sh RUN_MPI Number of tasks= 2 My rank= 0, number clients 2 myFunc generated_PCA_func LINEAR() nExamples 15000000 Number Parameters 83 Optimization Time 58.5383 found minimum 39.97145961 ret 1 ----------- performance times for the Master ---------- number OMP threads 240 DataLoadTime 2.10481 AveObjTime 0.00236871, countObjFunc 24432, totalObjTime 57.8723 Estimated flops in myFunc 128, estimated average GFlop/s 810.569 Estimated maximum GFlop/s 874.763, minimum GFLop/s 7.33881 ----------- performance times for the MPI run ---------- function: generated_PCA_func LINEAR() totalExamples 3e+07 AveObjTime 0.00236871, countObjFunc 24432, totalObjTime 57.8723 Estimated flops in myFunc 128, estimated average TFlop/s 1.62114, nClients 2 Estimated maximum TFlop/s 1.74953, minimum TFLop/s 0.0146776
Figure 4: Results of a two client PCA optimization using MPI.
In comparison, the multi-coprocessor PCA code in the previous article (which did not utilize the MPI API) reported 1,523 GF/s on the same sized problem. This indicates both the MPI and multi-coprocessor applications run at the same speed. However the MPI code on a single workstation utilizes shared memory for client communication. It is likely that a physical network interface connecting two distributed computational nodes will probably exhibit lower performance.
The average sustained performance for an NLPCA optimization utilizing 30 million examples on two coprocessors is 669 GF/s. This is approximately twice the performance of the master node that reports a 334.5 GF/s average performance.
$ sh RUN_MPI Number of tasks= 2 My rank= 0, number clients 2 Number of coprocessors per node= 2 myFunc generated func Eliott activation: x/(1+fabsf(x)) nExamples 15000000 Number Parameters 83 Optimization Time 900.006 found minimum 45.02399591 ret 6 ----------- performance times for the Master ---------- number OMP threads 240 DataLoadTime 1.97757 AveObjTime 0.0084287, countObjFunc 106177, totalObjTime 894.934 Estimated flops in myFunc 188, estimated average GFlop/s 334.571 Estimated maximum GFlop/s 342.978, minimum GFLop/s 10.0361 ----------- performance times for the MPI run ---------- function: generated func Eliott activation: x/(1+fabsf(x)) totalExamples 3e+07 AveObjTime 0.0084287, countObjFunc 106177, totalObjTime 894.934 Estimated flops in myFunc 188, estimated average TFlop/s 0.669142, nClients 2 Estimated maximum TFlop/s 0.685956, minimum TFLop/s 0.0200723
Figure 5: Results reported for a two client NLPCA MPI run.
In comparison, the multi-coprocessor NLPCA code in the previous article (that did not utilize the MPI API) reported 660 GF/s on the same sized problem. This result indicates both the MPI and multi-coprocessor applications run at the same speed. As noted, it is likely that a physical network interface connecting two distributed computational nodes will probably exhibit lower performance.
Performance on Stampede
The TACC Stampede system is a 10 PFLOPS (PF) Dell Linux Cluster based on 6,400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 (Sandy Bridge) processors and a single Intel Xeon Phi Coprocessor. The aggregate peak performance of the Xeon Phi processors is greater than seven petaflops.
The 56GB/s FDR InfiniBand interconnect consists of Mellanox switches, fiber cables, and HCAs (Host Channel Adapters). Eight core 648-port SX6536 switches and more than 320 36-port SX6025 endpoint switches (2 in each compute-node rack) form a 2-level Clos fat tree topology, illustrated in Figure 6. Core and endpoint switches have 4.0 and 73 Tb/s capacities, respectively. There is a 5:4 oversubscription at the endpoint (leaf) switches (20 node input ports: 16 core-switch output ports). Any MPI message is 5 hops or less from source to destination.
Figure 6: Stampede network architecture (courtesy of TACC).
The Texas Advanced Computing Center kindly provided access to their system during preproduction testing and follow-up. The scaling curve (in Figure 7) demonstrates near-linear scaling to 3000 Intel Xeon Phi coprocessors. This graph shows average sustained performance of 2.2 PF/s using 300 nodes. The maximum reported performance was nearly 3 PF/s. It is likely the average sustained performance will increase to 4.6 petaflops once Stampede receives all 6400 Intel Xeon Phi coprocessors.
Figure 7: Observed scaling to 3000 nodes of the TACC Stampede supercomputer.
Extraordinary Research Opportunities
The ability to access petaflops of performance with compute clusters containing multiple Intel Xeon Phi coprocessors opens the door to extraordinary research opportunities. For example, the 256-node runs provided an average performance that exceeded the theoretical peak performance of the $30-million PNNL Chinook supercomputer. Even small research organizations can afford a 256-node Intel Xeon Phi computational cluster. Further, the performance of this tiny portion of the Stampede supercomputer approaches that of the TACC Ranger supercomputer.
While this tutorial series has focused on optimization and machine learning, the petaflop era opens new vistas for climate modeling and prediction, materials modeling, brain modeling, and a wide assortment of other computational based areas of research.
This article demonstrates how to utilize Intel Xeon Phi coprocessors to evaluate a single objective function across a computational cluster using MPI. The example code can be used with existing numerical optimizations libraries to solve real problems of interest to data scientists. Performance results show that the TACC Stampede supercomputer is indeed capable of sustaining many petaflops of average effective performance. In other words, "Effective Performance" or "Honest flops" that take into account all communications overhead. Small compute clusters containing 256 nodes, which are affordable for schools and small research organizations, have the ability to exceed the peak theoretical performance of multimillion-dollar machines that are still operational at the smaller U.S. national laboratories, and deliver performance that approaches that of even large leadership-class supercomputers that are only a few years old.
Rob Farber is a frequent contributor to Dr. Dobb's on high-performance and massively parallel computing topics.