Channels ▼
RSS

C/C++

Exceeding Supercomputer Performance with Intel Phi


Performance of a Two Coprocessor Workstation

The script in Listing Three was used to optimize a 2x10x1x10x2 autoencoder using the mpiTrain executable in both the pca_mpi and nlpca_mpi directories. Note the data sets were individually generated using genData for each MPI client. A different seed is passed to genData for each client dataset.

Listing Three: The RUN_MPI script.

APP=pca_mpi
  NUM_NODES=2
  EXAMPLES=`expr 30000000 / $NUM_NODES`
  VAR=0.1
  NUM_CLIENTS=`expr $NUM_NODES - 1`

for i in `seq 0 $NUM_CLIENTS`
  do
  ./gen_$APP $APP.train.dat.$i $EXAMPLES $VAR $i &
  done
  wait
mpiexec -np $NUM_NODES ./mpiTrain_$APP.off $APP.train.dat $APP.param $APP.timing.txt
./gen_$APP $APP.pred.dat 1000 0 1
  ./pred_$APP $APP.param $APP.pred.dat > output.txt
# create file for gnuplot
  tail -n +3 output.txt > plot.txt
rm *.dat.*

The following results are based on a 12-core 3.3 GHz X5680 Westmere workstation containing two Intel Xeon Phi coprocessors.

PCA

The PCA run utilized 15 million examples per MPI client. The overall average MPI runtime for 30-million examples on two Intel Xeon Phi devices was 1,621 GF/s. The average sustained performance on the master was 810 GF/s, which indicates the MPI code is delivering a 2x speedup with two coprocessors.

$ sh RUN_MPI
Number of tasks= 2 My rank= 0, number clients 2
myFunc generated_PCA_func LINEAR()
nExamples 15000000
Number Parameters 83
Optimization Time 58.5383
found minimum 39.97145961 ret 1
----------- performance times for the Master ----------
number OMP threads 240
DataLoadTime 2.10481
AveObjTime 0.00236871, countObjFunc 24432, totalObjTime 57.8723
Estimated flops in myFunc 128, estimated average GFlop/s 810.569
Estimated maximum GFlop/s 874.763, minimum GFLop/s 7.33881
----------- performance times for the MPI run ----------
function: generated_PCA_func LINEAR()
totalExamples 3e+07
AveObjTime 0.00236871, countObjFunc 24432, totalObjTime 57.8723
Estimated flops in myFunc 128, estimated average TFlop/s 1.62114, nClients 2
Estimated maximum TFlop/s 1.74953, minimum TFLop/s 0.0146776

Figure 4: Results of a two client PCA optimization using MPI.

In comparison, the multi-coprocessor PCA code in the previous article (which did not utilize the MPI API) reported 1,523 GF/s on the same sized problem. This indicates both the MPI and multi-coprocessor applications run at the same speed. However the MPI code on a single workstation utilizes shared memory for client communication. It is likely that a physical network interface connecting two distributed computational nodes will probably exhibit lower performance.

NLPCA

The average sustained performance for an NLPCA optimization utilizing 30 million examples on two coprocessors is 669 GF/s. This is approximately twice the performance of the master node that reports a 334.5 GF/s average performance.

$ sh RUN_MPI
Number of tasks= 2 My rank= 0, number clients 2
Number of coprocessors per node= 2
myFunc generated func Eliott activation: x/(1+fabsf(x))
nExamples 15000000
Number Parameters 83
Optimization Time 900.006
found minimum 45.02399591 ret 6
----------- performance times for the Master ----------
number OMP threads 240
DataLoadTime 1.97757
AveObjTime 0.0084287, countObjFunc 106177, totalObjTime 894.934
Estimated flops in myFunc 188, estimated average GFlop/s 334.571
Estimated maximum GFlop/s 342.978, minimum GFLop/s 10.0361
----------- performance times for the MPI run ----------
function: generated func Eliott activation: x/(1+fabsf(x))
totalExamples 3e+07
AveObjTime 0.0084287, countObjFunc 106177, totalObjTime 894.934
Estimated flops in myFunc 188, estimated average TFlop/s 0.669142, nClients 2
Estimated maximum TFlop/s 0.685956, minimum TFLop/s 0.0200723

Figure 5: Results reported for a two client NLPCA MPI run.

In comparison, the multi-coprocessor NLPCA code in the previous article (that did not utilize the MPI API) reported 660 GF/s on the same sized problem. This result indicates both the MPI and multi-coprocessor applications run at the same speed. As noted, it is likely that a physical network interface connecting two distributed computational nodes will probably exhibit lower performance.

Performance on Stampede

The TACC Stampede system is a 10 PFLOPS (PF) Dell Linux Cluster based on 6,400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 (Sandy Bridge) processors and a single Intel Xeon Phi Coprocessor. The aggregate peak performance of the Xeon Phi processors is greater than seven petaflops.

The 56GB/s FDR InfiniBand interconnect consists of Mellanox switches, fiber cables, and HCAs (Host Channel Adapters). Eight core 648-port SX6536 switches and more than 320 36-port SX6025 endpoint switches (2 in each compute-node rack) form a 2-level Clos fat tree topology, illustrated in Figure 6. Core and endpoint switches have 4.0 and 73 Tb/s capacities, respectively. There is a 5:4 oversubscription at the endpoint (leaf) switches (20 node input ports: 16 core-switch output ports). Any MPI message is 5 hops or less from source to destination.

Stampede network architecture
Figure 6: Stampede network architecture (courtesy of TACC).

The Texas Advanced Computing Center kindly provided access to their system during preproduction testing and follow-up. The scaling curve (in Figure 7) demonstrates near-linear scaling to 3000 Intel Xeon Phi coprocessors. This graph shows average sustained performance of 2.2 PF/s using 300 nodes. The maximum reported performance was nearly 3 PF/s. It is likely the average sustained performance will increase to 4.6 petaflops once Stampede receives all 6400 Intel Xeon Phi coprocessors.

Observed scaling to 3000 nodes of the TACC Stampede supercomputer
Figure 7: Observed scaling to 3000 nodes of the TACC Stampede supercomputer.

Extraordinary Research Opportunities

The ability to access petaflops of performance with compute clusters containing multiple Intel Xeon Phi coprocessors opens the door to extraordinary research opportunities. For example, the 256-node runs provided an average performance that exceeded the theoretical peak performance of the $30-million PNNL Chinook supercomputer. Even small research organizations can afford a 256-node Intel Xeon Phi computational cluster. Further, the performance of this tiny portion of the Stampede supercomputer approaches that of the TACC Ranger supercomputer.

While this tutorial series has focused on optimization and machine learning, the petaflop era opens new vistas for climate modeling and prediction, materials modeling, brain modeling, and a wide assortment of other computational based areas of research.

Conclusion

This article demonstrates how to utilize Intel Xeon Phi coprocessors to evaluate a single objective function across a computational cluster using MPI. The example code can be used with existing numerical optimizations libraries to solve real problems of interest to data scientists. Performance results show that the TACC Stampede supercomputer is indeed capable of sustaining many petaflops of average effective performance. In other words, "Effective Performance" or "Honest flops" that take into account all communications overhead. Small compute clusters containing 256 nodes, which are affordable for schools and small research organizations, have the ability to exceed the peak theoretical performance of multimillion-dollar machines that are still operational at the smaller U.S. national laboratories, and deliver performance that approaches that of even large leadership-class supercomputers that are only a few years old.


Rob Farber is a frequent contributor to Dr. Dobb's on high-performance and massively parallel computing topics.


Related Articles

Programming Intel's Xeon Phi: A Jumpstart Introduction

CUDA vs. Phi: Phi Programming for CUDA Developers

Getting to 1 Teraflop on the Intel Phi Coprocessor

Numerical and Computational Optimization on the Intel Phi


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video