This article is part of an ongoing series of tutorials on how to understand and use the new Intel Xeon Phi coprocessors to create applications and adapt legacy software to run at high performance. The previous article in this series showed how to utilize multiple Intel Xeon Phi coprocessors to solve a single problem. Asynchronous data transfers and concurrent execution were utilized to increase application performance by a factor approaching the number of Intel Xeon Phi coprocessors in the system.
This article discusses the use of MPI (Message Passing Interface) to tie together thousands of Intel Xeon Phi coprocessors. Scaling and performance results from the new TACC (Texas Advanced Computing Center) supercomputer, Stampede, which utilizes Intel Xeon Phi coprocessors to deliver a peak theoretical 10 PF/s (petaflops per second) of performance are compared. The MPI code in this example delivers 2.2 PF/s when running on 3000 nodes. MPI performance results from a two Intel Xeon Phi coprocessor workstation using MPI are also reported.
A key challenge to programming large numbers of MPI nodes lies in the scalability of the parallel mapping and in minimizing or hiding communications overhead through the use of global broadcast or by overlapping data transfers and computation. This article combines C language MPI calls with Intel offload pragmas to evaluate a single least-squares objective function across all the nodes in an MPI system. As discussed in the previous articles in this series, many other device characteristics such as memory bandwidth, memory access pattern, and the number and types of floating-point calculations per datum all determine how close an application can get to peak performance when running across all the Intel Xeon Phi cores in each computational node.
Near-linear Scalability of the Parallel Mapping
My MPI mapping (Figure 1) is the same as the mapping used to create a purely offload mode, multi-device objective function in the previous tutorial.
Figure 1: Mapping an objective function to multiple devices.
This mapping has demonstrated near-linear scaling on most parallel architectures developed since the 1980s. The scalability to 60,000 processor cores on the TACC Ranger supercomputer is shown in Figure 2.
Figure 2: Scalability of the mapping to 60,000 processing cores.
Walking through the runtime of the three steps in the mapping:
- Parameters are communicated via a global broadcast. In general, global broadcast is extremely efficient and has a constant runtime. The time spent broadcasting data is usually limited by the bandwidth of the network interconnect hardware.
- Each partial sum calculated in this step runs independently on the MPI node, which means this step scales linearly according to number of processing nodes. The runtime of this step is determined by the amount of data per MPI node and the speed of the node.
- The scalability of the mapping for very large systems is affected mainly by the scalability of the reduction operation. Most MPI reduction operations require an
Nis the number of MPI nodes. In practice, the latency of the network interconnect limits the performance of this step, as the number of bytes in the partial sum is small.
While the scalability graph in Figure 2 looks linear to the eye, the
O(log2(N)) of the reduction operations means the runtime is near-linear. There is also a slight bend at the beginning of the scaling curve because more routers were required to communicate with more MPI nodes. The extra layers of routers introduced additional latency into the computation.
It is important to account for all communications overhead when reporting the performance of a parallel algorithm. This is why Figure 2 references an "Effective Rate," and the example codes measures runtimes that include communications delays.
A Scalable Data Load
The high performance of current supercomputers means that large amounts of data must be loaded so the runtime will be dominated by the calculation of the partial sums and not by the latency of the reduction operation. For example, the TACC Stampede supercomputer will eventually contain 51.2 trillion bytes of Intel Xeon Phi coprocessor memory. Happily, current parallel supercomputer file systems such as Lustre and GPFS can support tens of thousands of computational nodes and deliver hundreds of GB/s of streaming storage bandwidth.
A scalable data load is illustrated in Figure 3.
Figure 3: A scalable data load.
In this example, the master MPI node broadcasts the starting offset and number of examples to all the nodes. This way, each MPI client knows what data it should read. Each client then:
- Opens the data file and seeks to its correct offset.
- Reads its data.
- Closes the file.
Each disk in the file system (denoted by the blue disk icons) contains a portion of the data set stored in a single complete file. The parallel file system has been designed to concurrently transfer all the requested data segments (indicated by the colored boxes) from each disk to the requesting client. In this fashion, the aggregate throughput of the parallel file system can be realized with large numbers of streaming I/O operations, which is why even very large multi-terabyte data sets can be loaded on big supercomputers such as Stampede in a few seconds to a few minutes.
The same high performance can be achieved when many individual files are read by the MPI clients. For simplicity, the MPI clients in this article will append the MPI rank to the filename passed via the command-line for the data load. Utilizing a single file is preferred because it does not require that the user manually split the data based on the number of nodes in the MPI run. Instead, the application code can perform the partitioning to eliminate a potential source of error.
The mpiTrain.c Source Code
train.c source file used in the previous tutorials has been modified to work in an MPI environment. While these changes could have been made to
myFunc.h, a new training source code will make it less confusing for those readers who wish to adapt the
mpiTrain.c source code run multiple coprocessors in an MPI environment or hybrid coprocessor and CPU environment.