Demonstration: Scalability to 120 Threads Is Recommended
Listing One is a C source code snippet that implements doMult(), a function that multiplies two square matrices A and B and assigns the result into matrix C. This function will be used to quantify the performance impact of the number of threads utilized per core.
Listing One: A simple square matrix multiplication function.
// An OpenMP simple matrix multiply
void doMult(int size, float (* restrict A)[size],
float (* restrict B)[size], float (* restrict C)[size])
{
#pragma offload target(mic:MIC_DEV) \
in(A:length(size*size)) in( B:length(size*size)) \
out(C:length(size*size))
{
// Zero the C matrix
#pragma omp parallel for default(none) shared(C,size)
for (int i = 0; i < size; ++i)
for (int j = 0; j < size; ++j)
C[i][j] =0.f;
// Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i)
for (int k = 0; k < size; ++k)
for (int j = 0; j < size; ++j)
C[i][j] += A[i][k] * B[k][j];
}
}
Readers familiar with OpenMP are comfortable using pragmas to annotate their code so it can be parallelized by an OpenMP-compliant compiler. Listing Two is a complete source code with test harness to demonstrate the average native (Intel Xeon Phi as a Linux SMP computer) runtime performance.
Listing Two: Source code for firstMatrix.c.
/* firstMatrix.c (Rob Farber) */
#ifndef MIC_DEV
#define MIC_DEV 0
#endif
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <mkl.h>
#include <math.h>
// An OpenMP simple matrix multiply
void doMult(int size, float (* restrict A)[size],
float (* restrict B)[size], float (* restrict C)[size])
{
#pragma offload target(mic:MIC_DEV) \
in(A:length(size*size)) in( B:length(size*size)) \
out(C:length(size*size))
{
// Zero the C matrix
#pragma omp parallel for default(none) shared(C,size)
for (int i = 0; i < size; ++i)
for (int j = 0; j < size; ++j)
C[i][j] =0.f;
// Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i)
for (int k = 0; k < size; ++k)
for (int j = 0; j < size; ++j)
C[i][j] += A[i][k] * B[k][j];
}
}
int main(int argc, char *argv[])
{
if(argc != 4) {
fprintf(stderr,"Use: %s size nThreads nIter\n",argv[0]);
return -1;
}
int i,j,k;
int size=atoi(argv[1]);
int nThreads=atoi(argv[2]);
int nIter=atoi(argv[3]);
omp_set_num_threads(nThreads);
float (*restrict A)[size] = malloc(sizeof(float)*size*size);
float (*restrict B)[size] = malloc(sizeof(float)*size*size);
float (*restrict C)[size] = malloc(sizeof(float)*size*size);
// Fill the A and B arrays
#pragma omp parallel for default(none) shared(A,B,size) private(i,j,k)
for (i = 0; i < size; ++i) {
for (j = 0; j < size; ++j) {
A[i][j] = (float)i + j;
B[i][j] = (float)i - j;
}
}
// warm up
doMult(size, A,B,C);
double aveTime,minTime=1e6,maxTime=0.;
for(int i=0; i < nIter; i++) {
double startTime = dsecnd();
doMult(size, A,B,C);
double endTime = dsecnd();
double runtime = endTime-startTime;
maxTime=(maxTime > runtime)?maxTime:runtime;
minTime=(minTime < runtime)?minTime:runtime;
aveTime += runtime;
}
aveTime /= nIter;
printf("%s nThrds %d matrix %d maxRT %g minRT %g aveRT %g ave_GFlop/s %g\n",
argv[0], omp_get_num_threads(), size,
maxTime,minTime,aveTime, 2e-9*size*size*size/aveTime);
free(A); free(B); free(C);
return 0;
}
The firstMatrix.c source code can be compiled to run natively on an Intel Xeon Phi coprocessor as an OpenMP application with the Intel C Compiler (icc) command shown in Listing Three. The –mmic argument specifies native compilation for Xeon Phi.
Listing Three: Intel compiler command to build firstMult.c for native execution.
icc -mkl -O3 -mmic -openmp -L /opt/intel/lib/mic -Wno-unknown-pragmas -std=c99 -vec-report3 firstMatrix.c -o firstMatrix.mic -liomp5
Listing Four is a shell script that was used to observe how the runtime varied according to thread count when multiplying 1000x1000 square matrices natively on a Phi coprocessor.
Listing Four: Shell script for observing runtime variance.
export KMP_AFFINITY="granularity=thread,balanced" export LD_LIBRARY_PATH=/tmp i=1 while [ $i -lt 480 ] do ./firstMatrix.mic 1000 $i 10 let i++ done
The KMP_AFFINITY environment variable specifies the thread-to-core affinity. There are three preset schemes: compact, scatter, and balanced. Intel recommends the user explicitly define the affinity that works best for their application. The reason is that the default runtime thread affinity can change between software releases. For consistent application performance across software releases, do not rely on the default affinity scheme.
Compacttries to use minimum number of cores by pinning four threads to a core before filling the next coreScattertries to evenly distribute threads across all coresBalancedtries to equally scatter threads across all cores such that adjacent threads (sequential thread numbers) are pinned to the same core. One caveat being that all cores refers to the total number of cores -1 because one core is reserved for the operating system during an offload
Interested readers can find more about the affinitization schemes in "Best Known Methods for Using OpenMP on Intel Many Integrated Core Architecture."
The runtimes in Figure 4 show that the first performance peak is around 120 threads, or 2x the coprocessor core count. (Remember that one core is reserved for the operating system.) The highest and broadest performance peak is observed around 240 threads, or 4x the core count.
Figure 4: Average GFlop/s as a function of thread count when multiplying 1000x1000 matrices.
Figure 5 illustrates the variation in runtime performance with 240 threads is clearly larger than the variation with 120 threads. However, an average over 10 samples shows that the minimum observed runtime does not dramatically affect the overall average runtime (marked with the green tic mark).
Figure 5: Variation in runtime in native mode.
Be aware that operating system jitter due to system daemons and multiple user processes can introduce performance variations. (An excellent starting paper on this topic is "The Case of the Missing Supercomputer Performance.") In particular, the default round-robin scheduling with multiple devices can introduce performance variations.
Offload Programming
The offload pragma in Listing One, provides additional annotation so the compiler can correctly move data to and from the external Phi card. Note that multiple OpenMP loops can be contained within the scope of the offload directive. The clauses are interpreted as follows:
• offload: The offload pragma keyword specifies that the following clauses contain information relevant to offloading to the target device. target(mic:MIC_DEV) is the target clause that tells the compiler to generate code for both the host processor and the specified offload device. In this example, the target will be a Xeon Phi device associated with the integer specified by the constant MIC_DEV. Note that:
- The offload runtime will schedule offload work within a single application in a round-robin fashion, which can be useful to share the workload amongst multiple devices. It is the responsibility of the programmer to ensure that any persistent data resides on all the devices when round-robin scheduling is used! In general, only use persistent data when the device number is specified or bizarre errors can result. Note that the use of persistent data on the device is required by the three rules of high-performance computing to avoid PCIe bottlenecks.
- The offload runtime will utilize the host processor when no coprocessors are present and no device number is specified (for example,
target(mic)). - Alternatively, programmers can add
use _Offload_toto specify a device in their code.
• in(A:length(size*size)): The in(var-list modifiersopt) clause explicitly copies data from the host to the coprocessor. Note that:
- The
length(element-count-expr)specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements. - By default, memory will be allocated on the device and deallocated on exiting the scope of the directive.
- The
alloc_if(condition)andfree_if(condition)modifiers can change the default behavior.
• out(C:length(size*size)): The out(var-list modifiersopt) clause explicitly copies data from the coprocessor to the host. Note that:
- The
length(element-count-expr)specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements. By default, memory will be deallocated on exiting the scope of the directive. - The
free_if(condition)modifier can change the default behavior.
More information about the syntax of the offload directive is available from Intel.
Note that the call to doMult() utilizes the variable size to dynamically specify at runtime the number of columns in the 2D matrices. The ability to index contiguous memory through variable length multi-dimensional arrays (such as, 2d, 3d, and so on) arrays was added to the C programming language in the ANSI C99 specification. This feature is important because the offload pragmas transfer only contiguous regions of memory. Old-school C programmers have been trained to manually calculate the offset for each multi-dimensional array access from the start of a contiguous memory region. This article and the upcoming installments use the newer C99 VLA (Variable-Length Array) feature to make the examples easier to read, potentially enable more compiler optimizations, and also achieve high data transfer performance as each multidimensional array can be transferred in one operation. For compatibility reasons, it is also important to list the variables used in the multidimensional array declarations first in the calling sequence because some compilers (such as the Intel compiler) do not forward variable references within an argument list.
More information about the Xeon Phi offload syntax is available from Intel.




