Channels ▼
RSS

Open Source

Programming Intel's Xeon Phi: A Jumpstart Introduction


Demonstration: Scalability to 120 Threads Is Recommended

Listing One is a C source code snippet that implements doMult(), a function that multiplies two square matrices A and B and assigns the result into matrix C. This function will be used to quantify the performance impact of the number of threads utilized per core.

Listing One: A simple square matrix multiplication function.

// An OpenMP simple matrix multiply
void doMult(int size, float (* restrict A)[size],
	    float (* restrict B)[size], float (* restrict C)[size]) 
{
#pragma offload target(mic:MIC_DEV) \
                in(A:length(size*size)) in( B:length(size*size))	\
                out(C:length(size*size))
  {
    // Zero the C matrix
#pragma omp parallel for default(none) shared(C,size)
    for (int i = 0; i < size; ++i)
      for (int j = 0; j < size; ++j)
        C[i][j] =0.f;
    
    // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
    for (int i = 0; i < size; ++i)
      for (int k = 0; k < size; ++k)
        for (int j = 0; j < size; ++j)
          C[i][j] += A[i][k] * B[k][j];
  }
}

Readers familiar with OpenMP are comfortable using pragmas to annotate their code so it can be parallelized by an OpenMP-compliant compiler. Listing Two is a complete source code with test harness to demonstrate the average native (Intel Xeon Phi as a Linux SMP computer) runtime performance.

Listing Two: Source code for firstMatrix.c.

/* firstMatrix.c (Rob Farber) */
#ifndef MIC_DEV
#define MIC_DEV 0
#endif

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <mkl.h>
#include <math.h>

// An OpenMP simple matrix multiply
void doMult(int size, float (* restrict A)[size],
	    float (* restrict B)[size], float (* restrict C)[size]) 
{
#pragma offload target(mic:MIC_DEV) \
                in(A:length(size*size)) in( B:length(size*size))	\
                out(C:length(size*size))
  {
    // Zero the C matrix
#pragma omp parallel for default(none) shared(C,size)
    for (int i = 0; i < size; ++i)
      for (int j = 0; j < size; ++j)
        C[i][j] =0.f;
    
    // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
    for (int i = 0; i < size; ++i)
      for (int k = 0; k < size; ++k)
        for (int j = 0; j < size; ++j)
          C[i][j] += A[i][k] * B[k][j];
  }
}

int main(int argc, char *argv[])
{

  if(argc != 4) {
    fprintf(stderr,"Use: %s size nThreads nIter\n",argv[0]);
    return -1;
  }

  int i,j,k;
  int size=atoi(argv[1]);
  int nThreads=atoi(argv[2]);
  int nIter=atoi(argv[3]);
  
  omp_set_num_threads(nThreads);

  float (*restrict A)[size] = malloc(sizeof(float)*size*size);
  float (*restrict B)[size] = malloc(sizeof(float)*size*size);
  float (*restrict C)[size] = malloc(sizeof(float)*size*size);

  // Fill the A and B arrays
#pragma omp parallel for default(none) shared(A,B,size) private(i,j,k)
  for (i = 0; i < size; ++i) {
    for (j = 0; j < size; ++j) {
      A[i][j] = (float)i + j;
      B[i][j] = (float)i - j;
    }
  }
  
  // warm up
  doMult(size, A,B,C);

  double aveTime,minTime=1e6,maxTime=0.;
  for(int i=0; i < nIter; i++) {
    double startTime = dsecnd();
    doMult(size, A,B,C);
    double endTime = dsecnd();
    double runtime = endTime-startTime;
    maxTime=(maxTime > runtime)?maxTime:runtime;
    minTime=(minTime < runtime)?minTime:runtime;
    aveTime += runtime;
  }
  aveTime /= nIter;
  
  printf("%s nThrds %d matrix %d maxRT %g minRT %g aveRT %g ave_GFlop/s %g\n",
	 argv[0], omp_get_num_threads(), size,
	 maxTime,minTime,aveTime, 2e-9*size*size*size/aveTime);

  free(A); free(B); free(C);
  return 0;
}

The firstMatrix.c source code can be compiled to run natively on an Intel Xeon Phi coprocessor as an OpenMP application with the Intel C Compiler (icc) command shown in Listing Three. The –mmic argument specifies native compilation for Xeon Phi.

Listing Three: Intel compiler command to build firstMult.c for native execution.

icc -mkl -O3 -mmic -openmp -L  /opt/intel/lib/mic -Wno-unknown-pragmas -std=c99 -vec-report3 firstMatrix.c -o firstMatrix.mic -liomp5

Listing Four is a shell script that was used to observe how the runtime varied according to thread count when multiplying 1000x1000 square matrices natively on a Phi coprocessor.

Listing Four: Shell script for observing runtime variance.

export KMP_AFFINITY="granularity=thread,balanced"
export LD_LIBRARY_PATH=/tmp

i=1
while [ $i -lt 480 ]
do
   ./firstMatrix.mic 1000 $i 10
   let i++
done

The KMP_AFFINITY environment variable specifies the thread-to-core affinity. There are three preset schemes: compact, scatter, and balanced. Intel recommends the user explicitly define the affinity that works best for their application. The reason is that the default runtime thread affinity can change between software releases. For consistent application performance across software releases, do not rely on the default affinity scheme.

  1. Compact tries to use minimum number of cores by pinning four threads to a core before filling the next core
  2. Scatter tries to evenly distribute threads across all cores
  3. Balanced tries to equally scatter threads across all cores such that adjacent threads (sequential thread numbers) are pinned to the same core. One caveat being that all cores refers to the total number of cores -1 because one core is reserved for the operating system during an offload

Interested readers can find more about the affinitization schemes in "Best Known Methods for Using OpenMP on Intel Many Integrated Core Architecture."

The runtimes in Figure 4 show that the first performance peak is around 120 threads, or 2x the coprocessor core count. (Remember that one core is reserved for the operating system.) The highest and broadest performance peak is observed around 240 threads, or 4x the core count.


Figure 4: Average GFlop/s as a function of thread count when multiplying 1000x1000 matrices.

Figure 5 illustrates the variation in runtime performance with 240 threads is clearly larger than the variation with 120 threads. However, an average over 10 samples shows that the minimum observed runtime does not dramatically affect the overall average runtime (marked with the green tic mark).


Figure 5: Variation in runtime in native mode.

Be aware that operating system jitter due to system daemons and multiple user processes can introduce performance variations. (An excellent starting paper on this topic is "The Case of the Missing Supercomputer Performance.") In particular, the default round-robin scheduling with multiple devices can introduce performance variations.

Offload Programming

The offload pragma in Listing One, provides additional annotation so the compiler can correctly move data to and from the external Phi card. Note that multiple OpenMP loops can be contained within the scope of the offload directive. The clauses are interpreted as follows:

offload: The offload pragma keyword specifies that the following clauses contain information relevant to offloading to the target device. target(mic:MIC_DEV) is the target clause that tells the compiler to generate code for both the host processor and the specified offload device. In this example, the target will be a Xeon Phi device associated with the integer specified by the constant MIC_DEV. Note that:

  • The offload runtime will schedule offload work within a single application in a round-robin fashion, which can be useful to share the workload amongst multiple devices. It is the responsibility of the programmer to ensure that any persistent data resides on all the devices when round-robin scheduling is used! In general, only use persistent data when the device number is specified or bizarre errors can result. Note that the use of persistent data on the device is required by the three rules of high-performance computing to avoid PCIe bottlenecks.
  • The offload runtime will utilize the host processor when no coprocessors are present and no device number is specified (for example, target(mic)).
  • Alternatively, programmers can add use _Offload_to to specify a device in their code.

in(A:length(size*size)): The in(var-list modifiersopt) clause explicitly copies data from the host to the coprocessor. Note that:

  • The length(element-count-expr) specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements.
  • By default, memory will be allocated on the device and deallocated on exiting the scope of the directive.
  • The alloc_if(condition) and free_if(condition) modifiers can change the default behavior.

out(C:length(size*size)): The out(var-list modifiersopt) clause explicitly copies data from the coprocessor to the host. Note that:

  • The length(element-count-expr) specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements. By default, memory will be deallocated on exiting the scope of the directive.
  • The free_if(condition) modifier can change the default behavior.

More information about the syntax of the offload directive is available from Intel.

Note that the call to doMult() utilizes the variable size to dynamically specify at runtime the number of columns in the 2D matrices. The ability to index contiguous memory through variable length multi-dimensional arrays (such as, 2d, 3d, and so on) arrays was added to the C programming language in the ANSI C99 specification. This feature is important because the offload pragmas transfer only contiguous regions of memory. Old-school C programmers have been trained to manually calculate the offset for each multi-dimensional array access from the start of a contiguous memory region. This article and the upcoming installments use the newer C99 VLA (Variable-Length Array) feature to make the examples easier to read, potentially enable more compiler optimizations, and also achieve high data transfer performance as each multidimensional array can be transferred in one operation. For compatibility reasons, it is also important to list the variables used in the multidimensional array declarations first in the calling sequence because some compilers (such as the Intel compiler) do not forward variable references within an argument list.

More information about the Xeon Phi offload syntax is available from Intel.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video