Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Numerical and Computational Optimization on the Intel Phi

This is the third installment in a series on understanding and using the new Intel Xeon Phi coprocessors to create applications and adapt legacy software to run at high performance. The previous article in this series, Getting to 1 Teraflop on the Intel Phi Coprocessor, showed how to write an objective function that delivers an average performance exceeding a teraflop in native mode and 900 gigaflops second in offload mode on a single Intel Xeon Phi coprocessor.

This article will employ these objective functions to solve numerical nonlinear and linear optimization problems. The Intel VTune profiler will be used to examine the runtime of the optimization code and help gain insight into the Phi coprocessor's performance envelope. In the process, I provide full working source code for a real-world example that achieves performance comparable to the best observed performance with the optimized MKL matrix multiplication method shown in the first article in this series. The intention is to show that the performance potential of Phi hardware is accessible to programmers and to provide a working example that can be modified so you can explore the device performance envelope in both native and offload programming modes.

The key to entering the high performance arena with the Phi product family is to express sufficient parallelism and vector capability to fully utilize the device. After that, other device characteristics (such as memory bandwidth, memory access pattern, the number and types of floating-point calculations per datum, plus synchronization operations) determine how close an application can get to peak performance when running on all the Intel Xeon Phi cores.

Mapping Numerical Optimization to Parallel Hardware

I'll use the generic, exascale-capable mapping of objective functions to parallel hardware developed at Los Alamos National Laboratory and the Santa Fe Institute in the early 1980s in this example. The mapping is illustrated in Figure 1 for the threads on a single Intel Xeon Phi coprocessor.

OpenMP implementation for a single SMP system
Figure 1: OpenMP implementation for a single SMP system.

It is important to note that even though this article focuses on least squares objective functions for linear and nonlinear principal components analysis, the mapping itself is generic and has applicability to a wide range of numerical problems.

Building the nlopt Library

This article and the rest of the series uses the open-source, freely available nlopt optimization library, although most numerical methods that call a user-specified function to optimize will work. As I note in my book CUDA Application Design and Development, it is important to utilize numerical methods that require the objective function to return only a single floating-point value. Techniques that utilize an error vector limit the scalability of parallel and distributed implementations by imposing data bandwidth and memory capacity hardware restrictions.

Nlopt can be built on Linux by downloading the source code from the nlopt website. Version 2.3 is currently the latest. The nlopt software is well designed and can be easily built from source because it uses the standard Linux autoconf software.

A host-processor-based version for offload CPU optimizations can be built using the Intel icc compiler with the following commands:

export CC=icc
export CFLAGS="-O3"
export CXX=icpc
export CXXFLAGS="-O3"
export LDFLAGS="-L/opt/intel/lib/intel64 -lirc -limf -lsvml"
./configure --prefix=$HOME/install
make install

nlopt can also be cross-compiled to run on the Phi coprocessor by adding -mmic to the compiler flags environment variables. Note the Phi libraries need to be linked, and so the installation directory name has been changed to reflect the device architecture to prevent conflict.

export CC=icc
export CFLAGS="-mmic -O3"
export CXX=icc
export CXXFLAGS="-mmic -O3"
export LDFLAGS=
"-L/opt/intel/composer_xe_2013.1.117/compiler/lib/mic -lirc -limf -lsvml"
./configure --host=x86  --prefix=$HOME/install_mic 
make install

A Least Squares Objective Function with Data I/O

In general, most data is gathered experimentally or culled from digital archives. For ease of use, the source code for myFunc.h is modified to include a function init() that reads data from either a file or from stdin. For the purposes of this article, the linear and nonlinear genData.c programs provide working examples on how to generate the data sets used here  The reader is free to load their own data from observation or other data generators to test and solve their own computational problems.

The format of the data file comprises a header that defines the size of the input vector in the data set, the size of the output vector, and the number of examples (or observations) in the data set. The header is followed by the input and output vector of each observation.

Listing One is the complete revised source code for myFunc.h presented in the previous article with the addition of an init() function that can read the data from either stdin or a file. The use of stdin means the data can be piped to the training code when running natively on the Phi. Thus, the data does not need to consume precious memory on the Phi RAM file system, or introduce the setup complexities, performance limitations, and potential to exacerbate operating system jitter by using an NFS file system between the Phi coprocessor and the host system.

Listing One

// Rob Farber
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <malloc.h>
#include <math.h>
#include <omp.h>

#define MIC_DEV 0

#define	ALLOC alloc_if(1) free_if(0)
#define	FREE alloc_if(0) free_if(1)
#define	REUSE alloc_if(0) free_if(0)

// Use a struct to pass and get data from the objective function
typedef struct userData { 
  // Data information
  int nExamples;
  __declspec(align(64)) float * restrict example; 
  __declspec(align(64)) float * restrict param;

  // Timing information
  int isWarmup;
  double timeObjFunc;
  int countObjFunc;
  double timeDataLoad;
  double minTime, maxTime;
} userData_t;

// function to measure wall clock time
inline double getTime() { return(omp_get_wtime());}

#pragma offload_attribute (push, target (mic))

// helper macros to index into the example array
#define IN(i,nExamples,j)  (i*nExamples+j)
#define OUT(i,nExamples,j)  ((i+N_INPUT)*nExamples+j)

// Define the Sigmoid
char *desc="generated_PCA_func LINEAR()";
inline float G(float x) { return( x ) ;} 
#define G_ESTIMATE 0 
#elif USE_TANH
char *desc="generated_func tanh()";
inline float G(float x) { return( tanhf(x) ) ;} 
#define G_ESTIMATE 7 // estimate 7 flops for G
char *desc="generated func logistic()";
inline float G(float x) { return( 1.f/(1.f+expf(-x)) ) ;} 
#define G_ESTIMATE 7 // estimate flops for G
#else // Use Elliott function
char *desc="generated func Eliott activation: x/(1+fabsf(x))";
inline float G(float x) { return( x/(1.f+fabsf(x)) ) ;} 
#define G_ESTIMATE 3 // estimate flops for G

// This file defines the function to be evaluated
#include "fcn.h"

// The offload objective function
double _objFunc(unsigned int n,  const double * restrict x,
		double * restrict grad, void * restrict my_func_data)
  double err;
  userData_t *uData = (userData_t *) my_func_data;

  // convert from double to float for speed
  for(int i=0; i < N_PARAM; i++) uData->param[i]=x[i];
  int nExamples = uData->nExamples;
  // compiler workaround
  __declspec(align(64)) float * restrict example = uData->example;
  __declspec(align(64)) float * restrict param = uData->param; 
#pragma offload target(mic:MIC_DEV) in(param:length(N_PARAM) REUSE) \
                                    out(err) in(example:length(0) REUSE)
    err=0.; // initialize error here in case offload selected
#pragma omp parallel for reduction(+ : err)
    for(int i=0; i < nExamples; i++) {
      float d=myFunc(i, param, example, nExamples, NULL);
      err += d*d;

  return sqrt(err);
#pragma offload_attribute (pop)

// The optizimation library callable objective function that gathers timing information
double objFunc(unsigned int n,  const double * restrict x, 
	       double * restrict grad, void * restrict my_func_data)
  if(grad) {
    fprintf(stderr,"Gradient not implemented!\n");
  userData_t *uData = (userData_t *) my_func_data;

  double runTime=getTime();
  double err =  _objFunc(n,x,grad,my_func_data);
  runTime = getTime() - runTime;

  if(!uData->isWarmup) {
    // Note a maxTime of zero means this is the first call 
    if(uData->maxTime == 0.) {
      uData->maxTime = uData->minTime = runTime;
    uData->maxTime = (uData->maxTime > runTime)?uData->maxTime:runTime;
    uData->minTime = (uData->minTime < runTime)?uData->minTime:runTime;
    uData->timeObjFunc += runTime;

  return( err );

// Called to free memory and report timing information
void fini(userData_t *uData)
  int nThreads=0;
  // Intel recommended way to get the number of threads in offload mode.
#pragma offload target(mic:MIC_DEV) out(nThreads)
#pragma omp parallel
#pragma omp single
	nThreads = omp_get_num_threads();
  // Ouput some information
  if(!uData->isWarmup) {
    printf("number OMP threads %d\n", nThreads);
    printf("DataLoadTime %g\n", uData->timeDataLoad);
    printf("AveObjTime %g, countObjFunc %d, totalObjTime %g\n",
	   uData->timeObjFunc/uData->countObjFunc, uData->countObjFunc, uData->timeObjFunc);
    printf("Estimated flops in myFunc %d, estimated average GFlop/s %g\n", FLOP_ESTIMATE,
	   (((double)uData->nExamples*FLOP_ESTIMATE)/(uData->timeObjFunc/uData->countObjFunc)/1.e9) );
    printf("Estimated maximum GFlop/s %g, minimum GFLop/s %g\n",
	   (((double)uData->nExamples*FLOP_ESTIMATE)/(uData->maxTime)/1.e9) );

  // free if using offload mode
  __declspec(align(64)) float * restrict example = uData->example;// compiler workaround
  __declspec(align(64)) float * restrict param = uData->param;// compiler workaround
#pragma offload target(mic:MIC_DEV) in(example: length(0) FREE) in(param : length(0) FREE)

  // free on the host
  if(uData->example) free(uData->example); uData->example=NULL;
  if(uData->param) free(uData->param); uData->param=NULL;

void offloadData(userData_t *uData)
  int nDevices =_Offload_number_of_devices();

  if(nDevices == 0) {
    fprintf(stderr,"No devices found!\n");
    exit -1;

  // If necessary, perform offload transfer and allocation
  double startOffload=getTime();
  __declspec(align(64)) float * restrict example = uData->example; // compiler workaround
  __declspec(align(64)) float * restrict param = uData->param; // compiler workaround
  int Xsiz = uData->nExamples*EXAMPLE_SIZE; // compiler workaround
  // Note: the in for param just allocates memory on the device
#pragma offload target(mic:MIC_DEV) in(example: length(Xsiz) ALLOC) in(param : length(N_PARAM) ALLOC)
  // set data load time if using offload mode
  uData->timeDataLoad = getTime() - startOffload;

// loads the binary file of the form:
//    nInput, nOutput, nExamples
//    Input [0] [0:nExamples]
//    Input [1] [0:nExamples]
//    ...
//    Output [0] [0:nExamples]
//    Output [1] [0:nExamples]
//    ...
void init(char*filename, userData_t *uData)
  FILE *fn=stdin;

  // check if reading from stdin
  if(strcmp("-", filename) != 0)

  if(!fn) {
    fprintf(stderr,"Cannot open %s\n",filename);

  // read the header information
  double startTime=getTime();
  int32_t nInput, nOutput;
  int32_t nExamples;
  fread(&nInput,sizeof(int32_t), 1, fn);
  if(nInput != N_INPUT) {
    fprintf(stderr,"Number of inputs incorrect!\n");
  fread(&nOutput,sizeof(int32_t), 1, fn);
  if(nOutput != N_OUTPUT) {
    fprintf(stderr,"Number of outputs incorrect!\n");
  fread(&nExamples,sizeof(int32_t), 1, fn);
  if(nExamples <= 0) {
    fprintf(stderr,"Number of examples incorrect!\n");
  uData->nExamples = nExamples;
  // aligned allocation of the data
  uData->example=(float*) memalign(64,nExamples*EXAMPLE_SIZE*sizeof(float));
  if(!uData->example) {
    fprintf(stderr,"Not enough memory for examples!\n");
  // aligned allocation of the on-device parameters
  uData->param=(float*) memalign(64,N_PARAM*sizeof(float));
  if(!uData->param) {
    fprintf(stderr,"Not enough memory for the parameters!\n");

  // read the data
  for(int exIndex=0; exIndex < uData->nExamples; exIndex++) {
    for(int i=0; i < nInput; i++) 
      fread(&uData->example[IN(i,uData->nExamples, exIndex)],1, sizeof(float), fn);
    for(int i=0; i < nOutput; i++) 
      fread(&uData->example[OUT(i,uData->nExamples, exIndex)],1, sizeof(float), fn);
  // offload the data

  if(fn!=stdin) fclose(fn);

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.