Channels ▼


Exceeding Supercomputer Performance with Intel Phi

This article is not intended to be an MPI tutorial. The Intel Xeon Phi coprocessors were designed to make transitioning MPI code to this new device family as easy as possible. Those who wish to learn more about MPI should refer to the many excellent books and online tutorials that can easily be found with an Internet search.

The first modification in the mpiTrain.c source code is the definition of the MPI_NUM_COPROC_PER_NODE preprocessor variable (line 12). By default, this code assumes there are two coprocessors per workstation or computational node. Both the number of coprocessors per node and MPI client layout can vary according to system. Check with the systems consultants to see if this mapping of MPI rank number to coprocessor number is correct. For example, the TACC Stampede system contains one coprocessor per node. To run on Stampede, this variable needs to be deleted or commented out. Of course, MPI_NUM_COPROC_PER_NODE should be increased on MPI systems containing more than two coprocessors per node.

On startup, the main() method at line 66 determines the rank of the MPI client in the run. Each client then reads the data it will use during the optimization. As discussed previously, the rank number is used to identify the correct file for the MPI client.

The code then identifies the rank zero client, which is designated as the master node that runs the optimization routine. All clients of rank greater than zero are considered as slaves that run a simple state machine implemented in startClient() at line 35. The slave code is shown in lines 36-50. The state machine utilizes the following op codes:

  • 0. Return from the state machine
  • 1. Read the parameters via a broadcast, run the objective function and return the partial error as part of a reduction.

The master node runs the code lines 52-64 and lines 98-149, which opens files for the optimized parameters and for reporting individual node performance information. It then sets up and starts the optimization. Note that the optimization library calls mpiObjFunc() at line 52, which broadcasts the state machine op code and model parameters. mpiObjFunc() calls objFunc() described in the previous tutorials.

Once the optimization completes, the master node sends the termination op code. All the clients (both master and slaves) gather timing information, which is returned to the master and written to a file along with the final model parameters. The per node timing results can help identify any slow nodes in a system.

Listing One: Source code for mpiTrain.c.

// mpitrain.c Rob Farber
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <nlopt.h>
#include "mpi.h"

int numTasks, mpiRank;

// number of coprocessors per node (else assume 1 per node)

#include "myFunc.h"

void writeParam(char *filename, int nParam, double *x)
	FILE *fn=fopen(filename,"w");
	if(!fn) {
		fprintf(stderr,"Cannot open %s\n",filename);

	uint32_t n=nParam; // ensure size is uint32_t
	fwrite(&n,sizeof(uint32_t), 1, fn);
	for(int i=0; i < nParam; i++) {
		float tmp=x[i];
		fwrite(&tmp,sizeof(float), 1, fn);

int masterOP=1;

void startClient(void * restrict my_func_data)
	int op;
	double xFromMPI[N_PARAM];
	double partialError,sum;
	for(;;) { // loop until the master says I am done - then exit
		MPI_Bcast(&op, 1, MPI_INT, 0, MPI_COMM_WORLD); // receive the op code
		if(op==0) { // we are done, normal exit
	MPI_Bcast(xFromMPI, N_PARAM, MPI_DOUBLE, 0, MPI_COMM_WORLD); // receive the parameters
	partialError = objFunc(N_PARAM, xFromMPI, NULL, my_func_data);
	MPI_Reduce(&partialError, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

double mpiObjFunc(unsigned n, const double * restrict x, double * restrict grad,
		 void * restrict my_func_data)
	int op;
	double partialError, totalError=0.;

	MPI_Bcast(&masterOP, 1, MPI_INT, 0, MPI_COMM_WORLD); // Send the master op code
	MPI_Bcast((void*) x, N_PARAM, MPI_DOUBLE, 0, MPI_COMM_WORLD); // Send the parameters
	partialError = objFunc(N_PARAM, x, NULL, my_func_data);
	MPI_Reduce(&partialError, &totalError, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); // get the totalError


int main(int argc, char* argv[])
	nlopt_opt opt;
	userData_t uData = {0};
	FILE *fout_mpi;

	if(argc < 4) {
		fprintf(stderr,"Use: datafile paramFile mpiTimingFile\n");
		return -1;

	int ret = MPI_Init(&argc,&argv);
	if (ret != MPI_SUCCESS) {
		printf ("Error in MPI_Init()!\n");
		MPI_Abort(MPI_COMM_WORLD, ret);

	{ // for simplicity, append the mpiRank to the data filename
		char filename[256];
		//fprintf(stderr,"Loading %s into coprocessor %d\n", filename, MIC_DEV);

	if(mpiRank > 0) {
	{ // Master code
		{ // open MPI results file for this size run
			char buf[256];
			sprintf(buf, "%s.%04d.txt",argv[3],numTasks);
			fout_mpi = fopen(buf,"w");
			if(!fout_mpi) {
				fprintf(stderr,"Cannot open file %s\n",buf);
				return -1;

		printf ("Number of tasks= %d My rank= %d, number clients %d\n", numTasks,mpiRank,numTasks);
		printf ("Number of coprocessors per node= %d\n", MPI_NUM_COPROC_PER_NODE);
		printf("myFunc %s\n", desc);
		printf("nExamples %d\n", uData.nExamples);
		printf("Number Parameters %d\n", N_PARAM);
		opt = nlopt_create(NLOPT_LN_PRAXIS, N_PARAM); // algorithm and dimensionality
		// NOTE: alternative optimization methods ...
		//opt = nlopt_create(NLOPT_LN_NEWUOA, N_PARAM);
		//opt = nlopt_create(NLOPT_LN_COBYLA, N_PARAM);
		//opt = nlopt_create(NLOPT_LN_BOBYQA, N_PARAM);
		//opt = nlopt_create(NLOPT_LN_AUGLAG, N_PARAM);
		nlopt_set_min_objective(opt, mpiObjFunc, (void*) &uData);
		nlopt_set_maxtime(opt, 3600/4.); // maximum runtime in seconds
		///******** run for a short time to get performance info
		//nlopt_set_maxtime(opt, 60); // maximum runtime in seconds
		double minf; /* the minimum objective value, upon return */
		__declspec(align(64)) double x[N_PARAM];
		for(int i=0; i < N_PARAM; i++) x[i] = 0.1*(random()/(double)RAND_MAX);
		double startTime=getTime();
		ret=nlopt_optimize(opt, x, &minf);
		printf("Optimization Time %g\n",getTime()-startTime);
		if (ret < 0) {
			printf("nlopt failed! ret %d\n", ret);
		} else {
			printf("found minimum %0.10g ret %d\n", minf,ret);
		writeParam(argv[2],N_PARAM, x);

		masterOP = 0; // signal completion
		MPI_Bcast(&masterOP, 1, MPI_INT, 0, MPI_COMM_WORLD); // Send the master op code
		printf("----------- performance times for the Master ----------\n");

	int client_nExamples[numTasks];
	double client_timeObjFunc[numTasks];
	int client_countObjFunc[numTasks];
	double client_timeDataLoad[numTasks];
	double client_minTime[numTasks];
	double client_maxTime[numTasks];
	MPI_Gather(&uData.nExamples, 1, MPI_INT, client_nExamples, 1, MPI_INT, 0, MPI_COMM_WORLD);
	MPI_Gather(&uData.countObjFunc, 1, MPI_INT, client_countObjFunc, 1, MPI_INT, 0, MPI_COMM_WORLD);
	MPI_Gather(&uData.timeDataLoad, 1, MPI_DOUBLE, client_timeDataLoad, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
	MPI_Gather(&uData.timeObjFunc, 1, MPI_DOUBLE, client_timeObjFunc, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
	MPI_Gather(&uData.minTime, 1, MPI_DOUBLE, client_minTime, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
	MPI_Gather(&uData.maxTime, 1, MPI_DOUBLE, client_maxTime, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

	if(mpiRank==0) {
		printf("----------- performance times for the MPI run ----------\n");
		printf("function: %s\n",desc);
		uint64_t totalExamples=0;
		for(int i=0; i < numTasks; i++) totalExamples += client_nExamples[i];
		printf("totalExamples %g\n",(double) totalExamples);
		printf("AveObjTime %g, countObjFunc %d, totalObjTime %g\n",
		uData.timeObjFunc/uData.countObjFunc, uData.countObjFunc, uData.timeObjFunc);
		printf("Estimated flops in myFunc %d, estimated average TFlop/s %g, nClients %d\n", FLOP_ESTIMATE,
			(((double)totalExamples * (double)FLOP_ESTIMATE)/(uData.timeObjFunc/uData.countObjFunc)/1.e12),
		printf("Estimated maximum TFlop/s %g, minimum TFLop/s %g\n",
			(((double)totalExamples*(double)FLOP_ESTIMATE)/(uData.maxTime)/1.e12) );
		fprintf(fout_mpi, "nExamples countObjFunc timeDataLoad timeObjFunc minObjFcnTime maxObjFcnTime\n");
		for(int i=0; i < numTasks; i++) {
			fprintf(fout_mpi, "%g %g %g %g %g %g\n",
			(double) client_nExamples[i], (double) client_countObjFunc[i], client_timeDataLoad[i],
			client_timeObjFunc[i], client_minTime[i], client_maxTime[i]);

	return 0;

The only change to myFunc.h is the additional logic to utilize the MPI_NUM_COPROC_PER_NODE variable. To save space, that change alone is shown in Listing Two.

Listing Two: Change required to myFunc.h.

  #define MIC_DEV (mpiRank % MPI_NUM_COPROC_PER_NODE)
  #define MIC_DEV 0

Save the mpiTrain.c source code to the top directory used in the previous tutorials. Copy the pca directory to pca_mpi. Make the change to myFunc.h and save to disk. Then copy the pca_mpi directory to nlpca_mpi. The mpiTrain application is built as described in the previous tutorials.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.