This is the second tutorial in a series on understanding and using the new Intel Xeon Phi coprocessors to create applications and adapt legacy software to run with high performance. The previous article in this series showed how to achieve more than one teraflop (TF, one trillion floating-point operations per second) using the Intel Math Kernel library (MKL). This article demonstrates how to average more than 1 TF of performance using the Intel compiler and OpenMP directives when running natively on the Intel Xeon Phi coprocessor, and approximately 900 gigaflops (GF) using offload mode. In comparison, a 12-core, 3.3 GHz X5680 Intel Westmere chipset delivers approximately 120 GF on the same problem using the same optimized OpenMP code. The focus here is on the expression of an objective function that can be used by many derivative-free optimization frameworks. A working software framework is provided that allows you to generate functions (or provide your own) to explore the performance envelope of the Intel Xeon Phi coprocessors by varying the flops, memory bandwidth, program size, and problem size of the objective function. In addition, this timing framework will demonstrate how to use the offload pragmas so that only the minimum amount of data is transferred between the host and device between function calls. These same techniques are useful when writing library methods.

The next article in this series will utilize this objective function framework to solve both linear and nonlinear real-world optimization problems with high performance on a single Phi coprocessor. Further along in the series, I will show how to extend this framework to multiple devices within a workstation and distributed coprocessors accessible via a message passing interface (MPI).

Matrix multiplication is a useful computational tool that makes a great benchmark because it can show how close software can get to peak hardware performance. In particular, the heavy reliance on a fused multiply-add operation doubles the number of floating-point operations that most processors can provide. Vendors spend large amounts of money on optimized libraries such as MKL to highlight the capabilities of their hardware. Unfortunately, these optimized libraries make terrible teaching tools because they do not show how the designer of the optimized routines was able to optimize for peak performance on the device. For teaching purposes, this article provides full working source code for a real-world example that achieves performance close to the best observed performance in the first article in this series, which was attained using calls to the optimized MKL matrix multiplication method. The intention is to show that the peak performance potential of the Phi hardware is accessible to programmers and to provide a working example that can be modified so you can explore the performance envelope of the Phi coprocessors in both native and offload programming modes.

The key to entering the high-performance arena with the Phi product family is to express sufficient parallelism *and* vector capability to fully utilize the device. After that, many other device characteristics (such as memory bandwidth, memory access pattern, the number and types of floating-point calculations per datum, plus synchronization operations) determine how close an application can get to peak performance when running across all the Phi cores.

### Objective Functions

Many generic computational optimization problems are phrased in terms of an objective function that defines some real function that is to be maximized or minimized. A least squares objective function, sometimes referred to as a "cost function," can be used to evaluate how well a model fits a dataset for some set of model parameters. Equation 1 shows how the sum of the squares of the differences between a set of known and predicted points can be used to calculate how well a model fits the overall dataset. The data points can lie on a line, a curve, or a multidimensional surface.

**Equation 1: Sum of squares of differences error.**

Because the sum of the squares of the differences is always positive, a perfect fit will result in zero error. It is the responsibility of the optimization routine that calls the objective function to find the parameters that best fit the data. Unfortunately, it is rare to find a set of parameters that fit a dataset with zero error, as numerical techniques are subject to local minima, which means that the numerical optimization technique somehow gets stuck at a low point in the cost function from which it cannot escape. Further, most real-world datasets contain noisy and sometimes conflicting measurements that also preclude finding a set of parameters that provide a perfect fit. The good news for data scientists is that numerical optimization techniques can deal with noisy and conflicting datasets.

There are several popular libraries and tools that can call an objective function to find the minimum of a function over many variables. The book *Numerical Recipes* is an excellent source of information. (It also provides working source code that is subject to copyright.) Many free and licensed numerical toolkits are available. This article and the rest of the series uses the open-source, freely available nlopt optimization library.

### Mapping Objective Functions to Parallel Hardware

A generic, exascale-capable mapping of objective functions to parallel hardware was developed at Los Alamos National Laboratory and the Santa Fe Institute in the early 1980s. This efficient SIMD (Single Instruction Multiple Data) mapping was originally developed for a 64,000-processor machine, yet it still runs nicely on modern hardware such as GPUs and the Phi coprocessor. Figure 1 shows that this parallel mapping delivers near-linear scaling on the 60,000 core Texas Advanced Computing Center (TACC) Ranger supercomputer. (The runtime is near-linear because an* O(log _{2}(N))* reduction is required where

*N*is the number of processing elements.)

Analysis in my book *CUDA Application Design and Development* shows that this mapping also scales with near-linear performance on accelerators and clusters of GPUs. In addition, my 2008 presentation Massively Parallel Near-Linear Scalability Algorithms with Application to Unstructured Video Analysis shows how hand coding an objective function using compiler-intrinsic operations attained the peak performance of four single-precision operations per clock on a single AMD Barcelona core on the Ranger supercomputer. However, cache coherency messages between the processor cores and a reduction operation adversely affected multicore performance on conventional multicore processors.

**Figure 1: Near-linear scaling of the mapping observed on the TACC Ranger supercomputer.**

Figure 2 illustrates the mapping to multiple Phi devices. Key to this mapping is the evaluation of a sufficiently large dataset to fully utilize all the parallel hardware. It is also important to utilize optimization methods that require the objective function to return only a single floating-point value. Methods that utilize an error vector limit the scalability of parallel and distributed implementations by imposing data bandwidth and memory capacity hardware restrictions.

**Figure 2: Parallel objective function mapping for parallel and distributed computers.**

Example 1 illustrates the simplicity of an OpenMP implementation of a least squares objective function on a single Phi coprocessor (or other SMP system). The user-provided function `myFunc() `

calculates the error of the model for each example in the dataset being fit. The OpenMP compiler is responsible for correctly parallelizing the reduction loop across the processor cores. The programmer is responsible for expressing `myFunc()`

so that the compiler can correctly map the parallel instances to the per-core vector units on the Phi coprocessor.

#pragma omp parallel for reduction(+ : err) for(int i=0; i < nExamples; i++) { float d=myFunc(i, P, example, nExamples, NULL); err += d*d; }

**Example 1: OpenMP implementation for a single SMP system.**

It is important to understand that even though this tutorial focuses on objective functions for linear and nonlinear principal components analysis, the mapping itself is generic and has applicability to a wide range of optimization problems, such as neural networks, naive Bayes, Gaussian discriminative analysis (GDA), k-means, logistic regression (LR), independent component analysis, expectation maximization, support vector machine (SVM), and so on.

### A Vectorizable Function for PCA and NLPCA Analysis

Principal components analysis (PCA) and nonlinear principal components analysis (NLPCA) are extensively used in data mining and data analysis to reduce the dimensionality of a dataset and extract features from a dataset.

PCA analysis accounts for the maximum amount of variance in a dataset using a set of straight lines, where each line is defined by a weighted linear combination of the observed variables. The first line, or principal component, accounts for the greatest amount of variance; while each succeeding component accounts for as much variance as possible while remaining orthogonal to (uncorrelated with) the preceding components. For example, a cigar shaped distribution would require a single line (or principal component) to account for most of the variance in the dataset illustrated in Figure 3.

**Figure 3: Example of a linear dataset for PCA.**

While PCA utilizes straight lines, NLPCA can utilize continuous open or closed curves to account for variance in data. A circle is one example of a closed curve that joins itself so there are no end points, while Figure 4 illustrates an open curve. As a result, NLPCA has the ability to represent nonlinear problems in a lower dimensional space. Figure 4 shows one dataset where a single curve could account for most of the variance*. *NLPCA has wide applicability to numerous challenging problems including image and handwriting analysis, biological modeling, climate, and chemistry.

**Figure 4: Example of a nonlinear dataset for NLPCA.**

Both PCA and NLPCA can be implemented as an optimization problem through the use of autoencoders. An autoencoder is a neural network that learns to compress or create a low-dimensional representation of a dataset by recreating an input vector with some output neurons after forcing the information through a bottleneck layer of neurons containing fewer neurons than output neurons. The smaller number of bottleneck neurons are forced during the optimization procedure to form a compressed, lower-dimensional representation of the input vector, which can be correctly decompressed to recreate the input vector at the output neurons. A simple example of a two-input, one-bottleneck neuron autoencoder is shown in Figure 5. Geoffrey E. Hinton offers an excellent set of tutorials about the use of autoencoders and machine learning.

**Figure 5: A simple PCA autoencoder.**