It is possible to get some basic profiling information from the runtime by setting the H_TIME and H_TRACE environment variables to a non-zero value.
Following is the output when running the offload binary (matrix.off) compiled from the matrix.c source code when H_TRACE is set to 1. The start of each offload region for doMult() is highlighted in boldface. The transfer of the A and B matrices (highlighted in green) occur in each region as well as the return of the C matrix (highlighted in blue). The offload region is started three times: once for warm-up and twice as specified via the command-line to determine the average runtime.
Setting the H_TIME environment variable provides the following information about the runtime of each of the offload regions. The time to initialize the target (lines 6 and 66) is significantly higher for the first offload region, which justifies the use of a warm-up call prior to gathering any runtime statistics. The time waiting for the computation (lines 48 and 78) takes most of the runtime after the warm-up call (0.042820 out of 0.052616 seconds, or 81% of the time).
**************************************************************
timer data (sec)
**************************************************************
Offload from file matrix.c, line 16
host: total offload time 2.480265
host: initialize target 2.109299
host: acquire target 0.000007
host: wait dependencies 0.000000
host: setup buffers 0.006309
host: allocate buffers 0.006277
host: setup misc_data 0.000004
host: allocate buffer 0.000000
host: send pointers 0.010187
host: gather inputs 0.000010
host: map IN data buffer 0.000000
host: unmap IN data buffer 0.000000
host: initiate compute 0.031128
host: wait compute 0.320771
host: initiate pointer reads 0.000027
host: scatter outputs 0.000007
host: map OUT data buffer 0.000000
host: unmap OUT data buffer 0.000000
host: wait pointer reads 0.000447
host: destroy buffers 0.002036
target: total time 0.315454
target: setup offload descriptor 0.000158
target: entry lookup 0.000013
target: entry time 0.315282
target: scatter inputs 0.000100
target: add buffer reference 0.000012
target: compute 0.311368
target: gather outputs 0.000074
target: remove buffer reference 0.000017
Offload from file matrix.c, line 16
host: total offload time 0.052616
host: initialize target 0.000000
host: acquire target 0.000000
host: wait dependencies 0.000000
host: setup buffers 0.005432
host: allocate buffers 0.005425
host: setup misc_data 0.000001
host: allocate buffer 0.000000
host: send pointers 0.002159
host: gather inputs 0.000002
host: map IN data buffer 0.000000
host: unmap IN data buffer 0.000000
host: initiate compute 0.000019
host: wait compute 0.042820
host: initiate pointer reads 0.000004
host: scatter outputs 0.000002
host: map OUT data buffer 0.000000
host: unmap OUT data buffer 0.000000
host: wait pointer reads 0.000461
host: destroy buffers 0.001712
target: total time 0.039711
target: setup offload descriptor 0.000019
target: entry lookup 0.000006
target: entry time 0.039682
target: scatter inputs 0.000010
target: add buffer reference 0.000002
target: compute 0.039627
target: gather outputs 0.000025
target: remove buffer reference 0.000010
Offload from file matrix.c, line 16
host: total offload time 0.057763
host: initialize target 0.000001
host: acquire target 0.000000
host: wait dependencies 0.000000
host: setup buffers 0.009987
host: allocate buffers 0.009980
host: setup misc_data 0.000002
host: allocate buffer 0.000000
host: send pointers 0.002066
host: gather inputs 0.000003
host: map IN data buffer 0.000000
host: unmap IN data buffer 0.000000
host: initiate compute 0.000030
host: wait compute 0.041085
host: initiate pointer reads 0.002126
host: scatter outputs 0.000002
host: map OUT data buffer 0.000000
host: unmap OUT data buffer 0.000000
host: wait pointer reads 0.000467
host: destroy buffers 0.001992
target: total time 0.040084
target: setup offload descriptor 0.000012
target: entry lookup 0.000007
target: entry time 0.040064
target: scatter inputs 0.000011
target: add buffer reference 0.000002
target: compute 0.040010
target: gather outputs 0.000023
target: remove buffer reference 0.000010
**************************************************************
Intel has provided more advance profiling capabilities than just text output from the runtime. The Intel Profiling whitepaper and VTune User Guide provide a good starting point for further investigation.
Conclusion
The focus of this first tutorial is to provide enough information so you can start running code on the Intel Xeon Phi as quickly as possible. The complete example source codes in this tutorial demonstrate that it is indeed possible to achieve teraflop performance on these coprocessors. From a hardware perspective, the Phi coprocessor is a versatile platform that supports both modern and legacy programming models.
The true power of Intel Xeon Phi is currently realized through the capabilities of Intel's compilers, which are able to transparently compile a single source file to run natively on the host or an Intel Xeon Phi coprocessor as well as in an offload mode that can directly exploit all the hardware resources in a system. The value in being able to support multiple configurations with a single source file (or source tree) cannot be underestimated from the software design and application development perspectives.
Even though Intel Xeon Phi is new, it can leverage a tremendous number of existing tools, software development platforms, and libraries. The resulting information can be overwhelming to those just beginning to use this technology. Succinctly put the single key concept to understand about Intel Xeon Phi is that that the program must express sufficient parallelism and vector capability to achieve high performance. Measurements presented in this tutorial suggest that the application or offload region must use at least 120 concurrent threads of execution.
Related Reading
CUDA vs. Phi: Phi Programming for CUDA Developers
Getting to 1 Teraflop on the Intel Phi Coprocessor
Numerical and Computational Optimization on the Intel Phi
Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU-associated programming topics.




