Channels ▼
RSS

Design

Programming Intel's Xeon Phi: A Jumpstart Introduction


It is possible to get some basic profiling information from the runtime by setting the H_TIME and H_TRACE environment variables to a non-zero value.

Following is the output when running the offload binary (matrix.off) compiled from the matrix.c source code when H_TRACE is set to 1. The start of each offload region for doMult() is highlighted in boldface. The transfer of the A and B matrices (highlighted in green) occur in each region as well as the return of the C matrix (highlighted in blue). The offload region is started three times: once for warm-up and twice as specified via the command-line to determine the average runtime.

Setting the H_TIME environment variable provides the following information about the runtime of each of the offload regions. The time to initialize the target (lines 6 and 66) is significantly higher for the first offload region, which justifies the use of a warm-up call prior to gathering any runtime statistics. The time waiting for the computation (lines 48 and 78) takes most of the runtime after the warm-up call (0.042820 out of 0.052616 seconds, or 81% of the time).

**************************************************************
                    timer data                       (sec)          
**************************************************************
      Offload from file matrix.c, line 16
          host: total offload time                   2.480265
            host: initialize target                  2.109299
            host: acquire target                     0.000007
            host: wait dependencies                  0.000000
            host: setup buffers                      0.006309
              host: allocate buffers                 0.006277
            host: setup misc_data                    0.000004
              host: allocate buffer                  0.000000
            host: send pointers                      0.010187
            host: gather inputs                      0.000010
              host: map IN data buffer               0.000000
              host: unmap IN data buffer             0.000000
            host: initiate compute                   0.031128
            host: wait compute                       0.320771
            host: initiate pointer reads             0.000027
            host: scatter outputs                    0.000007
              host: map OUT data buffer              0.000000
              host: unmap OUT data buffer            0.000000
            host: wait pointer reads                 0.000447
            host: destroy buffers                    0.002036
          target: total time                         0.315454
            target: setup offload descriptor         0.000158
            target: entry lookup                     0.000013
            target: entry time                       0.315282
              target: scatter inputs                 0.000100
                target: add buffer reference         0.000012
              target: compute                        0.311368
              target: gather outputs                 0.000074
                target: remove buffer reference      0.000017
      Offload from file matrix.c, line 16
          host: total offload time                   0.052616
            host: initialize target                  0.000000
            host: acquire target                     0.000000
            host: wait dependencies                  0.000000
            host: setup buffers                      0.005432
              host: allocate buffers                 0.005425
            host: setup misc_data                    0.000001
              host: allocate buffer                  0.000000
            host: send pointers                      0.002159
            host: gather inputs                      0.000002
              host: map IN data buffer               0.000000
              host: unmap IN data buffer             0.000000
            host: initiate compute                   0.000019
            host: wait compute                       0.042820
            host: initiate pointer reads             0.000004
            host: scatter outputs                    0.000002
              host: map OUT data buffer              0.000000
              host: unmap OUT data buffer            0.000000
            host: wait pointer reads                 0.000461
            host: destroy buffers                    0.001712
          target: total time                         0.039711
            target: setup offload descriptor         0.000019
            target: entry lookup                     0.000006
            target: entry time                       0.039682
              target: scatter inputs                 0.000010
                target: add buffer reference         0.000002
              target: compute                        0.039627
              target: gather outputs                 0.000025
                target: remove buffer reference      0.000010
      Offload from file matrix.c, line 16
          host: total offload time                   0.057763
            host: initialize target                  0.000001
            host: acquire target                     0.000000
            host: wait dependencies                  0.000000
            host: setup buffers                      0.009987
              host: allocate buffers                 0.009980
            host: setup misc_data                    0.000002
              host: allocate buffer                  0.000000
            host: send pointers                      0.002066
            host: gather inputs                      0.000003
              host: map IN data buffer               0.000000
              host: unmap IN data buffer             0.000000
            host: initiate compute                   0.000030
            host: wait compute                       0.041085
            host: initiate pointer reads             0.002126
            host: scatter outputs                    0.000002
              host: map OUT data buffer              0.000000
              host: unmap OUT data buffer            0.000000
            host: wait pointer reads                 0.000467
            host: destroy buffers                    0.001992
          target: total time                         0.040084
            target: setup offload descriptor         0.000012
            target: entry lookup                     0.000007
            target: entry time                       0.040064
              target: scatter inputs                 0.000011
                target: add buffer reference         0.000002
              target: compute                        0.040010
              target: gather outputs                 0.000023
                target: remove buffer reference      0.000010
**************************************************************

Intel has provided more advance profiling capabilities than just text output from the runtime. The Intel Profiling whitepaper and VTune User Guide provide a good starting point for further investigation.

Conclusion

The focus of this first tutorial is to provide enough information so you can start running code on the Intel Xeon Phi as quickly as possible. The complete example source codes in this tutorial demonstrate that it is indeed possible to achieve teraflop performance on these coprocessors. From a hardware perspective, the Phi coprocessor is a versatile platform that supports both modern and legacy programming models.

The true power of Intel Xeon Phi is currently realized through the capabilities of Intel's compilers, which are able to transparently compile a single source file to run natively on the host or an Intel Xeon Phi coprocessor as well as in an offload mode that can directly exploit all the hardware resources in a system. The value in being able to support multiple configurations with a single source file (or source tree) cannot be underestimated from the software design and application development perspectives.

Even though Intel Xeon Phi is new, it can leverage a tremendous number of existing tools, software development platforms, and libraries. The resulting information can be overwhelming to those just beginning to use this technology. Succinctly put the single key concept to understand about Intel Xeon Phi is that that the program must express sufficient parallelism and vector capability to achieve high performance. Measurements presented in this tutorial suggest that the application or offload region must use at least 120 concurrent threads of execution.

Related Reading

CUDA vs. Phi: Phi Programming for CUDA Developers

Getting to 1 Teraflop on the Intel Phi Coprocessor

Numerical and Computational Optimization on the Intel Phi


Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU-associated programming topics.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video