How It Works
In CUDA-x86, thread blocks are mapped to x86 processor cores. Thread-level parallelism is mapped to SSE (Streaming SIMD Extensions) or AVX SIMD units as shown below. (AVX is an extension of SSE to 256-bit operation). PGI indicates that:
- The size of a warp (that is, the basic unit of code to be run) will be different than the typical 32 threads per warp for a GPU. For x86 computing, a warp might be the size of the SIMD units on the x86 core (either four or eight threads) or one thread per warp when SIMD execution is not utilized.
- In many cases, the PGI CUDA C compiler removes explicit synchronization of the thread processors when the compiler can determine it is safe to split loops.
- CUDA considers the GPU as a separate device from the host processors. CUDA x86 maintains this memory model, which means that data movement between the host and device memory spaces still consumes application runtime. As shown in the device bandwidth SDK example below, a modern Xeon processor can transfer data to a CUDA-x86 "device" at about 4GB/sec. All CUDA x86 pointers reside in the x86 memory space, so programmers can use conditional compilation to directly access memory without requiring data transfers when running on multicore processors.
Trying Out the Compiler
The PGI installation process is fairly straightforward:
- Register and download the latest version from PGI
- Extract the tarfile at the location of your choice and follow the instructions in INSTALL.txt.
- Under Linux, this basically requires running the file ./install as superuser and answering a few straight-forward questions.
- Note that you should answer "yes" to the installation of CUDA even if you have a GPU version of CUDA already installed on your system. The PGI x86 version will not conflict with the GPU version. Otherwise, the PGI compiler will not understand files with the .cu file extension.
- Create the license.dat file.
At this point, you have a 15-day license for the PGI compilers.
Setup the environment to build with the PGI tools as discussed in the installation guide. Following are the commands for bash under Linux:
PGI=/opt/pgi; export PGI MANPATH=$MANPATH:$PGI/linux86-64/11.5/man; export MANPATH LM_LICENSE_FILE=$PGI/license.dat; export LM_LICENSE_FILE PATH=$PGI/linux86-64/11.5/bin:$PATH; export PATH
Copy the PGI NVIDIA SDK samples to a convenient location and build them:
cp –r /opt/pgi/linux86-64/2011/cuda/cudaX86SDK . cd cudaX86SDK ; make
This is the output of
deviceQuery on an Intel Xeon e5560 processor:
CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "DEVICE EMULATION MODE" CUDA Driver Version: 99.99 CUDA Runtime Version: 99.99 CUDA Capability Major revision number: 9998 CUDA Capability Minor revision number: 9998 Total amount of global memory: 128000000 bytes Number of multiprocessors: 1 Number of cores: 0 Total amount of constant memory: 1021585952 bytes Total amount of shared memory per block: 1021586048 bytes Total number of registers available per block: 1021585904 Warp size: 1 Maximum number of threads per block: 1021585920 Maximum sizes of each dimension of a block: 32767 x 2 x 0 Maximum sizes of each dimension of a grid: 1021586032 x 32767 x 1021586048 Maximum memory pitch: 4206313 bytes Texture alignment: 1021585952 bytes Clock rate: 0.00 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: Yes Compute mode: Unknown Concurrent kernel execution: Yes Device has ECC support enabled: Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 99.99, CUDA Runtime Version = 99.99, NumDevs = 1, Device = DEVICE EMULATION MODE PASSED Press <Enter> to Quit... -----------------------------------------------------------
The output of
bandwidthTest shows that device transfers work as expected:
Running on... Device 0: DEVICE EMULATION MODE Quick Mode Host to Device Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4152.5 Device to Host Bandwidth, 1 Device(s), Paged memory Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4257.0 Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 8459.2 [bandwidthTest] - Test results: PASSED Press <Enter> to Quit... -----------------------------------------------------------
As with NVIDIA's nvcc compiler, it is easy to use the PGI pgCC compiler to build an executable from a CUDA source file. As an example, copy the
arrayReversal_multiblock_fast.cu code from Part 3 of this series. To compile and run it under Linux, type:
pgCC arrayReversal_multiblock_fast.cu ./a.out Correct!