Channels ▼
RSS

C/C++

A Gentle Introduction to OpenCL


A Simple Example

Now that you have a conceptual understanding of OpenCL's host applications and kernels, it's time to look at real code. I've provided two source files: add_numbers.c and add_numbers.cl. The first contains code for a simple host application and the second contains code for a kernel that adds a series of numbers together. This discussion won't present the OpenCL API in any depth, but will focus on how the code relates to the material discussed earlier.

Example Host Application

In general, the first step of a host application is to obtain a cl_device_id for each device that will execute a kernel. The add_numbers application accesses the first GPU device associated with the first platform. This is shown in the following code:

clGetPlatformIDs(1, &platform, NULL);
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

Here, the platform structure identifies the first platform identified by the OpenCL runtime. A platform identifies a vendor's installation, so a system may have an NVIDIA platform and an AMD platform. The device structure corresponds to the first accessible device associated with the platform. Because the second parameter is CL_DEVICE_TYPE_GPU, this device must be a GPU.

Next, the application creates a context containing only one device — the device structure created earlier.

context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);

After creating the context, the application creates a program from the source code in the add_numbers.cl file. Specifically, the code reads the file's content into a char array called program_buffer, and then calls clCreateProgramWithSource.

program = clCreateProgramWithSource(context, 1, 
          (const char**)&program_buffer, &program_size, &err);

Once the program is created, its source code must be compiled for the devices in the context. The function that accomplishes this is clBuildProgram, and the following code shows how it's used in the add_numbers application:

clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

The fourth parameter accepts options that configure the compilation. These are similar to the flags used by gcc. For example, you can define a macro with the option -DMACRO=VALUE and turn off optimization with -cl-opt-disable.

After a cl_program has been compiled, kernels can be created from its functions. The following code creates a cl_kernel from the function called add_numbers:

kernel = clCreateKernel(program, "add_numbers", &err);

Before the application can dispatch this kernel, it needs to create a command queue to a target device. With the right configuration, a command queue can support out-of-order kernel execution and/or profiling, which allows us to measure the time taken for a kernel's execution. The following code, however, creates a cl_command_queue that does not support profiling or out-of-order-execution:

queue = clCreateCommandQueue(context, device, 0, &err);

At this point, the application has created all the data structures (device, kernel, program, command queue, and context) needed by an OpenCL host application. Now, it deploys the kernel to a device with the following code:

global_size = 8;
local_size = 4;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

Of the OpenCL functions that run on the host, clEnqueueNDRangeKernel is probably the most important to understand. Not only does it deploy kernels to devices, it also identifies how many work-items should be generated to execute the kernel (global_size) and the number of work-items in each work-group (local_size).

In this example, the kernel is executed by eight work-items divided into two work-groups of four work-items each. Returning to my analogy, this corresponds to a school containing eight students divided into two classrooms of four students each.

Example Kernel

A practical application may generate thousands or even millions of work-items, but for the simple task of adding 64 numbers, eight work-items will suffice. The program file add_numbers.cl contains a function called add_numbers that performs this operation. Like all kernel functions, it returns void and its name is preceded by the __kernel identifier.

The kernel has 64 values to add together and eight work-items with which to add them. After each work-item computes its sum of eight values, these partial results will be added together to form a sum for the entire group. In the end, the kernel will return two sums — one for each work-group executing the kernel.

When a work-item starts executing add_numbers, it begins by determining which region of global memory contains the values it needs to access. This is accomplished with the following code:

global_addr = get_global_id(0) * 2;

The first line of code provides the work-item with its global ID, which distinguishes it from all the other executing work-items. With this ID, it computes global_addr, the address in global memory from which it should load data.

Next, the work-item loads its global data into private memory. Returning to my analogy, this is like a student going to the school-wide blackboard and copying values into a notebook. The following code shows how this works.

input1 = data[global_addr];
input2 = data[global_addr+1];
sum_vector = input1 + input2;

In the kernel code, input1, input2, and sum_vector all have the data type float4. This is a vector type and it's similar to an array of four floats, but there's one significant difference: When you operate on a float4, all four floats are operated upon in the same clock cycle. If the target device supports vector operations, the last line of this code will have it perform four floating-point additions simultaneously.

Each work-item stores its final sum to local memory, which corresponds to a classroom blackboard in my analogy. This is accomplished as follows:

local_addr = get_local_id(0);
local_result[local_addr] = sum_vector.s0 + sum_vector.s1 + 
                           sum_vector.s2 + sum_vector.s3;

The first line tells the work-item where it can store the sum of its eight values. The second line computes the sum and places it in local memory.

Each work-group contains four work-items, so each block of local memory contains four partial sums. To add these partial sums together, one work-item is designated to read the values for the group and arrive at a group result by adding them together. This is accomplished with the following code:

if(get_local_id(0) == 0) {
   sum = 0.0f;
   for(int i=0; i<get_local_size(0); i++) {
      sum += local_result[i];
   }
   group_result[get_group_id(0)] = sum;
}

The group_result data is stored in global memory. This is important because the host application can only read values stored in this address space. In the case of add_numbers, the application reads the two sums in group_result, adds them together, and checks the kernel's output against the correct answer. The host application completes its operation by printing the result and deallocating the structures it used to perform OpenCL processing.

Conclusion

With great power comes great complexity. OpenCL, with its wealth of features, makes it possible to code routines capable of executing on devices ranging from graphics cards to supercomputers. But to take full advantage of OpenCL, you need to have a thorough understanding of host applications and kernels.

This article has presented two analogies intended to ease the learning process, but in the end, it's the code that matters. The OpenCL API takes time and effort to understand, but once you've ascended the learning curve, you'll be able to tap into computing performance that exceeds anything a regular C/C++ programmer could hope for.


— Matthew Scarpino is a software consultant in the San Francisco Bay area. He is the author of the upcoming book OpenCL in Action from Manning Publications. This article was derived from an early draft of the book.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video