Channels ▼
RSS

Open Source

Build and Deploy OpenCL Kernels in Python


In the first installment of this two-part series on developing OpenCL host programs in Python with PyOpenCL, I explained some PyOpenCL features and showed how to build and deploy a kernel for OpenCL device execution. In this article, I discuss advanced features included in PyOpenCL that reduce the code required to build and deploy OpenCL kernels for many common parallel algorithms.

Checking OpenCL Program Build Failures

The code you write for an OpenCL program might generate a build failure. For example, the following lines show a modified version of the example introduced in the previous article that includes a typo and tries to access an undefined gresult variable, instead of result:

program = cl.Program(context, """
    __kernel void matrix_dot_vector(__global const float4 *matrix,
    __global const float4 *vector, __global float *result)
    {
      int gid = get_global_id(0);
      gresult[gid] = dot(matrix[gid], vector[0]);
    }
    """).build()

When you execute code that tries to build the OpenCL program with the matrix_dot_vector kernel, PyOpenCL raises a pyopencl.RuntimeError exception and provides detailed results about the detected errors. In addition, PyOpenCL saves the OpenCL program source in a temporary file that you can access after the build fails. The following lines show the results from a Windows Python console when I execute the previous lines:

Traceback (most recent call last):
  File "<console>", line 8, in <module>
  File "C:\Python33\lib\site-packages\pyopencl\__init__.py", line 176, in build
    options=options, source=self._source)
  File "C:\Python33\lib\site-packages\pyopencl\__init__.py", line 216, in _build_and_catch_errors
    raise err
pyopencl.RuntimeError: clBuildProgram failed: build program failure - 

Build on <pyopencl.Device 'ATI RV730' on 'AMD Accelerated Parallel Processing' at 0x4aeee10>:
"C:\Users\gaston\AppData\Local\Temp\OCL2732.tmp.cl", line 6: error: identifier
          "gresult" is undefined
  gresult[gid] = dot(matrix[gid], vector[0]);
  ^

1 error detected in the compilation of "C:\Users\gaston\AppData\Local\Temp\OCL2732.tmp.cl".

Internal error: clc compiler invocation failed.

(options: -I c:\python33\lib\site-packages\pyopencl\cl)
(source saved as c:\users\gaston\appdata\local\temp\tmpxqpd6n.cl)

It is extremely important to check for the exceptions raised by the build method because the default behavior for PyOpenCL is to use the cached build if one exists. Thus, if you don't check for exceptions and run the code again in the Python console, you will see the results of the matrix multiplication, but they will be produced by the previous version of the kernel code (the cached build code).

Working with Wait Lists and Command Events

PyOpenCL provides support for OpenCL command events and wait lists with the following two classes that also allow you to access OpenCL event profiling information:

  • pyopencl.Event enables you to monitor the execution of a kernel and the completion of a data transfer operation that doesn't involve a host-side buffer.
  • pyopencl.NannyEvent (subclass of pyopencl.Event) makes it possible to monitor the completion of a data transfer operation that involves a host-side buffer.

The following lines show a new version of the matrix-by-vector multiplication sample I introduced in the previous article. I don't show the initial lines because they don't change, so I just start with the changes in the line that calls the program.matrix_dot_vector method that deploys the kernel for device execution. The code uses the results of the matrix-by-vector multiplication to generate a new float4 vector named vector2 and make another kernel call to multiply this vector by another matrix named matrix2. The new version uses events to monitor both the execution of the kernels and the completion of the data-transfer operations.

matrix_dot_vector_kernel_event = 
    program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buffer)

matrix_dot_vector_copy_event = 
    cl.enqueue_copy(queue, matrix_dot_vector, destination_buffer, is_blocking=False, wait_for=[matrix_dot_vector_kernel_event])

matrix_dot_vector_copy_event.wait()

print(matrix_dot_vector)

vector2 = numpy.zeros((1, 1), cl.array.vec.float4)
vector2['x'] = matrix_dot_vector[0]
vector2['y'] = matrix_dot_vector[1]
vector2['z'] = matrix_dot_vector[2]
vector2['w'] = matrix_dot_vector[3]

matrix2 = numpy.zeros((1, 4), cl.array.vec.float4)
matrix2[0, 0]['s0'] = 1
matrix2[0, 0]['s1'] = 2
matrix2[0, 0]['s2'] = 3
matrix2[0, 0]['s3'] = 4
matrix2[0, 1]['s0'] = 5
matrix2[0, 1]['s1'] = 6
matrix2[0, 1]['s2'] = 7
matrix2[0, 1]['s3'] = 8
matrix2[0, 2]['s0'] = 9
matrix2[0, 2]['s1'] = 10
matrix2[0, 2]['s2'] = 11
matrix2[0, 2]['s3'] = 12
matrix2[0, 3]['s0'] = 13
matrix2[0, 3]['s1'] = 14
matrix2[0, 3]['s2'] = 15
matrix2[0, 3]['s3'] = 16

matrix2_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix2)
vector2_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector2)
matrix2_dot_vector2 = numpy.zeros(4, numpy.float32)
destination_buffer2 = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix2_dot_vector2.nbytes)

matrix2_dot_vector2_kernel_event = 
   program.matrix_dot_vector(queue, matrix_dot_vector2.shape, None, matrix2_buf, vector2_buf, destination_buffer2, wait_for=[matrix_dot_vector_copy_event])

matrix2_dot_vector2_copy_event = 
   cl.enqueue_copy(queue, matrix2_dot_vector2, destination_buffer2, is_blocking=False, 
   wait_for=[matrix2_dot_vector2_kernel_event])

matrix2_dot_vector2_copy_event.wait()

print(matrix2_dot_vector2)

The code assigns the result of the call to program.matrix_dot_vector method to the matrix_dot_vector_kernel_event variable that holds an instance of pyopencl.Event. Then, the call to cl.enqueue_copy uses this pyopencl.Event instance in the list of events that the method has to wait for before starting the data transfer. Notice that the wait_for argument is a list of pyopencl.Event instances, so you can schedule the execution when more than one event indicates that the execution of other operations finished. In this case, because the data transfer involved a host-side buffer, cl.enqueue_copy returns a new instance of pyopencl.NannyEvent that the code assigns to the matrix_dot_vector_copy_event variable. Notice that the is_blocking argument is set to False, so the method won't wait for completion and will return the pyopencl.NannyEvent instance. Just to demonstrate one possible wait mechanisms you can use, the code calls the wait() method for this instance to block until the execution of the operation related to the event finishes.

After the data transfer finishes, the code creates a new cl.array.vec.float4 vector (vector2) by using the results of the matrix-by-vector multiplication. In this case, the code uses the identifiers available in OpenCL for float4 vectors to assign the different values for each scalar value: 'x', 'y', 'z', and 'w'. The four lines that assign values to the different scalar values are equivalent to the following code that uses the other identifiers: 's0', 's1', 's2', and 's3'.

vector2['s0'] = matrix_dot_vector[0]
vector2['s1'] = matrix_dot_vector[1]
vector2['s2'] = matrix_dot_vector[2]
vector2['s3'] = matrix_dot_vector[3]

Next, the code defines a new matrix (matrix2) composed of four cl.array.vec.float4 vectors. In this case, the code also uses the following OpenCL identifiers to assign each scalar value: 's0', 's1', 's2', and 's3'. This example demonstrates a way you can continue using your existing OpenCL knowledge while working on the host program with PyOpenCL.

The code assigns the result of the second call to the program.matrix_dot_vector method to the matrix2_dot_vector2_kernel_event variable that holds an instance of pyopencl.Event. In this case, there is a value for the wait_for argument specifying that the method must make sure that the execution related to the matrix_dot_vector_copy_event event finished before starting.

The code chains the following executions by specifying the different values in the wait_for arguments:

  • Deploy the matrix_dot_vector kernel for execution
  • Transfer the results of the matrix_dot_vector kernel to the host program
  • Deploy the matrix_dot_vector kernel for execution with new arguments
  • Transfer the new results of the matrix_dot_vector kernel to the host program

You can use the events provided by PyOpenCL to have the same features you would employ in a C or C++ host program that uses the OpenCL events to specify when the different calls must be executed on OpenCL devices.

Working with the Element-Wise Expression Evaluation Builder

PyOpenCL provides a pyopencl.elementwise module that allows you to build kernels that evaluate multi-stage expressions on one or more operands in a single and efficient pass, avoiding the creation of temporary intermediate results that reduce overall throughput. The following lines show how to use the pyopencl.elementwise.ElementwiseKernel class to calculate the x component of polygon vertices:

import pyopencl as cl
import pyopencl.array as cl_array
import numpy

if __name__ == "__main__":
    platform = cl.get_platforms()[0]
    
    device = platform.get_devices()[0]
    
    context = cl.Context([device])
    
    queue = cl.CommandQueue(context)
    
    n = 40000
    
    x_origin_gpu = cl_array.to_device(queue, numpy.arange(0, n, 1, dtype=numpy.float32))
    
    # x_origin[i] + r * cos(2 * PI * i / n)
    from pyopencl.elementwise import ElementwiseKernel
    calculate_polygon_vertices_x = ElementwiseKernel(context,
            "float r, float *x_origin, float *x ",
            operation="x[i] = x_origin[i] + (r * cos(2 * M_PI * i / n))",
            name="calculate_polygon_vertices_x", preamble="#define M_PI 3.14159265358979323846")
    
    x_gpu = cl_array.empty_like(x_origin_gpu)
    
    # long n is included in the element-wise kernel
    # float r, float *x_origin, float *x
    event = calculate_polygon_vertices_x(50.0, x_origin_gpu, x_gpu)
    
    print(x_gpu)

The documentation for PyOpenCL contains many errors because some newer methods removed unnecessary arguments. For example, pyopencl.array.to_device (cl_array.to_device in the previous code sample) requires a CommandQueue as one of its arguments, but it doesn't require a Context anymore. The to_device method returns an Array that is an exact copy of the numpy.ndarray instance received as a parameter. The following line uses this method to generate the x origin points (x_origin_gpu) that will be used as an argument to call the generated kernel. Notice that the to_device method and the element-wise kernel builder make it easier to pass arguments to the kernel than in the previous example:

x_origin_gpu = cl_array.to_device(queue, numpy.arange(0, n, 1, dtype=numpy.float32))


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video