In the first installment of this two-part series on developing OpenCL host programs in Python with PyOpenCL, I explained some PyOpenCL features and showed how to build and deploy a kernel for OpenCL device execution. In this article, I discuss advanced features included in PyOpenCL that reduce the code required to build and deploy OpenCL kernels for many common parallel algorithms.
Checking OpenCL Program Build Failures
The code you write for an OpenCL program might generate a build failure. For example, the following lines show a modified version of the example introduced in the previous article that includes a typo and tries to access an undefined gresult
variable, instead of result
:
program = cl.Program(context, """ __kernel void matrix_dot_vector(__global const float4 *matrix, __global const float4 *vector, __global float *result) { int gid = get_global_id(0); gresult[gid] = dot(matrix[gid], vector[0]); } """).build()
When you execute code that tries to build the OpenCL program with the matrix_dot_vector
kernel, PyOpenCL raises a pyopencl.RuntimeError
exception and provides detailed results about the detected errors. In addition, PyOpenCL saves the OpenCL program source in a temporary file that you can access after the build fails. The following lines show the results from a Windows Python console when I execute the previous lines:
Traceback (most recent call last):
File "<console>", line 8, in <module>
File "C:\Python33\lib\site-packages\pyopencl\__init__.py", line 176, in build
options=options, source=self._source)
File "C:\Python33\lib\site-packages\pyopencl\__init__.py", line 216, in _build_and_catch_errors
raise err
pyopencl.RuntimeError: clBuildProgram failed: build program failure -
Build on <pyopencl.Device 'ATI RV730' on 'AMD Accelerated Parallel Processing' at 0x4aeee10>:
"C:\Users\gaston\AppData\Local\Temp\OCL2732.tmp.cl", line 6: error: identifier
"gresult" is undefined
gresult[gid] = dot(matrix[gid], vector[0]);
^
1 error detected in the compilation of "C:\Users\gaston\AppData\Local\Temp\OCL2732.tmp.cl".
Internal error: clc compiler invocation failed.
(options: -I c:\python33\lib\site-packages\pyopencl\cl)
(source saved as c:\users\gaston\appdata\local\temp\tmpxqpd6n.cl)
It is extremely important to check for the exceptions raised by the build method because the default behavior for PyOpenCL is to use the cached build if one exists. Thus, if you don't check for exceptions and run the code again in the Python console, you will see the results of the matrix multiplication, but they will be produced by the previous version of the kernel code (the cached build code).
Working with Wait Lists and Command Events
PyOpenCL provides support for OpenCL command events and wait lists with the following two classes that also allow you to access OpenCL event profiling information:
pyopencl.Event
enables you to monitor the execution of a kernel and the completion of a data transfer operation that doesn't involve a host-side buffer.pyopencl.NannyEvent
(subclass ofpyopencl.Event
) makes it possible to monitor the completion of a data transfer operation that involves a host-side buffer.
The following lines show a new version of the matrix-by-vector multiplication sample I introduced in the previous article. I don't show the initial lines because they don't change, so I just start with the changes in the line that calls the program.matrix_dot_vector
method that deploys the kernel for device execution. The code uses the results of the matrix-by-vector multiplication to generate a new float4
vector named vector2
and make another kernel call to multiply this vector by another matrix named matrix2
. The new version uses events to monitor both the execution of the kernels and the completion of the data-transfer operations.
matrix_dot_vector_kernel_event = program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buffer) matrix_dot_vector_copy_event = cl.enqueue_copy(queue, matrix_dot_vector, destination_buffer, is_blocking=False, wait_for=[matrix_dot_vector_kernel_event]) matrix_dot_vector_copy_event.wait() print(matrix_dot_vector) vector2 = numpy.zeros((1, 1), cl.array.vec.float4) vector2['x'] = matrix_dot_vector[0] vector2['y'] = matrix_dot_vector[1] vector2['z'] = matrix_dot_vector[2] vector2['w'] = matrix_dot_vector[3] matrix2 = numpy.zeros((1, 4), cl.array.vec.float4) matrix2[0, 0]['s0'] = 1 matrix2[0, 0]['s1'] = 2 matrix2[0, 0]['s2'] = 3 matrix2[0, 0]['s3'] = 4 matrix2[0, 1]['s0'] = 5 matrix2[0, 1]['s1'] = 6 matrix2[0, 1]['s2'] = 7 matrix2[0, 1]['s3'] = 8 matrix2[0, 2]['s0'] = 9 matrix2[0, 2]['s1'] = 10 matrix2[0, 2]['s2'] = 11 matrix2[0, 2]['s3'] = 12 matrix2[0, 3]['s0'] = 13 matrix2[0, 3]['s1'] = 14 matrix2[0, 3]['s2'] = 15 matrix2[0, 3]['s3'] = 16 matrix2_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix2) vector2_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector2) matrix2_dot_vector2 = numpy.zeros(4, numpy.float32) destination_buffer2 = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix2_dot_vector2.nbytes) matrix2_dot_vector2_kernel_event = program.matrix_dot_vector(queue, matrix_dot_vector2.shape, None, matrix2_buf, vector2_buf, destination_buffer2, wait_for=[matrix_dot_vector_copy_event]) matrix2_dot_vector2_copy_event = cl.enqueue_copy(queue, matrix2_dot_vector2, destination_buffer2, is_blocking=False, wait_for=[matrix2_dot_vector2_kernel_event]) matrix2_dot_vector2_copy_event.wait() print(matrix2_dot_vector2)
The code assigns the result of the call to program.matrix_dot_vector
method to the matrix_dot_vector_kernel_event
variable that holds an instance of pyopencl.Event
. Then, the call to cl.enqueue_copy
uses this pyopencl.Event
instance in the list of events that the method has to wait for before starting the data transfer. Notice that the wait_for
argument is a list of pyopencl.Event
instances, so you can schedule the execution when more than one event indicates that the execution of other operations finished. In this case, because the data transfer involved a host-side buffer, cl.enqueue_copy
returns a new instance of pyopencl.NannyEvent
that the code assigns to the matrix_dot_vector_copy_event
variable. Notice that the is_blocking
argument is set to False
, so the method won't wait for completion and will return the pyopencl.NannyEvent
instance. Just to demonstrate one possible wait mechanisms you can use, the code calls the wait()
method for this instance to block until the execution of the operation related to the event finishes.
After the data transfer finishes, the code creates a new cl.array.vec.float4
vector (vector2
) by using the results of the matrix-by-vector multiplication. In this case, the code uses the identifiers available in OpenCL for float4
vectors to assign the different values for each scalar value: 'x'
, 'y'
, 'z'
, and 'w'
. The four lines that assign values to the different scalar values are equivalent to the following code that uses the other identifiers: 's0'
, 's1'
, 's2'
, and 's3'.
vector2['s0'] = matrix_dot_vector[0] vector2['s1'] = matrix_dot_vector[1] vector2['s2'] = matrix_dot_vector[2] vector2['s3'] = matrix_dot_vector[3]
Next, the code defines a new matrix (matrix2
) composed of four cl.array.vec.float4
vectors. In this case, the code also uses the following OpenCL identifiers to assign each scalar value: 's0'
, 's1'
, 's2'
, and 's3'.
This example demonstrates a way you can continue using your existing OpenCL knowledge while working on the host program with PyOpenCL.
The code assigns the result of the second call to the program.matrix_dot_vector
method to the matrix2_dot_vector2_kernel_event
variable that holds an instance of pyopencl.Event
. In this case, there is a value for the wait_for
argument specifying that the method must make sure that the execution related to the matrix_dot_vector_copy_event
event finished before starting.
The code chains the following executions by specifying the different values in the wait_for
arguments:
- Deploy the
matrix_dot_vector
kernel for execution - Transfer the results of the
matrix_dot_vector
kernel to the host program - Deploy the
matrix_dot_vector
kernel for execution with new arguments - Transfer the new results of the
matrix_dot_vector
kernel to the host program
You can use the events provided by PyOpenCL to have the same features you would employ in a C or C++ host program that uses the OpenCL events to specify when the different calls must be executed on OpenCL devices.
Working with the Element-Wise Expression Evaluation Builder
PyOpenCL provides a pyopencl.elementwise
module that allows you to build kernels that evaluate multi-stage expressions on one or more operands in a single and efficient pass, avoiding the creation of temporary intermediate results that reduce overall throughput. The following lines show how to use the pyopencl.elementwise.ElementwiseKernel
class to calculate the x
component of polygon vertices:
import pyopencl as cl import pyopencl.array as cl_array import numpy if __name__ == "__main__": platform = cl.get_platforms()[0] device = platform.get_devices()[0] context = cl.Context([device]) queue = cl.CommandQueue(context) n = 40000 x_origin_gpu = cl_array.to_device(queue, numpy.arange(0, n, 1, dtype=numpy.float32)) # x_origin[i] + r * cos(2 * PI * i / n) from pyopencl.elementwise import ElementwiseKernel calculate_polygon_vertices_x = ElementwiseKernel(context, "float r, float *x_origin, float *x ", operation="x[i] = x_origin[i] + (r * cos(2 * M_PI * i / n))", name="calculate_polygon_vertices_x", preamble="#define M_PI 3.14159265358979323846") x_gpu = cl_array.empty_like(x_origin_gpu) # long n is included in the element-wise kernel # float r, float *x_origin, float *x event = calculate_polygon_vertices_x(50.0, x_origin_gpu, x_gpu) print(x_gpu)
The documentation for PyOpenCL contains many errors because some newer methods removed unnecessary arguments. For example, pyopencl.array.to_device
(cl_array.to_device
in the previous code sample) requires a CommandQueue
as one of its arguments, but it doesn't require a Context
anymore. The to_device
method returns an Array
that is an exact copy of the numpy.ndarray
instance received as a parameter. The following line uses this method to generate the x
origin points (x_origin_gpu
) that will be used as an argument to call the generated kernel. Notice that the to_device
method and the element-wise kernel builder make it easier to pass arguments to the kernel than in the previous example:
x_origin_gpu = cl_array.to_device(queue, numpy.arange(0, n, 1, dtype=numpy.float32))