Channels ▼
RSS

Database

Easy OpenCL with Python


As an example, I'll use the 12 steps to build and deploy a kernel with PyOpenCL. This way, if you have experience with C++ OpenCL host applications, you will be able to use PyOpenCL to prepare your host applications to build and deploy your OpenCL kernels.

The following lines show the code for an OpenCL kernel that computes the product of a matrix and a vector:

__kernel void matrix_dot_vector(__global const float4 *matrix,
        __global const float4 *vector, __global float *result)
{
    int gid = get_global_id(0);
    result[gid] = dot(matrix[gid], vector[0]);
}

Both the matrix and vector kernel arguments are of type float4 and are stored in the device's global address space, also known as "global memory." The kernel code retrieves the global ID number (gid) and uses it to calculate the product of the float4 matrix row whose index is equal to the global ID number and the float4 vector. The float result is stored in the global ID number index of result. Figure 1 shows an example of a 4-by-4 matrix multiplied by a 4-element vector.

PyOpenCL
Figure 1: A 4x4 matrix multiplied by a 4-element vector with the result.

The matrix-vector multiplication shown in Figure 1 requires the following operations:

1 * 1 + 2 * 2 + 4 * 4 + 8 * 8 = 85
16 * 1 + 32 * 2 + 64 * 4 + 128 * 8 = 1360
3 * 1 + 6 * 2 + 9 * 4 + 12 * 8 = 147
5 * 1 + 10 * 2 + 15 * 4 +25 * 8 = 285

Each row in the matrix is a float4 vector, so the kernel performs just one operation to compute the product of one row and the float4 vector. For example, the first matrix row is (1.0, 2.0, 4.0, 8.0) and the only element of vector is (1.0, 2.0, 4.0, 8.0). The dot operation for the first matrix row will have two arguments with 4 float values packed in each argument: (1.0, 2.0, 4.0, 8.0) for matrix[gid], and (1.0, 2.0, 4.0, 8.0) for vector[0]. The code takes advantage of the vector processing capabilities of OpenCL and demonstrates the support for vector types that PyOpenCL provides to Python.

The following lines show Python code that uses PyOpenCL and Numpy to perform the steps required for an OpenCL host program. The code includes comments that indicate which blocks of code are performing each of the 12 steps of the typical OpenCL C++ host program. You can also run different parts of the code in the Python console.

import pyopencl as cl
from pyopencl import array
import numpy

if __name__ == "__main__":
    vector = numpy.zeros((1, 1), cl.array.vec.float4)
    matrix = numpy.zeros((1, 4), cl.array.vec.float4)
    matrix[0, 0] = (1, 2, 4, 8)
    matrix[0, 1] = (16, 32, 64, 128)
    matrix[0, 2] = (3, 6, 9, 12)
    matrix[0, 3] = (5, 10, 15, 25)
    vector[0, 0] = (1, 2, 4, 8)
    
    ## Step #1. Obtain an OpenCL platform.
    platform = cl.get_platforms()[0]
    
    ## It would be necessary to add some code to check the check the support for
    ## the necessary platform extensions with platform.extensions
    
    ## Step #2. Obtain a device id for at least one device (accelerator).
    device = platform.get_devices()[0]
    
    ## It would be necessary to add some code to check the check the support for
    ## the necessary device extensions with device.extensions
    
    ## Step #3. Create a context for the selected device.
    context = cl.Context([device])
    
    ## Step #4. Create the accelerator program from source code.
    ## Step #5. Build the program.
    ## Step #6. Create one or more kernels from the program functions.
    program = cl.Program(context, """
        __kernel void matrix_dot_vector(__global const float4 *matrix,
        __global const float4 *vector, __global float *result)
        {
          int gid = get_global_id(0);
          result[gid] = dot(matrix[gid], vector[0]);
        }
        """).build()
    
    ## Step #7. Create a command queue for the target device.
    queue = cl.CommandQueue(context)
    
    ## Step #8. Allocate device memory and move input data from the host to the device memory.
    mem_flags = cl.mem_flags
    matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
    vector_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector)
    matrix_dot_vector = numpy.zeros(4, numpy.float32)
    destination_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix_dot_vector.nbytes)
    
    ## Step #9. Associate the arguments to the kernel with kernel object.
    ## Step #10. Deploy the kernel for device execution.
    program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buf)
    
    ## Step #11. Move the kernel’s output data to host memory.
    cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)
    
    ## Step #12. Release context, program, kernels and memory.
    ## PyOpenCL performs this step for you, and therefore,
    ## you don't need to worry about cleanup code
    
    print(matrix_dot_vector)

The first lines create two variables that initialize both the matrix and the vector. Notice that vector is an array of cl.array.vec.float4 with a single element and matrix is an array of cl.array.vec.float4 with four elements. I used numpy.zeros to create the array with the cl.array.vec.float4 type and then additional code to initialize the values shown in Figure 1. This way, you can easily understand the way you can use cl.array.vec types:

vector = numpy.zeros((1, 1), cl.array.vec.float4)
matrix = numpy.zeros((1, 4), cl.array.vec.float4)
matrix[0, 0] = (1, 2, 4, 8)
matrix[0, 1] = (16, 32, 64, 128)
matrix[0, 2] = (3, 6, 9, 12)
matrix[0, 3] = (5, 10, 15, 25)
vector[0, 0] = (1, 2, 4, 8)

The code retrieves the first available platform, then the first device for this platform. There is no code to check either the available extensions or the device type. However, I placed comments in the code as a reminder that these tasks are necessary in a more complex host program.

Then, the code creates an OpenCL context for the selected device and calls cl.Program to create a program for the context with the kernel source code as one of the arguments. The call to the build() method for the created cl.Program instance builds the kernel.

The code calls cl.CommandQueue with the context as an argument to create a command queue (queue) for the target device. Then, it allocates device memory and moves input data from the host to the device memory. The following lines use the most basic features provided by PyOpenCL to do this:

mem_flags = cl.mem_flags
matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
vector_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector)
matrix_dot_vector = numpy.zeros(4, numpy.float32)
destination_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix_dot_vector.nbytes)

The code defines the following buffers by calling cl.Buffer:

  • matrix_buf. A read-only buffer that copies the data from the matrix variable. The kernel will read from this buffer in the global memory space
  • vector_buf. A read-only buffer that copies the data from the vector variable. The kernel will read from this buffer in the global memory space
  • destination_buf: A write-only buffer that will hold the result of the matrix-by-vector multiplication. The kernel will write to this buffer in the global memory space

The following line associates the arguments to the kernel and deploys it for device execution by calling the method that PyOpenCL generates in program with the built kernel name: matrix_dot_vector. The previously created queue is the first argument:

program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buf)

The kernel receives the following arguments:

  • matrix_buf for matrix
  • vector_buf for vector
  • destination_buf for result

When the kernel finishes, it is time to move the kernel's output data (result) stored in destination_buf to the host program memory. The following line calls cl.enqueue_copy to do this, and the result will be available in the matrix_dot_vector variable.

cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)

In this example, the code doesn't take advantage of the different events that are fired when the kernel finishes its execution. That notwithstanding, because PyOpenCL performs all the necessary cleanup operations, you don't need to worry about reference counts or releasing the underlying OpenCL structures and resources.

Conclusion

This example shows basic features that PyOpenCL provides to Python developers who want to create OpenCL host applications. In the next article in this series, I'll dive deep into more advanced features that reduce the code required to build and deploy OpenCL kernels for many common parallel algorithms.


Gaston Hillar is a frequent contributor to Dr. Dobb's.

Related Article

Build and Deploy OpenCL Kernels in Python


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video