 Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# Easy OpenCL with Python

As an example, I'll use the 12 steps to build and deploy a kernel with PyOpenCL. This way, if you have experience with C++ OpenCL host applications, you will be able to use PyOpenCL to prepare your host applications to build and deploy your OpenCL kernels.

The following lines show the code for an OpenCL kernel that computes the product of a matrix and a vector:

```__kernel void matrix_dot_vector(__global const float4 *matrix,
__global const float4 *vector, __global float *result)
{
int gid = get_global_id(0);
result[gid] = dot(matrix[gid], vector);
}```

Both the `matrix` and `vector` kernel arguments are of type `float4` and are stored in the device's global address space, also known as "global memory." The kernel code retrieves the global ID number (`gid`) and uses it to calculate the product of the `float4` matrix row whose index is equal to the global ID number and the `float4` vector. The float result is stored in the global ID number index of `result`. Figure 1 shows an example of a 4-by-4 matrix multiplied by a 4-element vector. Figure 1: A 4x4 matrix multiplied by a 4-element vector with the result.

The matrix-vector multiplication shown in Figure 1 requires the following operations:

```1 * 1 + 2 * 2 + 4 * 4 + 8 * 8 = 85
16 * 1 + 32 * 2 + 64 * 4 + 128 * 8 = 1360
3 * 1 + 6 * 2 + 9 * 4 + 12 * 8 = 147
5 * 1 + 10 * 2 + 15 * 4 +25 * 8 = 285```

Each row in the matrix is a `float4` vector, so the kernel performs just one operation to compute the product of one row and the `float4` vector. For example, the first matrix row is `(1.0, 2.0, 4.0, 8.0)` and the only element of vector is `(1.0, 2.0, 4.0, 8.0)`. The dot operation for the first matrix row will have two arguments with 4 float values packed in each argument: `(1.0, 2.0, 4.0, 8.0)` for `matrix[gid]`, and `(1.0, 2.0, 4.0, 8.0)` for `vector`. The code takes advantage of the vector processing capabilities of OpenCL and demonstrates the support for vector types that PyOpenCL provides to Python.

The following lines show Python code that uses PyOpenCL and Numpy to perform the steps required for an OpenCL host program. The code includes comments that indicate which blocks of code are performing each of the 12 steps of the typical OpenCL C++ host program. You can also run different parts of the code in the Python console.

```import pyopencl as cl
from pyopencl import array
import numpy

if __name__ == "__main__":
vector = numpy.zeros((1, 1), cl.array.vec.float4)
matrix = numpy.zeros((1, 4), cl.array.vec.float4)
matrix[0, 0] = (1, 2, 4, 8)
matrix[0, 1] = (16, 32, 64, 128)
matrix[0, 2] = (3, 6, 9, 12)
matrix[0, 3] = (5, 10, 15, 25)
vector[0, 0] = (1, 2, 4, 8)

## Step #1. Obtain an OpenCL platform.
platform = cl.get_platforms()

## It would be necessary to add some code to check the check the support for
## the necessary platform extensions with platform.extensions

## Step #2. Obtain a device id for at least one device (accelerator).
device = platform.get_devices()

## It would be necessary to add some code to check the check the support for
## the necessary device extensions with device.extensions

## Step #3. Create a context for the selected device.
context = cl.Context([device])

## Step #4. Create the accelerator program from source code.
## Step #5. Build the program.
## Step #6. Create one or more kernels from the program functions.
program = cl.Program(context, """
__kernel void matrix_dot_vector(__global const float4 *matrix,
__global const float4 *vector, __global float *result)
{
int gid = get_global_id(0);
result[gid] = dot(matrix[gid], vector);
}
""").build()

## Step #7. Create a command queue for the target device.
queue = cl.CommandQueue(context)

## Step #8. Allocate device memory and move input data from the host to the device memory.
mem_flags = cl.mem_flags
matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
vector_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector)
matrix_dot_vector = numpy.zeros(4, numpy.float32)
destination_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix_dot_vector.nbytes)

## Step #9. Associate the arguments to the kernel with kernel object.
## Step #10. Deploy the kernel for device execution.
program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buf)

## Step #11. Move the kernel’s output data to host memory.
cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)

## Step #12. Release context, program, kernels and memory.
## PyOpenCL performs this step for you, and therefore,
## you don't need to worry about cleanup code

print(matrix_dot_vector)```

The first lines create two variables that initialize both the matrix and the vector. Notice that vector is an array of `cl.array.vec.float4` with a single element and matrix is an array of `cl.array.vec.float4` with four elements. I used `numpy.zeros` to create the array with the `cl.array.vec.float4` type and then additional code to initialize the values shown in Figure 1. This way, you can easily understand the way you can use `cl.array.vec` types:

```vector = numpy.zeros((1, 1), cl.array.vec.float4)
matrix = numpy.zeros((1, 4), cl.array.vec.float4)
matrix[0, 0] = (1, 2, 4, 8)
matrix[0, 1] = (16, 32, 64, 128)
matrix[0, 2] = (3, 6, 9, 12)
matrix[0, 3] = (5, 10, 15, 25)
vector[0, 0] = (1, 2, 4, 8)```

The code retrieves the first available platform, then the first device for this platform. There is no code to check either the available extensions or the device type. However, I placed comments in the code as a reminder that these tasks are necessary in a more complex host program.

Then, the code creates an OpenCL context for the selected device and calls `cl.Program` to create a program for the context with the kernel source code as one of the arguments. The call to the `build`() method for the created `cl.Program` instance builds the kernel.

The code calls `cl.CommandQueue` with the context as an argument to create a command queue (`queue`) for the target device. Then, it allocates device memory and moves input data from the host to the device memory. The following lines use the most basic features provided by PyOpenCL to do this:

```mem_flags = cl.mem_flags
matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
vector_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=vector)
matrix_dot_vector = numpy.zeros(4, numpy.float32)
destination_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix_dot_vector.nbytes)```

The code defines the following buffers by calling `cl.Buffer`:

• `matrix_buf`. A read-only buffer that copies the data from the `matrix` variable. The kernel will read from this buffer in the global memory space
• `vector_buf`. A read-only buffer that copies the data from the `vector` variable. The kernel will read from this buffer in the global memory space
• `destination_buf`: A write-only buffer that will hold the result of the matrix-by-vector multiplication. The kernel will write to this buffer in the global memory space

The following line associates the arguments to the kernel and deploys it for device execution by calling the method that PyOpenCL generates in `program` with the built kernel name: `matrix_dot_vector`. The previously created `queue` is the first argument:

`program.matrix_dot_vector(queue, matrix_dot_vector.shape, None, matrix_buf, vector_buf, destination_buf)`

The kernel receives the following arguments:

• `matrix_buf` for `matrix`
• `vector_buf` for `vector`
• `destination_buf` for `result`

When the kernel finishes, it is time to move the kernel's output data (result) stored in `destination_buf` to the host program memory. The following line calls `cl.enqueue_copy` to do this, and the result will be available in the `matrix_dot_vector` variable.

`cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)`

In this example, the code doesn't take advantage of the different events that are fired when the kernel finishes its execution. That notwithstanding, because PyOpenCL performs all the necessary cleanup operations, you don't need to worry about reference counts or releasing the underlying OpenCL structures and resources.

### Conclusion

This example shows basic features that PyOpenCL provides to Python developers who want to create OpenCL host applications. In the next article in this series, I'll dive deep into more advanced features that reduce the code required to build and deploy OpenCL kernels for many common parallel algorithms.

Gaston Hillar is a frequent contributor to Dr. Dobb's.

### Related Article

Build and Deploy OpenCL Kernels in Python

### More Insights To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

## Featured Reports ## Featured Whitepapers 