The following lines create a `pyopencl.elementwise.ElementwiseKernel`

instance named `calculate_polygon_vertices_x`

. The first argument is a string formatted as a C argument list. However, you must be careful with the addition of arguments because the kernel builder adds some arguments for you. For example, the total number of elements in the `x_origin`

array is automatically included as a `long n`

argument. Thus, if you add an argument named `n`

, the code will fail because there will be a duplicate argument. That information is not included in the PyOpenCL documentation and you can spend unnecessary time trying to guess what's wrong with your code.

from pyopencl.elementwise import ElementwiseKernel calculate_polygon_vertices_x = ElementwiseKernel(context, "float r, float *x_origin, float *x ", operation="x[i] = x_origin[i] + (r * cos(2 * M_PI * i / n))", name="calculate_polygon_vertices_x", preamble="#define M_PI 3.14159265358979323846")

The `operation`

argument is a C snippet that performs the desired `map`

operation. Notice that the current index is available in the `i`

variable and the total number of elements in the `n`

variable. The preamble argument includes a `#define`

that the kernel builder will include outside of the function context in the kernel's source code. This way, I define the value for `M_PI`

to avoid making the code compatible only with certain OpenCL extensions that define `M_PI`

when enabled.

The following lines show the code that the element-wise kernel builder generates based on the arguments provided to the `pyopencl.elementwise.ElementwiseKernel`

class. Notice that the generated kernel includes code to partition the loop based on the workgroups.

#define M_PI 3.14159265358979323846 #define PYOPENCL_ELWISE_CONTINUE continue __kernel void calculate_polygon_vertices_x(float r, __global float *x_origin__base, long x_origin__offset, __global float *x__base, long x__offset, long n) { int lid = get_local_id(0); int gsize = get_global_size(0); int work_group_start = get_local_size(0)*get_group_id(0); long i; __global float *x_origin = (__global float *) ((__global char *) x_origin__base + x_origin__offset); __global float *x = (__global float *) ((__global char *) x__base + x__offset);; //CL// for (i = work_group_start + lid; i < n; i += gsize) { x[i] = x_origin[i] + (r * cos(2 * M_PI * i / n)); } ; }

The `pyopencl.array.empty_like`

method creates a new and uninitialized `Array`

instance that has the same properties that the `Array`

instance received as a parameter. The code uses this method to create the `x_gpu`

variable, then uses it as an argument to call the `calculate_polygon_vertices_x`

element-wise kernel.

x_gpu = cl_array.empty_like(x_origin_gpu) event = calculate_polygon_vertices_x(50.0, x_origin_gpu, x_gpu)

### Working with the Sum and Counts Builder

The `pyopencl.reduction`

module enables you to build kernels that perform a map expression on each entry, then a reduce expression on the outcome of the map. The following lines show an example of the usage of the `pyopencl.reduction.ReductionKernel`

class to calculate the minimum between two random values in the map operation and then perform a sum in the reduction operation. The documentation doesn't mention this, but you need to install the Mako package before using the `ReductionKernel`

class. Mako is a templating language for Python that the `ReductionKernel`

class uses to build the kernel.

import pyopencl as cl import pyopencl.array as cl_array import numpy if __name__ == "__main__": platform = cl.get_platforms()[0] device = platform.get_devices()[0] context = cl.Context([device]) queue = cl.CommandQueue(context) n = 5000 x_gpu = cl_array.to_device(queue, numpy.random.randn(n).astype(numpy.float32)) y_gpu = cl_array.to_device(queue, numpy.random.randn(n).astype(numpy.float32)) from pyopencl.reduction import ReductionKernel reduction_kernel = ReductionKernel( context, dtype_out=numpy.float32, neutral="0", map_expr="min(x[i], y[i])", reduce_expr="a+b", arguments="__global float *x, __global float *y", name="reduction_kernel") result = reduction_kernel(x_gpu, y_gpu).get() print(result)

Both `x_gpu`

and `y_gpu`

use the previously introduced `pyopencl.array.to_device`

method to create two arrays with random values, ready to be used as arguments for the kernel execution. Then, the code creates a `pyopencl.reduction.ReductionKernel`

instance named `reduction_kernel`

. In this case, there is no consistency with the other kernel builder and the `context`

is required as an argument. The code specifies the following additional arguments:

`dtype_out`

indicates the numpy.dtype for the reduction result`neutral`

specifies the initial value that can be either a float or an integer but formatted as a string`map_expr`

is a string with a C snippet that performs the desired map operation. In this case, the code just determines the minimum value between two values. You can call functions defined in the optional`preamble`

argument`reduce_expr`

is a string with a C snippet that performs the binary reduction operation with two values,`a`

and`b`

. In this case, the code just sums the two values You can call functions defined in the optional`preamble`

argument`arguments`

is a string with a C argument list`name`

is the kernel's name

In this case, I haven't included a value for the optional `preamble`

argument. The following lines show the code that the reduction kernel builder generates based on the arguments provided to the `pyopencl.reduction.ReductionKernel`

class. Notice that the generated kernel includes code to partition the loop based on the workgroups and adds many arguments that weren't specified in the `arguments`

string.

#define GROUP_SIZE 32 #define READ_AND_MAP(i) (min(x[i], y[i])) #define REDUCE(a, b) (a+b) #include <pyopencl-complex.h> typedef float out_type; __kernel void reduction_kernel_stage1( __global out_type *out, __global float *x, __global float *y, unsigned int seq_count, unsigned int n) { __local out_type ldata[GROUP_SIZE]; unsigned int lid = get_local_id(0); unsigned int i = get_group_id(0)*GROUP_SIZE*seq_count + lid; out_type acc = 0; for (unsigned s = 0; s < seq_count; ++s) { if (i >= n) break; acc = REDUCE(acc, READ_AND_MAP(i)); i += GROUP_SIZE; } ldata[lid] = acc; barrier(CLK_LOCAL_MEM_FENCE); if (lid < 32) { __local volatile out_type *lvdata = ldata; lvdata[lid] = REDUCE( lvdata[lid], lvdata[lid + 16]); lvdata[lid] = REDUCE( lvdata[lid], lvdata[lid + 8]); lvdata[lid] = REDUCE( lvdata[lid], lvdata[lid + 4]); lvdata[lid] = REDUCE( lvdata[lid], lvdata[lid + 2]); lvdata[lid] = REDUCE( lvdata[lid], lvdata[lid + 1]); } if (lid == 0) out[get_group_id(0)] = ldata[0]; }

### Conclusion

The kernel builders really simplify the OpenCL code you need to write and they allow you to focus on providing a few arguments for common parallel algorithms. You won't face limitations using PyOpenCL and Python because PyOpenCL allows you to access the entire OpenCL 1.2 API. If you've worked with C or C++ OpenCL host programs, you might want to analyze additional features that might simplify your code even further. You can combine PyOpenCL host programs with Numpy features and other Python packages, and you will definitely increase your productivity with OpenCL kernels.

*Gaston Hillar is a frequent contributor to *Dr. Dobb's.