Channels ▼
RSS

Tools

Build and Deploy OpenCL Kernels in Python


The following lines create a pyopencl.elementwise.ElementwiseKernel instance named calculate_polygon_vertices_x. The first argument is a string formatted as a C argument list. However, you must be careful with the addition of arguments because the kernel builder adds some arguments for you. For example, the total number of elements in the x_origin array is automatically included as a long n argument. Thus, if you add an argument named n, the code will fail because there will be a duplicate argument. That information is not included in the PyOpenCL documentation and you can spend unnecessary time trying to guess what's wrong with your code.

from pyopencl.elementwise import ElementwiseKernel
calculate_polygon_vertices_x = ElementwiseKernel(context,
        "float r, float *x_origin, float *x ",
        operation="x[i] = x_origin[i] + (r * cos(2 * M_PI * i / n))",
        name="calculate_polygon_vertices_x", preamble="#define M_PI 3.14159265358979323846")

The operation argument is a C snippet that performs the desired map operation. Notice that the current index is available in the i variable and the total number of elements in the n variable. The preamble argument includes a #define that the kernel builder will include outside of the function context in the kernel's source code. This way, I define the value for M_PI to avoid making the code compatible only with certain OpenCL extensions that define M_PI when enabled.

The following lines show the code that the element-wise kernel builder generates based on the arguments provided to the pyopencl.elementwise.ElementwiseKernel class. Notice that the generated kernel includes code to partition the loop based on the workgroups.

#define M_PI 3.14159265358979323846

#define PYOPENCL_ELWISE_CONTINUE continue

__kernel void calculate_polygon_vertices_x(float r, __global float *x_origin__base, long x_origin__offset, __global float *x__base, long x__offset, long n)
{
  int lid = get_local_id(0);
  int gsize = get_global_size(0);
  int work_group_start = get_local_size(0)*get_group_id(0);
  long i;

  __global float *x_origin = (__global float *) ((__global char *) x_origin__base + x_origin__offset);
__global float *x = (__global float *) ((__global char *) x__base + x__offset);;
  //CL//
  for (i = work_group_start + lid; i < n; i += gsize)
  {
    x[i] = x_origin[i] + (r * cos(2 * M_PI * i / n));
  }
  
  ;
}

The pyopencl.array.empty_like method creates a new and uninitialized Array instance that has the same properties that the Array instance received as a parameter. The code uses this method to create the x_gpu variable, then uses it as an argument to call the calculate_polygon_vertices_x element-wise kernel.

  x_gpu = cl_array.empty_like(x_origin_gpu)
  event = calculate_polygon_vertices_x(50.0, x_origin_gpu, x_gpu)

Working with the Sum and Counts Builder

The pyopencl.reduction module enables you to build kernels that perform a map expression on each entry, then a reduce expression on the outcome of the map. The following lines show an example of the usage of the pyopencl.reduction.ReductionKernel class to calculate the minimum between two random values in the map operation and then perform a sum in the reduction operation. The documentation doesn't mention this, but you need to install the Mako package before using the ReductionKernel class. Mako is a templating language for Python that the ReductionKernel class uses to build the kernel.

import pyopencl as cl
import pyopencl.array as cl_array
import numpy

if __name__ == "__main__":
    platform = cl.get_platforms()[0]
    
    device = platform.get_devices()[0]
    
    context = cl.Context([device])
    
    queue = cl.CommandQueue(context)
    
    n = 5000
    x_gpu = cl_array.to_device(queue, numpy.random.randn(n).astype(numpy.float32))
    y_gpu = cl_array.to_device(queue, numpy.random.randn(n).astype(numpy.float32))
    
    from pyopencl.reduction import ReductionKernel
    reduction_kernel = ReductionKernel(
            context,
            dtype_out=numpy.float32,
            neutral="0",
            map_expr="min(x[i], y[i])",
            reduce_expr="a+b",
            arguments="__global float *x, __global float *y",
            name="reduction_kernel")
    
    result = reduction_kernel(x_gpu, y_gpu).get()
    
    print(result)

Both x_gpu and y_gpu use the previously introduced pyopencl.array.to_device method to create two arrays with random values, ready to be used as arguments for the kernel execution. Then, the code creates a pyopencl.reduction.ReductionKernel instance named reduction_kernel. In this case, there is no consistency with the other kernel builder and the context is required as an argument. The code specifies the following additional arguments:

  • dtype_out indicates the numpy.dtype for the reduction result
  • neutral specifies the initial value that can be either a float or an integer but formatted as a string
  • map_expr is a string with a C snippet that performs the desired map operation. In this case, the code just determines the minimum value between two values. You can call functions defined in the optional preamble argument
  • reduce_expr is a string with a C snippet that performs the binary reduction operation with two values, a and b. In this case, the code just sums the two values You can call functions defined in the optional preamble argument
  • arguments is a string with a C argument list
  • name is the kernel's name

In this case, I haven't included a value for the optional preamble argument. The following lines show the code that the reduction kernel builder generates based on the arguments provided to the pyopencl.reduction.ReductionKernel class. Notice that the generated kernel includes code to partition the loop based on the workgroups and adds many arguments that weren't specified in the arguments string.

#define GROUP_SIZE 32
#define READ_AND_MAP(i) (min(x[i], y[i]))
#define REDUCE(a, b) (a+b)

#include <pyopencl-complex.h>

typedef float out_type;

__kernel void reduction_kernel_stage1(
  __global out_type *out, __global float *x, __global float *y,
  unsigned int seq_count, unsigned int n)
{
    __local out_type ldata[GROUP_SIZE];

    unsigned int lid = get_local_id(0);

    unsigned int i = get_group_id(0)*GROUP_SIZE*seq_count + lid;

    out_type acc = 0;
    for (unsigned s = 0; s < seq_count; ++s)
    {
      if (i >= n)
        break;
      acc = REDUCE(acc, READ_AND_MAP(i));

      i += GROUP_SIZE;
    }

    ldata[lid] = acc;

        barrier(CLK_LOCAL_MEM_FENCE);

        if (lid < 32)
        {
            __local volatile out_type *lvdata = ldata;

                lvdata[lid] = REDUCE(
                  lvdata[lid],
                  lvdata[lid + 16]);

                lvdata[lid] = REDUCE(
                  lvdata[lid],
                  lvdata[lid + 8]);

                lvdata[lid] = REDUCE(
                  lvdata[lid],
                  lvdata[lid + 4]);

                lvdata[lid] = REDUCE(
                  lvdata[lid],
                  lvdata[lid + 2]);

                lvdata[lid] = REDUCE(
                  lvdata[lid],
                  lvdata[lid + 1]);
        }

    if (lid == 0) out[get_group_id(0)] = ldata[0];
}

Conclusion

The kernel builders really simplify the OpenCL code you need to write and they allow you to focus on providing a few arguments for common parallel algorithms. You won't face limitations using PyOpenCL and Python because PyOpenCL allows you to access the entire OpenCL 1.2 API. If you've worked with C or C++ OpenCL host programs, you might want to analyze additional features that might simplify your code even further. You can combine PyOpenCL host programs with Numpy features and other Python packages, and you will definitely increase your productivity with OpenCL kernels.


Gaston Hillar is a frequent contributor to Dr. Dobb's.

Related Article

Easy OpenCL with Python


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video