One of OpenCL's great advantages is that kernels can execute on high-performance computing devices such as GPUs. To take advantage of these parallel-processing capabilities in code, an OpenCL developer needs to clearly understand two points:
- The OpenCL Execution Model: Kernels are executed by one or more work-items. Work-items are collected into work-groups and each work-group executes on a compute unit.
- The OpenCL Memory Model: Kernel data must be specifically placed in one of four address spaces global memory, constant memory, local memory, or private memory. The location of the data determines how quickly it can be processed.
It takes time to understand these models, and the more familiar you are with them, the faster your code will run. To explain how the models interact, I've devised a second analogy: The execution of a kernel is like a day at school.
A Day at School
When I was in middle school, my teacher assigned 30 problems to her 30 students every day. But she didn't check all 900 answers. Instead, she assigned a different number to each of us, and we'd go to the front of the class and solve the problem with that number. We'd copy our work from our notebook to the blackboard, and if the teacher liked what she saw, we'd get a good grade. Clever, huh?
An OpenCL device is like a school composed of classrooms like mine. Each classroom contains students performing math problems. The students in a class share the same blackboard, but each student has a separate notebook. Students in the same class can work together at their blackboard, but students in different classes can't work together.
Here's where it gets tricky: None of these classrooms have a teacher. Also, every student in the school works on the same math problem, but with different values. For example, if the problem involves adding two numbers, one student might do 1+2, another might do 3+4, and another might do 5+6. When all the students in a classroom complete their calculations, they can leave. Then, the blackboard will be erased and a new class of students will come in and work on the same problem, but with different values.
Each student entering a class automatically knows what problem they'll be solving, but they don't know what values they'll be working with. The blackboard in each classroom is initially blank, so students go to a central blackboard that contains values for the entire school. This central blackboard is much larger than the blackboards in the classrooms, but because of the long hallway, it takes a great deal of time for students to read its values. Figure 3 depicts the relationship between classes, classrooms, students, notebooks, and blackboards.
For most math problems, each student will go to the central blackboard only twice once to read the values for their problem, and once to write down their final answer. Because the central blackboard is so far away, students do their actual solving using their notebooks and classroom blackboards. Once all of the final answers are on the central blackboard, the school day is over.
Students in different classes can't talk to one another, so the students in class 1 won't know when the students in class 2 have finished. The only way to be certain that a class has finished is when the school day ends. It's important to make the distinction between a classroom and a class. A classroom is a physical area with a blackboard. A class is a group of students that occupy a classroom. As one class leaves a classroom, another can enter.
To keep things organized, each class has an identifier that distinguishes it from every other class. Each student has two identifiers: one that distinguishes it from every other student in the class, and one that distinguishes it from every other student in the school. As an example, a student may have a class ID of 12 and a school ID of 638.
Kernel Execution on a Device
In my analogy, the school corresponds to an OpenCL device and the math problem represents the kernel. Each student corresponds to a work-item and each class corresponds to a work-group. A classroom corresponds to a compute unit (processing core); and just as each classroom can be occupied by a class, each compute unit can be occupied by a work-group. Figure 4 depicts this.
Identification numbers play a large role in OpenCL, and each work-item has two IDs: a global ID and a local ID. The global ID identifies the work-item among all other work-items executing the kernel. The local ID identifies the work-item among other work-items in the work-group. Work-items in different work-groups may have the same local ID, but they'll never have the same global ID. Returning to the school analogy, a work-item's local ID corresponds to a student's class ID and a work-item's global ID corresponds to the student's school ID.
Now let's talk about memory. The OpenCL device model identifies four address spaces:
- Global memory: Stores data for the entire device.
- Constant memory: Similar to global memory, but is read-only.
- Local memory: Stores data for the work-items in a work-group.
- Private memory: Stores data for an individual work-item.
In my analogy, the central blackboard corresponds to global memory, which can be read from and written to by both the host and the device. When the host application transfers data to the device, the data is stored in global memory. Similarly, when the host reads data from a device, the data comes from the device's global memory. This memory is commonly the largest memory region on an OpenCL device, but it's also the slowest for work-items to access.
Work-items can access local memory much faster (~100x) than they can access global/constant memory; and local memory blocks correspond to the blackboards in each classroom. Local memory isn't nearly as large as global/constant memory, but because of the access speed, it's a good place for work-items to store their intermediate results. Just as students in the same class can work together at the classroom blackboard, work-items in the same work-group can access the same block of local memory.
The private memory in an OpenCL device corresponds to the notebook each student uses to solve a math problem. Each work-item has exclusive access to its private memory and it can access this memory faster than it can access local memory or global/constant memory. But this address space is much smaller than any other address space, so it's important not to use too much of it.
When I started using OpenCL, I wondered how many work-items could be generated for a kernel. As I hope this analogy has made clear, you can generate as many work-items and work-groups you like. However, if the device only contains M compute units and N work-items per work-group, only MN work-items will execute the kernel at any given time.