Skip here if you only care about the code.
Definitions:
OpenCL contains two APIs:
In-Order vs Out-of-Order Execution
In-Order Execution: Instructions go through fetch-decode-execute in the order the compiler generates them. Top to bottom, one stalls, and they all stall.
Out-of-Order Execution: Instructions are fetched in the compiler-generated order but they may be executed in a different order, independent instructions can run in parallel and if one stalls the others can still pass.
There are three models we need to be aware of:
A Compute Device could be the GPU.
The clEnqueueNDRangeKernel
command allows a single kernel program to be initiated to operate in parallel across a data structure structure.
NDRange
is an N-Dimensional (ND) structure that holds the work items.NDRange
, and each pixel is a work-item that a copy of the kernel running on a single processing element will operate onwork-group
size representing groups of individual work items in an NDRange.Several memory types in OpenCL form a hierarchy:
First, we need to import the necessary libraries:
#include <CL/opencl.hpp>
#include <iostream>
#include <fstream>
#include <string>
Following the steps from above in our main function:
int main() {
// Store the Platform ID
cl_platform_id platform_id;
// Get the first Available Platform
clGetPlatformIDs(1, &platform_id, NULL);
// Optionally: get the name of the platform
char platformName[128];
clGetPlatformInfo(platform_id,
CL_PLATFORM_NAME,
sizeof(platformName),
platformName, NULL
);
// Store the Device ID
cl_device_id device_id;
// Get the first GPU available on a given platform
clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
// Optionally: get the name of the Device
char deviceName[128];
clGetDeviceInfo(device_id,
CL_DEVICE_NAME,
sizeof(deviceName),
deviceName, NULL
);
std::cout << "Current Device is: " << deviceName << std::endl;
std::cout << "Current Platform is: " << platformName << std::endl;
...
Alternatively, we can get these from a list:
cl::vector<cl::Platform> platformList;
cl::Platform::get(&platformList);
cl::Device device(device_id);
cl::Context context = cl::Context({ device });
cl::CommandQueue queue(context, device);
Sources
data structure lets us store and manage kernel source code before executing it. cl::Program::Sources sources;
int SIZE = 10;
int A_h[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int B_h[] = { 11, 12, 13, 15, 16, 17, 18, 19, 20 };
// We need to allocate a memory buffer on the device itself (_d is device)
// A buffer can be of several types:
// - CL_MEM_READ_ONLY
// - CL_MEM_WRITE_ONLY
cl::Buffer A_d(context, CL_MEM_READ_WRITE, sizeof(int) * SIZE);
cl::Buffer B_d(context, CL_MEM_READ_WRITE, sizeof(int) * SIZE);
cl::Buffer C_d(context, CL_MEM_READ_WRITE, sizeof(int) * SIZE);
queue.enqueueWriteBuffer(A_d, CL_TRUE, 0, sizeof(int) * SIZE, A_h);
queue.enqueueWriteBuffer(B_d, CL_TRUE, 0, sizeof(int) * SIZE, B_h);
kernels.cl
void kernel simple_add(global const int* A, global const int* B, global int* C)
{
int id = get_global_id(0);
C[id] = A[id] + B[id];
};
global
keyword means it is pointing to global memory.You may also see this written as:
void __kernel simple_add(global const int* A, global const int* B, global int* C)
{
int id = get_global_id(0);
C[id] = A[id] + B[id];
};
These are the same semantically.
.cl
file. // Read the Kernel file as a string
std::ifstream file("kernels.cl");
std::string kernel_code((std::istreambuf_iterator<char>(file)),
(std::istreambuf_iterator<char>()));
sources.push_back({ kernel_code.c_str(), kernel_code.length() });
// Next we create a program which links
// the OpenCL code to the context
cl::Program program(context, sources);
// Building the Program
if (program.build({ device }) != CL_SUCCESS)
{
std::cout << "Error Building: "
<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
<< std::endl;
exit(1);
}
// Create a Kernel from the program
cl::Kernel simple_add = cl::Kernel(program, "simple_add");
// Add all the data as arguments to the kernel
simple_add.setArg(0, A_d);
simple_add.setArg(1, B_d);
simple_add.setArg(2, C_d);
// - We want to run the simple_add kernel
// - Global Offset: start kernel from 0 (NullRange)
// - Global Work Size: How many work-items or threads we run
// - Local Work Size: work-group size -> OpenCL chooses the size (NullRange)
queue.enqueueNDRangeKernel(simple_add,
cl::NullRange,
cl::NDRange(SIZE),
cl::NullRange
);
int C_h[SIZE];
// Read the final buffer
// - Read from C_d
// - Set Blocking or Non-Blocking
// - Start Reading at 0
// - Read 4 Bytes * SIZE
// - Destination pointer (read into)
queue.enqueueReadBuffer(C_d, CL_TRUE, 0, sizeof(int) * SIZE, C_h);
// output the final array
std::cout << " result: \n";
for (int i = 0; i<10; i++) {
std::cout << C_h[i] << " ";
}
std::cout << std::endl;
}