--- title: "Chapter 09: Generic OpenCL Kernel Runners (openclPort layer)" author: "Kjell Nygren" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Chapter 09: Generic OpenCL Kernel Runners (openclPort layer)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## The `openclPort` namespace The `openclPort` namespace (declared in `openclPort.h` and implemented in `opencl_kernel_runners.cpp`) provides the generic OpenCL infrastructure that underlies all of `nmathopencl`'s GPU execution. It is designed to be completely independent of any particular mathematical library: it knows only about OpenCL C types and standard C++ types, not about `nmath`, distributions, or R data structures. This namespace is currently shipped inside `nmathopencl` to make the package self-contained. It will migrate to a standalone `openclPort` CRAN package in a future release (see Chapter 00 for the release roadmap). The design described here will remain stable during that migration; downstream packages using these utilities via `#include "openclPort.h"` will require only a change in their `LinkingTo` field. ## `opencl_dbl_scalar_kernel_runner` This is the single generic runner that powers every kernel in `nmathopencl`. ### Declaration ```cpp // openclPort.h (within namespace openclPort, guarded by #ifdef USE_OPENCL) void opencl_dbl_scalar_kernel_runner( const std::string& kernel_source, const char* kernel_name, const std::vector& dargs, int n_out, std::vector& out_flat ); ``` ### Arguments | Parameter | Type | Description | |-----------|------|-------------| | `kernel_source` | `const std::string&` | Complete, concatenated OpenCL program source (shims + library + kernel). Must be syntactically valid OpenCL C. | | `kernel_name` | `const char*` | Name of the `__kernel void` function to execute, e.g., `"dnorm_kernel"`. | | `dargs` | `const std::vector&` | Scalar `double` arguments to pass to the kernel, in the order the kernel declares them. | | `n_out` | `int` | Number of `double` values to read back from the output buffer. Also becomes the global work size. | | `out_flat` | `std::vector&` | Output buffer; resized to `n_out` on entry and filled with the kernel's results. | ### What it does The runner performs the complete OpenCL lifecycle for a single kernel invocation: 1. **Platform and device selection** --- queries `clGetPlatformIDs` and `clGetDeviceIDs` to select the first available GPU device. 2. **Context and command queue** --- creates a `cl_context` and a `cl_command_queue` for the selected device. 3. **Program build** --- calls `clCreateProgramWithSource` followed by `clBuildProgram`. On failure, it retrieves the build log via `CL_PROGRAM_BUILD_LOG` and throws a `std::runtime_error` containing the compiler output. 4. **Kernel creation** --- calls `clCreateKernel` with the supplied `kernel_name`. 5. **Output buffer** --- allocates a `CL_MEM_WRITE_ONLY` device buffer of size `n_out * sizeof(double)`. 6. **Argument setting** --- sets each element of `dargs` as a scalar kernel argument (`clSetKernelArg`) in index order, then sets the output buffer as the next argument, and finally `n_out` as the last integer argument. 7. **Execution** --- enqueues the kernel with a 1-D global work size of `n_out` via `clEnqueueNDRangeKernel`. 8. **Result retrieval** --- blocks on `clFinish` then reads the output buffer back to host memory via `clEnqueueReadBuffer`. 9. **Resource cleanup** --- releases all OpenCL objects in reverse order of creation. Error checking is performed after every OpenCL API call. Any non-`CL_SUCCESS` status causes the runner to clean up all already-created resources and throw a `std::runtime_error` with a message that includes the status code, its symbolic name, and the platform/device information. ### Argument layout contract The runner assumes the kernel is declared with the following argument signature: ```c __kernel void my_kernel( double arg0, double arg1, ..., double argN, __global double* out, int n_out ) ``` The `dargs` vector maps element-by-element to `arg0` through `argN`. The runner always appends the output buffer and `n_out` as the last two arguments automatically. Kernels in `inst/cl/src/` that accept vector inputs (e.g., `dnorm_kernel`) do not fit this layout and use a separate, more specialized runner in `kernel_runners.cpp`; the double-scalar runner handles the scalar-parameter kernels that produce replicated output. ### Example: calling the runner from a kernel wrapper ```cpp // Inside a kernel_wrappers.cpp function (nmathopencl namespace): namespace nmathopencl { Rcpp::NumericVector dnorm_opencl_impl( int n, double mu, double sigma, bool give_log, bool verbose) { std::string src = openclPort::load_kernel_library("nmath") + "\n" + openclPort::load_kernel_source("src/dnorm_scalar_kernel.cl"); std::vector out(n); openclPort::opencl_dbl_scalar_kernel_runner( src, "dnorm_scalar_kernel", { mu, sigma, static_cast(give_log) }, // dargs n, // n_out out ); return Rcpp::NumericVector(out.begin(), out.end()); } } // namespace nmathopencl ``` ## Error-handling utilities Five `inline` functions in `openclPort.h` provide consistent error reporting across all runners. Because they are `inline`, they are available to any translation unit that includes `openclPort.h`; downstream packages that `LinkingTo: nmathopencl` get them for free. ### `opencl_status_name` ```cpp inline const char* opencl_status_name(cl_int status); ``` Returns the symbolic name of an OpenCL error code as a C string: ``` CL_SUCCESS, CL_DEVICE_NOT_FOUND, CL_BUILD_PROGRAM_FAILURE, ... ``` Returns `"UNKNOWN_OR_VENDOR_SPECIFIC"` for codes not in the table. Covers all error codes defined by the OpenCL 3.0 specification that are commonly encountered in practice. ### `opencl_status_hint` ```cpp inline const char* opencl_status_hint(cl_int status); ``` Returns a plain-English diagnostic hint for the most actionable error codes: | Status | Hint | |--------|------| | `CL_OUT_OF_RESOURCES` | Device/runtime resource limit exceeded (watchdog, register pressure, ...) | | `CL_OUT_OF_HOST_MEMORY` | Host memory allocation failed | | `CL_MEM_OBJECT_ALLOCATION_FAILURE` | Device memory allocation failed | | `CL_BUILD_PROGRAM_FAILURE` | Kernel compilation failed; inspect build log | | `CL_INVALID_CONTEXT` | OpenCL context is invalid | | `CL_DEVICE_NOT_AVAILABLE` | Device present but temporarily unavailable | Returns `"No additional hint available."` for all other codes. ### `opencl_read_platform_info_str` / `opencl_read_device_info_str` ```cpp inline std::string opencl_read_platform_info_str( cl_platform_id platform, cl_platform_info param); inline std::string opencl_read_device_info_str( cl_device_id device, cl_device_info param); ``` Query string-valued platform or device info (e.g., `CL_PLATFORM_NAME`, `CL_DEVICE_NAME`, `CL_DRIVER_VERSION`) safely: they handle null handles, query failures, and trailing null terminators, returning `"unknown"` on any error. Typical usage: ```cpp std::string name = openclPort::opencl_read_device_info_str( device, CL_DEVICE_NAME); std::string vendor = openclPort::opencl_read_platform_info_str( platform, CL_PLATFORM_VENDOR); ``` ### `opencl_make_context_error` ```cpp inline std::runtime_error opencl_make_context_error( cl_int status, cl_platform_id platform, cl_device_id device); ``` Constructs a `std::runtime_error` with a message of the form: ``` OpenCL error at clCreateContext (status=-6, name=CL_OUT_OF_HOST_MEMORY). platform_name=NVIDIA CUDA, platform_vendor=NVIDIA Corporation, device_name=NVIDIA GeForce RTX 4090, driver_version=560.94. This may indicate a transient driver/runtime context failure. ``` Used inside `opencl_dbl_scalar_kernel_runner` and `f2_f3_kernel_runner` immediately after a `clCreateContext` failure. ## Other utilities in `openclPort` ### Rcpp -> `std::vector` conversion ```cpp std::vector openclPort::flattenMatrix(const Rcpp::NumericMatrix& mat); std::vector openclPort::copyVector(const Rcpp::NumericVector& vec); ``` `flattenMatrix` converts a column-major R matrix to a row-major (or simply contiguous) `std::vector` suitable for passing as a device buffer. `copyVector` does the same for a numeric vector. Both are implemented in `OpenCL_helper.cpp`. ### Device probing (C++ / `openclPort`) At the R prompt, host GPU inventory uses **`opencltools::gpu_names()`** (see Chapters 00--01). Inside the package DLL, **`openclPort`** exposes related C++ helpers for kernel runners: ```cpp Rcpp::CharacterVector openclPort::gpu_names(); int openclPort::detect_num_gpus_internal(); bool openclPort::has_opencl(); int openclPort::get_opencl_core_count(); ``` - `openclPort::gpu_names()` --- OpenCL device names from the runtime linked into this package (distinct from `opencltools::gpu_names()` R helper). - `detect_num_gpus_internal()` --- GPU device count for envelope-style runners. - `nmathopencl_has_opencl()` --- compile-time `USE_OPENCL` flag in this DLL. - `openclPort::get_opencl_core_count()` --- internal C++ helper (delegates to opencltools); at the R prompt use **`opencltools::get_opencl_core_count()`**. ### OpenCL build configuration ```cpp struct openclPort::OpenCLConfig { bool have_expm1; bool have_log1p; std::string buildOptions; }; openclPort::OpenCLConfig openclPort::configureOpenCL( cl_context context, cl_device_id device); ``` `configureOpenCL` probes the device to determine which standard math functions are available as OpenCL built-ins (specifically `expm1` and `log1p`). It returns a set of `buildOptions` (e.g., `-DHAVE_EXPM1 -DHAVE_LOG1P`) that are passed to `clBuildProgram` so that the nmath `.cl` files can conditionally use the built-in or the ported version. ## Using `openclPort` from a downstream package A package that `LinkingTo: nmathopencl` can include `openclPort.h` directly: ```cpp // In a downstream package's source file: #include "openclPort.h" void my_kernel_runner(const std::string& program_source, int n) { std::vector out; openclPort::opencl_dbl_scalar_kernel_runner( program_source, "my_kernel", { 1.0, 2.5 }, // scalar args n, out ); } ``` No additional CMake or autoconf work is needed beyond the `LinkingTo` field, because `openclPort.h` is installed to `inst/include/` and the exported symbols are visible in the shared library.