--- title: "Chapter 03: Structure of nmath Kernel Programs" author: "Kjell Nygren" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Chapter 03: Structure of nmath Kernel Programs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Why "programs" rather than compiled objects Normal C++ code in an R package is compiled ahead of time by the package build toolchain and linked into a DLL/`.so`. OpenCL device code works differently: the GPU source is compiled **at runtime**, on the device, by the GPU driver. This means there is no offline linker for device code --- all GPU code that will run together must be collected into a single source string before the driver sees it. In OpenCL terms, that string is a **program**. The C++ call sequence is: ```cpp // kernel_source is one big string containing all device code program = clCreateProgramWithSource(context, 1, &src_ptr, &src_len, &status); clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr); kernel = clCreateKernel(program, "dnorm_kernel", &status); ``` `clBuildProgram` invokes the vendor GPU compiler on `kernel_source` and produces a device binary. A build failure here is equivalent to a C++ compilation error, except it happens at the moment of first use, not at `R CMD INSTALL` time. ## No `#include` on the device OpenCL C is a restricted dialect of C99. There is no file system visible to device code: `#include` is simply not available inside an OpenCL program string. Every declaration or definition that a kernel function needs must appear **earlier in the same string**. This creates the fundamental porting challenge for `nmathopencl`. R's `nmath` library is written as standard C99 and relies on a chain of `#include` directives: ```c #include "nmath.h" /* includes R_ext/Arith.h, R_ext/Error.h, ... */ #include "dpq.h" ``` Those headers pull in standard C types (`FILE*`, `size_t`, `int32_t`), R-specific runtime types (`SEXP`, `Rboolean`), and R callbacks (`error()`, `warning()`, `R_alloc()`). None of these concepts exist on a GPU. The solution is to assemble the program string in layers, prepending OpenCL-compatible replacements for everything the nmath sources expect to find via `#include`. ## The four-layer program structure A complete `nmathopencl` kernel program is a concatenation of four layers: ``` ┌─────────────────────────────────────────────────────────────┐ │ Layer 0 – OPENCL.cl │ │ Global configuration: fp64 pragma, HAVE_* macros, ML_NAN │ ├─────────────────────────────────────────────────────────────┤ │ Layer 1 – Upstream shims (inst/cl/R_ext_types/, R_shims/, │ │ R_ext_runtime/, System/, ...) │ │ OpenCL-C replacements for C headers the nmath sources need │ ├─────────────────────────────────────────────────────────────┤ │ Layer 2 – nmath library (inst/cl/nmath/) │ │ The ported Mathlib functions, exactly the subset required │ │ by this kernel (determined by @all_depends_nmath) │ ├─────────────────────────────────────────────────────────────┤ │ Layer 3 – Kernel function (inst/cl/src/) │ │ The __kernel entry point: GPU-parallel version of a CPU │ │ loop over the nmath function │ └─────────────────────────────────────────────────────────────┘ ``` Each layer is a plain text block that is physically concatenated (in order) to form the final string passed to `clCreateProgramWithSource`. ## Layer 0: `OPENCL.cl` --- the global configuration header `inst/cl/OPENCL.cl` is always placed at the very top. It sets the conditions under which the rest of the program will compile: ```c #pragma OPENCL EXTENSION cl_khr_fp64 : enable /* require 64-bit floats */ /* Enable expm1/log1p built-ins on OpenCL >= 1.2 */ #if defined(__OPENCL_C_VERSION__) && (__OPENCL_C_VERSION__ >= 120) #define HAVE_EXPM1 1 #define HAVE_LOG1P 1 #endif #undef HAVE_LONG_DOUBLE /* no long double on GPU */ #define ML_NAN (0.0/0.0) #define ML_POSINF INFINITY #define ML_NEGINF -INFINITY #define INLINE static inline ``` This one file handles the biggest portability concern: double-precision support (`cl_khr_fp64`) must be declared before any `double` variable appears in device code, and the `HAVE_*` macros tell the nmath sources which C99 math built-ins are available. ## Layer 1: The upstream shim layer R's nmath sources transitively include a long chain of headers. The shim layer provides OpenCL C replacements, grouped by what they replace: ### Type stubs | Files | What they replace | |-------|-------------------| | `System/stdint.cl` | ``: `int32_t`, `uint64_t`, ... using OpenCL built-in integer types | | `R_ext_types/Boolean.cl`, `Complex.cl`, `Constants.cl`, etc. | Type-only parts of ``, ``, etc. | | `R_shims/Rinternals.cl` | Skeletal `SEXP`, `SEXPREC`, `Rboolean` so transitively included headers compile | None of these types are *used* by the nmath math logic; the stubs exist so the preprocessor does not abort on an unrecognized identifier. ### Capability macros `R_shims/Rconfig.cl` reproduces what R's `configure` script writes into `config.h`: ```c #define HAVE_EXPM1 1 #define HAVE_LOG1P 1 #define IEEE_754 1 ``` Without this, conditional compilation branches inside nmath would take the wrong path (e.g., falling back to a Taylor series instead of calling the built-in `expm1`). ### Runtime no-ops R's `error()` and `warning()` are host-side callbacks that `longjmp` into the R runtime --- they cannot exist on the GPU. `R_ext_runtime/Error.cl` replaces them with silent macros: ```c #define error(...) /* no-op */ #define warning(...) /* no-op */ ``` Similarly, `R_ext_runtime/Memory.cl` stubs `R_alloc()` and friends (nmath's hot paths do not actually allocate, so these stubs satisfy the preprocessor without ever executing), and `R_ext_runtime/Print.cl` stubs `Rprintf`. ### Math declarations `R_ext_runtime/Arith.cl` is the most actively *used* shim. It maps R's numeric predicates to their OpenCL equivalents: ```c #define R_FINITE(x) isfinite(x) #define ISNAN(x) isnan(x) ``` `nmath/Rmath.cl` and `nmath/nmath.cl` close out the shim layer by providing the forward declarations, mathematical constants, and error-code macros that the nmath function files expect to find in `Rmath.h` / `nmath.h`. ### The design principle Every shim is **minimally intrusive**: it defines only what the nmath math logic actually needs, and stubs or silences everything else. The goal is that the nmath `.cl` files are as close to verbatim copies of the C sources as possible; the shims carry all the differences. ## Layer 2: The nmath library and its annotation scheme `inst/cl/nmath/` contains approximately 180 `.cl` files, one per source file in R's `src/nmath/`. They implement the full suite of Mathlib functions in OpenCL C. ### `@provides` and `@depends` annotations Because the program is assembled at runtime, the loader must know the correct concatenation order. Each nmath file declares what it provides and what it depends on: ```c // From inst/cl/nmath/dnorm.cl // @source_origin: dnorm.c // @depends: nmath, dpq // @provides: dnorm, dnorm4 // @all_depends: dpq, Rmath, nmath // @load_order: 31 ``` `opencltools::load_kernel_library()` reads these tags, builds a dependency graph, and performs a topological sort --- files with no unsatisfied dependencies are emitted first, then files whose dependencies have been emitted, and so on. The result is a load order in which every function is declared before it is called. ### Only the required subset is loaded A kernel that calls `dnorm` needs only 4 nmath files. A kernel that calls `dbinom` needs 18. Loading all ~180 files for every kernel would be wasteful and slow to compile. Instead, each kernel file carries a pre-computed `@all_depends_nmath` annotation listing the exact transitive closure: ```c // dnorm_kernel.cl -- needs 4 files // @all_depends_nmath: dpq, Rmath, nmath, dnorm // dbinom_kernel.cl -- needs 18 files // @all_depends_nmath: dpq, refactored, Rmath, nmath, stirlerr_cycle_free, // chebyshev, cospi, fmax2, gammalims, lgammacor, log1p, gamma, lgamma, // pgamma_utils, stirlerr_cycle_dependent, bd0, stirlerr, dbinom ``` `load_library_for_kernel()` reads `@all_depends_nmath`, looks up each stem in a pre-built index (`kernel_dependency_index.tsv`) to get the global load order, and concatenates exactly those files in the correct sequence. No topological sort is computed at install time --- the sort was done once when the index was built. ## Layer 3: The kernel function The `__kernel` function is the entry point that the GPU driver exposes. Conceptually it is a **GPU-parallel version of a CPU `for` loop**: ```c /* CPU equivalent: for (int i = 0; i < n; i++) out[i] = dnorm(x[i], mu[i], sigma[i], gl); */ __kernel void dnorm_kernel( __global const double* x, __global const double* mu, __global const double* sigma, __global const int* give_log, __global double* out, const int len ) { int i = get_global_id(0); /* this work-item handles element i */ if (i >= len) return; int gl = (give_log[i] != 0) ? 1 : 0; out[i] = dnorm(x[i], mu[i], sigma[i], gl); /* same nmath call */ } ``` `get_global_id(0)` returns the index of the current work-item in the first dimension of the work grid. The host launches exactly `n` work-items so each element of the input vector is handled by one work-item in parallel. The kernel calls `dnorm` --- the same function name as in the C source. All the layers below it (shims + nmath files) ensure that `dnorm` is defined and reachable in the same program string. ## Assembling the program in C++ `build_rmath_opencl_program()` in `src/kernel_loader.cpp` shows precisely how the four layers are concatenated: ```cpp std::string build_rmath_opencl_program( const std::string& kernel_relative_path, const std::string& package, const std::string& nmath_depends_annotation) { // Load exactly the nmath subset this kernel needs (via @all_depends_nmath) const std::string nmath_src = load_library_for_kernel( kernel_relative_path, "nmath", package, nmath_depends_annotation); return load_kernel_source("OPENCL.cl", package) + "\n" // Layer 0 + load_kernel_library("libR_shims", package, false) + "\n" // Layer 1 + load_kernel_library("R_ext_types", package, false) + "\n" + load_kernel_library("R_shims", package, false) + "\n" + load_kernel_library("R_ext_runtime", package, false) + "\n" + load_kernel_library("R_ext_internals", package, false) + "\n" + load_kernel_library("System", package, false) + "\n" + nmath_src + "\n" // Layer 2 + load_kernel_source(kernel_relative_path, package); // Layer 3 } ``` The returned string is passed directly to `clCreateProgramWithSource` and then compiled by the device driver. At that point the GPU can execute the kernel function. Note the dual guard: each loader function itself is compiled to a no-op when `USE_OPENCL` is absent (see Chapter 02), so `build_rmath_opencl_program` is always callable but returns an empty string on CPU-only builds. Callers check `nmathopencl_has_opencl()` before dispatching to the GPU path. ## What this means for downstream developers A downstream package that builds on `nmathopencl` does not need to replicate this assembly machinery. Kernel text loading uses **opencltools** (`load_kernel_source`, `load_kernel_library` with `package = "nmathopencl"`). `load_library_for_kernel` is re-exported from **nmathopencl** (and available via `LinkingTo: nmathopencl`) and handles all of layers 0, 1, and 2 automatically. The downstream developer's responsibilities are: 1. **Write the `__kernel` function** (Layer 3): call the nmath functions your computation needs. 2. **Annotate it** with `@depends_nmath` and `@all_depends_nmath` so the loader knows which nmath subset to pull in. 3. **Write the C++ runner** that calls `build_rmath_opencl_program()` (or the equivalent chain), dispatches work-items, and transfers results. Chapter 07 covers how to write and annotate kernel functions. Chapter 10 walks through all three steps end-to-end for the `ex_glmbayes` example.