--- title: "Chapter 10: Case Study --- Building Custom GLM Kernels (ex_glmbayes)" author: "Kjell Nygren" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Chapter 10: Case Study --- Building Custom GLM Kernels (ex_glmbayes)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Overview Chapters 03-10 describe the infrastructure of `nmathopencl`: the shim layer, the `nmath` library, the kernel format, and the runner API. This chapter shows how to put all of it together by walking through a concrete downstream use-case: the `ex_glmbayes` example built into the package. `ex_glmbayes` implements GPU-accelerated evaluation of the log-posterior and gradient (`f2`/`f3`) for Bayesian GLMs (Gaussian, Gamma, binomial with logit/probit/cloglog links, and Poisson). It is the type of component a package author would build on top of `nmathopencl` to accelerate their own likelihood computations. The name `ex_glmbayes` signals that this is an example, not a core part of the library; it lives entirely in `ex_glmbayes_*`-prefixed files and within the `ex_glmbayes` C++ namespace. ## Step 1 --- Identify the nmath functions your kernel needs For each GLM family, determine which nmath functions the GPU kernel must call: | Kernel | nmath function(s) called | Link function | |--------|--------------------------|---------------| | `f2_f3_binomial_logit.cl` | `dbinom_raw` | logit | | `f2_f3_binomial_probit.cl` | `dbinom_raw`, `pnorm5`, `dnorm4` | probit | | `f2_f3_binomial_cloglog.cl` | `dbinom_raw` | cloglog | | `f2_f3_gamma.cl` | `dgamma` | log | | `f2_f3_gaussian.cl` | `dnorm4` | identity/log | | `f2_f3_poisson.cl` | `lgamma` (OpenCL built-in) | log | For `f2_f3_poisson`, `lgamma` is an OpenCL built-in (part of the OpenCL 1.1 math specification) so no nmath files are needed at all. For the others, consult the `@all_depends_nmath` tag of each kernel file to get the full transitive dependency list. ## Step 2 --- Extract the minimal nmath subset The `@all_depends_nmath` tag in each kernel file gives the complete transitive closure. Taking the union across the five non-Poisson kernels yields 20 files (17 nmath functions + `Rmath`, `nmath`, `refactored` as infrastructure): ``` Rmath, nmath, refactored, chebyshev, cospi, fmax2, gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils, stirlerr_cycle_dependent, stirlerr, bd0, dpois, dgamma, dnorm, pnorm, dbinom ``` These files are stored in `inst/cl/ex_glmbayes_nmath/`. Rather than copying them manually, use `extract_library_subset()` to derive this directory automatically from the kernel annotations: ```{r, eval = FALSE} nmath_dir <- system.file("cl", "nmath", package = "nmathopencl") src_dir <- system.file("cl", "ex_glmbayes_src", package = "nmathopencl") # Load the pre-built dependency index (read once, reuse across calls) idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds")) result <- extract_library_subset( kernel_paths = list.files(src_dir, pattern = "\\.cl$", full.names = TRUE), library_dir = nmath_dir, dest_dir = "inst/cl/ex_glmbayes_nmath", depends_tag = "depends_nmath", index = idx ) ``` `extract_library_subset()` copies the 20 `.cl` files **plus both index files** (`kernel_dependency_index.rds` and `kernel_dependency_index.tsv`) into the destination directory. The index files are required for the indexed loading described in Step 5. A package developer would create an analogous subdirectory for their own kernel set using the same function. ## Step 3 --- Write the kernel files Each kernel file in `inst/cl/ex_glmbayes_src/` starts with the dependency metadata block followed by the kernel body: ```c // @library_deps: nmath // @calls_nmath: dbinom_raw // @depends_nmath: dbinom // @all_depends_nmath_count: 17 // @all_depends_nmath: Rmath, nmath, refactored, chebyshev, cospi, fmax2, // gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils, // stirlerr_cycle_dependent, stirlerr, bd0, dbinom // @calls_opencl_builtin: (none) #pragma OPENCL EXTENSION cl_khr_fp64 : enable #pragma OPENCL EXTENSION cl_khr_printf : enable #define MAX_L2 64 // upper bound on l2 __kernel void f2_f3_binomial_logit( __global const double* X, // design matrix, size = l1 x l2, col-major __global const double* B, // grid, size = m1 x l2 __global const double* mu, // prior mean, length = l2 __global const double* P, // prior precision, size = l2 x l2 __global const double* alpha, // offset, length = l1 __global const double* y, // response, length = l1 __global const double* wt, // weights, length = l1 __global double* qf, // out: neg log-posterior, length = m1 __global double* grad, // out: gradient, size = m1 x l2 const int l1, const int l2, const int m1 ) { int j = get_global_id(0); if (j >= m1) return; // prior quadratic form: 0.5 * (B_j - mu)' P (B_j - mu) double tmp[MAX_L2]; for (int k = 0; k < l2; ++k) { double acc = 0.0; for (int l = 0; l < l2; ++l) acc += P[k*l2 + l] * (B[j*l2 + l] - mu[l]); tmp[k] = acc; } double res_acc = 0.0; for (int k = 0; k < l2; ++k) res_acc += 0.5 * (B[j*l2 + k] - mu[k]) * tmp[k]; // gradient: prior part double g_loc[MAX_L2]; for (int k = 0; k < l2; ++k) g_loc[k] = tmp[k]; // data likelihood: logistic link + dbinom_raw for (int i = 0; i < l1; ++i) { double dot = -alpha[i]; for (int k = 0; k < l2; ++k) dot -= X[k*l1 + i] * B[j*l2 + k]; double e = exp(dot <= 0 ? dot : -dot); double p = dot <= 0 ? 1.0/(1.0+e) : e/(1.0+e); double q = 1.0 - p; res_acc += -dbinom_raw(y[i], wt[i], p, q, 1); double resid = p < 0.5 ? p*wt[i] - y[i]*wt[i] : (1-y[i])*wt[i] - q*wt[i]; for (int k = 0; k < l2; ++k) g_loc[k] += X[k*l1 + i] * resid; } qf[j] = res_acc; for (int k = 0; k < l2; ++k) grad[k * m1 + j] = g_loc[k]; } ``` Key design choices: - `MAX_L2 64` --- a compile-time upper bound on the number of predictors, used to size local arrays (OpenCL C does not support variable-length arrays in all implementations). - Column-major layout for `X` (matches R's memory layout for matrices passed from R to C without transposition). - A single work-item per grid point `j`; all `l1` observations are processed sequentially within that work-item. This is appropriate when `m1` (number of grid points) is much larger than `l1` (number of observations), as is typical for envelope sampling. ## Step 4 --- Write the kernel runner (C++) The kernel runner is the low-level C++ function that manages the OpenCL lifecycle for this specific kernel. It lives in `ex_glmbayes_kernel_runners.cpp` inside the `ex_glmbayes::opencl` namespace: ```cpp // ex_glmbayes_kernel_runners.cpp (abbreviated) namespace ex_glmbayes { namespace opencl { void f2_f3_kernel_runner( const std::string& kernel_source, const char* kernel_name, int l1, int l2, int m1, const std::vector& X_flat, const std::vector& B_flat, const std::vector& mu_flat, const std::vector& P_flat, const std::vector& alpha_flat, const std::vector& y_flat, const std::vector& wt_flat, std::vector& qf_flat, std::vector& grad_flat ) { // Select platform and device // Create context, command queue, program, kernel // Create input buffers (CL_MEM_READ_ONLY) and output buffers (CL_MEM_WRITE_ONLY) // Set kernel arguments // Enqueue kernel with global_size = m1 // Read back qf_flat and grad_flat // Release all resources } }} // namespace ex_glmbayes::opencl ``` The runner uses the error-handling utilities from `openclPort.h` (`opencl_make_context_error`, `opencl_status_name`) to produce informative exceptions on failure. ## Step 5 --- Write the kernel wrapper (C++) The kernel wrapper is the R-facing `[[Rcpp::export]]` function. It validates and converts R inputs, selects the kernel name based on family/link, assembles the program string using the dependency index, calls the runner, and returns results as R objects. It lives in `ex_glmbayes_kernel_wrappers.cpp`: ```cpp // ex_glmbayes_kernel_wrappers.cpp (abbreviated) namespace ex_glmbayes { namespace opencl { // [[Rcpp::export]] Rcpp::List f2_f3_opencl( std::string family, std::string link, Rcpp::NumericMatrix b, Rcpp::NumericVector y, Rcpp::NumericMatrix x, Rcpp::NumericMatrix mu, Rcpp::NumericMatrix P, Rcpp::NumericVector alpha, Rcpp::NumericVector wt, int progbar) { int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol(); // 1. Flatten R matrices into contiguous vectors auto X_flat = openclPort::flattenMatrix(x); auto B_flat = openclPort::flattenMatrix(b); // ... (all inputs) // 2. Dispatch: set kernel_name and kernel_file based on family/link std::string kernel_name, kernel_file; if (family == "binomial" && link == "logit") { kernel_name = "f2_f3_binomial_logit"; kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl"; } // ... (all families) // 3. Assemble program source using the pre-built dependency index. // load_library_for_kernel() reads @depends_nmath from the kernel file, // looks up the transitive closure in kernel_dependency_index.tsv, and // loads only the required subset in the correct order --- no sort at runtime. std::string nmath_src = openclPort::load_library_for_kernel( kernel_file, // kernel .cl path (relative to inst/cl/) "ex_glmbayes_nmath", // library subdir containing index + .cl files "nmathopencl", // package "depends_nmath" // annotation tag listing direct nmath entry points ); std::string kernel_src = openclPort::load_kernel_source( kernel_file, "nmathopencl"); std::string program_source = nmath_src + "\n" + kernel_src; // 4. Allocate output buffers std::vector qf_flat(m1); std::vector grad_flat(static_cast(m1) * l2); // 5. Run the kernel f2_f3_kernel_runner(program_source, kernel_name.c_str(), l1, l2, m1, X_flat, B_flat, mu_flat, P_flat, alpha_flat, y_flat, wt_flat, qf_flat, grad_flat); // 6. Wrap outputs as R objects and return Rcpp::NumericVector qf(qf_flat.begin(), qf_flat.end()); // ... (reshape grad_flat into a matrix) return Rcpp::List::create(Rcpp::Named("qf") = qf, ...); } }} // namespace ex_glmbayes::opencl ``` ### Why `load_library_for_kernel()` rather than `load_kernel_library()`? `load_kernel_library()` sorts and concatenates **every** file in the subdirectory on every call. For a 137-file library like `nmath` this requires reading all files sequentially. `load_library_for_kernel()` reads the pre-built `kernel_dependency_index.tsv` once, then reads only the small subset of files that the specific kernel actually needs --- 3 files for `f2_f3_gaussian`, 17 for `f2_f3_binomial_logit`, and so on. The savings are significant for large libraries and become important when the kernel wrapper is called repeatedly. The old full-library approach is still available for cases where the exact dependency set is unknown or all files are genuinely needed: ```cpp // Old approach --- loads all files in the subdirectory (kept for reference) // std::string nmath_src = openclPort::load_kernel_library( // "ex_glmbayes_nmath", "nmathopencl", false); ``` ## Step 6 --- Write the Rcpp export wrappers (C++ and R) The `[[Rcpp::export]]` annotation on `f2_f3_opencl` causes `Rcpp::compileAttributes()` to generate: - A `.Call` entry in `RcppExports.cpp` - An R wrapper in `RcppExports.R` The R-side convenience function in `ex_glmbayes.R` adds a `nmathopencl_has_opencl()` guard and CPU fallback: ```{r, eval = FALSE} f2_f3_opencl_R <- function(family, link, b, y, x, mu, P, alpha, wt, use_opencl = TRUE) { if (use_opencl && nmathopencl_has_opencl()) { f2_f3_opencl(family, link, b, y, x, mu, P, alpha, wt, progbar = 0L) } else { f2_f3_non_opencl(family, link, b, y, x, mu, P, alpha, wt) } } ``` ## Step 7 --- Implement the CPU fallback The CPU fallback uses the same mathematical logic as the kernel but executes in C++ using standard R math calls. The relevant code lives in: - `ex_glmbayes_famfuncs_binomial.cpp` --- binomial family functions - `ex_glmbayes_famfuncs_poisson.cpp` --- Poisson family - `ex_glmbayes_famfuncs_Gamma.cpp` --- Gamma family - `ex_glmbayes_famfuncs_gaussian.cpp` --- Gaussian family - `ex_glmbayes_EnvelopeEval.cpp` --- orchestrates the CPU path - `ex_glmbayes_EnvelopeSize.cpp` --- memory allocation for the envelope These files are grouped under the `ex_glmbayes::fam` and `ex_glmbayes::env` sub-namespaces. They use R's standard `` functions (the same mathematical functions as the GPU kernels, but via the C runtime rather than OpenCL). ## File inventory The complete set of files contributing to the `ex_glmbayes` example: ### OpenCL source (`inst/cl/`) | Directory | Files | |-----------|-------| | `ex_glmbayes_nmath/` | 20 files: the minimal nmath subset | | `ex_glmbayes_src/` | 6 files: one `f2_f3_*.cl` kernel per GLM family | | `ex_glmbayes_draft_src/` | Draft versions (work in progress) | ### C++ source (`src/`) | File | Namespace | Role | |------|-----------|------| | `ex_glmbayes_kernel_runners.cpp` | `ex_glmbayes::opencl` | Low-level OpenCL runner for the `f2_f3` kernels | | `ex_glmbayes_kernel_wrappers.cpp` | `ex_glmbayes::opencl` | R-facing wrapper: input validation, program assembly, runner dispatch | | `ex_glmbayes_export_wrappers.cpp` | (Rcpp export) | `[[Rcpp::export]]` entry points for `EnvelopeSize`, `EnvelopeEval`, `glmb_Standardize_Model` | | `ex_glmbayes_famfuncs_binomial.cpp` | `ex_glmbayes::fam` | CPU binomial likelihood | | `ex_glmbayes_famfuncs_poisson.cpp` | `ex_glmbayes::fam` | CPU Poisson likelihood | | `ex_glmbayes_famfuncs_Gamma.cpp` | `ex_glmbayes::fam` | CPU Gamma likelihood | | `ex_glmbayes_famfuncs_gaussian.cpp` | `ex_glmbayes::fam` | CPU Gaussian likelihood | | `ex_glmbayes_EnvelopeEval.cpp` | `ex_glmbayes::env` | CPU envelope evaluation (f2/f3 orchestrator) | | `ex_glmbayes_EnvelopeSize.cpp` | `ex_glmbayes::env` | Envelope memory sizing | | `ex_glmbayes_glmb_Standardize_Model.cpp` | `ex_glmbayes::sim` | Design matrix standardization | ### C++ headers (`src/`) | File | Namespace declared | |------|--------------------| | `ex_glmbayes_opencl.h` | `ex_glmbayes::opencl` | | `ex_glmbayes_famfuncs.h` | `ex_glmbayes::fam` | | `ex_glmbayes_Envelopefuncs.h` | `ex_glmbayes::env` | | `ex_glmbayes_simfuncs.h` | `ex_glmbayes::sim` | ### R source (`R/`) | File | Role | |------|------| | `ex_glmbayes.R` | R-facing API: `f2_f3_opencl_R`, `f2_f3_non_opencl` | | `ex_glmbayes_rcpp_wrappers.R` | Thin Rcpp wrappers for the three exported C++ export functions | ## Adapting this pattern for a new package To build a similar component in your own package: 1. **Identify your nmath dependencies** --- write the `@depends_nmath` annotation in each of your kernel files (listing the direct nmath entry points your kernel calls). Then call `extract_library_subset()` to copy the exact transitive closure of files, plus the companion index files, into your package's `inst/cl//`: ```{r, eval = FALSE} nmath_dir <- system.file("cl", "nmath", package = "nmathopencl") idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds")) extract_library_subset( kernel_paths = list.files("inst/cl/mypkg_src", "\\.cl$", full.names = TRUE), library_dir = nmath_dir, dest_dir = "inst/cl/mypkg_nmath", # must exist depends_tag = "depends_nmath", index = idx ) ``` 2. **Write your kernel(s)** --- add `@library_deps`, `@calls_nmath`, and `@all_depends_nmath` tags at the top of each `.cl` file. Store them in `inst/cl//`. 3. **Write a kernel runner** --- use `openclPort::opencl_dbl_scalar_kernel_runner` directly if your kernel fits the double-scalar layout, or write a custom runner following the same OpenCL lifecycle pattern as `f2_f3_kernel_runner`. Use `openclPort::opencl_make_context_error` and `opencl_status_name` for error handling. 4. **Write a kernel wrapper** --- flatten your R inputs with `openclPort::flattenMatrix` / `openclPort::copyVector`, assemble the program string with `openclPort::load_library_for_kernel()` + `openclPort::load_kernel_source()`, call the runner, and return an R-friendly result. The `@depends_nmath` annotation in your kernel file drives the subset selection automatically. 5. **Write a CPU fallback** --- implement the same computation using standard R math (``) or base-R functions. Guard the GPU path with `nmathopencl_has_opencl()`. 6. **Add `LinkingTo: nmathopencl`** to your `DESCRIPTION` file so that `openclPort.h` is found at compile time. 7. **Add configure scripts for CRAN safety** --- this is the step most commonly overlooked. A package that references `-lOpenCL` or `CL/cl.h` in a static `src/Makevars` will **fail to compile on CRAN's build machines** (which have no GPU SDK), and no binary will be produced. The solution is a pair of configure scripts (`configure` for Linux/macOS, `configure.win` for Windows) that **generate `src/Makevars` dynamically at install time**: if the SDK is found they write `-DUSE_OPENCL` and the link flags; if not they write a CPU-only Makevars. The package always compiles cleanly. Copy the generic templates into your package root with: ```{r, eval = FALSE} use_opencl_configure() ``` This writes `configure` and `configure.win` to the current directory, sets `configure` as executable (on Linux/macOS), and prints a checklist. The templates honor `OPENCL_HOME` / `OPENCL_SDK` environment variables so developers with non-standard SDK locations can point to them without modifying system paths. The three-entity relationship that the scripts establish: ``` configure / configure.win -> detects CL/cl.h + libOpenCL at install time -> writes -DUSE_OPENCL into Makevars (or omits it for CPU-only) #ifdef USE_OPENCL (in your C++ source) -> guards all GPU code; compiles cleanly either way nmathopencl_has_opencl() (in your R code) -> calls a compiled-in bool that mirrors the compile-time decision -> TRUE only if -DUSE_OPENCL was set when the package was built ``` On Linux the `configure` script also runs a **runtime probe** (`clGetPlatformIDs`) before setting `USE_OPENCL`. This extra check catches the common case where the ICD loader (`libOpenCL.so`) is installed but no vendor platform has been registered in `/etc/OpenCL/vendors/`. `configure.win` (Windows) detects the header only; the ICD loader (`OpenCL.dll`) is installed with the GPU driver and does not require a separate probe. Add the generated files to `.gitignore` --- they are build artifacts, not source files: ``` src/Makevars src/Makevars.win ``` For more detail, including the migration plan for when these templates move to `opencltools`, see `system.file("configure-templates", "README.md", package = "nmathopencl")`. The `ex_glmbayes` files are the authoritative reference implementation for steps 1--6; step 7 applies universally to any downstream package that exposes GPU acceleration.