---
title: "Chapter 05: Kernels, Kernel Runners, and Kernel Wrappers"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 05: Kernels, Kernel Runners, and Kernel Wrappers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

When an R package uses `nmathopencl` to accelerate a computation on the GPU,
the work is divided among three distinct kinds of source file.  This chapter
introduces them in the order that makes the design easiest to understand:
starting with the part that looks most like ordinary C++ code and moving
toward the part that is furthest from it.

| Role | File type | Compiled by | Conditional guard | Visible to R |
|---|---|---|---|---|
| Kernel wrapper | `*.cpp` | Host C++ compiler, at install time | Internal `#ifdef USE_OPENCL` block | Yes (directly or via a calling function) |
| Kernel runner | `*.cpp` | Host C++ compiler, at install time | Entire body inside `#ifdef USE_OPENCL` | No |
| Kernel | `*.cl` | GPU driver, *at runtime* | None needed | No |

Three design goals shape this structure:

1. **CRAN safety.**  CRAN's build servers have no OpenCL SDK.  Any package
   submitted to CRAN must compile cleanly without it.  Wrapping runner bodies
   entirely in `#ifdef USE_OPENCL` means those functions simply do not exist
   in a CPU-only build — there are no unresolved OpenCL symbols for the
   linker to complain about.

2. **Clean type boundaries.**  R and OpenCL use entirely different type
   systems and memory models.  Explicit conversion at each boundary — from
   C++ input types (including Rcpp wrapper types) to plain C++ types the
   runner accepts, to OpenCL device buffers, and back — makes those
   differences explicit and testable, rather than hiding them behind casts.

3. **Separation of concerns.**  The wrapper owns the R interface and type
   conversion.  The runner owns the OpenCL bookkeeping.  The kernel owns the
   mathematical computation.  None of the three needs to know the internal
   details of the other two.

### How this chapter is organized

Each of the three sections below follows the same pattern:

1. A **general description** of what that role does and what varies between
   implementations.
2. A concrete **example** drawn from the `ex_glmbayes` component of
   `nmathopencl`, which evaluates the GLM log-posterior and its gradient over
   a grid of candidate coefficient vectors.  This is labeled clearly as an
   example throughout.
3. Commentary on **what you would change** when writing your own version.

`ex_glmbayes` is used as the example rather than a simpler function because
it involves inputs of several different shapes, a two-dimensional parallel
structure, and two different output types.  That richness makes the
type-conversion and parallelism stories concrete.  A simpler computation would
follow the same pattern with fewer inputs and a simpler output.

---

## The computation in the example: GLM log-posterior over a grid

A Bayesian GLM sampler such as `glmbayes` needs to evaluate, at each MCMC
step, a function `f2_f3(B)` for a grid of candidate coefficient vectors
`B[1], …, B[m1]`.  For each grid point `j`, `f2_f3` returns:

- `qf[j]` — a scalar: the negative log-posterior (quadratic prior term plus
  the binomial negative log-likelihood summed over all `l1` observations).
- `grad[, j]` — a vector of length `l2`: the gradient of `qf[j]` with respect
  to the coefficients.

The grid points are completely independent of each other, which makes the
problem ideal for the GPU.  Three dimension integers — `l1` (observations),
`l2` (coefficients), `m1` (grid points) — appear in all three layers and must
be consistent throughout.

---

## Serial vs. parallel execution: the fundamental concept

Before examining the three layers in detail it is worth being precise about
what "parallel" means here, because it is the reason the whole architecture
exists.

### The serial (CPU) mental model

On a CPU, evaluating `f2_f3` over `m1` grid points looks like an ordinary
loop:

```cpp
// CPU serial execution — one grid point at a time
for (int j = 0; j < m1; ++j) {
    qf[j]      = compute_neg_log_posterior(B[j], X, y, wt, mu, P, alpha);
    grad[:, j] = compute_gradient(B[j], X, y, wt, mu, P, alpha);
}
```

The iterations execute one after the other.  Each call to
`compute_neg_log_posterior` must finish before the next begins.  With `m1 =
500` and `l1 = 300`, that is 150,000 observation-level computations done
sequentially.

### The parallel (GPU) mental model

OpenCL replaces the loop with a single API call that *launches all `m1`
iterations simultaneously*:

```cpp
// GPU parallel execution — all m1 grid points launched at once
size_t global = (size_t) m1;
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &global, nullptr,
                       0, nullptr, nullptr);
```

This one call enqueues `m1` independent *work-items*.  Each work-item is an
independent execution of the `__kernel` function, distinguished only by its
unique index:

```c
int j = get_global_id(0);   // 0, 1, 2, ..., m1-1 — one per work-item
```

The GPU driver maps these work-items onto the GPU's physical compute units.
A modern GPU has dozens to hundreds of compute units, each capable of running
multiple work-items simultaneously.  With `m1 = 500`, several hundred grid
points can genuinely execute at the same time, not just interleaved.

The work-items are organized internally into *work-groups* for scheduling
purposes.  When the local work-group size is set to `nullptr` (as in the
runner), the driver chooses a group size that fits the hardware.  Work-items
within a work-group can share fast local memory, but the `ex_glmbayes` kernels
do not use local memory — each work-item is fully independent.

### When OpenCL is not available

If the package was built without `USE_OPENCL`, or if `nmathopencl_has_opencl()` returns
`FALSE` at runtime, the wrapper returns zero-filled output vectors.  There is
no serial CPU fallback that computes the actual values.

This is a deliberate design choice.  Implementing a correct CPU fallback that
matches the GPU computation would mean maintaining two versions of the same
mathematical code.  In practice, `glmbayes` calls `f2_f3_opencl` only after
checking `nmathopencl_has_opencl()`, and falls back to its own pure-R implementation when
OpenCL is not available.  The zero-return convention signals clearly that the
GPU path was not executed, rather than silently returning potentially stale or
incorrect results.

When implementing your own wrapper, you face the same choice: return zeros
(safe, unambiguous), throw an error, or call a CPU implementation of the same
computation.  Whatever you choose, make it explicit in the `#else` branch.

### Summary

| Mode | Execution | Result |
|---|---|---|
| GPU available (`USE_OPENCL` + `nmathopencl_has_opencl()`) | `m1` work-items launched simultaneously via `clEnqueueNDRangeKernel` | GPU-computed values |
| CPU-only build or no GPU at runtime | `#ifdef`/`nmathopencl_has_opencl()` guard triggers | Zero-filled outputs (or your chosen fallback) |

---

## The kernel wrapper

### General description

A kernel wrapper is an ordinary C++ function — always compiled, always
callable — that serves as the boundary between the calling code and the GPU
acceleration path.  It is the right place to start because it is the most
familiar-looking of the three roles.

Every kernel wrapper does the same four things, regardless of the specific
computation:

1. **Converts its C++ input types** to the plain `std::vector` types the
   runner requires and the OpenCL API can handle.  The exact conversions
   depend on the types in the function signature.
2. **Allocates output storage** for the runner to write into.
3. **Conditionally dispatches** to the runner if `USE_OPENCL` was defined at
   build time.  If not, the function returns the zero-initialized outputs.
4. **Converts the runner's output** from plain C++ back to the types the
   caller expects.

What varies between implementations is the **function signature** (the
specific C++ types coming in and going out), the **conversion steps** (which
helpers are needed), and the **output restructuring** (how flat C++ vectors
become the return types the caller needs).  The four-step structure, and the
`#ifdef USE_OPENCL` placement, are the same everywhere.

### Example: `f2_f3_opencl` from `ex_glmbayes`

The following is the kernel wrapper for the binomial-logit GLM gradient
computation.  Read it as one concrete instance of the general pattern
described above.

```cpp
// ── EXAMPLE: src/ex_glmbayes_kernel_wrappers.cpp ────────────────────────────
// This wrapper is specific to the GLM log-posterior computation.
// Your own wrapper will have a different name, a different signature,
// and different conversion and output steps.
// ────────────────────────────────────────────────────────────────────────────

#include <RcppArmadillo.h>
#include "openclPort.h"
#include "ex_glmbayes_opencl.h"

using namespace openclPort;
using namespace ex_glmbayes::opencl;

namespace ex_glmbayes { namespace opencl {

Rcpp::List f2_f3_opencl(
    std::string          family,   // "binomial", "poisson", etc.
    std::string          link,     // "logit", "probit", etc.
    Rcpp::NumericMatrix  b,        // coefficient grid, l2 × m1
    Rcpp::NumericVector  y,        // response, length l1
    Rcpp::NumericMatrix  x,        // design matrix, l1 × l2
    Rcpp::NumericMatrix  mu,       // prior mean
    Rcpp::NumericMatrix  P,        // prior precision, l2 × l2
    Rcpp::NumericVector  alpha,    // offsets, length l1
    Rcpp::NumericVector  wt,       // weights / trial counts, length l1
    int                  progbar
    // Your wrapper's parameters will reflect your own computation's
    // inputs. A simpler function might take just one or two vectors.
) {
  int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();
  // Your dimension variables will match your own inputs.
```

#### Step 1 — Convert C++ input types to flat `std::vector<double>`

```cpp
  // ── EXAMPLE: input conversion specific to this GLM computation ──────────
  // The C++ types coming in (Rcpp::NumericMatrix, Rcpp::NumericVector)
  // reflect what this particular function signature requires.
  // Your wrapper will have different input types and will apply whichever
  // combination of flattenMatrix / copyVector / manual conversion matches
  // your own function signature.

  auto X_flat     = flattenMatrix(x);
  auto B_flat     = flattenMatrix(b);
  auto mu_flat    = flattenMatrix(mu);
  auto P_flat     = flattenMatrix(P);
  auto alpha_flat = copyVector(alpha);
  auto y_flat     = copyVector(y);
  auto wt_flat    = copyVector(wt);
```

This conversion step, and the output vectors allocated next, all appear
*outside* any `#ifdef USE_OPENCL` block.  The reason is that the output
vectors (`qf_flat`, `grad_flat`) must be in scope after `#endif` when the
wrapper constructs the return value from them.  Keeping all the conversions
together, before the conditional block, makes the scoping unambiguous.

**Why not pass the Rcpp objects directly to the runner?**  A
`Rcpp::NumericMatrix` is a C++ object whose data lives in R's
garbage-collected heap (as an R SEXP).  Passing a raw pointer into that
storage to the OpenCL API is unsafe: the driver requires a stable, contiguous
memory region for the duration of the `clCreateBuffer` copy, and R's GC can
move or invalidate heap objects when control returns to R.  A
`std::vector<double>` provides the required guarantee: it owns its memory,
controls its lifetime through ordinary C++ scope rules, and will not move or
be freed until the vector itself goes out of scope.

**Why `double` and not `float`?**  The function signature uses `double`
throughout because that is the native precision of `Rcpp::NumericMatrix` and
`Rcpp::NumericVector` — and of the `double*` arrays they wrap.  Using 32-bit
`float` inside a kernel would require explicit downcast on every input and
upcast on every output, and each round-trip introduces rounding error that
makes GPU results diverge from the CPU reference computation.  Using `double`
throughout keeps the two paths numerically equivalent, at the cost of
requiring the `cl_khr_fp64` extension on the device (supported on all current
discrete GPUs and AMD APUs).

**Memory layout.**  `Rcpp::NumericMatrix` stores its data in column-major
order internally (matching R's own convention).  `flattenMatrix` preserves
that order: it traverses the matrix column-by-column, so the resulting flat
vector has the same layout.  The kernel indexes the design matrix as
`X[col * l1 + row]` — column-major indexing — so the flat vector and the
kernel's index arithmetic agree.  If the flat vector were row-major, every
off-diagonal access would silently return the wrong value.

```cpp
// openclPort helpers used above (defined in OpenCL_helper.cpp):

std::vector<double> flattenMatrix(const Rcpp::NumericMatrix& mat) {
  int nrow = mat.nrow(), ncol = mat.ncol();
  std::vector<double> out;
  out.reserve(static_cast<size_t>(nrow) * ncol);
  for (int j = 0; j < ncol; ++j)       // outer: columns
    for (int i = 0; i < nrow; ++i)     // inner: rows within column
      out.push_back(mat(i, j));
  return out;
}

std::vector<double> copyVector(const Rcpp::NumericVector& vec) {
  return std::vector<double>(vec.begin(), vec.end());
}
// For IntegerVector: std::vector<int>(v.begin(), v.end())
```

#### Step 2 — Allocate output buffers

```cpp
  // ── EXAMPLE: output sizes specific to this computation ──────────────────
  // Your output buffers will match your kernel's output parameters.
  // A simpler kernel writing one output vector would need only one
  // std::vector here.

  std::vector<double> qf_flat(m1);                      // scalar per grid point
  std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);  // gradient matrix
  Rcpp::NumericVector qf(m1);   // zero-initialized; returned as-is on CPU-only build
```

`std::vector<double>` zero-initializes its elements by default.  If
`USE_OPENCL` was not defined, the wrapper returns these zero-filled objects —
a predictable, testable result on any machine without a GPU.  The
`static_cast<size_t>` prevents integer overflow in `m1 * l2` on 32-bit
platforms.

#### Step 3 — Select the kernel and assemble the program string

```cpp
#ifdef USE_OPENCL

  // ── EXAMPLE: family/link dispatch specific to GLM ────────────────────────
  // This dispatch is needed here because one wrapper covers multiple
  // GLM families. Your wrapper may call a single fixed kernel, in which
  // case kernel_name and kernel_file are constants, not variables.

  std::string kernel_name, kernel_file;

  if (family == "binomial" || family == "quasibinomial") {
    if (link == "logit") {
      kernel_name = "f2_f3_binomial_logit";
      kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
    } else if (link == "probit") {
      kernel_name = "f2_f3_binomial_probit";
      kernel_file = "ex_glmbayes_src/f2_f3_binomial_probit.cl";
    }
    /* ... other link functions ... */
  } else if (family == "poisson" || family == "quasipoisson") {
    kernel_name = "f2_f3_poisson";
    kernel_file = "ex_glmbayes_src/f2_f3_poisson.cl";
  }
  /* ... other families ... */

  // ── Program string assembly (same pattern for all wrappers) ─────────────
  // The shim libraries are always the same. What changes is the nmath
  // subset (driven by your kernel's @all_depends_nmath annotation) and
  // the kernel source file itself.

  std::string all_src =
      load_kernel_source("OPENCL.cl") +
      "\n" + load_kernel_library("libR_shims",      "nmathopencl", false) +
      "\n" + load_kernel_library("R_ext_types",     "nmathopencl", false) +
      "\n" + load_kernel_library("R_shims",         "nmathopencl", false) +
      "\n" + load_kernel_library("R_ext_runtime",   "nmathopencl", false) +
      "\n" + load_kernel_library("R_ext_internals", "nmathopencl", false) +
      "\n" + load_kernel_library("System",          "nmathopencl", false) +
      "\n" + load_library_for_kernel(
                 kernel_file, "ex_glmbayes_nmath", "nmathopencl",
                 "all_depends_nmath") +
      "\n" + load_kernel_source(kernel_file);
```

OpenCL has no `#include` directive, so the wrapper must concatenate every
source file the kernel depends on into one string before calling the runner.
The layers are assembled in dependency order (Chapter 03 explains each layer):
platform definitions, R compatibility shims, the nmath subset the kernel
needs, and finally the kernel source itself.  `load_library_for_kernel` reads
the `@all_depends_nmath` annotation from the kernel file and loads only those
nmath files in topological order, rather than loading all 137.

#### Step 4 — Dispatch to the runner

```cpp
  // ── EXAMPLE: runner call specific to this computation ───────────────────
  // Your call site will name your own runner function and pass your
  // own flat vectors. The pattern — pass the assembled string, the
  // kernel name, the dimensions, all inputs, and the output buffers —
  // is the same.

  f2_f3_kernel_runner(
      all_src, kernel_name.c_str(),
      l1, l2, m1,
      X_flat, B_flat, mu_flat, P_flat, alpha_flat, y_flat, wt_flat,
      qf_flat, grad_flat);    // filled in by the runner

  for (int j = 0; j < m1; ++j)
    qf[j] = qf_flat[j];      // copy scalar outputs into the Rcpp output vector

#else
  Rcpp::Rcout << "[INFO] OpenCL not available — returning zeros.\n";
#endif
```

#### Step 5 — Convert outputs from plain C++ back to the required return types

```cpp
  // ── EXAMPLE: two output conversion patterns ──────────────────────────────
  // qf   uses an element-by-element copy (clear, zero overhead for small m1).
  // grad uses a zero-copy Armadillo wrap to avoid copying a large matrix.
  //
  // Your wrapper will use whichever conversion pattern matches its own
  // return type: a copy loop or Rcpp::NumericVector(...) for a vector result,
  // arma::mat for a matrix, or a Rcpp::List assembling multiple outputs.

  arma::mat grad_arma(
      grad_flat.data(),   // pointer to existing flat storage — not copied
      m1,                 // number of rows
      l2,                 // number of columns
      false,              // copy_aux_mem = false: borrow the memory
      false               // strict = false
  );

  return Rcpp::List::create(
      Rcpp::Named("qf")   = qf,
      Rcpp::Named("grad") = grad_arma
  );
}

}} // namespace ex_glmbayes::opencl
```

**Two output conversion strategies are shown here.**

*Element-by-element copy (`qf`):* the runner writes results into the plain
`std::vector<double> qf_flat`.  After the runner returns, each element is
copied into the `Rcpp::NumericVector qf` that was pre-allocated before the
`#ifdef` block.  This is the clearest pattern and is effectively free when
`m1` is small.  For larger outputs, `Rcpp::NumericVector(v.begin(), v.end())`
achieves the same conversion in one line.

*Zero-copy Armadillo wrap (`grad`):* the runner writes into
`std::vector<double> grad_flat`.  Rather than copying all `m1 × l2` elements
into a new matrix object, an `arma::mat` is constructed over `grad_flat`'s
existing memory using Armadillo's `(ptr, nrow, ncol, copy_aux_mem, strict)`
constructor.  Setting `copy_aux_mem = false` tells Armadillo to use the
existing memory directly.  The matrix is valid until `grad_flat` is destroyed,
which happens after `Rcpp::List::create` serializes the result — so the
lifetime is safe.  For typical MCMC grid sizes the gradient matrix can be
several hundred kilobytes; avoiding the copy measurably reduces per-call
overhead.

Your own wrapper will use whichever conversion matches its return type: a copy
loop or `Rcpp::NumericVector(...)` for a vector result, an `arma::mat` wrap
for a large matrix, or a `Rcpp::List` assembling multiple outputs.

---

## The kernel runner

### General description

A kernel runner is a C++ function whose entire body is enclosed in
`#ifdef USE_OPENCL … #endif`.  It literally does not exist in a CPU-only
build: no prototype visible to the linker, no symbols in the object file, no
`CL/cl.h` ever seen by the preprocessor.

Every kernel runner follows the same **eight-step lifecycle**, regardless of
what computation it performs:

1. Select a device.
2. Create a context and command queue.
3. Compile the program on-device.
4. Allocate device buffers and copy input data to the GPU.
5. Bind kernel arguments.
6. Launch the kernel.
7. Read results back to the host (blocking).
8. Release all resources.

What varies between implementations is the **function signature** (the specific
`std::vector` and `int` parameters matching your computation), the **number
and arrangement of device buffers** (one `cl_mem` per kernel parameter), and
the **number of `clSetKernelArg` calls** (one per kernel parameter, in the
same order as the kernel's `__kernel` signature).  The eight steps and their
sequence are the same for every runner.

The runner's interface uses only plain C++ types — `std::vector<double>`,
`int`, `std::string`, `const char*`.  It has no Rcpp dependencies and no
knowledge of R's memory model.  This makes the runner the easiest of the
three layers to test in isolation.

### Example: `f2_f3_kernel_runner` from `ex_glmbayes`

```cpp
// ── EXAMPLE: src/ex_glmbayes_kernel_runners.cpp ─────────────────────────────
// This runner is specific to the GLM gradient computation, which has
// seven input matrices/vectors and two output buffers.
// Your runner will have a different name, a different parameter list,
// and a different number of clCreateBuffer / clSetKernelArg calls,
// but the eight lifecycle steps will be the same.
// ────────────────────────────────────────────────────────────────────────────

#ifdef USE_OPENCL

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

#include "openclPort.h"
using namespace openclPort;

namespace ex_glmbayes { namespace opencl {

void f2_f3_kernel_runner(
    const std::string&         kernel_source,  // fully assembled program text
    const char*                kernel_name,    // e.g. "f2_f3_binomial_logit"
    int l1, int l2, int m1,
    // ── EXAMPLE: seven input vectors specific to the GLM computation ────────
    // Your runner will have whatever std::vector parameters your kernel needs.
    // A simpler kernel might take just one or two input vectors.
    const std::vector<double>& X_flat,
    const std::vector<double>& B_flat,
    const std::vector<double>& mu_flat,
    const std::vector<double>& P_flat,
    const std::vector<double>& alpha_flat,
    const std::vector<double>& y_flat,
    const std::vector<double>& wt_flat,
    // ── EXAMPLE: two output vectors specific to this computation ────────────
    std::vector<double>&       qf_flat,        // written back on return
    std::vector<double>&       grad_flat       // written back on return
) {
```

#### Step 1 — Select a device

```cpp
  cl_platform_id platform = nullptr;
  cl_device_id   device   = nullptr;
  opencl_bind_selected_fp64_device_or_throw(platform, device);
  // This call is the same in every runner. It selects the device the
  // user chose via opencltools::set_opencl_device() and verifies that
  // it supports 64-bit floating point.
```

#### Step 2 — Create a context and command queue

```cpp
  cl_context context =
      clCreateContext(nullptr, 1, &device, nullptr, nullptr, &status);

  cl_queue_properties props[] = {0};
  cl_command_queue queue =
      clCreateCommandQueueWithProperties(context, device, props, &status);
  // These two calls are identical in every runner.
```

An OpenCL *context* is the environment within which all subsequent objects
live — buffers, programs, kernels, and queues all belong to a context and
share access to the same device memory.  A *command queue* is the channel
through which the host issues instructions to the device.  Commands are placed
in the queue and the device executes them in order, asynchronously with the
host CPU.

#### Step 3 — Compile the program on-device

```cpp
  const char* src_ptr = kernel_source.c_str();
  size_t      src_len = kernel_source.size();
  cl_program  program =
      clCreateProgramWithSource(context, 1, &src_ptr, &src_len, &status);

  status = clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr);
  // These calls are identical in every runner. If compilation fails,
  // the runner retrieves the build log from the driver and throws a
  // std::runtime_error with the full diagnostic.

  cl_kernel kernel = clCreateKernel(program, kernel_name, &status);
  // kernel_name is the C-string name of the __kernel function in the
  // .cl file — passed in from the wrapper.
```

`clBuildProgram` triggers on-device compilation: the GPU driver parses the
OpenCL C source, optimizes it, and produces GPU-native machine code specific
to the selected device.  This is analogous to `R CMD SHLIB` but happens at
runtime inside the GPU driver process.  `clCreateKernel` then extracts the
specific `__kernel` function by name from the compiled program.

#### Step 4 — Allocate device buffers and transfer input data

```cpp
  // ── EXAMPLE: nine buffers specific to this computation ──────────────────
  // Your runner will have one cl_mem per kernel parameter (excluding
  // scalar integers, which are passed directly in Step 5).
  // The split between READ_ONLY and WRITE_ONLY reflects the kernel's
  // use of each buffer.

  // Input buffers: allocate and copy from host in one call
  cl_mem bufX = clCreateBuffer(
      context,
      CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
      sizeof(double) * X_flat.size(),
      (void*) X_flat.data(),
      &status);
  // ... similarly for bufB, bufMu, bufP, bufA, bufY, bufW ...

  // Output buffers: allocate uninitialized device memory
  cl_mem bufQF = clCreateBuffer(
      context,
      CL_MEM_WRITE_ONLY,
      sizeof(double) * m1,
      nullptr,     // no initial data — the kernel will write here
      &status);

  cl_mem bufGrad = clCreateBuffer(
      context,
      CL_MEM_WRITE_ONLY,
      sizeof(double) * l2 * m1,
      nullptr,
      &status);
```

For *input* buffers, `CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR` allocates
device memory and immediately copies the flat vector's data into it.  After
this call returns, the device buffer is fully populated and the host-side
vector is no longer needed by the driver.

For *output* buffers, `CL_MEM_WRITE_ONLY` with a `nullptr` host pointer
allocates uninitialized device memory that the kernel will write into.  The
results are read back to the host after execution.

#### Step 5 — Bind kernel arguments

```cpp
  // ── EXAMPLE: twelve arguments for this kernel ────────────────────────────
  // Your runner will have one clSetKernelArg call per kernel parameter,
  // in the exact same order as the parameters appear in the __kernel
  // function signature. Mismatched order causes a runtime error.

  int arg = 0;
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufX);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufB);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufMu);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufP);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufA);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufY);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufW);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufQF);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufGrad);
  // Scalar integers are passed by value, not as cl_mem handles:
  clSetKernelArg(kernel, arg++, sizeof(int),    &l1);
  clSetKernelArg(kernel, arg++, sizeof(int),    &l2);
  clSetKernelArg(kernel, arg++, sizeof(int),    &m1);
```

`clSetKernelArg` connects a host-side value to a positional parameter in the
kernel's `__kernel` signature.  The second argument is the zero-based
parameter index.  Buffer arguments (`cl_mem`) are passed by address; scalar
arguments (`int`, `double`) are also passed by address and the driver copies
the value.  The count and order of calls here must exactly mirror the count
and order of parameters in the `.cl` file.

#### Step 6 — Launch the kernel

```cpp
  // The global work size is the total number of work-items to launch.
  // For this computation it is m1: one independent instance of the kernel
  // function per grid point. All m1 instances are launched simultaneously
  // (see "Serial vs. parallel execution" earlier in this chapter).
  // Your kernel may use a different global size reflecting its own
  // parallelism structure — e.g. the number of output elements.

  size_t global = (size_t) m1;
  clEnqueueNDRangeKernel(
      queue, kernel,
      1,          // one-dimensional work space (indexed by j = 0..m1-1)
      nullptr,    // global offset: start from index 0
      &global,    // total work-items: m1
      nullptr,    // local work-group size: let the driver choose
      0, nullptr, nullptr);
  // This call places the launch command in the queue and returns immediately.
  // The GPU executes the m1 work-items asynchronously with the host CPU.
  // The blocking clEnqueueReadBuffer in Step 7 is what forces the host to
  // wait until all work-items have finished before reading the results.
```

#### Step 7 — Read results back to the host (blocking)

```cpp
  // ── EXAMPLE: two read-backs for this computation ─────────────────────────
  // Your runner will have one clEnqueueReadBuffer call per output buffer.

  clEnqueueReadBuffer(
      queue, bufQF,
      CL_TRUE,                              // blocking: wait until done
      0,
      sizeof(double) * qf_flat.size(),
      qf_flat.data(),
      0, nullptr, nullptr);

  clEnqueueReadBuffer(
      queue, bufGrad,
      CL_TRUE,
      0,
      sizeof(double) * grad_flat.size(),
      grad_flat.data(),
      0, nullptr, nullptr);
  // After these calls, qf_flat and grad_flat contain GPU-computed results.
```

`CL_TRUE` makes each call blocking: it waits until the data transfer from
device to host is complete.  Because commands execute in queue order, this
also guarantees that the kernel has finished.

#### Step 8 — Release all resources

```cpp
  // Release in reverse order of creation.
  // Every runner must release every cl_mem, cl_kernel, cl_program,
  // cl_command_queue, and cl_context it creates. Omissions leak GPU VRAM.

  clReleaseMemObject(bufGrad);
  clReleaseMemObject(bufQF);
  clReleaseMemObject(bufW);
  clReleaseMemObject(bufY);
  clReleaseMemObject(bufA);
  clReleaseMemObject(bufP);
  clReleaseMemObject(bufMu);
  clReleaseMemObject(bufB);
  clReleaseMemObject(bufX);
  clReleaseKernel(kernel);
  clReleaseProgram(program);
  clReleaseCommandQueue(queue);
  clReleaseContext(context);
}

}} // namespace ex_glmbayes::opencl

#endif // USE_OPENCL
// Everything from the opening #ifdef to this #endif is invisible to the
// preprocessor on a machine without OpenCL.
```

---

## The kernel (`*.cl`)

### General description

A kernel is an OpenCL C source file that contains the actual computation to
be run on the GPU.  It is the innermost piece: pure math, with no knowledge
of R, no knowledge of the runner's bookkeeping, and no access to host memory.
Its only interface with the outside world is the set of `__global` pointer
and scalar arguments the runner hands to it via `clSetKernelArg`.

Every kernel file has the same basic structure, regardless of the
computation:

1. **Annotation header** — comment lines with `@all_depends_nmath` (and
   optionally `@depends`, `@provides`) that the wrapper's program assembler
   reads to determine which nmath files to prepend.  If your kernel does not
   call any nmath functions, this section can be omitted.
2. **Extension pragmas** — at minimum `#pragma OPENCL EXTENSION cl_khr_fp64 :
   enable` if the kernel uses `double`.
3. **Helper functions** — any `static inline` device-side functions called
   by the `__kernel` entry point, including nmath functions that will be
   prepended by the assembler.
4. **`__kernel` entry point** — the function the runner calls via
   `clCreateKernel`.

What varies between kernels is the **`__kernel` signature** (one `__global`
parameter per device buffer allocated in the runner), the **scalar parameters**
(dimension integers and other constants), and the **computation itself**.

A kernel is never compiled when the R package is installed.  Instead, the
source text is assembled into a string by the wrapper, passed to the runner,
and handed to `clCreateProgramWithSource`.  The GPU driver compiles it to
native machine code at runtime.  This *on-device compilation* allows a single
source file to run on any GPU without vendor-specific binaries.

### Example: `f2_f3_binomial_logit` from `ex_glmbayes`

```c
// ── EXAMPLE: inst/cl/ex_glmbayes_src/f2_f3_binomial_logit.cl ───────────────
// This kernel is specific to the binomial-logit GLM gradient computation.
// Your kernel will have a different name, a different parameter list
// (matching the device buffers your runner allocates), and a different
// computation. The file structure — annotations, pragma, helpers,
// __kernel entry point — is the same.
// ────────────────────────────────────────────────────────────────────────────

// @all_depends_nmath: dpq, refactored, Rmath, nmath, stirlerr_cycle_free,
//   chebyshev, cospi, fmax2, gammalims, lgammacor, log1p, gamma, lgamma,
//   pgamma_utils, stirlerr_cycle_dependent, bd0, stirlerr, dbinom
// If your kernel calls no nmath functions, omit this annotation entirely.

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
// Required whenever the kernel uses 'double'. Always include this.

#pragma OPENCL EXTENSION cl_khr_printf : enable
// Optional; enables printf() for debugging inside the kernel.

// ── EXAMPLE: compile-time constant specific to this computation ─────────────
// MAX_L2 bounds the size of the per-work-item private array g_loc[].
// GPU private memory is fixed at compile time, so a maximum is needed.
// Your kernel may need a similar constant if it uses fixed-size local arrays,
// or may not need one at all if its private storage is scalar.
#define MAX_L2 64
```

**`@all_depends_nmath`** lists the nmath files whose source must be prepended
before this kernel when the program string is assembled.  The wrapper's
`load_library_for_kernel` reads this annotation and loads exactly those files
in topological order.  Chapter 03 explains the annotation scheme and the
dependency graph in detail.

**`MAX_L2`** bounds the size of a per-work-item stack array.  Each work-item
allocates `double g_loc[MAX_L2]` in *private memory* — the GPU equivalent of
a register file — whose size must be known at compile time.  A maximum
coefficient count of 64 is sufficient for the GLM use case.  Your kernel may
need a similar constant, or may not require any fixed-size private arrays at
all.

#### The kernel signature

```c
// ── EXAMPLE: parameter list matching the nine buffers allocated in the runner
// Your kernel's __global parameters must exactly correspond (in number,
// type, and order) to the cl_mem buffers created by clCreateBuffer in the
// runner and bound by clSetKernelArg.

__kernel void f2_f3_binomial_logit(
    __global const double* X,      // design matrix, l1 × l2, column-major
    __global const double* B,      // coefficient grid, m1 × l2
    __global const double* mu,     // prior mean vector, length l2
    __global const double* P,      // prior precision matrix, l2 × l2
    __global const double* alpha,  // offsets, length l1
    __global const double* y,      // response (success proportions), length l1
    __global const double* wt,     // trial counts / weights, length l1
    __global double*       qf,     // OUTPUT: neg log-posterior, length m1
    __global double*       grad,   // OUTPUT: gradient, m1 × l2
    const int l1,
    const int l2,
    const int m1
    // Scalar integers are passed by value; no __global qualifier.
) {
```

**`__kernel`** marks this function as a GPU entry point that `clCreateKernel`
can look up by name.

**`__global`** is an OpenCL address-space qualifier.  It marks a pointer as
pointing into *global device memory* — the GPU's main VRAM — which all
work-items can access.  Pointers without an address-space qualifier point into
the work-item's private memory, which is invisible to other work-items and
cannot be mapped from the host.  Every `cl_mem` buffer created in the runner
corresponds to exactly one `__global` pointer in the kernel signature.

**`const int l1, l2, m1`** are scalar values set via `clSetKernelArg` with
`sizeof(int)` and passed by value.  Scalar kernel arguments do not use
`__global`.

#### The parallel execution body

```c
    int j = get_global_id(0);   // which grid point is this work-item?
    if (j >= m1) return;        // guard: don't process out-of-range indices
    // Your kernel's first line will also call get_global_id(0) to determine
    // which element or slice this work-item is responsible for.
```

`get_global_id(0)` returns the unique index of this work-item within the
one-dimensional work space.  When the runner enqueues the kernel with a global
size of `m1`, the GPU launches `m1` work-items simultaneously, each receiving
a different value of `j`.  Each work-item executes the function body
independently for its own grid point.  This is the GPU's answer to the CPU's
outer `for (j = 0; j < m1; j++)` loop.

The `if (j >= m1) return` guard handles the case where the GPU rounds up the
number of work-items to a multiple of the hardware's work-group size.  Without
this guard, extra work-items would write past the end of the output buffers.

```c
    // ── EXAMPLE: GLM log-posterior and gradient ───────────────────────────
    // The computation below is specific to the binomial-logit model.
    // Your kernel body will contain whatever computation you are
    // accelerating, reading from __global input pointers and writing
    // to __global output pointers.

    double g_loc[MAX_L2];   // private (per-work-item) working storage

    // 1. Quadratic prior term: 0.5 × (B[j,] − μ)ᵀ P (B[j,] − μ)
    // 2. Binomial likelihood loop over l1 observations
    // 3. Write results

    qf[j] = res_acc;

    for (int k = 0; k < l2; ++k)
        grad[k * m1 + j] = g_loc[k];
    // grad is stored as grad[col * m1 + row] — column-major — so that
    // the wrapper's arma::mat(ptr, m1, l2) interprets it correctly.
    // Your output layout should match how your wrapper will read it back.
```

The gradient is written in column-major order (`grad[col * m1 + row]`) so
that the `arma::mat(grad_flat.data(), m1, l2)` construction in the wrapper
produces an `m1 × l2` matrix in the layout R expects.  The decision about
output memory layout is made in the kernel and must be consistent with how
the wrapper reads the data back.

---

## The complete call chain

```
R:  f2_f3_opencl("binomial", "logit", b, y, x, mu, P, alpha, wt)
      │
      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Kernel wrapper  (always compiled, always callable)             │
│                                                                 │
│  1. Read dimensions                                             │
│  2. Convert Rcpp inputs → std::vector<double>                   │
│     (flattenMatrix for matrices, copyVector for vectors)        │
│  3. Allocate zero-filled output vectors                         │
│  4. Select kernel name + file (family/link dispatch)            │
│  5. Assemble program string                                     │
│     (OPENCL.cl + shims + nmath subset + kernel source)          │
│                                                                 │
│  ── #ifdef USE_OPENCL ──────────────────────────────────────── │
│  6. Call f2_f3_kernel_runner(all_src, kernel_name, ...)         │
│  7. Copy qf_flat → NumericVector qf                             │
│  ── #endif ─────────────────────────────────────────────────── │
│                                                                 │
│  8. Wrap grad_flat → arma::mat (zero-copy)                      │
│  9. Return List(qf = ..., grad = ...)                           │
└─────────────────────────────────────────────────────────────────┘
      │  std::vector<double>: flat, contiguous, R-free, device-safe
      │  (only reached when USE_OPENCL is defined)
      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Kernel runner  (compiled only when USE_OPENCL is defined)      │
│                                                                 │
│  1. Bind device (fp64-capable GPU)                              │
│  2. Create context + command queue                              │
│  3. clCreateProgramWithSource + clBuildProgram                  │
│     → GPU driver compiles the assembled string on-device        │
│  4. clCreateKernel("f2_f3_binomial_logit")                      │
│  5. clCreateBuffer × 9                                          │
│     inputs:  READ_ONLY | COPY_HOST_PTR → data copied to VRAM   │
│     outputs: WRITE_ONLY, nullptr → uninitialized device memory  │
│  6. clSetKernelArg × 12  (9 buffers + l1, l2, m1)              │
│  7. clEnqueueNDRangeKernel(global_size = m1)                    │
│     → queues the launch; GPU executes asynchronously            │
│  8. clEnqueueReadBuffer × 2 (blocking)                          │
│     → waits for GPU; copies results to qf_flat, grad_flat       │
│  9. clRelease* for all objects                                  │
└─────────────────────────────────────────────────────────────────┘
      │  clEnqueueNDRangeKernel(global = m1):
      │  ALL m1 work-items launched simultaneously — not a loop
      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Kernel  (compiled on-device; m1 instances run in parallel)     │
│                                                                 │
│  Each work-item  j = get_global_id(0)  independently:           │
│    • quadratic prior term  qf[j] += 0.5 × (B[j,]−μ)ᵀP(B[j,]−μ)│
│    • for i in 0..l1:  sigmoid → p                               │
│                        nll_binomial_glmb_ocl(y[i], wt[i], p)    │
│                        gradient accumulation → g_loc[k]          │
│    • qf[j] = res_acc                                            │
│    • grad[k*m1 + j] = g_loc[k]  for k in 0..l2                 │
└─────────────────────────────────────────────────────────────────┘
      │
      └─► qf_flat, grad_flat  ←  runner  ←  wrapper  ←  R
```

---

## CRAN safety: how the architecture prevents build failures

CRAN's build servers have no OpenCL SDK installed.  A package that includes
`CL/cl.h` or links against `-lOpenCL` on a CRAN server will fail to build
and be rejected.

The runner being entirely absent from CPU-only builds is the critical property:

- `#include <CL/cl.h>` appears only at the top of the runner file, inside
  `#ifdef USE_OPENCL`.  On a CRAN build the preprocessor never reaches that
  line.
- Every `cl_*` type and every `clCreate*` / `clEnqueue*` call lives inside
  the same `#ifdef` block.  None of them generate any object-code symbols.
- The linker therefore never looks for `-lOpenCL` and the build succeeds on
  any machine.

The wrapper being unconditionally compiled means the full C++ API surface is
present on every build.  On a CPU-only build, `f2_f3_opencl` returns a list
with `qf` and `grad` both zero — the same structure the caller expects, with
zero values instead of GPU-computed values — making it straightforward to test
the function's existence and return shape without a GPU.

---

## Applying the pattern in your package

### File layout

```
src/
  mypkg_kernel_wrappers.cpp   # always compiled; internal #ifdef USE_OPENCL
  mypkg_kernel_runners.cpp    # all function bodies inside #ifdef USE_OPENCL
inst/cl/
  mypkg_src/
    myfunction_kernel.cl      # __kernel + @all_depends_nmath annotation
```

### Type conversion reference

For inputs (wrapper converts C++ input types → `std::vector` before calling
the runner):

| C++ input type | Conversion | Runner argument type |
|---|---|---|
| `Rcpp::NumericVector` | `copyVector(v)` | `const std::vector<double>&` |
| `Rcpp::NumericMatrix` | `flattenMatrix(m)` | `const std::vector<double>&` |
| `Rcpp::IntegerVector` | `std::vector<int>(v.begin(), v.end())` | `const std::vector<int>&` |

For outputs (wrapper converts `std::vector` → required return types after the
runner returns):

| Return type needed | Conversion from `std::vector<double>` | When to use |
|---|---|---|
| `Rcpp::NumericVector` (small) | `for (j) out[j] = flat[j]` | Clarity |
| `Rcpp::NumericVector` (large) | `Rcpp::NumericVector(v.begin(), v.end())` | Conciseness |
| `arma::mat` / `Rcpp::NumericMatrix` | `arma::mat(ptr, nrow, ncol, false, false)` | Avoid large copy |

### Headers and `DESCRIPTION`

- **Wrapper files:** include `openclPort.h` (for `load_kernel_source`,
  `load_library_for_kernel`, `flattenMatrix`, `copyVector`) and
  `RcppArmadillo.h` if returning matrices.
- **Runner files:** include `openclPort.h` unconditionally; include
  `CL/cl.h` inside `#ifdef USE_OPENCL`.
- **`DESCRIPTION`:** add `nmathopencl` to `LinkingTo`; add
  `Imports: Rcpp, RcppArmadillo` as appropriate.

Chapter 09 covers the generic `openclPort` runner infrastructure — device
selection, scalar and matrix runner templates, and error handling — in detail.
Chapter 10 applies this full three-role pattern end-to-end for all five
`ex_glmbayes` gradient kernels.