---
title: "Chapter 06: Integrating Kernel Wrappers into Your Codebase"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 06: Integrating Kernel Wrappers into Your Codebase}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

Chapter 05 described the internal structure of a kernel wrapper: how inputs
are converted, how the runner is dispatched, and how results are converted
back.  This chapter takes a step back and looks at how kernel wrappers fit
into the broader codebase of a package.

Two questions arise immediately:

1. **What happens when OpenCL is not available?**  Every kernel wrapper must
   have a CPU path.  A wrapper that simply returns zeros is safe but
   unhelpful; most real wrappers need to fall back to a correct CPU
   computation.

2. **How is the wrapper exposed?**  Some kernel wrappers have a direct
   interface into R (callable from R code).  Others are purely internal C++
   components, called by other C++ functions that hold the R-facing API.
   The choice depends on whether the computation has a natural direct R use.

`nmathopencl` contains examples of both patterns.  The distribution-function
wrappers (`dnorm_opencl`, `pnorm_opencl`, etc.) are exported R functions with
rich CPU fallback logic.  The GLM gradient wrapper (`f2_f3_opencl`) is a
purely internal C++ component, called by a C++ dispatcher that also has a
separate CPU implementation (`f2_f3_non_opencl`).  Both patterns are explored
in detail below.

---

## The two integration patterns

### Pattern 1: wrapper with a direct R interface

In this pattern the kernel wrapper (or a thin R function that calls it)
is exported and callable directly from R.  The CPU fallback is the equivalent
computation using standard R or C functions — in `nmathopencl`'s case, the
`stats::` distribution functions.

```
R caller
  │
  ▼
R wrapper function  (exported, input validation, recycling)
  │  if inputs are non-finite, sd == 0, etc. → fallback_full()
  │
  ▼
.opencl_try_or_fallback()
  │  if !nmathopencl_has_opencl()           → fallback_expr()  (CPU path)
  │  if OpenCL call succeeds    → return GPU result
  │  if OpenCL call fails
  │    and fallback = TRUE      → fallback_expr()  (CPU path)
  │    and fallback = FALSE     → propagate error
  ▼
C++ kernel wrapper  (internal, not exported)
  │  #ifdef USE_OPENCL + nmathopencl_has_opencl() guard
  │  type conversion + program assembly + runner dispatch
  ▼
GPU result
```

The fallback can be triggered at two separate levels:

- **R level** (before the C++ call): when input validation detects a
  condition the GPU path cannot handle (e.g. `sd == 0`, non-finite values).
  `fallback_full()` calls `stats::dnorm(x, mean, sd, log = log)` directly.

- **C++ / runtime level**: `.opencl_try_or_fallback()` checks `nmathopencl_has_opencl()`
  before attempting the GPU path.  If OpenCL is not available it calls
  `fallback_expr()` without ever touching the C++ kernel wrapper.  If a
  GPU call throws an exception and `fallback = TRUE`, it catches the error
  and calls `fallback_expr()`.

### Pattern 2: wrapper as an internal C++ component

In this pattern the kernel wrapper has no direct R interface.  It is called
from within a C++ dispatcher function alongside a CPU counterpart.  The R
interface belongs to a higher-level function that selects between the two
based on a `use_opencl` flag passed in by the caller.

```
R caller
  │
  ▼
Exported R function  (e.g. Ex_EnvelopeEval)
  │  validates inputs, passes use_opencl flag
  ▼
.EnvelopeEval_cpp()   (internal R → C++ bridge, [[Rcpp::export]])
  ▼
EnvelopeEval_cpp()    (C++ dispatcher)
  │  if use_opencl && nmathopencl_has_opencl()
  │    → f2_f3_opencl()     (OpenCL kernel wrapper)
  │  else
  │    → f2_f3_non_opencl() (pure C++ CPU implementation)
  ▼
Result (qf, grad) returned regardless of path taken
```

The two implementations — `f2_f3_opencl` and `f2_f3_non_opencl` — share the
same function signature and return the same data structure.  The caller cannot
tell from the return value which path was taken.

---

## Pattern 1 in detail: `dnorm_opencl`

### The R wrapper

`dnorm_opencl` in `R/normal_opencl.R` is the user-facing function.  It
mirrors the interface of `stats::dnorm` and adds `opencl_parallel`,
`fallback`, and `verbose` arguments.

```r
# R/normal_opencl.R  (simplified)

#' @export
dnorm_opencl <- function(x, mean = 0, sd = 1, log = FALSE,
                         opencl_parallel = NA, fallback = FALSE,
                         verbose = FALSE) {

  # ── Input validation ──────────────────────────────────────────────────────
  # These checks mirror stats::dnorm behavior.
  if (!is.numeric(x))    stop("`x` must be numeric.")
  if (!is.numeric(mean)) stop("`mean` must be numeric.")
  if (!is.numeric(sd))   stop("`sd` must be numeric.")
  if (length(x) == 0L)   return(numeric(0))

  # ── Recycling (like stats::dnorm) ─────────────────────────────────────────
  len  <- max(length(x), length(mean), length(sd))
  xv   <- rep_len(as.double(x),    len)
  mv   <- rep_len(as.double(mean), len)
  sv   <- rep_len(as.double(sd),   len)
  logv <- rep_len(log,             len)

  # ── R-level fallback function ─────────────────────────────────────────────
  # Called when inputs contain conditions the GPU path cannot handle,
  # or when OpenCL is unavailable and fallback = TRUE.
  fallback_full <- function() {
    stats::dnorm(x, mean = mean, sd = sd, log = log)
  }

  # ── R-level conditions that force the CPU path ────────────────────────────
  if (any(!is.finite(xv) | !is.finite(mv) | !is.finite(sv))) {
    return(fallback_full())   # stats::dnorm handles NaN, Inf, NA
  }
  if (any(sv < 0)) {
    stop("`sd` must be non-negative.", call. = FALSE)
  }
  if (any(sv == 0)) {
    return(fallback_full())   # degenerate case; stats::dnorm handles it
  }

  # ── Dispatch: try GPU, fall back to CPU on failure if fallback = TRUE ─────
  log_int <- as.integer(logv)
  opc     <- .encode_opencl_parallel(opencl_parallel)

  .opencl_try_or_fallback(
    opencl_expr  = function() .dnorm_opencl(xv, mv, sv, log_int, opc, verbose),
    fallback_expr = fallback_full,
    fallback      = fallback,
    verbose       = verbose,
    fn_name       = "dnorm_opencl"
  )
}
```

`.dnorm_opencl` (dot-prefixed) is the internal Rcpp-exported symbol for the
C++ kernel wrapper.  It is not part of the public API; it exists only to make
the C++ function callable from R.

### The `.opencl_try_or_fallback` helper

This helper encapsulates the runtime dispatch logic that every Pattern 1
wrapper shares:

```r
# R/opencl_linkage_utils.R

.opencl_try_or_fallback <- function(opencl_expr, fallback_expr,
                                    fallback, verbose, fn_name) {
  if (!nmathopencl_has_opencl()) {
    # OpenCL not available in this build or session — go straight to CPU.
    if (verbose)
      message(sprintf("[%s] OpenCL unavailable; using CPU fallback.", fn_name))
    return(fallback_expr())
  }

  # OpenCL available: try the GPU path.
  out <- tryCatch(opencl_expr(), error = function(e) e)

  if (inherits(out, "error")) {
    if (fallback) {
      # GPU call failed and the caller requested a fallback.
      if (verbose) {
        message(sprintf("[%s] OpenCL call failed; using CPU fallback.", fn_name))
        message(out$message)
      }
      return(fallback_expr())
    }
    stop(out$message, call. = FALSE)  # no fallback requested — propagate error
  }

  out  # GPU call succeeded
}
```

The design makes the fallback behavior explicit and controllable:

- `fallback = FALSE` (default): if the GPU call fails, the error propagates
  to the caller.  The caller sees an actual error rather than silently
  receiving CPU results.
- `fallback = TRUE`: if the GPU call fails, the CPU path is used
  transparently.  Useful in batch workflows where any result is better than
  an error.

### The C++ kernel wrapper

The C++ kernel wrapper `.dnorm_opencl` is exported to R via
`// [[Rcpp::export(name = ".dnorm_opencl")]]`.  It is the minimal C++ entry
point: it converts inputs, runs the GPU path if available, and returns zeros
if not.

```cpp
// src/kernel_wrappers.cpp  (within nmathopencl namespace)

// [[Rcpp::export(name = ".dnorm_opencl")]]
Rcpp::NumericVector dnorm_opencl(
    const Rcpp::NumericVector& x,
    const Rcpp::NumericVector& mean,
    const Rcpp::NumericVector& sd,
    const Rcpp::IntegerVector& give_log,
    int                        opencl_parallel_code,
    bool                       verbose
) {
  const int len = x.size();
  Rcpp::NumericVector out(len);   // zero-initialized

#ifdef USE_OPENCL
  if (!nmathopencl_has_opencl() || len == 0) return out;

  try {
    d_givelog_ndrange_kernel_fill(
        "src/dnorm_kernel.cl", "dnorm_kernel",
        len, {&x, &mean, &sd}, give_log, out, verbose);
  } catch (const std::exception& e) {
    if (verbose) Rcpp::Rcout << e.what() << "\n";
    throw;
  }
#endif

  return out;
}
```

Note that the C++ wrapper itself returns zeros when `!nmathopencl_has_opencl()`.  It does
**not** call `stats::dnorm`.  The R wrapper is responsible for the fallback
to `stats::dnorm`; the C++ wrapper simply reports "no GPU result" via zeros.
This keeps the C++ layer free of any R evaluation machinery.

---

## Pattern 2 in detail: `f2_f3_opencl`

### The exported R function

`Ex_EnvelopeEval` (in `R/ex_glmbayes.R`) is the user-facing function.  It
accepts a `use_opencl` flag and delegates entirely to the C++ dispatcher:

```r
# R/ex_glmbayes.R

#' @export
Ex_EnvelopeEval <- function(G4, y, x, mu, P, alpha, wt,
                            family, link,
                            use_opencl = FALSE,
                            verbose    = FALSE) {
  # Input validation (matrix/vector type checks) ...

  .EnvelopeEval_cpp(G4, y, x, mu, P, alpha, wt,
                    family, link, use_opencl, verbose)
}
```

There is no R-level fallback function here.  The fallback is handled entirely
inside the C++ dispatcher.

### The C++ dispatcher

`EnvelopeEval_cpp` (inside `src/`) receives `use_opencl` and decides which
C++ implementation to call:

```cpp
// src/ (conceptual structure — details in actual source)

Rcpp::List EnvelopeEval_cpp(
    Rcpp::NumericMatrix G4, Rcpp::NumericVector y,
    Rcpp::NumericMatrix x,  Rcpp::NumericMatrix mu,
    Rcpp::NumericMatrix P,  Rcpp::NumericVector alpha,
    Rcpp::NumericVector wt,
    std::string family, std::string link,
    bool use_opencl, bool verbose
) {
  // Prepare shared inputs (common to both paths) ...

  if (use_opencl && nmathopencl_has_opencl()) {
    // GPU path: call the OpenCL kernel wrapper
    return ex_glmbayes::opencl::f2_f3_opencl(
        family, link, b, y, x, mu, P, alpha, wt, verbose);
  } else {
    // CPU path: call the pure C++ implementation
    return ex_glmbayes::f2_f3_non_opencl(
        family, link, b, y, x, mu, P, alpha, wt);
  }
}
```

Both `f2_f3_opencl` and `f2_f3_non_opencl` return a `Rcpp::List` with
identical structure: `list(qf = numeric(m1), grad = matrix(m1, l2))`.  The
dispatcher's caller cannot tell from the return value which path was used.

### Why a dedicated CPU implementation?

For Pattern 1 (distribution functions), the CPU fallback is an existing
well-tested function from `stats::`.  No separate CPU implementation is
needed.

For the GLM gradient computation, no equivalent off-the-shelf CPU function
exists.  `f2_f3_non_opencl` is a pure C++ implementation of the same
mathematical computation, written without any OpenCL dependency.  It compiles
on every platform and produces bit-for-bit equivalent results to the GPU path
(within double-precision rounding).

Having both implementations under explicit control also makes it possible to
benchmark them directly: `use_opencl = FALSE` forces the CPU path even on a
GPU-equipped machine.

---

## Choosing between the two patterns

The choice between Pattern 1 and Pattern 2 comes down to whether there is a
natural existing CPU computation to fall back to.

| Criterion | Pattern 1 (R interface + R fallback) | Pattern 2 (C++ dispatch + CPU implementation) |
|---|---|---|
| Existing CPU function available? | Yes (`stats::`, `base::`, etc.) | No; need to write the CPU implementation |
| Does the computation have a direct R use? | Yes (called directly from R) | Often not (called from a C++ simulation loop) |
| Where does fallback live? | R level (`fallback_full()`) + runtime (`nmathopencl_has_opencl()`) | C++ level (`use_opencl && nmathopencl_has_opencl()`) |
| Caller can request optional fallback? | Yes (`fallback = TRUE/FALSE` argument) | Caller controls via `use_opencl` flag |
| Wrapper directly R-callable? | Yes (exported via `[[Rcpp::export]]`) | Not necessarily — may be purely internal C++ |

Both patterns guarantee that the package compiles and runs correctly on
any machine.  The GPU path is always optional; the CPU path always produces
a valid (if unaccelerated) result.

---

## Naming conventions

`nmathopencl` uses a consistent naming scheme to make the role of each
function clear:

| Name | Type | Role |
|---|---|---|
| `dnorm_opencl` | Exported R function | User-facing API; validates inputs; manages fallback |
| `.dnorm_opencl` | Internal R → C++ bridge | Rcpp export; positional R → C++ call only |
| `nmathopencl::dnorm_opencl` | C++ kernel wrapper | `#ifdef` guard; type conversion; runner dispatch |
| `nmathopencl::dnorm_kernel_runner` | C++ kernel runner | Full OpenCL lifecycle; `#ifdef USE_OPENCL` only |
| `Ex_EnvelopeEval` | Exported R function | User-facing API; passes `use_opencl` flag |
| `.EnvelopeEval_cpp` | Internal R → C++ bridge | Positional R → C++ call only |
| `f2_f3_opencl` | C++ kernel wrapper | OpenCL path; used inside dispatcher |
| `f2_f3_non_opencl` | C++ CPU implementation | CPU path; used inside same dispatcher |

The `.dot` prefix on internal R functions signals that they are not part of
the public API and will not appear in `?help` search or autocompletion.

For your own package, a consistent analogous scheme might be:

```
myfunc_opencl()        # exported R function  (if direct R use)
.myfunc_opencl()       # internal R → C++ bridge
mypkg::myfunc_opencl() # C++ kernel wrapper (in namespace)
mypkg::myfunc_runner() # C++ kernel runner   (in namespace, #ifdef only)
mypkg::myfunc_cpu()    # C++ CPU fallback    (if Pattern 2)
```

---

## Summary

Every kernel wrapper needs a CPU path.  The two patterns differ in where that
path lives and who controls the dispatch:

- **Pattern 1** puts the fallback logic in R, using the existing `stats::`
  ecosystem.  It is the right choice when the computation mirrors an existing
  R function and has direct R users.

- **Pattern 2** puts the fallback logic in C++, alongside a dedicated CPU
  implementation.  It is the right choice when the computation is novel,
  when it is called from a C++ simulation loop rather than directly from R,
  or when benchmarking between the two paths is important.

In both patterns the OpenCL infrastructure — the runner and the kernel — is
identical.  What differs is only how the wrapper is wired into the rest of
the package.

Chapter 12 describes the `nmathopencl` R API in full, showing how the
distribution-function wrappers are documented and organized.  Chapter 10
works through the `ex_glmbayes` pattern end-to-end.