---
title: "Chapter 00: nmathopencl --- Package Overview"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 00: nmathopencl --- Package Overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
```

## What is `nmathopencl`?

`nmathopencl` is a **developer library**: it ports R's internal `nmath`
(Mathlib) statistical math functions to OpenCL so that downstream R packages
can embed those functions inside their own custom GPU kernels. The primary
audience is **package authors** who want GPU-accelerated computation and need
statistical math functions available on the device side --- without having to
port the underlying nmath sources themselves.

A secondary audience is **end users** who want to call distribution functions
(`dnorm`, `pgamma`, `rbinom`, ...) directly on GPU hardware. The package
exports `*_opencl` wrappers for the full nmath family, but their main role is
**validation**: running them on large vectors confirms that the OpenCL pipeline
and GPU hardware are working before a downstream package is built. For modest
vector sizes the GPU often performs no better than the CPU, because the cost
of kernel compilation and host-to-device data transfer dominates. Meaningful
GPU acceleration of individual nmath calls requires very large workloads.

The real performance story is at the downstream package level. When nmath
calls are embedded *inside* larger GPU kernels --- alongside other expensive
device-side operations such as the gradient and envelope calculations in
`glmbayes` --- the GPU does the computation without the round-trip transfer
penalty, and substantial gains become possible. The design here supports that
pattern; the exported `*_opencl` functions demonstrate it works.

OpenCL is vendor-neutral: the same kernels run on NVIDIA, AMD, and Intel
hardware. CPU-only execution is always supported when no OpenCL stack is
present, so the package is safe to list as a dependency even in environments
that lack a GPU.

## Three-layer architecture

The package is organized in three layers, each corresponding to a set of
vignettes:

```
????????????????????????????????????????????????
|  Layer 3 --- Kernels  (inst/cl/src/)           |
|  __kernel functions for the R-callable API   |
????????????????????????????????????????????????
|  Layer 2 --- nmath library  (inst/cl/nmath/)   |
|  Ported nmath/Rmath functions as device-side |
|  OpenCL C functions                          |
????????????????????????????????????????????????
|  Layer 1 --- Upstream shims                    |
|  (inst/cl/R_shims/, R_ext/, System/, ...)      |
|  Type definitions, macros, and constants     |
|  that replace C headers unavailable in       |
|  OpenCL C                                    |
????????????????????????????????????????????????
```

Layer 1 is the foundation: it makes the rest of the ported code compile
under OpenCL's restricted C99 dialect without modification to the nmath
sources. Layer 2 is the library: ~180 `.cl` files implementing the full
suite of Mathlib functions. Layer 3 is the API surface: thin wrapper kernels
that map a GPU work-item index to an element of an input vector and call the
appropriate Layer 2 function.

Downstream packages locate the Layer 2 sources at runtime with
`system.file("cl", package = "nmathopencl")` and assemble them into their
own OpenCL programs using `opencltools::load_kernel_library(..., package = "nmathopencl")`. They own the kernel
runners, R wrappers, and compilation lifecycle; `nmathopencl` simply provides
the portable math library they build on.

See **Chapter 03** for the detailed assembly model, including how the four
components of a complete kernel program (global configuration header, shims,
nmath subset, and kernel function) are concatenated and compiled at runtime.

## C++ layout inside the package DLL

| Layer | Location | Purpose |
|-------|----------|---------|
| **`nmathopencl`** | `nmathopencl.h`, `kernel_runners.cpp`, `kernel_wrappers.cpp` | Distribution-specific kernel runners and R-facing wrappers for all nmath functions |
| **Internal OpenCL infrastructure** | `openclPort.h`, `opencl_kernel_runners.cpp` | Generic kernel runner, error helpers, device probing, and kernel loading inside the DLL --- see **Chapter 09** |
| **`ex_glmbayes`** | `ex_glmbayes_*.cpp/.h` | Self-contained example showing how a downstream package (`glmbayes`) builds custom GLM kernels on top of the layers above |

Kernel authors who `LinkingTo: nmathopencl` may include `openclPort.h` directly;
the internal runner layer is documented in Chapter 09.

## Related packages

`nmathopencl` is part of a small suite of cooperating packages:

| Package | Role | Typical entry points |
|---------|------|----------------------|
| **`nmathopencl`** (this package) | OpenCL-ported Mathlib, `*_opencl` validation API, kernel loaders, package-local device selection | `nmathopencl_has_opencl()`, `load_kernel_*`, `dnorm_opencl()` |
| **`opencltools`** ([CRAN](https://CRAN.R-project.org/package=opencltools)) | Host/runtime diagnostics and kernel-library authoring tools | `detect_environment_and_gpus()`, `verify_opencl_runtime()`, `load_library_for_kernel()`, `diagnose_glmbayes()` (opencltools-only report) |
| **`glmbayes`** ([CRAN](https://CRAN.R-project.org/package=glmbayes)) | End-user Bayesian GLMs with optional GPU paths | `glmb()`, `use_opencl = TRUE` |

**`nmathopencl` Imports `opencltools` (>= 0.8.0).** Host inventory, driver/ICD
checks, and PATH validation are delegated to **opencltools**; compile-time OpenCL
status for **this** package's DLL stays local via **`nmathopencl_has_opencl()`**. Host/runtime
probes (`detect_*`, PATH helpers, `gpu_names`, and related functions) are **not**
re-exported from **nmathopencl** --- call `opencltools::…` directly. Kernel-library
authoring helpers (`load_library_for_kernel`, `extract_library_subset`, and
related tagging tools) are re-exported for downstream kernel authors.

For OpenCL setup and enablement, start with **Chapter 01** (attach messages and
the nmathopencl-specific enablement path) and **`opencltools`** vignette
**Chapter 01** (platform install details).

## R-side API families

The exported `*_opencl` functions cover the full nmath family and mirror
the structure of base R's `stats` package:

| R file | Functions |
|--------|-----------|
| `normal_opencl.R` | `dnorm_opencl`, `pnorm_opencl`, `qnorm_opencl`, `rnorm_opencl` |
| `gamma_opencl.R` | `dgamma_opencl`, `pgamma_opencl`, ... |
| `binomial_opencl.R` | `dbinom_opencl`, `pbinom_opencl`, ... |
| `poisson_opencl.R` | `dpois_opencl`, `ppois_opencl`, ... |
| `beta_opencl.R` | `dbeta_opencl`, ... |
| ... | (and so on for all families) |
| `special_opencl.R` | `lgammafn_opencl`, `gammafn_opencl`, ... |
| `math_support_opencl.R` | `fmax2_opencl`, `fmin2_opencl`, ... |

Every function accepts a scalar parameter set, dispatches to the GPU via the
kernel infrastructure, and falls back to the corresponding `stats::` or base-R
function if OpenCL is unavailable or if the call fails. As noted above, these
wrappers serve primarily as a working demonstration of the GPU pipeline; they
can show speedups at very large vector sizes but are not the primary mechanism
through which downstream packages obtain GPU acceleration.

## Checking OpenCL availability

```{r, eval = FALSE}
library(nmathopencl)

# Compile-time OpenCL support in this nmathopencl build
nmathopencl_has_opencl()

# Same check for the imported opencltools dependency
opencltools::has_opencl()

# Host/runtime diagnostic report (opencltools)
opencltools::diagnose_glmbayes()
```

- **`nmathopencl_has_opencl()`** (nmathopencl) --- was **this** package built with OpenCL
  (`-DUSE_OPENCL`)?
- **`opencltools::has_opencl()`** --- was the imported dependency built with
  OpenCL?
- **`opencltools::diagnose_glmbayes()`** --- host/runtime report from **opencltools**.

Host and driver inventory (`detect_environment_and_gpus()`,
`verify_opencl_runtime()`, and related probes) live in **`opencltools`** --- use
`opencltools::…` when calling them directly. All exported `*_opencl` wrappers
branch on `nmathopencl_has_opencl()` first; the `fallback` argument then controls whether
a failed OpenCL call is replaced with the CPU path (ignored when OpenCL is
absent at compile time).

See **Chapter 01** for the step-by-step enablement path (attach messages,
opencltools first, then source reinstall of nmathopencl).

## Vignette guide

**Part 0: Overview**

| Vignette | Topic |
|----------|-------|
| Chapter 00 (this document) | Package overview and architecture |

**Part I: Getting Started**

| Vignette | Topic |
|----------|-------|
| Chapter 01 | OpenCL enablement for `nmathopencl` (attach messages, `opencltools` dependency, source reinstall) |
| Chapter 02 | Adding `USE_OPENCL` and `has_opencl()` to your package: `configure` scripts, `opencltools` runtime relationship |

**Part II: The Library and Program Model**

| Vignette | Topic |
|----------|-------|
| Chapter 03 | Structure of `nmath` kernel programs: the four-layer assembly model |
| Chapter 04 | The `nmath` OpenCL library (`inst/cl/nmath/`): cycles, shims, and annotation |

**Part III: Developer Guide**

| Vignette | Topic |
|----------|-------|
| Chapter 05 | Kernels, kernel runners, and kernel wrappers: roles and interaction |
| Chapter 06 | Integrating kernel wrappers into your codebase: CPU fallbacks and R interfaces |
| Chapter 07 | Writing and annotating `__kernel` functions |
| Chapter 08 | Kernel loading: `load_kernel_source` and `load_kernel_library` |
| Chapter 09 | Generic OpenCL kernel runners: the `openclPort` C++ infrastructure |
| Chapter 10 | Case study: building custom GLM kernels (`ex_glmbayes`) |
| Chapter 11 | Testing, debugging, and benchmarking GPU kernels |

**Part IV: The R API**

| Vignette | Topic |
|----------|-------|
| Chapter 12 | The `nmathopencl` R API: distribution functions on the GPU |