Chapter 02-S02: Normal–Normal Conjugacy for One Mean

library(glmbayes)
if (requireNamespace("bayesrules", quietly = TRUE)) library(bayesrules)

1. The model

1.1 Likelihood

Suppose we observe \(n\) measurements \(y_1, \dots, y_n\) drawn independently from a Normal distribution with unknown mean \(\mu\) and known standard deviation \(\sigma\):

\[ Y_i \mid \mu \;\sim\; \mathcal{N}(\mu,\, \sigma^2), \qquad i = 1,\dots,n. \]

The sufficient statistic is the sample mean \(\bar y\), which itself follows

\[ \bar{y} \mid \mu \;\sim\; \mathcal{N}\!\left(\mu,\, \frac{\sigma^2}{n}\right). \]

In practice \(\sigma\) is rarely truly known, but treating it as fixed provides the cleanest template for understanding posterior updating. The full treatment — unknown mean and unknown variance — is handled in Chapter 3 with lmb().

1.2 Prior

Since \(\mu\) can take any real value, the Normal distribution is the natural conjugate prior:

\[ \mu \;\sim\; \mathcal{N}(\mu_0,\, \tau_0^2). \]

Here \(\mu_0\) is your prior best guess and \(\tau_0\) encodes your uncertainty about it.

1.3 Posterior derivation

The key is to work with precision (reciprocal variance): \(\kappa = 1/\sigma^2\).

Symbol	Formula	Meaning
\(\kappa_0 = 1/\tau_0^2\)	Prior precision	Information in the prior
\(\kappa_L = n/\sigma^2\)	Likelihood precision	Information in the data

Multiplying the Normal likelihood by the Normal prior kernel and completing the square gives the conjugate posterior:

\[ \mu \mid \bar{y} \;\sim\; \mathcal{N}\!\left(\mu_n,\; \frac{1}{\kappa_n}\right), \]

where

\[ \kappa_n \;=\; \kappa_0 + \kappa_L \qquad\text{(precisions add)} \]

\[ \mu_n \;=\; \frac{\kappa_0\,\mu_0 + \kappa_L\,\bar{y}}{\kappa_n} \;=\; \frac{\kappa_0}{\kappa_n}\,\mu_0 \;+\; \frac{\kappa_L}{\kappa_n}\,\bar{y}. \]

The posterior mean \(\mu_n\) is a precision-weighted average of the prior mean and the data mean.

Intuition.

Precisions add because information is additive: each independent source of knowledge about \(\mu\) contributes its own precision.
A very diffuse prior (\(\tau_0 \to \infty\), \(\kappa_0 \to 0\)) gives \(\mu_n \approx \bar y\) — the data take over entirely.
A very small sample (\(n\) small) or noisy data (\(\sigma\) large) keeps \(\kappa_L\) small, so the prior retains more influence.
As \(n \to \infty\), \(\mu_n \to \bar y\) and the posterior variance \(1/\kappa_n \to 0\) — the Bayesian estimate converges to the classical MLE.

2. `bayesrules` illustration

The bayesrules package provides plot_normal_normal() and summarize_normal_normal() for quickly visualizing and summarizing the Normal–Normal update.

Scenario. A sleep researcher’s prior belief is that undergraduates average about 7 hours of sleep per night. They encode this as \(\mu_0 = 7\), \(\tau_0 = 1.5\) — fairly uncertain, allowing for means anywhere from 4 to 10 hours. The population standard deviation is assumed known at \(\sigma = 1.8\) hours (from large-scale studies). A sample of \(n = 22\) students yields a sample mean of \(\bar y = 6.1\) hours.

prior_mean <- 7;    prior_sd <- 1.5   ## prior: N(7, 1.5^2)
sigma      <- 1.8                     ## known population SD
y_bar      <- 6.1;  n_sleep  <- 22L   ## observed sample

library(bayesrules)
## Overlay prior, likelihood, and posterior
plot_normal_normal(
  mean       = prior_mean,
  sd         = prior_sd,
  sigma      = sigma,
  y_bar      = y_bar,
  n          = n_sleep,
  prior      = TRUE,
  likelihood = TRUE,
  posterior  = TRUE
)

## Tabular summary of prior, likelihood, and posterior parameters
summarize_normal_normal(
  mean  = prior_mean,
  sd    = prior_sd,
  sigma = sigma,
  y_bar = y_bar,
  n     = n_sleep
)
#>       model    mean    mode       var        sd
#> 1     prior 7.00000 7.00000 2.2500000 1.5000000
#> 2 posterior 6.15529 6.15529 0.1382253 0.3717866

Reading the output. The summarize_normal_normal() table reports the prior mean and SD, the likelihood (data) mean and SE, and the resulting posterior mean and SD. The plot overlays all three: the wide prior, the sharper likelihood centered at 6.1, and the posterior — narrower than both and sitting between them, pulled strongly toward the data because 22 observations outweigh the prior.

Analytic posterior.

prec_0 <- 1 / prior_sd^2
prec_L <- n_sleep / sigma^2
prec_n <- prec_0 + prec_L

post_mean <- (prec_0 * prior_mean + prec_L * y_bar) / prec_n
post_sd   <- sqrt(1 / prec_n)

nn_analytic <- data.frame(
  Example = "Sleep hours (BayesRules)",
  n = n_sleep,
  `y-bar` = y_bar,
  Posterior = sprintf("N(%.4f, %.4f^2)", post_mean, post_sd),
  Mean = post_mean,
  SD = post_sd,
  check.names = FALSE
)
knitr::kable(nn_analytic, digits = 4,
  caption = "Conjugate Normal--Normal posterior (known sigma)")

Conjugate Normal–Normal posterior (known sigma)
Example	n	y-bar	Posterior	Mean	SD
Sleep hours (BayesRules)	22	6.1	N(6.1553, 0.3718^2)	6.1553	0.3718

3. Real data: Old Faithful waiting times

The faithful dataset (built into R) records waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park. We treat the waiting time \(Y_i\) as approximately Normal with unknown mean \(\mu\) and use a Normal–Normal conjugate model.

Prior. Based on decades of visitor records, a geologist believes the mean waiting time is around 72 minutes, with considerable uncertainty (\(\tau_0 = 15\) minutes). We estimate the population SD from the data and treat it as fixed at \(\hat\sigma\).

y_faith <- faithful$waiting
n_faith <- length(y_faith)

ybar_faith  <- mean(y_faith)
sigma_faith <- sd(y_faith)           ## treat as known

prior_mean_f <- 72;  prior_sd_f <- 15   ## N(72, 15^2) prior

cat(sprintf("n = %d,  y-bar = %.2f,  sigma = %.2f\n",
            n_faith, ybar_faith, sigma_faith))
#> n = 272,  y-bar = 70.90,  sigma = 13.59

Analytic posterior.

prec_0f <- 1 / prior_sd_f^2
prec_Lf <- n_faith / sigma_faith^2
prec_nf <- prec_0f + prec_Lf

post_mean_f <- (prec_0f * prior_mean_f + prec_Lf * ybar_faith) / prec_nf
post_sd_f   <- sqrt(1 / prec_nf)

faith_analytic <- data.frame(
  Dataset = "Old Faithful waiting",
  n = n_faith,
  `y-bar` = ybar_faith,
  Posterior = sprintf("N(%.4f, %.4f^2)", post_mean_f, post_sd_f),
  Mean = post_mean_f,
  SD = post_sd_f,
  check.names = FALSE
)
knitr::kable(faith_analytic, digits = 4,
  caption = "Conjugate Normal--Normal posterior (known sigma)")

Conjugate Normal–Normal posterior (known sigma)
Dataset	n	y-bar	Posterior	Mean	SD
Old Faithful waiting	272	70.8971	N(70.9004, 0.8231^2)	70.9004	0.8231

Reading the output. With \(n = 272\) observations, the likelihood precision dominates the prior, so the posterior mean is almost identical to \(\bar y\). The posterior SD is small — the data overwhelm the prior.

3.1 Fitting with `lmb()`

In glmbayes, the intercept-only model uses lmb() with dNormal() and fixed dispersion \(\sigma^2\):

df_faith <- data.frame(y = y_faith)

mu_f    <- matrix(prior_mean_f, nrow = 1, ncol = 1,
                  dimnames = list(NULL, "(Intercept)"))
Sigma_f <- matrix(prior_sd_f^2, nrow = 1, ncol = 1,
                  dimnames = list("(Intercept)", "(Intercept)"))

pf_faith <- dNormal(mu = mu_f, Sigma = Sigma_f,
                    dispersion = sigma_faith^2)

set.seed(2026)
fit_faith <- lmb(
  n       = 20000,
  y ~ 1,
  data    = df_faith,
  pfamily = pf_faith
)
print(fit_faith)
#> 
#> Call:  
#> lmb(formula = y ~ 1, n = 20000, data = df_faith)
#> 
#> Posterior Mean Coefficients:
#> (Intercept)  
#>       70.91

faith_compare <- data.frame(
  Dataset = "Old Faithful waiting",
  Posterior = faith_analytic$Posterior,
  `Analytic Mean` = faith_analytic$Mean,
  `Analytic SD`   = faith_analytic$SD,
  `lmb Post.Mean` = fit_faith$coef.means["(Intercept)"],
  `lmb Post.Sd`   = sd(fit_faith$coefficients[, "(Intercept)", drop = TRUE]),
  check.names = FALSE
)
knitr::kable(faith_compare, digits = 4,
  caption = "Analytic vs. lmb() posterior mean and SD")

Analytic vs. lmb() posterior mean and SD
	Dataset	Posterior	Analytic Mean	Analytic SD	lmb Post.Mean	lmb Post.Sd
(Intercept)	Old Faithful waiting	N(70.9004, 0.8231^2)	70.9004	0.8231	70.9091	0.8261

The lmb Post.Mean and lmb Post.Sd columns should match the analytic Mean and SD to Monte Carlo error.

4. Connection to glmbayes

This scalar case is the template for Gaussian linear models with lmb() (Chapter 3). In regression, the single unknown \(\mu\) is replaced by a coefficient vector \(\beta \in \mathbb{R}^p\), the scalar prior variance \(\tau_0^2\) becomes a \(p \times p\) prior covariance matrix \(\Sigma_0\), and the precision-weighted average becomes the matrix formula \(\mu_n = (\Sigma_0^{-1} + X^\top X / \sigma^2)^{-1}(\Sigma_0^{-1}\mu_0 + X^\top y / \sigma^2)\) — but the logic is identical.