GP (1) Gaussian Process

Posted Aug 3, 2025

By jayarnim

2 min read

gaussian process

gaussian process is a nonparametric stochastic process that uses the mean and covariance structure of function values as prior information to infer the function distribution, assuming that the function values defined for an arbitrary input set jointly follow a multivariate normal distribution.
function definition:
\[y_{i}=f(X_{i})+\epsilon_{i}, \quad \epsilon_{i} \sim \mathcal{N}(0,\sigma_{N}^{2})\]
function value vector:
\[\begin{aligned} \mathbf{f} =\begin{bmatrix} f(X_{1}) & f(X_{2}) & \cdots & f(X_{N}) \end{bmatrix}^{T} \end{aligned}\]
multivariate normal distribution assumption:
\[\mathbf{f}\sim\mathcal{N}(M(X),K_{XX})\]
function $f(\cdot)$ is assumed to follow a gaussian process with mean function $m(\cdot)$ and covariance function $k(\cdot, \cdot^{\prime})$:
\[f(\cdot)\sim\mathcal{GP}(m(\cdot),k(\cdot,\cdot^{\prime}))\]

Mercer’s theorem

Mercer’s theorem states that a symmetric, positive definite kernel function can be expressed as the inner product of the functions forming the basis of RKHS.
\[\begin{aligned} k(x,x^{\prime}) &=\sum_{i=1}^{\infty}{\lambda_{i}\phi_{i}(x)\phi_{i}(x^{\prime})}\\ &=\langle\phi(x),\phi(x^{\prime})\rangle_{\mathcal{H}} \end{aligned}\]
Covariance is the expectation of the inner product of centered vectors.
\[\begin{aligned} \mathrm{Cov}\left[A,B\right] &=\mathbb{E}\left[(A-\mu_{A})(B-\mu_{B})\right] \end{aligned}\]
In a function space, a function is represented as a linear combination of basis functions.
\[\begin{aligned} f(x) &=\sum_{i=1}^{\infty}{\beta_{i}\phi_{i}(x)} \quad \begin{cases}\text{$\beta_{i}$ is stochastic}\\\text{$\phi_{i}$ is deterministic}\end{cases} \end{aligned}\]
Therefore, the covariance of the function values is derived as follows:
\[\begin{aligned} \mathrm{Cov}\left[f(x),f(x^{\prime})\right] &=\mathbb{E}\left([f(x)-m(x)][f(x^{\prime})-m(x^{\prime})]\right)\\ &=\sum_{i=1}^{\infty}\sum_{j=1}^{\infty}{\mathbb{E}\left[(\beta_{i}-\mu_{i})(\beta_{j}-\mu_{j})\right]\phi_{i}(x)\phi_{j}(x^{\prime})}\\ &=\sum_{i=1}^{\infty}{\lambda_{i}\phi_{i}(x)\phi_{i}(x^{\prime})}\\ \\ \because\mathbb{E}\left[(\beta_{i}-\mu_{i})(\beta_{j}-\mu_{j})\right] &=\mathrm{Cov}\left[\beta_{i},\beta_{j}\right]\\ &=\begin{cases}\mathrm{Var}\left[\beta_{i}\right]\quad &i=j\\ 0 \quad &i \ne j\end{cases} \quad\mathrm{s.t.}\quad \beta_{i}\perp\beta_{j} \end{aligned}\]
By Mercer’s theorem, the covariance function becomes a kernel function.
\[\begin{aligned} k(x,x^{\prime}) =\sum_{i=1}^{\infty}{\lambda_{i}\phi_{i}(x)\phi_{i}(x^{\prime})} =\mathrm{Cov}\left[f(x),f(x^{\prime})\right] \end{aligned}\]

bayes rule

prior represents a probabilistic assumption about the central tendency $m(\cdot)$ and dispersion $k(\cdot,\cdot^{\prime})$ of the function values:
\[\begin{aligned} p(f) &=\mathcal{N}(M(X),K_{XX}) \end{aligned}\]
likelihood:
\[\begin{aligned} p(y \mid f) &=\mathcal{N}(f,\sigma_{N}^{2}\mathbf{I}) \end{aligned}\]
evidence:
\[\begin{aligned} p(y) &=\int{p(y \mid f)p(f)\mathrm{d}f}\\ &=\mathcal{N}(M(X),K_{XX}+\sigma_{N}^{2}\mathbf{I}) \end{aligned}\]
posterior represents the updated probabilistic assumption about the mean and covariance structure of the function after conditioning on the observations:
\[\begin{aligned} p(f \mid y) &=\frac{p(y \mid f)p(f)}{p(y)}\\ &= \mathcal{N}(M^{\prime}(X),K^{\prime}_{XX}) \end{aligned}\]

joint prob. dist. application

joint prob. dist. of multivariate normal dist.:
\[\begin{bmatrix}X_{A}\\X_{B}\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\mu_{A}\\ \mu_{B}\end{bmatrix},\begin{bmatrix}\Sigma_{AA}&\Sigma_{AB}\\ \Sigma_{BA}&\Sigma_{BB}\end{bmatrix}\right)\]
analytical representation of conditional dist.:
\[p(X_{A}\mid X_{B})=\mathcal{N}(\mu_{A}+\Sigma_{AB}\Sigma_{BB}^{-1}\cdot(X_{B}-\mu_{B}),\Sigma_{AA}-\Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})\]
application to posterior inference (what is used conditionally is $y$, because what can be observed is $y$, not $\mathbf{f}$):
\[\begin{bmatrix}u\\y\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}M(Z)\\ M(X)\end{bmatrix},\begin{bmatrix}K_{ZZ}&K_{ZX}\\ K_{XZ}&K_{XX} + \sigma_{N}^{2}\mathbf{I}\end{bmatrix}\right) ,\quad u=f(Z)\]
therefore:
\[p(u \mid y) =\mathcal{N}(M^{\prime}(Z),K^{\prime}_{ZZ})\]
posterior mean function:
\[m^{\prime}(Z) =M(Z) + K_{ZX}\cdot(K_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1}\cdot(y-M(X))\]
posterior cov. function:
\[k^{\prime}(Z,Z^{\prime}) =K_{ZZ}-K_{ZX}\cdot(K_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1}K_{XZ}\]

predictive dist.

prior predictive dist. (unobs)

plate diagram:
\[f \rightarrow \cancel{y}, \quad f_{*} \rightarrow y_{*}, \quad f_{*} \leftrightarrow f\]
prior:
\[\begin{aligned} p(f_{*}) &= \int{p(f_{*}\mid f)p(f)\mathrm{d}f}\\ &=\mathcal{N}(M(X_{*}),K_{**}) \end{aligned}\]
prior predictive dist.:
\[\begin{aligned} p(y_{*}) &= \int{p(y_{*}\mid f_{*})p(f_{*})\mathrm{d}f_{*}}\\ &= \mathcal{N}(M(X_{*}),K_{**}+\sigma_{N}^{2}\mathbf{I}) \end{aligned}\]

posterior predictive dist. (obs)

plate diagram:
\[f \rightarrow y, \quad f_{*} \rightarrow y_{*}, \quad f_{*} \leftrightarrow f\]
conditional prior:
\[\begin{aligned} p(f_{*} \mid y) &= \int{p(f_{*}\mid f)p(f \mid y)\mathrm{d}f}\\ &= \mathcal{N}(M^{\prime}(X_{*}),K^{\prime}_{**}) \end{aligned}\]
posterior predictive dist.:
\[\begin{aligned} p(y_{*} \mid y) &= \int{p(y_{*}\mid f_{*})p(f_{*} \mid y)\mathrm{d}f_{*}}\\ &= \mathcal{N}(M^{\prime}(X_{*}),K^{\prime}_{**} + \sigma_{N}^{2}\mathbf{I}) \end{aligned}\]

BAYES, 4.stochastic process

This post is licensed under CC BY 4.0 by the author.