GP (2) Sparse Gaussian Process
sparse gaussian process
- sparse gaussian process is a technique that achieves the following two advantages by introducing $M$ inducing points that can represent all $N\to\infty$ observations:
- infinite-dimensional function space is projected onto a low-dimensional induction point space, thereby ensuring computational efficiency.
- uncertainty is restricted to the low-dimensional interior, thereby ensuring inference stability.
-
nonparametric function:
\[f(X)=\sum_{i=1}^{\infty}{\beta_{i}k(X,X_{i})}\] -
nonparametric function space $f \in \mathcal{F}$ is infinite-dimensional function space consisting of an infinite number of basis functions:
\[\mathcal{F} =\mathrm{span}\left\{k(\cdot,X_{1}),k(\cdot,X_{2}),\cdots\right\}\] -
inducing point approximation:
\[\sum_{i=1}^{\infty}{\beta_{i}k(X,X_{i})} \approx\sum_{i=1}^{M}\gamma_{i}k(X,Z_{i})\] -
infinite dimensional function spaces are approximated by finite-dimensional spaces that take a finite number of inducing points as basis functions:
\[\mathcal{F}\approx\mathrm{span}\left\{k(\cdot,Z_{1}),k(\cdot,Z_{2}),\cdots,k(\cdot,Z_{M})\right\}\]
idea
-
representativeness of information: statistical assumption that the correlation structure between samples occurs only through the representative value, and accordingly, conditional independence is established between samples with respect to the representative value.
\[\begin{gathered} f_{i} \leftrightarrow f_{j}\\ \Downarrow\\ f_{i} \leftarrow u \rightarrow f_{j} \end{gathered}\] -
conditional independence: when information C is conditionally given between two pieces of information A and B, the information between A and B is independent, or, C “breaks” or “absorbs” the dependence between A and B.
\[\begin{gathered} f_{i} \perp f_{j}\mid u\\ \Updownarrow\\ \mathrm{Cov}(f_{i},f_{j}\mid u) =\underbrace{\Sigma_{XX}-\overbrace{\Sigma_{XZ}\Sigma_{ZZ}^{-1}\Sigma_{ZX}}^{\text{low-rank approx.}}}_{\text{residual covariance}}\approx 0\\ \Updownarrow\\ p(f_{i},f_{j}\mid u)=p(f_{i}\mid u)p(f_{j}\mid u) \end{gathered}\]
low-rank functional approximation
-
inducing obs $u$, which represent full obs $f$:
\[u=f(Z) \sim \mathcal{N}(M(Z),K_{ZZ})\] -
to compress the function space into the induction point space:
\[\begin{aligned} p(f) \to p(f \mid u) \end{aligned}\] -
joint prob. dist. application:
\[\begin{bmatrix}f\\u\end{bmatrix}\sim\mathcal{N}\left(\begin{bmatrix}M(X)\\M(Z)\end{bmatrix},\begin{bmatrix}K_{ZZ}&K_{ZX}\\K_{XZ}&K_{XX}\end{bmatrix}\right)\] -
$f$ can be expressed as a linear combination of $u$:
\[\mathbb{E}\left[f \mid u\right] = M(X)+K_{XZ}K_{ZZ}^{-1}(u-M(Z))\] -
to convert the function value perspective $f$ to the function perspective $f(\cdot)$:
\[\begin{aligned} \mathbb{E}\left[f(\cdot)\mid f(Z)\right] &= m(\cdot) + \sum_{i=1}^{M}{\gamma_{i}k(\cdot,Z_{i})}\\ &\approx \sum_{i=1}^{M}{\gamma_{i}k(\cdot,Z_{i})}\\ \gamma_{i} &= \sum_{j=1}^{M}{(K_{ZZ}^{-1})_{i,j}\left(u_{j}-m(Z_{j})\right)} \end{aligned}\] -
therefore infinite dimensional space consisting of a kernel function $\phi(\cdot)=k(\cdot,X)$ for all input values $\forall X$ as a basis can be approximated by a finite dimensional space consisting of a kernel function $\psi(\cdot)=k(\cdot,Z)$ for the induction points $u=f(Z)$ as a basis.
\[\mathcal{F}\approx\mathrm{span}\left\{k(\cdot,Z_{1}),k(\cdot,Z_{2}),\cdots,k(\cdot,Z_{M})\right\}\] -
under conditional independence, uncertainty outside the inducing point space is eliminated:
\[\mathrm{Cov}\left[f \mid u\right] = K_{XX}-K_{XZ}K_{ZZ}^{-1}K_{ZX} \approx 0\]
low-rank kernel approximation
-
original bayes rule:
\[\underbrace{p(f\mid y)}_{\text{posterior}} \propto \underbrace{p(y\mid f)}_{\text{likelihood}}\cdot \underbrace{p(f)}_{\text{prior}}\] -
updated bayes rule:
\[\underbrace{p(f, u\mid y)}_{\text{posterior}} \propto \underbrace{p(y\mid f)}_{\text{likelihood}}\cdot \underbrace{p(f \mid u)p(u)}_{\text{joint prior}}\] -
plate diagram:
\[u \rightarrow f, \quad f \rightarrow y, \quad u \nrightarrow y\] -
marginal likelihood:
\[\begin{aligned} p(y \mid u) &= \int{\underbrace{p(y \mid f)}_{\text{likelihood}}\underbrace{p(f \mid u)}_{\text{conditional prior}}\mathrm{d}f} \end{aligned}\] -
posterior of inducing obs $u$ can be expressed through the obs $y$:
\[\begin{aligned} p(u \mid y) &\propto \underbrace{p(y \mid u)}_{\text{marginal likelihood}} \cdot \underbrace{p(u)}_{\text{prior}} \end{aligned}\] -
joint prob. dist. application:
\[\begin{aligned} \begin{bmatrix}u\\y\end{bmatrix} &\sim\mathcal{N}\left(\begin{bmatrix}M(Z)\\ M(X)\end{bmatrix},\begin{bmatrix}K_{ZZ}&K_{ZX}\\K_{XZ}&K_{XX}+\sigma_{N}^{2}\mathbf{I}\end{bmatrix}\right) \end{aligned}\] -
therefore:
\[\begin{aligned} p(u \mid y) &= \mathcal{N}(M(Z)+K_{ZX}(K_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1}(y-m(X)),K_{ZZ}-K_{ZX}(K_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1}K_{XZ}) \end{aligned}\] -
replace components by taking advantage of the property of conditional independence:
\[K_{XX} \approx \underbrace{K_{XZ}K_{ZZ}^{-1}K_{ZX}}_{=:Q_{XX}}\] -
woodbury identity:
\[(Q_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1} =\frac{1}{\sigma^{2}}\mathbf{I}-\frac{1}{\sigma^{4}}K_{XZ}\underbrace{\left(K_{ZZ}+\frac{1}{\sigma^{2}}K_{ZX}K_{XZ}\right)^{-1}}_{M \times M}K_{ZX}\] -
computational cost reduced:
\[\begin{aligned} \mathrm{Cost}\left[(K_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1}\right] &= \mathcal{O}(N^{3})\\ \mathrm{Cost}\left[(Q_{XX}+\sigma_{N}^{2}\mathbf{I})^{-1}\right] &= \mathcal{O}(N^{2}M) \end{aligned}\]