Post

GP (3) Random Fourier Features Gaussian Process

Random Fourier Feature Gaussian Process

  • Random Fourier Feature Gaussian Process is a technique for approximating an infinite-dimensional function space based on kernel functions to a finite-dimensional function space based on real sinusoids.

  • nonparametric function:

    \[f(x)=\sum_{i=1}^{\infty}{\beta_{i}k(x,x_{i})}\]
  • nonparametric function space:

    \[\mathcal{F} =\mathrm{span}\left\{k(\cdot,x_{1}),k(\cdot,x_{2}),\cdots\right\}\]
  • Random Fourier Feature:

    \[\sum_{i=1}^{\infty}{\beta_{i}k(x,x_{i})} \approx \sum_{m=1}^{M}{\gamma_{m}\psi_{m}(x)}\]
  • infinite dimensional function space is approximated by finite-dimensional space that take a finite number of real sinusoids as basis functions:

    \[\mathcal{F} \approx\mathrm{span}\left\{\psi_{1}(\cdot),\psi_{2}(\cdot),\cdots,\psi_{M}(\cdot)\right\}\]

how to approximation


  • by Bochner’s theorem, a shift-invariant and positive-definite kernel function can be expressed as an infinite-dimensional inner product of continuous complex sinusoids.

    \[\begin{aligned} k(x,x^{\prime}) &=\int_{\mathbb{R}^{d}}{\phi(x)\phi^{*}(x)p(\omega)\mathrm{d}\omega}\\ \phi(x) &=\exp{i\omega x},\quad\omega\in\mathbb{R} \end{aligned}\]
  • Random Fourier Feature is a technique for approximating the infinite-dimensional inner product of continuous complex sinusoids to the finite-dimensional inner product of discrete real sinusoids.

    \[\begin{aligned} \int_{\mathbb{R}^{d}}{\phi(x)\phi^{*}(x)p(\omega)\mathrm{d}\omega} &\approx\sqrt{\frac{2}{M}}\sum_{m=1}^{M}{\psi_{m}(x)\psi_{m}^{*}(x)}\\ \psi_{m}(x) &=\cos{\omega_{m}x+b_{m}},\quad\begin{cases} \omega_{m}\sim p(\omega)\\ b_{m}\sim\mathrm{Uniform}(0,2\pi] \end{cases} \end{aligned}\]
  • when substituted into the kernel function, which is basis function of a nonparametric function space, the nonparametric function can be expressed as a linear combination of discrete real number sinusoids.

    \[\begin{aligned} f(x) &=\sum_{i=1}^{\infty}{\beta_{i}k(x,x_{i})}\\ &\approx\sum_{i=1}^{\infty}{\beta_{i}}\left(\sqrt{\frac{2}{M}}\sum_{m=1}^{M}{\psi_{m}(x)\psi_{m}(x_{i})}\right)\\ &=\sum_{m=1}^{M}{\underbrace{\sqrt{\frac{2}{M}}\left(\sum_{i=1}^{\infty}{\beta_{i}\psi_{m}(x_{i})}\right)}_{=:\gamma_{m}}\psi_{m}(x)}\\ &=\sum_{m=1}^{M}{\gamma_{m}\psi_{m}(x)} \end{aligned}\]
  • therefore, the infinite dimensional function space based on the kernel function $\phi(x)=k(\cdot,x)$ defined for all input values $\forall x$ is approximated by the finite dimensional function space based on the discrete real sinusoid $\psi_{m}(x)=\cos{\omega_{m}x+b_{m}}$.

    \[\mathcal{F} \approx\mathrm{span}\left\{\psi_{1}(\cdot),\psi_{2}(\cdot),\cdots,\psi_{M}(\cdot)\right\}\]

bayes rule


  • The basis \(\psi_{m}(x):=\cos{\omega_{m}x+b_{m}}\) is deterministic after Monte Carlo approximation. Therefore, the probability of the random variable \(f(x):=\sum_{m=1}^{M}{\gamma_{m}\psi_{m}(x)}\) is entirely based on the coefficient \(\gamma_{m}\).

    \[\gamma_{m}:=\sqrt{\frac{2}{M}}\sum_{i=1}^{\infty}{\beta_{i}\psi_{m}(x_{i})}\]
  • proxy variable of $f(x)$:

    The distribution of the function value vector \(\mathbf{f}_{X}:=\{f(x_{i})\}_{i=1}^{N}\) for the input set \(\{x_{i}\}_{i=1}^{N}\) can be expressed as a distribution that pushes forward the distribution \(p(\gamma)\) of \(\gamma\) through a linear transformation \(T_{X}:=\Psi_{X}\).

    \[\begin{gathered} \mathbf{f}_{X}=\Psi_{X}^{T}\gamma=T_{X}(\gamma)\quad\Rightarrow\quad p(\mathbf{f}_{X})=(T_{X})_{\#}p(\gamma) \end{gathered}\]
    • \(\mathbb{E}\left[\mathbf{f}_{X}\right]=T_{X}\mathbb{E}\left[\gamma\right]\) \(\quad\)
    • \(\mathrm{Cov}\left[\mathbf{f}_{X}\right]=T_{X}^{T}\mathbb{E}\left[\gamma\right]T_{X}\) \(\quad\)
  • prior:

    \[\begin{aligned} p(\gamma)=\mathcal{N}(\mu_{\gamma},\Sigma_{\gamma}) \end{aligned}\]
  • likelihood:

    \[\begin{gathered} y_{i} =\sum_{m=1}^{M}{\gamma_{m}\psi(x)}+\epsilon_{i},\quad\epsilon\sim\mathcal{N}(0,\sigma^{2})\\ \therefore p(y\mid\gamma) =\mathcal{N}(\Psi_{X}^{T}\gamma,\sigma^{2}\mathbf{I}) \end{gathered}\]
  • posterior:

    \[\begin{aligned} p(\gamma\mid y) &\propto p(y\mid\gamma)p(\gamma)\\ &\propto \exp\left[-\frac{1}{2\sigma^{2}}\Vert y-\Psi_{X}^{T}\gamma\Vert^{2}-\frac{1}{2}\Vert\gamma\Vert^{2}\right]\\ &\propto \mathcal{N}(\mu_{\gamma}^{\prime},\Sigma_{\gamma}^{\prime}) \end{aligned}\]
    • $\mu_{\gamma}^{\prime}=(1/\sigma^{2})\Sigma_{\gamma}\Psi_{X}^{T}y$
    • $\Sigma_{\gamma}^{\prime}=\left[(1/\sigma^{2})\Psi_{X}^{T}\Psi_{X}+\mathbf{I}\right]^{-1}$
  • posterior predictive:

    \[\begin{aligned} \mathbf{f}_{*} &=\Psi_{*}^{T}\gamma\\ \therefore p(\mathbf{f}_{*}\mid y) &=\int{p(\mathbf{f}_{*}\mid\gamma)p(\gamma\mid y)\mathrm{d}\gamma}\\ &=\mathcal{N}(\Psi_{*}^{T}\mu_{\gamma}^{\prime},\Psi_{*}^{T}\Sigma_{\gamma}^{\prime}\Psi_{*}) \end{aligned}\]
This post is licensed under CC BY 4.0 by the author.