Posterior Estimation (3) Kernelized Stein Discrepancy

Posted Jul 20, 2024

By jayarnim

7 min read

Stein Method

Stein Operator

테스트 함수 $f(x)$ 를 확률 분포 $p(x)$ 의 구조가 반영된 형태로 변환하는 연산자로서, $x$ 의 변화에 따른 $f(x)$ 의 변화율를 측정하고, 그 변화를 $x$ 가 발생할 가능성이 반영된 형태로 조정하는 방식으로 동작함

\[\begin{aligned} \mathcal{T}_{p}f(x) &= \nabla_{x}f(x) + f(x)\cdot\nabla_{x}\log{p(x)} \end{aligned}\]

$\nabla_{x}f(x)$ : $x$ 에 대한 $f(x)$ 의 변화율
$\nabla_{x}\log{p(x)}=\displaystyle\frac{p^{\prime}(x)}{p(x)}$ : 스코어 함수(Score Function)로서 $x$ 에 대한 $\log{p(x)}$ 의 변화율

Stein Class

Stein Operator 을 통해 변화율 조정 가능한 테스트 함수 $f(x)$ 는 확률 분포 $p(x)$ 의 정의역 경계에서 $0$ 으로 수렴해야 하고(Boundary Condition), 정의역에서 미분 가능해야 함(Differentiability Condition)

Boundary Condition
\[\begin{aligned} \lim_{x \to \text{boundary}}{f(x)p(x)}=0 \end{aligned}\]
Differentiability Condition
\[\begin{aligned} \forall x \in \mathbb{R}, \quad \exists \nabla_{x}f(x) = \lim_{h \to 0}{\frac{f(x+h)-f(x)}{h}} \end{aligned}\]

Stein’s Identity

$x$ 의 변화에 따른 테스트 함수 $f(x)$ 의 변화 양상을 $x$ 발생 확률 분포로써 조정했을 때(Stein Operator) 조정된 변화율의 기대값이 $0$ 이 되는 성질로서, 특정한 형태의 연산(Stein Operator)을 거친 테스트 함수 $f(x)$ 가 특정 확률 분포 $p(x)$ 에서 평균적으로 변화가 없음을 보장하는 정리

\[\begin{aligned} \mathbb{E}_{x \sim p}\left[\mathcal{T}_{p}f(x)\right] &= \int{p(x) \cdot \mathcal{T}_{p}f(x)\text{d}x}\\ &= \int{p(x)\cdot\Big(f^{\prime}(x) + f(x) \cdot \nabla_{x}\log{p(x)} \Big)\text{d}x}\\ &= \int{\Big[p(x) \cdot f^{\prime}(x) + p(x) \cdot f(x) \cdot \nabla_{x}\log{p(x)} \Big]\text{d}x}\\ &= \int{p(x) \cdot f^{\prime}(x)\text{d}x} + \int{p(x) \cdot f(x) \cdot \nabla_{x}\log{p(x)}\text{d}x}\\ &= 0 \end{aligned}\]

부정적분:
\[\begin{aligned} \int_{a}^{b}{u(x) \cdot v^{\prime}(x)\text{d}x} &= \Big[u(x) \cdot v(x)\Big]_{a}^{b} - \int_{a}^{b}{u^{\prime}(x) \cdot v(x)\text{d}x} \end{aligned}\]
대입:
\[\begin{aligned} \int_{-\infty}^{\infty}{f(x) \cdot p^{\prime}(x)\text{d}x} &= \Big[f(x) \cdot p(x)\Big]_{-\infty}^{\infty} - \int_{-\infty}^{\infty}{f^{\prime}(x) \cdot p(x)\text{d}x} \end{aligned}\]
$p(x)$ 가 확률 밀도 함수이고, $f(x)$ 가 Bounded Function 이라고 하자.
$x \to \pm\infty$ 이면 $p(x) \to 0$ 이므로,
\[\begin{aligned} \Big[f(x) \cdot p(x)\Big]_{-\infty}^{\infty} &=0 \end{aligned}\]
따라서:
\[\begin{aligned} \int_{-\infty}^{\infty}{f^{\prime}(x) \cdot p(x)\text{d}x} &= -\int_{-\infty}^{\infty}{f(x) \cdot p^{\prime}(x)\text{d}x}\\ &= \int_{-\infty}^{\infty}{-f(x) \cdot p^{\prime}(x)\text{d}x} \end{aligned}\]
로그함수의 미분:
\[\begin{aligned} \nabla_{x}\log{p(x)} = \frac{\text{d}}{\text{d}x}\log{p(x)} = \frac{p^{\prime}(x)}{p(x)} \end{aligned}\]
따라서:
\[\begin{aligned} p(x) \cdot f(x) \cdot \nabla_{x}\log{p(x)} &= \cancel{p(x)} \cdot f(x) \cdot \frac{p^{\prime}(x)}{\cancel{p(x)}}\\ &= f(x) \cdot p^{\prime}(x) \end{aligned}\]

KSD

RKHS

RKHS(Reproducing Kernel Hilbert Space) : 특정한 재생 커널(Reproducing Kernel)에 의해 정의된 힐베르트 공간으로서, 이 함수 공간에서 모든 함수는 커널 함수의 선형 조합으로 표현됨
\[\begin{aligned} f(x) &= \sum_{i=1}^{n}{\alpha_{i} \cdot \mathcal{K}(x, x_{i})} \end{aligned}\]
- 힐베르트 공간(Hilbert Space) : 유클리드 공간을 일반화한 개념으로서, 내적과 거리가 정의된 완비 공간
  \[\begin{aligned} \langle f, g \rangle _{\mathcal{H}}:=\sum_{i=1}^{n}\sum_{j=1}^{m}{\alpha_{i}\beta_{j}\mathcal{K}(x_{i}, x_{j})}, \quad \Vert f \Vert _{\mathcal{H}} := \sqrt{\langle f, f \rangle _{\mathcal{H}}} \end{aligned}\]
  - $f(x)=\sum_{i=1}^{n}{\alpha_{i} \cdot \mathcal{K}(x, x_{i})}$
  - $g(x)=\sum_{j=1}^{m}{\beta_{j} \cdot \mathcal{K}(x, x_{j})}$
- 재생(Reproducing) : 모든 함수 $f(x)$ 가 커널 함수의 선형 조합으로 표현됨에 따라 $f(x)$ 를 직접 계산하지 않고도 커널 함수를 통해 함수 값을 재생할 수 있는 성질
  \[\begin{aligned} f(x) = \langle f, \mathcal{K}(\cdot, x) \rangle _{\mathcal{H}} = \left\langle \sum_{i=1}^{n}{\alpha_{i} \cdot \mathcal{K}(\cdot, x_{i})}, \mathcal{K}(\cdot, x) \right\rangle _{\mathcal{H}} = \sum_{i=1}^{n}{\alpha_{i} \cdot \mathcal{K}(x, x_{i})} \end{aligned}\]

재생 커널(Reproducing Kernel)

Name	Function
Linear	$\mathcal{K}\left(X,X^{\prime}\right) = X \cdot X^{\prime}$
Polynomial	$\mathcal{K}\left(X,X^{\prime}\right) = \left(X \cdot X^{\prime} + \beta\right)^{d}$
RBF	$\mathcal{K}\left(X,X^{\prime}\right) = \exp{\left[-\displaystyle\frac{\Vert X-X^{\prime} \Vert^{2}}{2\ell^{2}}\right]}$
Matern	$\mathcal{K}\left(X,X^{\prime}\right) = \displaystyle\frac{2^{1-\nu}}{\Gamma\left(\nu\right)} \cdot \left(\displaystyle\frac{\sqrt{2\nu} \Vert X-X^{\prime} \Vert}{\ell}\right)^{\nu} \cdot K_{\nu}\left(\displaystyle\frac{\sqrt{2\nu} \Vert X-X^{\prime} \Vert}{\ell}\right)$

대칭성(Symmetry):
\[\begin{aligned} \mathcal{K}(x, x^{\prime}) &= \mathcal{K}(x^{\prime}, x) \end{aligned}\]
양의 정부호성(Positive Definiteness):
\[\begin{aligned} \overrightarrow{\mathbf{v}}^{T} \mathbf{K} \overrightarrow{\mathbf{v}} > 0, \quad \forall \overrightarrow{\mathbf{v}} \in \mathbb{R}^{N}\setminus \{0\} \end{aligned}\]
- $\mathbf{K}$ is Kernel Matrix
  \[\begin{aligned} \mathbf{K} = \begin{pmatrix} \mathcal{K}(x_{1}, x_{1}) & \mathcal{K}(x_{1}, x_{2}) & \cdots & \mathcal{K}(x_{1}, x_{N})\\ \mathcal{K}(x_{2}, x_{1}) & \mathcal{K}(x_{2}, x_{2}) & \cdots & \mathcal{K}(x_{2}, x_{N})\\ \vdots & \vdots & \ddots & \vdots\\ \mathcal{K}(x_{N}, x_{1}) & \mathcal{K}(x_{N}, x_{2}) & \cdots & \mathcal{K}(x_{N}, x_{N})\\ \end{pmatrix} \end{aligned}\]

KSD

Stein’s Identity Application
- if $q(x)=p(x)$:
  \[\begin{aligned} \mathbb{E}_{x \sim q}\left[\mathcal{T}_{p}f(x)\right] = \begin{cases}\int{q(x) \cdot \mathcal{T}_{p}f(x)\text{d}x}\\ \sum{q(x) \cdot \mathcal{T}_{p}f(x)} \end{cases} = 0 \end{aligned}\]
- else:
  \[\begin{aligned} \mathbb{E}_{x \sim q}\left[\mathcal{T}_{p}f(x)\right] = \begin{cases}\int{q(x) \cdot \mathcal{T}_{p}f(x)\text{d}x}\\ \sum{q(x) \cdot \mathcal{T}_{p}f(x)} \end{cases} \ne 0 \end{aligned}\]
KSD(Kernelized Stein Discrepancy) : Stein’s Identity 를 활용하여 두 확률 분포 간 차이를 측정함에 있어, RKHS 의 커널 함수를 이용하는 기법
- Stein Discrepancy:
  \[\begin{aligned} D_{\text{STEIN}}(q \parallel p) &= \sup_{f \in \mathcal{F}}{\mathbb{E}_{x \sim q}\left[\mathcal{T}_{p}f(x)\right]} \end{aligned}\]
- If Test Function is defined in RKHS:
  \[\begin{aligned} f(x) = \sum_{i=1}^{n}{\alpha_{i} \cdot \mathcal{K}(x, x_{i})},\quad f \in \mathcal{H}_{K} \end{aligned}\]
- Argument of Supremum is:
  \[\begin{aligned} f^{*}(x) &= \text{arg} \sup_{f \in \mathcal{F}}{\mathbb{E}_{x \sim q}\left[\mathcal{T}_{p}f(x)\right]}\\ &= \mathbb{E}_{x^{\prime} \sim q}[\mathcal{K}(x, x^{\prime}) \cdot \nabla_{x^{\prime}}\log{p(x^{\prime})}+\nabla_{x^{\prime}}\mathcal{K}(x, x^{\prime})] \end{aligned}\]
- Therefore:
  \[\begin{aligned} D_{\text{STEIN}}(q \parallel p) &= \mathbb{E}_{x,x^{\prime}\sim q}\left[\nabla_{x}\log{p(x)}^{T} \cdot \mathcal{K}(x,x^{\prime}) \cdot \nabla_{x^{\prime}}\log{p(x^{\prime})} + \nabla_{x}\nabla_{x^{\prime}}\mathcal{K}(x,x^{\prime})\right] \end{aligned}\]

SVGD

SVGD(Stein Variational Gradient Descent) : Stein Discrepancy 를 활용하여 제안 분포 $W \sim Q$ 를 목표 분포 $W \mid \mathcal{D} \sim P$ 에 근사하는 비모수 방법론(Non-Parametric Method)
- Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm(2016)

Vector Optimization vs. Distribution Optimization through Gradient Descent

	Euclidean Gradient	Wasserstein Gradient
Space	Euclidean	Measure
Distance	Euclidean	Wasserstein
Target	$x$	$q(x)$
Loss Function	$\mathcal{L}(x)$	$D(q \parallel p)$
Gradient	$\partial \mathcal{L}(x) / \partial x$	$\delta D(q \parallel p) / \delta q$
Update	$x \to x^{*}$	$q \to p$

Distribution Optimization $\to$ Particle Optimization

Stein Variational Gradient Descent 는 파라미터의 분포를 직접 모델링하지 않고 개별 입자들을 모델링함. 즉, 제안 분포와 목표 분포 간 차이를 줄임으로써 분포 전체를 직접 최적화하는 방식이 아니라(변분 미분), 제안 분포에서 샘플링된 개별 입자들의 위치가 목표 분포를 따르도록 조정하는 방식으로 학습이 진행됨(편미분).
입자가 이동해야 하는 최적 방향을 제공하는 지표는 분포 간 차이를 나타내는 $D_{\text{STEIN}}(q \parallel p)$ 에 개별 입자의 위치를 편미분한 값 $\phi(x_{i})=\displaystyle\frac{\partial}{\partial x_{i}}D_{\text{STEIN}}(q \parallel p)$ 으로, 이는 RKHS에서 유도된 최적의 함수와 동일함.
Stein Gradient:
\[\begin{aligned} \phi(w_{i}) &= \mathbb{E}_{w_{j} \sim q}[\mathcal{K}(w_{i}, w_{j}) \cdot \nabla_{w_{j}}\log{p(w_{j} \mid \mathcal{D})}+\nabla_{w_{j}}\mathcal{K}(w_{i}, w_{j})]\\ &= \frac{1}{M}\sum_{j=1}^{M}{\underbrace{\mathcal{K}(w_{i}, w_{j})}_{\text{kernel}} \cdot \underbrace{\nabla_{w_{j}}\log{p(w_{j} \mid \mathcal{D})}}_{\text{score function}}+\underbrace{\nabla_{w_{j}}\mathcal{K}(w_{i}, w_{j})}_{\text{repulsion effect}}} \end{aligned}\]
- 스코어 함수(Score Function) : 입자 $w_{i} \sim q$ 가 확률 분포 $p$ 를 따라 움직이도록 유도함
  \[\begin{aligned} \underbrace{\log{p(w_{j} \mid \mathcal{D})}}_{\text{posterior}} &= \underbrace{\log{p(\mathcal{D} \mid w_{j})}}_{\text{likelihood}} + \underbrace{\log{p(w_{j})}}_{\text{prior}} - \underbrace{\log{p(\mathcal{D})}}_{\text{evidence}}\\ \therefore \underbrace{\nabla_{w_{j}}\log{p(w_{j} \mid \mathcal{D})}}_{\text{score function}} &= \underbrace{\nabla_{w_{j}}\log{p(\mathcal{D} \mid w_{j})}}_{\text{likelihood gradient}} + \underbrace{\nabla_{w_{j}}\log{p(w_{j})}}_{\text{prior gradient}} \end{aligned}\]
- 반발 효과(Repulsion Effect) : 입자들이 서로 퍼지도록 함으로써 입자 간 분포를 유지함
Update:
\[\begin{aligned} w_{i} \gets w_{i} + \epsilon \cdot \phi(w_{i}) \end{aligned}\]

Data Mining Techs, Bayesian Modeling

This post is licensed under CC BY 4.0 by the author.