Bayesian Attention Modules

Fan, X., Zhang, S., Chen, B., & Zhou, M.
(2020).
Bayesian attention modules.
Advances in Neural Information Processing Systems, 33, 16362-16376.

Posted Sep 19, 2024

By jayarnim

2 min read

Bayesian Attention Modules

문제 의식: 요청 정보와 참조 정보 간 관계 대응(정렬, Alignment)의 불확실성
- Deterministic Attention Mechanism: 참조 정보에 대한 요청 정보의 집중도 정렬 결과 확정적인 값을 도출함
- 데이터 분포가 불확실할 경우(데이터가 복잡하고 잡음이 많을 경우) 요청 정보와 참조 정보 간 관계를 확정적으로 도출하기 어려움
- 요청 정보와 참조 정보 간 관계 구조가 다층적이고 다양할 수 있음
Bayesian Attention Modules: Attention Score 에 실증적 베이지안 프레임워크를 적용하여 요청 정보와 참조 정보 간 집중도의 불확실성을 반영하는 방법론
- Attention Score 를 확정적인 값을 취하는 상수가 아니라 확률 분포를 따르는 확률변수로 가정함
- 요청 정보와 참조 정보 간 유사도 함수 값의 지수변환을 Attention Score 기대값으로 간주함
- 참조 정보에 대한 개별 요청 정보의 집중도를 사전 정보(해당 참조 정보에 대한 요청 정보 전반의 집중도)로 규제함

Notation

$y$: response variable
$q,k,v,\cdots \in X$: explanatory variables
$q,k,v$: query, key, value
$S$: random variable of attention score
$s$: sample of attention score
$\Omega$: random variable of attention weight
$\omega$: sample of attention weight
$P_{\theta}(\cdot)$: posterior dist.
$Q_{\phi}(\cdot)$: approx. dist.
$\Pi_{\eta}(\cdot)$: prior dist.
$\mathcal{L}(\cdot)$: likelihood
$\mathbf{h}$: linear transformation vector
$\mathbf{W}$: linear transformation matrix
$\mathbf{b}$: bias vector

How to Modeling

attention score distribution must be defined over non-negative random variables
- if Approx. is Weibull Dist., Prior must be Gamma Dist.
  - Weibull: $S \sim \mathrm{Weibull}(k,\lambda)$ ($k$ is hyper-parameter)
  - Gamma: $S \sim \mathrm{Gamma}(\alpha,\beta)$ ($\beta$ is hyper-parameter)
- if Approx. is Lognormal Dist., Prior must be Lognormal Dist.
  - Lognormal: $S \sim \mathrm{Lognormal}(\mu,\sigma^{2})$ ($\sigma$ is hyper-parameter)
function values of the attention scores are used to compute the parameters of the Approx. dist. $Q_{\phi}(S)$
\[\begin{aligned} \mathbb{E}\left[S_{i,j}\right]=\exp{\psi_{i,j}} \quad \mathrm{for} \quad \psi_{i,j}=f(q_{i},k_{j}) \end{aligned}\]
- Weibull: $\lambda=\exp{\psi} / \Gamma(1+1/k)$
- Lognormal: $\mu=\psi - \sigma^{2}/2$
Prior $\Pi_{\eta}(S)$ is contextual dist. based on keys
\[\begin{aligned} \mathbb{E}\left[S_{j}\right]=\exp{\psi_{j}} \quad \mathrm{for} \quad \psi_{j}=\mathrm{softmax}\left[\mathbf{h}^{T}(\mathbf{W}k_{j}+\mathbf{b})\right] \end{aligned}\]
- Gamma: $\alpha=\exp{\psi}\cdot\beta$
- Lognormal: $\mu=\psi - \sigma^{2}/2$
attention weights are derived through L1-normalization, rather than softmax, in order to reflect uncertainty
\[\begin{aligned} \omega_{i} &= \frac{s_{i}}{\sum_{l \ne i}{s_{l}}} \end{aligned}\]
bayesian framework
\[\begin{aligned} P_{\theta}(S \mid y,X) = \frac{\mathcal{L}(y \mid S, X)\Pi(S)}{P(y \mid X)} \end{aligned}\]
- posterior: $P_{\Theta}(S \mid y,X)$
- likelihood: $\mathcal{L}(y \mid S, X)$
- prior: $\Pi_{\eta}(S)$
variational inference
\[\begin{aligned} \mathrm{ELBO} = \mathbb{E}_{S \sim Q_{\phi}}\left[\log{\mathcal{L}(y \mid X)}\right]-\mathrm{KL}\left[Q_{\phi}(S) \parallel \Pi_{\eta}(S)\right] \end{aligned}\]
- approx: $Q_{\phi}(S) \approx P_{\theta}(S \mid y,X)$

BAYES, 3.bayes applications

This post is licensed under CC BY 4.0 by the author.

Bayesian Attention Modules

Notation

How to Modeling

Trending Tags