For the pdf slides, click here

Variational Inference

Introduction of the variational inference method

Definitions

Variational inference is also called variational Bayes, thus
- all parameters are viewed as random variables, and
- they will have prior distributions.
We denote the set of all latent variables and parameters by $Z$
- Note: the parameter vector $θ$ no long appears, because it’s now a part of $Z$
Goal: find approximation for
- posterior distribution $p (Z ∣ X)$ , and
- marginal likelihood $p (X)$ , also called the model evidence

Model evidence equals lower bound plus KL divergence

Goal: We want to find a distribution $q (Z)$ that approximates the posterior distribution $p (Z ∣ X)$ . In other word, we want to minimize the KL divergence $KL (q ‖ p)$ .
Note the decomposition of the marginal likelihood $\log p (X) = L (q) + KL (q ‖ p),$
Thus, maximizing the lower bound (also called ELBO) $L (q)$ is equivalent to minimizing the KL divergence $KL (q ‖ p)$ . $\begin{aligned} L (q) & = \int q (Z) \log {\frac{p (X, Z)}{q (Z)}} d Z \\ KL (q ‖ p) & = - \int q (Z) \log {\frac{p (Z ∣ X)}{q (Z)}} d Z \end{aligned}$

Mean field family

Goal: restrict the family of distribution $q (Z)$ so that they comprise only tractable distributions, while allow the family to be sufficiently flexible so that it can approximate the posterior distribution well
Mean field family : partition the elements of $Z$ into disjoint groups denoted by $Z_{j}$ , for $j = 1, \dots, M$ , and assume $q$ factorizes wrt these groups: $q (Z) = \prod_{j = 1}^{M} q_{j} (Z_{j})$
- Note: we place no resitriction on the functional forms of the individual factors $q_{j} (Z_{j})$

Solution for mean field families: derivation

We will optimize wrt each $q_{j} (Z_{j})$ in turn.
For $q_{j}$ , the lower bound (to be maximized) can be decomposed as $\begin{aligned} L (q) & = \int \prod_{k} q_{k} {\log p (X, Z) - \sum_{k} \log q_{k}} d Z \\ = \int q_{j} \underset{E_{k \neq j} [\log p (X, Z)]}{\underset{⏟}{{\int \log p (X, Z) \prod_{k \neq j} q_{k} d Z_{k}}}} d Z_{j} - \int q_{j} \log q_{j} d Z_{j} + const \\ = - KL (q_{j} ‖ \tilde{p} (X, Z_{j})) + const \end{aligned}$
- Here the new distribution $\tilde{p} (X, Z_{j})$ is defined as $\log \tilde{p} (X, Z_{j}) = E_{k \neq j} [\log p (X, Z)] + const$

Solution for mean field families

A general expression for the optimal solution $q_{j}^{*} (Z_{j})$ is $\log q_{j}^{*} (Z_{j}) = E_{k \neq j} [\log p (X, Z)] + const$
- We can only use this solution in an iterative manner, because the expectations should be computed wrt other factors $q_{k} (Z_{k})$ for $k \neq j$ .
- Convergence is guaranteed because bound is convex wrt each factor $q_{j}$
- On the right hand side we only need to retain those terms that have some functional dependence on $Z_{j}$

Example: approximate a bivariate Gaussian using two independent distributions

Target distribution: a bivariate Gaussian $p (z) = N (z ∣ μ, Λ^{- 1}), μ = (\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}), Λ = (\begin{array}{cc} λ_{11} & λ_{12} \\ λ_{12} & λ_{22} \end{array})$
We use a factorized form to approximate $p (z)$ : $q (z) = q_{1} (z_{1}) q_{2} (z_{2})$
Note: we do not assume any functional forms for $q_{1}$ and $q_{2}$

VI solution to the bivariate Gaussian problem

$\begin{aligned} \log q_{1}^{*} (z_{1}) & = E_{z_{2}} [\log p (z)] + const \\ = E_{z_{2}} [- \frac{1}{2} (z_{1} - μ_{1})^{2} Λ_{11} - (z_{1} - μ_{1}) Λ_{12} (z_{2} - μ_{2})] + const \\ = - \frac{1}{2} z_{1}^{2} Λ_{11} + z_{1} μ_{1} Λ_{11} - (z_{1} - μ_{1}) Λ_{12} (E [z_{2}] - μ_{2}) + const \end{aligned}$

Thus we identify a normal, with mean depending on $E [z_{2}]$ : $q^{*} (z_{1}) = N (z_{1} ∣ m_{1}, Λ_{11}^{- 1}), m_{1} = μ_{1} - Λ_{11}^{- 1} Λ_{12} (E [z_{2}] - μ_{2})$
By symmetry, $q^{*} (z_{2})$ is also normal; its mean depends on $E [z_{1}]$ $q^{*} (z_{2}) = N (z_{2} ∣ m_{2}, Λ_{22}^{- 1}), m_{2} = μ_{2} - Λ_{22}^{- 1} Λ_{12} (E [z_{1}] - μ_{1})$
We treat the above variational solutions as re-estimation equations, and cycle through the variables in turn updating them until some convergence criterion is satisfied

Visualize VI solution to bivariate Gaussian

Variational inference minimizes KL $(q ‖ p)$ : mean of the approximation is correct, but variance (along the orthogonal direction) is significantly under-estimated
Expectation propagation minimizes KL $(p ‖ q)$ : solution equals marginal distributions

Figure 1: Left: variational inference. Right: expectation propagation

Another example to compare KL $(q ‖ p)$ and KL $(p ‖ q)$

To approximate a mixture of two Gaussians $p$ (blue contour)
Use a single Gaussian $q$ (red contour) to approximate $p$
- By minimizing KL $(p ‖ q)$ : figure (a)
- By minimizing KL $(q ‖ p)$ : figure (b) and (c) show two local minimum

For multimodal distribution
- a variational solution will tend to find one of the modes,
- but an expectation propagation solution would lead to poor predictive distribution (because the average of the two good parameter values is typically itself not a good parameter value)

Example: univariate Gaussian

Suppose the data $D = {x_{1}, \dots, x_{N}}$ follows iid normal distribution $x_{i} \sim N (μ, τ^{- 1})$
The prior distributions are $\begin{aligned} μ ∣ τ & \sim N (μ_{0}, (λ_{0} τ)^{- 1}) \\ τ & \sim Gam (a_{0}, b_{0}) \end{aligned}$
Factorized variational approximation $q (μ, τ) = q (μ) q (τ)$

Variational solution for $μ$

$\begin{aligned} \log q^{*} (μ) & = E_{τ} [\log p (D ∣ μ, τ) + \log p (μ ∣ τ)] + const \\ = - \frac{E [τ]}{2} {λ_{0} (μ - μ_{0})^{2} + \sum_{i = 1}^{N} (x_{i} - μ)^{2}} + const \end{aligned}$

Thus, the variational solution for $μ$ is $\begin{aligned} q (μ) & = N (μ ∣ μ_{N}, λ_{N}^{- 1}) \\ μ_{N} & = \frac{λ_{0} μ_{0} + N \bar{x}}{λ_{0} + N} \\ λ_{N} & = (λ_{0} + N) E [τ] \end{aligned}$

Variational solution for $τ$

$\begin{aligned} \log q^{*} (τ) & = E_{μ} [\log p (D ∣ μ, τ) + \log p (μ ∣ τ) + \log p (τ)] + const \\ = (a_{0} - 1) \log τ - b_{0} τ + \frac{N}{2} \log τ \\ - \frac{τ}{2} E_{μ} [λ_{0} (μ - μ_{0})^{2} + \sum_{i = 1}^{N} (x_{i} - μ)^{2}] + const \end{aligned}$

Thus, the variational solution for $τ$ is $\begin{aligned} q (τ) & = Gam (τ ∣ a_{N}, b_{N}) \\ a_{N} & = a_{0} + + \frac{N}{2} \\ b_{N} & = b_{0} + \frac{1}{2} E_{μ} [λ_{0} (μ - μ_{0})^{2} + \sum_{i = 1}^{N} (x_{i} - μ)^{2}] \end{aligned}$

Visualization of VI solution to univariate normal

Model selection

Model selection (comparison) under variational inference

In addition to making inference on the parameter $Z$ , we may also want to compare a set of candidate models, denoted by index $m$
We should consider the factorization $q (Z, m) = q (Z ∣ m) q (m)$ to approximate the posterior $p (Z, m ∣ X)$
We can maximize the information lower bound $L_{m} = \sum_{m} \sum_{Z} q (Z ∣ m) q (m) \log {\frac{p (Z, X, m)}{q (Z ∣ m) q (m)}}$ which is a lower bound of $\log p (X)$
The maximized $q (m)$ can be used for model selection

Variational Mixture of Gaussians

Mixture of Gaussians

For each observation $x_{n} \in R^{D}$ , we have a corresponding latent variable $z_{n}$ , a 1-of- $K$ binary group indicator vector
Mixture of Gasussians joint likelihood, based on $N$ observations $\begin{aligned} p (Z ∣ π) & = \prod_{n = 1}^{N} \prod_{k = 1}^{K} π_{k}^{z_{n k}} \\ p (X ∣ Z, μ, Λ) & = \prod_{n = 1}^{N} \prod_{k = 1}^{K} N {(x_{n} ∣ μ_{k}, Λ_{k}^{- 1})}^{z_{n k}} \end{aligned}$

Figure 2: Graph representation of mixture of Gaussians

Conjugate priors

Dirichlet for $π$ $p (π) = Dir (π ∣ α_{0}) \propto \prod_{k = 1}^{K} π_{k}^{α_{0 k} - 1}$
Independent Gaussian-Wishart for $μ, Λ$ $\begin{aligned} p (μ, Λ) & = \prod_{k = 1}^{K} p (μ_{k} ∣ Λ_{k}) p (Λ_{k}) \\ = \prod_{k = 1}^{K} N (μ_{k} ∣ m_{0}, (β_{0} Λ_{k})^{- 1}) W (Λ_{k} ∣ W_{0}, ν_{0}) \end{aligned}$
- Usually, the prior mean $m_{0} = 0$

Variational distribution

Joint posterior $p (X, Z, π, μ, Λ) = p (X ∣ Z, μ, Λ) p (Z ∣ π) p (π) p (μ ∣ Λ) p (Λ)$
Variational distribution factorizes between the latent variables and the parameters $\begin{aligned} q (Z, π, μ, Λ) & = q (Z) q (π, μ, Λ) \\ = q (Z) q (π) \prod_{k = 1}^{K} q (μ_{k}, Λ_{k}) \end{aligned}$

Variational solution for $Z$

Optimized factor $\begin{aligned} \log q^{*} (Z) & = E_{π, μ, Λ} [\log p (X, Z, π, μ, Λ)] \\ = E_{π} [\log p (Z ∣ π)] + E_{μ, Λ} [\log p (X ∣ Z, μ, Λ)] \\ = \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} \log ρ_{n k} + const \\ \log ρ_{n k} = & E [\log π_{k}] + \frac{1}{2} E [\log | Λ_{k} |] - \frac{D}{2} \log (2 π) \\ - \frac{1}{2} E_{μ, Λ} [{(x_{n} - μ_{k})}^{'} Λ_{k} (x_{n} - μ_{k})] \end{aligned}$
Thus, the factor $q^{*} (Z)$ takes the same functional form as the prior $p (Z ∣ π)$ $q^{*} (Z) = \prod_{n = 1}^{N} \prod_{k = 1}^{K} r_{n k}^{z_{n k}}, r_{n} k = \frac{ρ_{n k}}{\sum_{j = 1}^{K} ρ_{n j}}$
- By $q^{*} (Z)$ , the posterior mean (i.e., responsibility) $E [z_{n k}] = r_{n k}$

Define three statistics wrt the responsibilities

For each of group $k = 1, \dots, K$ , denote $\begin{aligned} N_{k} & = \sum_{n = 1}^{N} r_{n k} \\ {\bar{x}}_{k} & = \frac{1}{N_{k}} \sum_{n = 1}^{N} r_{n k} x_{n} \\ S_{k} & = \frac{1}{N_{k}} \sum_{n = 1}^{N} r_{n k} (x_{n} - {\bar{x}}_{k}) {(x_{n} - {\bar{x}}_{k})}^{'} \end{aligned}$

Variational solution for $π$

Optimized factor $\begin{aligned} \log q^{*} (π) & = \log p (π) + E_{Z} [p (Z ∣ π)] \\ = (α_{0} - 1) \sum_{k = 1}^{K} \log π_{k} + \sum_{k = 1}^{K} \sum_{n = 1}^{N} r_{n k} \log π_{n k} + const \end{aligned}$
Thus, $q^{*} (π)$ is a Dirichlet distribution $q^{*} (π) = Dir (α), α_{k} = α_{0} + N_{k}$

Variational solution for $μ_{k}, Λ_{k}$

Optimized factor for $(μ_{k}, Λ_{k})$ $\begin{aligned} \log q^{*} (μ_{k}, Λ_{k}) = & E_{Z} [\sum_{n = 1}^{N} z_{n k} \log N (x_{n} ∣ μ_{k}, Λ_{k}^{- 1})] \\ + \log p (μ_{k} ∣ Λ_{k}) + \log p (Λ_{k}) \end{aligned}$
Thus, $q^{*} (μ_{k}, Λ_{k})$ is Gaussian-Wishart $\begin{aligned} q^{*} (μ_{k} ∣ Λ_{k}) & = N (m_{k}, {(β_{k} Λ_{k})}^{- 1}) q^{*} (Λ_{k}) & = W (Λ_{k} ∣ W_{k}, ν_{k}) \end{aligned}$
Parameters are updated by the data $\begin{aligned} β_{k} & = β_{0} + N_{k}, m_{k} = \frac{1}{β_{k}} (β_{0} m_{0} + N_{k} {\bar{x}}_{k}), ν_{k} = ν_{0} + N_{k} \\ W_{k}^{- 1} & = W_{0}^{- 1} + N_{k} S_{k} + \frac{β_{0} N_{k}}{β_{0} + N_{k}} ({\bar{x}}_{k} - m_{0}) {({\bar{x}}_{k} - m_{0})}^{'} \end{aligned}$

Similarity between VI and EM solutions

Optimization of the variational posterior distribution involves cycling between two stages analogous to the E and M steps of the maximum likelihood EM algorithm
- Finding $q^{*} (Z)$ : analogous to the E step, both need to compute the responsibilities
- Finding $q^{*} (π, μ, Λ)$ : analogous to the M step
The VI solution (Bayesian approach) has little computational overhead, comparing with the EM solution (maximum likelihood approach). The dominant computational cost for VI are
- Evaluation of the responsibilities
- Evaluation and inversion of the weighted data covariance matrices

Advantage of the VI solution over the EM solution:

Since our priors are conjugate, the variational posterior distributions have the same functional form as the priors

No singularity arises in maximum likelihood when a Gassuain component “collapses” onto a specific data point
- This is actually the advantage of Bayesian solutions (with priors) over frequentist ones
No overfitting if we choose a large number $K$ . This is helpful in determining the optimal number of components without performing cross validation
- For $α_{0} < 1$ , the prior favors soutions where some of the mixing coefficients $π$ are zero, thus can result in some less than $K$ number components having nonzero mixing coefficients

Computing variational lower bound

To test for convergence, it is useful to monitor the bound during the re-estimation.
At each step of the iterative re-estimation, the value of the lower bound should not decrease $\begin{aligned} L = & \sum_{Z} ∭ q^{*} (Z, π, μ, Λ) \log {\frac{p (X, Z, π, μ, Λ)}{q^{*} (Z, π, μ, Λ)}} d π d μ d Λ \\ = & E [\log p (X, Z, π, μ, Λ)] - E [\log q^{*} (Z, π, μ, Λ)] \\ = & E [\log p (X ∣ Z, μ, Λ)] + E [\log p (Z ∣ π)] \\ + E [\log p (π)] + E [\log p (μ, Λ)] \\ - E [\log q^{*} (Z)] - E [\log q^{*} (π)] - E [\log q^{*} (μ, Λ)] \end{aligned}$

Label switching problem

EM solution of maximum likelihood does not have label switching problem, because the initialization will lead to just one of the solutions
In a Bayesian setting, label switching problem can be an issue, because the marginal posterior is multi-modal.
Recall that for multi-modal posteriors, variational inference usually approximate the distribution in the neighborhood of one of the modes and ignore the others

Induced factorizations

Induced factorizations: the additional factorizations that are a consequence of the interaction between
- the assumed factorization, and
- the conditional independence properties of the true distribution
For example, suppose we have three variation groups $A, B, C$
- We assume the following factorization $q (A, B, C) = q (A, B) q (C)$
- If $A$ and $B$ are conditional independent $A ⊥ B ∣ X, C ⟺ p (A, B ∣ X, C) = p (A ∣ X, C) p (B ∣ X, C)$ then we have induced factorization $q^{*} (A, B) = q^{*} (A) q^{*} (B)$ $\begin{aligned} \log q^{*} (A, B) & = E_{C} [\log p (A, B ∣ X, C)] + const \\ = E_{C} [\log p (A ∣ X, C)] + E_{C} [\log p (B ∣ X, C)] + const \end{aligned}$

Variational Linear Regression

Bayesian linear regression

Here, I use a denotion system commonly used in statistics textbooks. So its different from the one used in this book.
Likelihood function $p (y ∣ β) = \prod_{n = 1}^{N} N (y_{n} ∣ x_{n} β, ϕ^{- 1})$
- $ϕ = 1 / σ^{2}$ is the precision parameter. We assume that it is known.
- $β \in R^{p}$ includes the intercept
Prior distributions: Normal Gamma $\begin{aligned} p (β ∣ κ) & = N (β ∣ 0, κ^{- 1} I) \\ p (κ) & = Gam (κ ∣ a_{0}, b_{0}) \end{aligned}$

Variational solution for $κ$

Variational posterior factorization $q (β, κ) = q (β) q (κ)$
Varitional solution for $κ$ $\begin{aligned} \log q^{*} (κ) & = \log p (κ) + E_{β} [\log p (β ∣ κ)] \\ = (a_{0} - 1) \log κ - b_{0} κ + \frac{p}{2} \log κ - \frac{κ}{2} E [β^{'} β] \end{aligned}$
Varitional posterior is a Gamma $\begin{aligned} κ & \sim Gam (a_{N}, b_{N}) \\ a_{N} & = a_{0} + \frac{p}{2} \\ b_{N} & = b_{0} + \frac{E [β^{'} β]}{2} \end{aligned}$

Variational solution for $β$

Variational solution for $β$ $\begin{aligned} \log q^{*} (β) & = \log p (y ∣ β) + E_{κ} [\log p (β ∣ κ)] \\ = - \frac{ϕ}{2} {(y - X β)}^{2} - \frac{E [κ]}{2} β^{'} β \\ = - \frac{1}{2} β^{'} (E [κ] I + ϕ X^{'} X) β + ϕ β^{'} X^{'} y \end{aligned}$
Variational posterior is a Normal $\begin{aligned} β & \sim N (m_{N}, S_{N}) \\ S_{N} & = {(E [κ] I + ϕ X^{'} X)}^{- 1} \\ m_{N} & = ϕ S_{N} X^{'} y \end{aligned}$

Iteratively re-estimate the variational solutions

Means of the variational posteriors $\begin{aligned} E [κ] & = \frac{a_{N}}{b_{N}} \\ E [β^{'} β] & = m_{N} m_{N}^{'} + S_{N} \end{aligned}$
Lower bound of $\log p (y)$ can be used in convergence monitoring, and also model selection $\begin{aligned} L = & E [\log p (β, κ, y)] - E [\log q^{*} (β, κ)] \\ = & E_{β} [\log p (y ∣ β)] + E_{β, κ} [\log p (β ∣ κ)] + E_{κ} [\log p (κ)] \\ - E_{β} [\log q^{*} (β)] - E_{κ} [\log q^{*} (κ)] \end{aligned}$

Book Notes: Pattern Recognition and Machine Learning -- Ch10 Variational Inference

Variational Inference

Introduction of the variational inference method

Definitions

Model evidence equals lower bound plus KL divergence

Mean field family

Solution for mean field families: derivation

Solution for mean field families

Example: approximate a bivariate Gaussian using two independent distributions

VI solution to the bivariate Gaussian problem

Visualize VI solution to bivariate Gaussian

Another example to compare KL(q∥p)(q‖p) and KL(p∥q)(p‖q)

Example: univariate Gaussian

Example: univariate Gaussian

Variational solution for μμ

Variational solution for ττ

Visualization of VI solution to univariate normal

Model selection

Model selection (comparison) under variational inference

Variational Mixture of Gaussians

Mixture of Gaussians

Conjugate priors

Variational distribution

Variational solution for ZZ

Define three statistics wrt the responsibilities

Variational solution for ππ

Variational solution for μk,Λkμk,Λk

Similarity between VI and EM solutions

Advantage of the VI solution over the EM solution:

Computing variational lower bound

Label switching problem

Induced factorizations

Variational Linear Regression

Bayesian linear regression

Variational solution for κκ

Variational solution for ββ

Iteratively re-estimate the variational solutions

Exponential Family Distributions

Local Variational Methods

Variational Logistic Regression

Expectation Propagation

References

Another example to compare KL $(q ‖ p)$ and KL $(p ‖ q)$

Variational solution for $μ$

Variational solution for $τ$

Variational solution for $Z$

Variational solution for $π$

Variational solution for $μ_{k}, Λ_{k}$

Variational solution for $κ$

Variational solution for $β$