For the pdf slides, click here
Variational Inference
Introduction of the variational inference method
Definitions
- Variational inference is also called variational Bayes,
thus
- all parameters are viewed as random variables, and
- they will have prior distributions.
- We denote the set of all latent variables and parameters by
- Note: the parameter vector no long appears, because it’s now a part of
- Goal: find approximation for
- posterior distribution , and
- marginal likelihood , also called the model evidence
Model evidence equals lower bound plus KL divergence
Goal: We want to find a distribution that approximates the posterior distribution . In other word, we want to minimize the KL divergence .
Note the decomposition of the marginal likelihood
Thus, maximizing the lower bound (also called ELBO) is equivalent to minimizing the KL divergence .
Mean field family
Goal: restrict the family of distribution so that they comprise only tractable distributions, while allow the family to be sufficiently flexible so that it can approximate the posterior distribution well
- Mean field family : partition the elements of into disjoint groups
denoted by , for , and assume factorizes wrt these groups:
- Note: we place no resitriction on the functional forms of the individual factors
Solution for mean field families: derivation
We will optimize wrt each in turn.
For , the lower bound (to be maximized) can be decomposed as
- Here the new distribution is defined as
Solution for mean field families
A general expression for the optimal solution is
- We can only use this solution in an iterative manner, because the expectations should be computed wrt other factors for .
- Convergence is guaranteed because bound is convex wrt each factor
- On the right hand side we only need to retain those terms that have some functional dependence on
Example: approximate a bivariate Gaussian using two independent distributions
Target distribution: a bivariate Gaussian
We use a factorized form to approximate :
Note: we do not assume any functional forms for and
VI solution to the bivariate Gaussian problem
Thus we identify a normal, with mean depending on :
By symmetry, is also normal; its mean depends on
We treat the above variational solutions as re-estimation equations, and cycle through the variables in turn updating them until some convergence criterion is satisfied
Visualize VI solution to bivariate Gaussian
Variational inference minimizes KL: mean of the approximation is correct, but variance (along the orthogonal direction) is significantly under-estimated
Expectation propagation minimizes KL: solution equals marginal distributions

Figure 1: Left: variational inference. Right: expectation propagation
Another example to compare KL and KL
- To approximate a mixture of two Gaussians (blue contour)
- Use a single Gaussian (red contour) to approximate
- By minimizing KL: figure (a)
- By minimizing KL: figure (b) and (c) show two local minimum
- For multimodal distribution
- a variational solution will tend to find one of the modes,
- but an expectation propagation solution would lead to poor predictive distribution (because the average of the two good parameter values is typically itself not a good parameter value)
Example: univariate Gaussian
Example: univariate Gaussian
Suppose the data follows iid normal distribution
The prior distributions are
Factorized variational approximation
Variational solution for
Thus, the variational solution for is
Variational solution for
Thus, the variational solution for is
Visualization of VI solution to univariate normal
Model selection
Model selection (comparison) under variational inference
In addition to making inference on the parameter , we may also want to compare a set of candidate models, denoted by index
We should consider the factorization to approximate the posterior
We can maximize the information lower bound which is a lower bound of
The maximized can be used for model selection
Variational Mixture of Gaussians
Mixture of Gaussians
For each observation , we have a corresponding latent variable , a 1-of- binary group indicator vector
Mixture of Gasussians joint likelihood, based on observations

Figure 2: Graph representation of mixture of Gaussians
Conjugate priors
Dirichlet for
Independent Gaussian-Wishart for
- Usually, the prior mean
Variational distribution
Joint posterior
Variational distribution factorizes between the latent variables and the parameters
Variational solution for
Optimized factor
Thus, the factor takes the same functional form as the prior
- By , the posterior mean (i.e., responsibility)
Define three statistics wrt the responsibilities
- For each of group , denote
Variational solution for
Optimized factor
Thus, is a Dirichlet distribution
Variational solution for
Optimized factor for
Thus, is Gaussian-Wishart
Parameters are updated by the data
Similarity between VI and EM solutions
Optimization of the variational posterior distribution involves cycling between two stages analogous to the E and M steps of the maximum likelihood EM algorithm
- Finding : analogous to the E step, both need to compute the responsibilities
- Finding : analogous to the M step
The VI solution (Bayesian approach) has little computational overhead, comparing with the EM solution (maximum likelihood approach). The dominant computational cost for VI are
- Evaluation of the responsibilities
- Evaluation and inversion of the weighted data covariance matrices
Advantage of the VI solution over the EM solution:
- Since our priors are conjugate, the variational posterior distributions have the same functional form as the priors
No singularity arises in maximum likelihood when a Gassuain component “collapses” onto a specific data point
- This is actually the advantage of Bayesian solutions (with priors) over frequentist ones
No overfitting if we choose a large number . This is helpful in determining the optimal number of components without performing cross validation
- For , the prior favors soutions where some of the mixing coefficients are zero, thus can result in some less than number components having nonzero mixing coefficients
Computing variational lower bound
- To test for convergence, it is useful to monitor the bound during the re-estimation.
- At each step of the iterative re-estimation, the value of the lower bound should not decrease
Label switching problem
EM solution of maximum likelihood does not have label switching problem, because the initialization will lead to just one of the solutions
In a Bayesian setting, label switching problem can be an issue, because the marginal posterior is multi-modal.
Recall that for multi-modal posteriors, variational inference usually approximate the distribution in the neighborhood of one of the modes and ignore the others
Induced factorizations
Induced factorizations: the additional factorizations that are a consequence of the interaction between
- the assumed factorization, and
- the conditional independence properties of the true distribution
For example, suppose we have three variation groups
- We assume the following factorization
- If and are conditional independent then we have induced factorization
Variational Linear Regression
Bayesian linear regression
Here, I use a denotion system commonly used in statistics textbooks. So its different from the one used in this book.
Likelihood function
- is the precision parameter. We assume that it is known.
- includes the intercept
Prior distributions: Normal Gamma
Variational solution for
Variational posterior factorization
Varitional solution for
Varitional posterior is a Gamma
Variational solution for
Variational solution for
Variational posterior is a Normal
Iteratively re-estimate the variational solutions
Means of the variational posteriors
Lower bound of can be used in convergence monitoring, and also model selection
TO BE CONTINUED
Exponential Family Distributions
Local Variational Methods
Variational Logistic Regression
Expectation Propagation
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.