K-means Clustering vs Mixtures of Gaussians
- K-means clustering
- Mixture of Gaussians
EM Algorithm
- The general EM algorithm
- A different view of the EM algorithm, related to variational inference

For the pdf slides, click here

K-means Clustering vs Mixtures of Gaussians

K-means clustering

K-means clustering: problem

Data
- $D$ -dimensional observations: $x_{1}, \dots, x_{N}$
Parameters
- $K$ clusters’ means:
- Binary indicator : if object is in class
Goal: find values for and to minimize the objective function (called a distortion measure)

K-means clustering: solution

Two-stage optimization
- Update and alternatively, and repeat until convergence
- Resembles the E step and M step in the EM algorithm

E(expectation) step: updates .
- Assign the th data point to the closest cluster center
M(maximization) step: updates
- Set cluster mean to be mean of all data points assigned to this cluster

Mixture of Gaussians

Mixture of Gaussians: definition

Mixture of Gaussians: log likelihood
Introduce a -dim latent indicator variable

The marginal distribution of is multinomial

We call the posterior probability as the Responsibility that component takes for explaining the observation

Mixture of Gaussians: singularity problem with MLE

Problem with maximum likelihood estimation: presence of singularities: there will be clusters that contains only one data point, so that the corresponding covariance matrix will be estimated at zero, thus the likelihood explodes
- Therefore, when finding MLE, we should avoid finding such singularity solution and instead seek well-behaved local maxima of the likelihood function: see the following EM approach
- Alternatively, we can to adopt a Bayesian approach

Figure 1: Illustration of singularities

Conditional MLE of

Suppose we observe data points
Similarly, we write the latent variables as
Set the derivatives of with respect to to zero Then we obtain where is the effective number of points assigned to cluster

Conditional MLE of and

Similarly, setting the derivatives of log likelihood wrt , we have
Use Lagrange multiplier to maximize log likelihood wrt under the constraint that all add up to one: we get the solution
The above results on are not closed-form solution because the responsibilities depend on them in a complex way.

EM algorithm for mixture of Gaussians

Initialize , usually using the -means algorithm.
E step: compute responsibilities using the current parameters
M step: re-estimate the parameters using the current responsibilities, where
Check for convergence of either the parameters or the log likelihood. If not converged, return to step 2.

Connection between K-means and Gaussian mixture model

K-means algorithm itself is often used to initialize the parameters in a Gaussian mixture model before applying the EM algorithm
Mixture of Gaussians: soft assignment of data points to clusters, using posterior probabilities
-means can be viewed as a special case of mixture of Gaussian, where covariances of mixture components are , where is a parameter shared by all components.
- In the responsibility calculation, In the limit , for each observation , the responsibilities has exactly one unity and all the rest are zero.

EM Algorithm

The general EM algorithm

EM algorithm: definition

Goal: maximize likelihood with respect to the parameter , for models having latent variables .
Notations
- : observed data; also called incomplete data
- : model parameters
- : latent variables, usually each observation has a latent variable
- is called complete data
Log likelihood
- The sum over can be replaced by an integral if is continuous
- The presence of sum prevents the logarithm from acting directly on the joint distribution. This complicates MLE solutions, especially for exponential family.

General EM algorithm: two-stage iterative optimization

Choose the initial parameters
E step: since the conditional posterior contains all of our knowledge about the latent variable , we compute the expected complete-data log likelihood under it.
M step: revise parameter estimate
- Note in the maximizing step, the logarithm acts driectly on the joint likelihood , so the maximizating will be tractable.
Check for convergence of the log likelihood or the parameter values. If not converged, use to replace , and return to step 2.

Gaussian mixtures revisited

Recall that latent variables :
Complete data log likelihood
- Comparing this with incomplete data log likelihood in Eq , we have the sum over and logarithm interchanged. Thus, the logarithm acts on Gaussian density directly.

Figure 2: Mixture of Gaussians, treating latent variables as observed

Continue: Gaussian mixtures revisited

Conditional posterior of Thus, the conditional posterior of are independent
Conditional expectations
Thus the objective function in the M-step

Book Notes: Pattern Recognition and Machine Learning -- Ch9 Mixture Models and EM Algorithm

K-means Clustering vs Mixtures of Gaussians

K-means clustering

K-means clustering: problem

K-means clustering: solution

Mixture of Gaussians

Mixture of Gaussians: definition

Mixture of Gaussians: singularity problem with MLE

Conditional MLE of μk

Conditional MLE of Σk and πk

EM algorithm for mixture of Gaussians

Connection between K-means and Gaussian mixture model

EM Algorithm

The general EM algorithm

EM algorithm: definition

General EM algorithm: two-stage iterative optimization

Gaussian mixtures revisited

Continue: Gaussian mixtures revisited

A different view of the EM algorithm, related to variational inference

A different view of the EM algorithm

A different view of the EM algorithm: E step

A different view of the EM algorithm: M step

EM algorithm illustration

EM algorithm in Bayesian statistics

EM algorithm and missing data

EM algorithm for IID data with N latent variables

Extensions of EM algorithms

References

Conditional MLE of

Conditional MLE of and

EM algorithm for IID data with latent variables