Mixture Models & EM Nicholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate
Previously We looked at -means and hierarchical clustering as mechanisms for unsupervised learning -means was simply a block coordinate descent scheme for a specific objective function Today: how to learn probabilistic models for unsupervised learning problems 2
EM: Soft Clustering Clustering (e.g., k-means) typically assumes that each instance is given a hard assignment to exactly one cluster Does not allow uncertainty in class membership or for an instance to belong to more than one cluster Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster Soft clustering gives probabilities that an instance belongs to each of a set of clusters 3
Probabilistic Clustering Try a probabilistic model! Allows overlaps, clusters of different size, etc. Can tell a generative story for data pp(xx yy) pp(yy) Challenge: we need to estimate model parameters without labeled yy s (i.e., in the unsupervised setting) Z X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5 4
Probabilistic Clustering Clusters of different shapes and sizes Clusters can overlap! (-means doesn t allow this) 5
Finite Mixture Models Given a dataset: xx (1),, xx () Mixture model: Θ = {λλ 1,, λλ, θθ 1,, θθ } pp xx Θ = λλ yy pp yy (xx θθ yy ) yy=1 pp yy (xx θθ yy ) is a mixture component from some family of probability distributions parameterized by θθ yy and λλ 0 such that yy λλ yy = 1 are the mixture weights We can think of λλ yy = pp YY = yy Θ for some random variable YY that takes values in {1,, } 6
Finite Mixture Models Uniform mixture of 3 Gaussians 7
Multivariate Gaussian A dd-dimensional multivariate Gaussian distribution is defined by a dd dd covariance matrix Σ and a mean vector μμ pp xx μμ, Σ = 1 2ππ d det Σ exp 1 2 xx μμ TT Σ 1 (xx μμ) The covariance matrix describes the degree to which pairs of variables vary together The diagonal elements correspond to variances of the individual variables 8
Multivariate Gaussian A dd-dimensional multivariate Gaussian distribution is defined by a dd dd covariance matrix Σ and a mean vector μμ pp xx μμ, Σ = 1 2ππ d det Σ exp 1 2 xx μμ TT Σ 1 (xx μμ) The covariance matrix must be a symmetric positive definite matrix in order for the above to make sense Positive definite: all eigenvalues are positive & matrix is invertible Ensures that the quadratic form is concave 9
Gaussian Mixture Models (GMMs) We can define a GMM by choosing the ttt component of the mixture to be a Gaussian density with parameters θθ = μμ, Σ pp xx μμ, Σ k = 1 2ππ d det Σ k exp 1 2 xx μμ TT Σ 1 (xx μμ ) We could cluster by fitting a mixture of Gaussians to our data How do we learn these kinds of models? 10
Learning Gaussian Parameters MLE for supervised univariate Gaussian μμ MMMMMM = 1 xx ii 2 σσ MMMMMM = 1 1 xx ii μμ MMMMMM 2 MLE for supervised multivariate Gaussian μμ MMMMMM = 1 xx ii ΣΣ MLE = 1 1 xx ii μμ MMMMMM xx ii μμ MMMMMM TT 11
Learning Gaussian Parameters MLE for supervised multivariate mixture of Gaussian distributions μμ MMMMMM = 1 MM xx ii ii MM Σ MMMMMM = 1 MM xx ii μμ MMMMMM xx ii μμ MMMMMM ii MM Sums are over the observations that were generated by the ttt mixture component (this requires that we know which points were generated by which distribution!) TT 12
The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, xx (ii), and mixture assignments, yy {1,, } arg max Θ pp(xx ii Θ) = arg max Θ = arg max Θ pp(xx ii, YY = yy Θ) yy=1 pp xx ii YY = yy, Θ pp(yy = yy Θ) yy=1 13
The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, xx (ii), and mixture assignments, yy {1,, } arg max Θ pp(xx ii Θ) = arg max Θ = arg max Θ pp(xx ii, YY = yy Θ) yy=1 pp xx ii YY = yy, Θ pp(yy = yy Θ) yy=1 We only know how to compute the probabilities for each mixture component 14
The Unsupervised Case In the case of a Gaussian mixture model pp xx ii YY = yy, Θ = (xx (ii) μμ yy, Σ yy ) pp YY = yy Θ = λλ yy Differentiating the MLE objective yields a system of equations that is difficult to solve in general The solution: modify the objective to make the optimization easier 15
Expectation Maximization 16
Jensen s Inequality For a convex function ff: R nn R, any aa 1,, aa [0,1] such that aa 1 + + aa = 1, and any xx (1),, xx R nn, aa 1 ff xx 1 + + aa ff xx ff aa 1 xx 1 + + aa xx Inequality is reversed for concave functions 17
EM Algorithm log l(θ) = log pp xx ii, YY = yy Θ yy=1 = log yy=1 FF(Θ, qq) yy=1 qqii yy qq ii yy pp xx ii, YY = yy Θ qq ii yy log pp xx ii, YY = yy Θ qq ii yy 18
EM Algorithm log l(θ) = log pp xx ii, YY = yy Θ yy=1 = log yy=1 FF(Θ, qq) yy=1 qqii yy qq ii yy pp xx ii, YY = yy Θ qq ii yy log pp xx ii, YY = yy Θ qq ii yy qq ii (yy) is an arbitrary positive probability distribution 19
EM Algorithm log l(θ) = log pp xx ii, YY = yy Θ yy=1 = log yy=1 FF(Θ, qq) yy=1 qqii yy qq ii yy pp xx ii, YY = yy Θ qq ii yy log pp xx ii, YY = yy Θ qq ii yy Jensen s ineq. 20
EM Algorithm arg max Θ,qq 1,..,qq yy=1 qq ii yy log pp xx ii, YY = yy Θ qq ii yy This objective is not jointly concave in Θ and qq 1,, qq Best we can hope for is a local maxima (and there could be A LOT of them) The EM algorithm is a block coordinate ascent scheme that finds a local optimum of this objective Start from an initialization Θ 0 and qq 1 0,, qq 0 21
EM Algorithm E step: with the θθ s fixed, maximize the objective over qq qq tt+1 arg max qq 1,..,qq yy=1 qq ii yy log pp xx ii, YY = yy Θ tt qq ii yy Using the method of Lagrange multipliers for the constraint that yy qq ii yy = 1 gives qq ii tt+1 yy = pp YY = yy XX = xx ii, Θ tt 22
EM Algorithm M step: with the qq s fixed, maximize the objective over Θ θθ tt+1 arg max Θ yy=1 qq ii tt+1 yy log pp xx ii, YY = yy θθ qq ii tt+1 yy For the case of GMM, we can compute this update in closed form This is not necessarily the case for every model May require gradient ascent 23
EM Algorithm Start with random parameters E-step maximizes a lower bound on the log-sum for fixed parameters M-step solves a MLE estimation problem for fixed probabilities Iterate between the E-step and M-step until convergence 24
EM for Gaussian Mixtures E-step: M-step: qq tt ii yy = λλtt yy pp xx ii μμ tt tt yy, Σ yy tt pp xx ii μμ tt tt yy, Σ yy yy λλ yy μμ tt+1 yy = qq ii tt yy xx ii qq ii tt yy Probability of xx (ii) under the appropriate multivariate normal distribution Σ tt+1 yy = qq tt ii yy xx ii tt+1 μμ yy xx ii tt+1 μμ TT yy qq tt ii yy λλ yy tt+1 = 1 qq ii tt yy 25
EM for Gaussian Mixtures E-step: M-step: qq tt ii yy = λλtt yy pp xx ii μμ tt tt yy, Σ yy tt pp xx ii μμ tt tt yy, Σ yy yy λλ yy μμ tt+1 yy = qq ii tt yy xx ii qq ii tt yy Probability of xx (ii) under the mixture model Σ tt+1 yy = qq tt ii yy xx ii tt+1 μμ yy xx ii tt+1 μμ TT yy qq tt ii yy λλ yy tt+1 = 1 qq ii tt yy 26
Gaussian Mixture Example: Start 27
After first iteration 28
After 2nd iteration 29
After 3rd iteration 30
After 4th iteration 31
After 5th iteration 32
After 6th iteration 33
After 20th iteration 34
Properties of EM EM converges to a local optimum This is because each iteration improves the log-likelihood Proof same as -means (just block coordinate ascent) E-step can never decrease likelihood M-step can never decrease likelihood If we make hard assignments instead of soft ones, algorithm is equivalent to -means! 35