Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Similar documents
Bayesian Methods: Naïve Bayes

2 When Some or All Labels are Missing: The EM Algorithm

Logistic Regression. Hongning Wang

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Lecture 5. Optimisation. Regularisation

Special Topics: Data Science

CS249: ADVANCED DATA MINING

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

CS145: INTRODUCTION TO DATA MINING

knn & Naïve Bayes Hongning Wang

Lecture 10. Support Vector Machines (cont.)

Operational Risk Management: Preventive vs. Corrective Control

Communication Amid Uncertainty

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

Product Decomposition in Supply Chain Planning

Communication Amid Uncertainty

EE 364B: Wind Farm Layout Optimization via Sequential Convex Programming

Imperfectly Shared Randomness in Communication

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

The Intrinsic Value of a Batted Ball Technical Details

Optimizing Cyclist Parking in a Closed System

The next criteria will apply to partial tournaments. Consider the following example:

Trouble With The Curve: Improving MLB Pitch Classification

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Operations on Radical Expressions; Rationalization of Denominators

On the convergence of fitting algorithms in computer vision

Prediction Market and Parimutuel Mechanism

Two Machine Learning Approaches to Understand the NBA Data

San Francisco State University ECON 560 Summer Midterm Exam 2. Monday, July hour 15 minutes

Evaluating and Classifying NBA Free Agents

SNARKs with Preprocessing. Eran Tromer

New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary Variables

Machine Learning Application in Aviation Safety

Finding your feet: modelling the batting abilities of cricketers using Gaussian processes

Below are the graphing equivalents of the above constraints.

Design Project 2 Sizing of a Bicycle Chain Ring Bolt Set ENGR 0135 Sangyeop Lee November 16, 2016 Jordan Gittleman Noah Sargent Seth Strayer Desmond

ISyE 6414: Regression Analysis

Optimization and Search. Jim Tørresen Optimization and Search

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Polynomial DC decompositions

Combining Experimental and Non-Experimental Design in Causal Inference

ISyE 6414 Regression Analysis

Parsimonious Linear Fingerprinting for Time Series

NCSS Statistical Software

A Chiller Control Algorithm for Multiple Variablespeed Centrifugal Compressors

Analysis of Shear Lag in Steel Angle Connectors

COMP Intro to Logic for Computer Scientists. Lecture 13

Modelling the distribution of first innings runs in T20 Cricket

Mathematics of Pari-Mutuel Wagering

HIGH RESOLUTION DEPTH IMAGE RECOVERY ALGORITHM USING GRAYSCALE IMAGE.

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

Write these equations in your notes if they re not already there. You will want them for Exam 1 & the Final.

Simplifying Radical Expressions and the Distance Formula

Abstract In this paper, the author deals with the properties of circumscribed ellipses of convex quadrilaterals, using tools of parallel projective tr

Ivan Suarez Robles, Joseph Wu 1

Bayesian Optimized Random Forest for Movement Classification with Smartphones

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

A HYBRID METHOD FOR CALIBRATION OF UNKNOWN PARTIALLY/FULLY CLOSED VALVES IN WATER DISTRIBUTION SYSTEMS ABSTRACT

CS 4649/7649 Robot Intelligence: Planning

PREDICTING the outcomes of sporting events

What is Restrained and Unrestrained Pipes and what is the Strength Criteria

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Modeling the Relationship Between MLB Ballparks and Home Team Performance. Using Shape Analysis. Sara Biesiadny, B.S. A Thesis

Functions of Random Variables & Expectation, Mean and Variance

Sample Solution for Problem 1.a

A COURSE OUTLINE (September 2001)

3D Inversion in GM-SYS 3D Modelling

/435 Artificial Intelligence Fall 2015

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Analysis of Pumping Characteristics of a Multistage Roots Pump

College Football Rankings: An Industrial Engineering Approach. An Undergraduate Honors College Thesis. in the

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

Analysis of Gini s Mean Difference for Randomized Block Design

COMPLEX NUMBERS. Powers of j. Consider the quadratic equation; It has no solutions in the real number system since. j 1 ie. j 2 1

STANDARD SCORES AND THE NORMAL DISTRIBUTION

Determination of the Design Load for Structural Safety Assessment against Gas Explosion in Offshore Topside

Use of Auxiliary Variables and Asymptotically Optimum Estimators in Double Sampling

Massey Method. Introduction. The Process

Practical Approach to Evacuation Planning Via Network Flow and Deep Learning

On the Impact of Garbage Collection on flash-based SSD endurance

Anabela Brandão and Doug S. Butterworth

Probabilistic Modelling in Multi-Competitor Games. Christopher J. Farmer

STAT/MATH 395 PROBABILITY II

Modelling and Simulation of Environmental Disturbances

A definition of depth for functional observations

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

WHEN TECHNOLOGICAL COEFFICIENTS CHANGES NEED TO BE ENDOGENOUS

Chapter 20. Planning Accelerated Life Tests. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Copyright 2017 by Glenn Daniel Sidle. All Rights Reserved

DNS Study on Three Vortex Identification Methods

Chapter 2 Ventilation Network Analysis

Bézier Curves and Splines

Physical Design of CMOS Integrated Circuits

Gait Based Personal Identification System Using Rotation Sensor

Ocean Fishing Fleet Scheduling Path Optimization Model Research. Based On Improved Ant Colony Algorithm

Ranking teams in partially-disjoint tournaments

Modeling the NCAA Tournament Through Bayesian Logistic Regression

OPTIMAL TRAJECTORY GENERATION OF COMPASS-GAIT BIPED BASED ON PASSIVE DYNAMIC WALKING

Transcription:

Mixture Models & EM Nicholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate

Previously We looked at -means and hierarchical clustering as mechanisms for unsupervised learning -means was simply a block coordinate descent scheme for a specific objective function Today: how to learn probabilistic models for unsupervised learning problems 2

EM: Soft Clustering Clustering (e.g., k-means) typically assumes that each instance is given a hard assignment to exactly one cluster Does not allow uncertainty in class membership or for an instance to belong to more than one cluster Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster Soft clustering gives probabilities that an instance belongs to each of a set of clusters 3

Probabilistic Clustering Try a probabilistic model! Allows overlaps, clusters of different size, etc. Can tell a generative story for data pp(xx yy) pp(yy) Challenge: we need to estimate model parameters without labeled yy s (i.e., in the unsupervised setting) Z X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5 4

Probabilistic Clustering Clusters of different shapes and sizes Clusters can overlap! (-means doesn t allow this) 5

Finite Mixture Models Given a dataset: xx (1),, xx () Mixture model: Θ = {λλ 1,, λλ, θθ 1,, θθ } pp xx Θ = λλ yy pp yy (xx θθ yy ) yy=1 pp yy (xx θθ yy ) is a mixture component from some family of probability distributions parameterized by θθ yy and λλ 0 such that yy λλ yy = 1 are the mixture weights We can think of λλ yy = pp YY = yy Θ for some random variable YY that takes values in {1,, } 6

Finite Mixture Models Uniform mixture of 3 Gaussians 7

Multivariate Gaussian A dd-dimensional multivariate Gaussian distribution is defined by a dd dd covariance matrix Σ and a mean vector μμ pp xx μμ, Σ = 1 2ππ d det Σ exp 1 2 xx μμ TT Σ 1 (xx μμ) The covariance matrix describes the degree to which pairs of variables vary together The diagonal elements correspond to variances of the individual variables 8

Multivariate Gaussian A dd-dimensional multivariate Gaussian distribution is defined by a dd dd covariance matrix Σ and a mean vector μμ pp xx μμ, Σ = 1 2ππ d det Σ exp 1 2 xx μμ TT Σ 1 (xx μμ) The covariance matrix must be a symmetric positive definite matrix in order for the above to make sense Positive definite: all eigenvalues are positive & matrix is invertible Ensures that the quadratic form is concave 9

Gaussian Mixture Models (GMMs) We can define a GMM by choosing the ttt component of the mixture to be a Gaussian density with parameters θθ = μμ, Σ pp xx μμ, Σ k = 1 2ππ d det Σ k exp 1 2 xx μμ TT Σ 1 (xx μμ ) We could cluster by fitting a mixture of Gaussians to our data How do we learn these kinds of models? 10

Learning Gaussian Parameters MLE for supervised univariate Gaussian μμ MMMMMM = 1 xx ii 2 σσ MMMMMM = 1 1 xx ii μμ MMMMMM 2 MLE for supervised multivariate Gaussian μμ MMMMMM = 1 xx ii ΣΣ MLE = 1 1 xx ii μμ MMMMMM xx ii μμ MMMMMM TT 11

Learning Gaussian Parameters MLE for supervised multivariate mixture of Gaussian distributions μμ MMMMMM = 1 MM xx ii ii MM Σ MMMMMM = 1 MM xx ii μμ MMMMMM xx ii μμ MMMMMM ii MM Sums are over the observations that were generated by the ttt mixture component (this requires that we know which points were generated by which distribution!) TT 12

The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, xx (ii), and mixture assignments, yy {1,, } arg max Θ pp(xx ii Θ) = arg max Θ = arg max Θ pp(xx ii, YY = yy Θ) yy=1 pp xx ii YY = yy, Θ pp(yy = yy Θ) yy=1 13

The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, xx (ii), and mixture assignments, yy {1,, } arg max Θ pp(xx ii Θ) = arg max Θ = arg max Θ pp(xx ii, YY = yy Θ) yy=1 pp xx ii YY = yy, Θ pp(yy = yy Θ) yy=1 We only know how to compute the probabilities for each mixture component 14

The Unsupervised Case In the case of a Gaussian mixture model pp xx ii YY = yy, Θ = (xx (ii) μμ yy, Σ yy ) pp YY = yy Θ = λλ yy Differentiating the MLE objective yields a system of equations that is difficult to solve in general The solution: modify the objective to make the optimization easier 15

Expectation Maximization 16

Jensen s Inequality For a convex function ff: R nn R, any aa 1,, aa [0,1] such that aa 1 + + aa = 1, and any xx (1),, xx R nn, aa 1 ff xx 1 + + aa ff xx ff aa 1 xx 1 + + aa xx Inequality is reversed for concave functions 17

EM Algorithm log l(θ) = log pp xx ii, YY = yy Θ yy=1 = log yy=1 FF(Θ, qq) yy=1 qqii yy qq ii yy pp xx ii, YY = yy Θ qq ii yy log pp xx ii, YY = yy Θ qq ii yy 18

EM Algorithm log l(θ) = log pp xx ii, YY = yy Θ yy=1 = log yy=1 FF(Θ, qq) yy=1 qqii yy qq ii yy pp xx ii, YY = yy Θ qq ii yy log pp xx ii, YY = yy Θ qq ii yy qq ii (yy) is an arbitrary positive probability distribution 19

EM Algorithm log l(θ) = log pp xx ii, YY = yy Θ yy=1 = log yy=1 FF(Θ, qq) yy=1 qqii yy qq ii yy pp xx ii, YY = yy Θ qq ii yy log pp xx ii, YY = yy Θ qq ii yy Jensen s ineq. 20

EM Algorithm arg max Θ,qq 1,..,qq yy=1 qq ii yy log pp xx ii, YY = yy Θ qq ii yy This objective is not jointly concave in Θ and qq 1,, qq Best we can hope for is a local maxima (and there could be A LOT of them) The EM algorithm is a block coordinate ascent scheme that finds a local optimum of this objective Start from an initialization Θ 0 and qq 1 0,, qq 0 21

EM Algorithm E step: with the θθ s fixed, maximize the objective over qq qq tt+1 arg max qq 1,..,qq yy=1 qq ii yy log pp xx ii, YY = yy Θ tt qq ii yy Using the method of Lagrange multipliers for the constraint that yy qq ii yy = 1 gives qq ii tt+1 yy = pp YY = yy XX = xx ii, Θ tt 22

EM Algorithm M step: with the qq s fixed, maximize the objective over Θ θθ tt+1 arg max Θ yy=1 qq ii tt+1 yy log pp xx ii, YY = yy θθ qq ii tt+1 yy For the case of GMM, we can compute this update in closed form This is not necessarily the case for every model May require gradient ascent 23

EM Algorithm Start with random parameters E-step maximizes a lower bound on the log-sum for fixed parameters M-step solves a MLE estimation problem for fixed probabilities Iterate between the E-step and M-step until convergence 24

EM for Gaussian Mixtures E-step: M-step: qq tt ii yy = λλtt yy pp xx ii μμ tt tt yy, Σ yy tt pp xx ii μμ tt tt yy, Σ yy yy λλ yy μμ tt+1 yy = qq ii tt yy xx ii qq ii tt yy Probability of xx (ii) under the appropriate multivariate normal distribution Σ tt+1 yy = qq tt ii yy xx ii tt+1 μμ yy xx ii tt+1 μμ TT yy qq tt ii yy λλ yy tt+1 = 1 qq ii tt yy 25

EM for Gaussian Mixtures E-step: M-step: qq tt ii yy = λλtt yy pp xx ii μμ tt tt yy, Σ yy tt pp xx ii μμ tt tt yy, Σ yy yy λλ yy μμ tt+1 yy = qq ii tt yy xx ii qq ii tt yy Probability of xx (ii) under the mixture model Σ tt+1 yy = qq tt ii yy xx ii tt+1 μμ yy xx ii tt+1 μμ TT yy qq tt ii yy λλ yy tt+1 = 1 qq ii tt yy 26

Gaussian Mixture Example: Start 27

After first iteration 28

After 2nd iteration 29

After 3rd iteration 30

After 4th iteration 31

After 5th iteration 32

After 6th iteration 33

After 20th iteration 34

Properties of EM EM converges to a local optimum This is because each iteration improves the log-likelihood Proof same as -means (just block coordinate ascent) E-step can never decrease likelihood M-step can never decrease likelihood If we make hard assignments instead of soft ones, algorithm is equivalent to -means! 35