On the Convergence of Bound Optimization Algorithms

On the Convergence of ound Optmzaton lgorthms Ruslan Salakhutdnov Sam Rowes Unversty of Toronto 6 Kng s College Rd, MS 3G4, Canada rsalakhu,rowes@cs.toronto.edu Zoubn Ghahraman Gatsby Computatonal Neuroscence Unt Unversty College London 17 Queen Square, London WC1N 3R, UK zoubn@gatsby.ucl.ac.uk bstract Many practtoners who use EM and related algorthms complan that they are sometmes slow. When does ths happen, and what can be done about t? In ths paper, we study the general class of bound optmzaton algorthms ncludng EM, Iteratve Scalng, Non-negatve Matrx Factorzaton, CCCP and ther relatonshp to drect optmzaton algorthms such as gradentbased methods for parameter learnng. We derve a general relatonshp between the updates performed by bound optmzaton methods and those of gradent and second-order methods and dentfy analytc condtons under whch bound optmzaton algorthms exhbt quas-newton behavor, and condtons under whch they possess poor, frst-order convergence. ased on ths analyss, we consder several specfc algorthms, nterpret and analyze ther convergence propertes and provde some recpes for preprocessng nput to these algorthms to yeld faster convergence behavor. We report emprcal results supportng our analyss and showng that smple data preprocessng can result n dramatcally mproved performance of bound optmzers n practce. 1 ound Optmzaton lgorthms Many problems n machne learnng and pattern recognton ultmately reduce to the optmzaton of a scalar valued functon LΘ) of a free parameter vector Θ. For example, n supervsed and unsupervsed probablstc modelng the objectve functon may be the condtonal) data lkelhood or the posteror over parameters. In dscrmnatve learnng we may use a classfcaton or regresson score; n renforcement learnng we may use average dscounted reward. Optmzaton may also arse durng nference; for example we may want to reduce the cross entropy between two dstrbutons or mnmze a functon such as the ethe free energy. ound optmzaton O) algorthms take advantage of the fact that many objectve functons arsng n practce have a specal structure. We can often explot ths structure to obtan a bound on the objectve functon and proceed by optmzng ths bound. Ideally, we seek a bound that s vald everywhere n parameter space, easly optmzed, and equal to the true objectve functon at one or more) ponts). general form of a bound maxmzer whch teratvely lower bounds the objectve functon s gven below: General ound Optmzer for maxmzng LΘ): ssume: GΘ, Ψ) such that for any Θ and Ψ : 1. GΘ, Θ ) = LΘ ) & LΘ) GΘ, Ψ ) Ψ Θ 2. arg max Θ GΘ, Ψ ) can be found easly for any Ψ. Iterate: Θ t+1 = arg max Θ GΘ, Θ t ) Guarantee: LΘ t+1 ) = GΘ t+1, Θ t+1 ) GΘ t+1, Θ t ) GΘ t, Θ t ) = LΘ t ) bound optmzer does nothng more than coordnate ascent n the functonal GΘ, Ψ), alternatng between maxmzng G wth respect to Ψ for fxed Θ and wth respect to Θ for fxed Ψ. These algorthms enjoy a strong guarantee; they never worsen the objectve functon. Many popular teratve algorthms are bound optmzers, ncludng the EM algorthm for maxmum lkelhood learnng n latent varable models2, teratve scalng IS) algorthms for parameter estmaton n maxmum entropy models1, non-negatve matrx factorzaton NMF)3 and the recent CCCP algorthm for mnmzng the ethe free energy n approxmate nference problems1. In ths paper we explore two questons of theoretcal and practcal nterest: when wll bound optmzaton be fast or slow relatve to other standard approaches, and what can be done to mprove convergence rates of these algorthms when they are slow? 2 Convergence ehavor and nalyss How large are the steps that bound optmzaton methods take? ny bound optmzer mplctly defnes a mappng:

M : Θ Θ from parameter space to tself, so that Θ t+1 = MΘ t ). If terates Θ t converge to a fxed pont Θ then Θ = MΘ ). If MΘ) s contnuous and dfferentable, we can Taylor expand t n the neghborhood of the fxed pont Θ : Θ t+1 Θ = M Θ )Θ t Θ ) 1) where M Θ ) = M Θ=Θ. Snce M Θ ) s typcally nonzero, a bound optmzer can essentally be seen as a lnear teraton algorthm wth a convergence rate matrx M Θ ). Near a local optmum, ths matrx s related to the curvature of the functonal GΘ, Ψ): lm M Θ t ) = 2 GΘ, Ψ ) 2 GΘ ) 1 Θ t Θ where we defne the mxed partals and Hessan as: 2 G Θ, Ψ ) 2 GΘ,Ψ) Ψ Θ = Θ T Ψ = Θ 2 G Θ ) Θ = Θ Ψ = Θ 2 GΘ,Ψ) T We assume we can easly fnd arg max Θ GΘ, Ψ), and thus 2 G Θ ) s negatve defnte nvertble). Proof sketch of eq 2): y performng Taylor seres expanson of GΘ 2, Θ 1 )= GΘ,Θ1 ) Θ=Θ 2 around Θ, Θ ), we have: GΘ 2, Θ 1 ) = GΘ, Θ ) + Θ 2 Θ ) T 2 GΘ ) + Θ 1 Θ ) T 2 GΘ, Ψ ) +.... Substtutng Θ t for Θ 1, and MΘ t ) for Θ 2 gves = MΘ t ) Θ ) T 2 GΘ ) + Θ t Θ ) T 2 GΘ, Ψ ) +.... In the lmt, Θ = MΘ ) and = lm Θ t Θ M Θ t )) 2 GΘ ) + 2 GΘ, Ψ ). ) What drectons do bound optmzers move n parameter space? For most objectve functons, the O step Θ t+1) Θ t) n parameter space and true gradent vector L Θ t ) = LΘ) Θ=Θt can be trvally related by a transformaton matrx P Θ t ), that changes at each teraton: 2) 3) 4) Θ t+1) Θ t) = P Θ t ) L Θ t ) ) Under certan condtons, ths transformaton matrx P Θ t ) s guaranteed to be postve defnte wth respect to any gradent. In partcular, f C1: GΘ, Θ t ) s well-defned, and dfferentable everywhere n Θ; and C2: for any fxed Θ t Θ t+1), GΘ, Θ t ) has only a sngle crtcal pont along any drecton, located at the maxmum Θ t+1 ; then LΘ t )P Θ t ) L Θ t ) > Θ t 6) The second condton may seem very strong, however, t s satsfed n many practcal cases. For example, for the EM algorthm, t s satsfed whenever the M-step has a sngle unque soluton n partcular, t holds for exponental famly models due to concavty of GΘ, Θ t )); for GIS, NMF, CCCP, and many others, t s satsfed due to concavty of GΘ, Θ t ) although C2 does not mply concavty). Proof sketch of eq 6): Note that GΘ t )Θ t+1) Θ t ), where GΘ t ) = GΘ,Θt ) Θ=Θ t s the drectonal dervatve of functon GΘ, Θ t ) n the drecton of Θ t+1) Θ t. C1 and C2 together mply that ths quantty s postve, otherwse by the Mean Value Theorem C1) GΘ, Θ t ) would have a crtcal pont along some drecton, located at a pont other than Θ t+1 C2). y usng the dentty LΘ t ) = GΘ,Θt ) Θ=Θ t, we have LΘ t )P Θ t ) LΘ t ) = GΘ t )Θ t+1) Θ t ) >. ) The mportant consequence of the above analyss s that when the bound functon has a unque optmum, O has the appealng qualty of always takng a step Θ t+1) Θ t havng postve projecton onto the true gradent of the objectve functon LΘ t ). Ths makes O smlar to a frst order method operatng on the gradent of a locally reshaped lkelhood functon. For maxmum lkelhood learnng of a mxture of Gaussans model usng the EM-algorthm, ths postve defnte transformaton matrx P Θ t ) was frst descrbed by Xu and Jordan9. We have extended ther results by dervng the explct form of the transformaton matrx for several other latent varables models such as Factor nalyss F), Probablstc Prncpal Component nalyss PPC), mxture of PPCs, mxture of Fs, and Hdden Markov Models ; we have also derved the general form of P Θ t ) matrx for exponental famly models n terms of natural parameters. One can further study the structure of the transformaton matrx P Θ t ) and relate t to the convergence rate matrx M. Our man result s that when the dervatve s small M has small egenvalues), the transformaton matrx approaches the negatve nverse Hessan and bound optmzaton behaves lke a second-order Newton method. In partcular, n the neghborhood of a local optmum Θ : lm Θ t Θ P Θt ) = I M Θ ) 1 SΘ ) 7) where SΘ ) = 2 LΘ) 2 Θ=Θ s the Hessan of the objectve functon. We assume that P Θ) and MΘ) are dfferentable and that SΘ ) 1 exsts. Proof sketch of eq 7): Takng negatve dervatves of ) wrt Θ t yelds I M Θ t ) = P Θ t ) LΘ t ) P Θ t )SΘ t ) where M Θ t ) = t+1 / t j s the nput-output dervatve matrx for the O mappng and P Θ t ) = P Θt ) Θ=Θt s the tensor dervatve of P Θ t ) wth respect to Θ t. In the lmt, near a fxed pont, the frst term wll vansh snce the gradent s gong to zero assumng P Θ t ) does not become nfnte); the equalty 7) readly follows. ) Ths shows that the nature of the quas-newton behavor s controlled by the convergence matrx M Θ ). When the matrx M has small egenvalues, then near a local optmum bound optmzaton may exhbt quas-newton convergence behavor. Ths s also true n plateau regons where the gradent s very small even f they are not near a local optmum.

1 8.4 GRDIENT EM NEWTON 6 4 µ 1 µ 2 8. 3 8. 2 1 9 6 3 3 6 9 µ 1 µ 2 7.7 9.1 1 1 1 1 GRDIENT 9. NEWTON EM 8.7 6 4 3 2 1 1. 1. 8.3 8.. 1 6 3 3 6 9.1 8. 1 1 1 1. 1. 1.. 1 1. Fgure 1: Contour plots of the lkelhood functon LΘ) for MoG examples usng well-separated upper panels) and not-well-separated lower panels) one-dmensonal data sets. xes correspond to the two means. The dashdot lne shows the drecton of the true gradent LΘ), the sold lne shows the drecton of P Θ) LΘ) and the dashed lne shows the drecton of S) 1 LΘ). Rght panels are blowups of dashed regons on the left. The numbers ndcate the log of the l 2 norm of LΘ). For the well-separated case, n the vcnty of Θ, vectors P Θ) LΘ) and S) 1 LΘ) become dentcal. We can study the form and propertes of ths matrx by examnng ts structure, ts egenvalues, or the rato of ts two top egenvalues. In partcular, f the top egenvalue of M Θ ) tends to zero, then O becomes a true Newton method, rescalng the gradent by exactly the negatve nverse Hessan. s the egenvalues tend to unty, O takes smaller and smaller stepszes, gvng poor, frst-order convergence. 3 Common ound Optmzers 3.1 Expectaton-Maxmzaton EM) We now consder a partcular bound optmzer, the popular Expectaton-Maxmzaton EM) algorthm, and derve specfc cases of the results above for models whch use EM to adjust ther parameters. To begn, consder a probablstc model of observed data x whch uses latent varables y. For any value of Ψ, t can be easly verfed that the followng dfference of two terms s a lower bound on the lkelhood: GΘ, Ψ) = QΘ, Ψ) HΨ, Ψ) = pyx, Ψ) ln px, yθ)dy pyx, Ψ) ln pyx, Ψ)dy The log lkelhood functon can be wrtten as: LΘ) = ln pxθ) = pyx, Θ) ln pxθ)dy = GΘ, Θ) GΘ, Ψ) Ψ y 2), we can easly establsh: 2 GΘ ) = 2 QΘ, Θ ) 2 Θ=Θ 2 GΘ, Ψ ) = 2 HΘ, Θ ) 2 Θ=Θ and therefore we have an expresson for M Θ ): MΘ) Θ=Θ = 2 HΘ,Θ ) 2 Θ=Θ 2 QΘ,Θ ) 2 Θ=Θ 1 Ths can be nterpreted as the rato of mssng nformaton to complete nformaton near the local optmum 2. ccordng to 7), n the neghborhood of a soluton for suffcently large t): P Θ t ) I 2 H 2 ) 2 ) 1 1 Q 2 Θ=Θ t SΘ t ) Ths formulaton of the EM algorthm has a very nterestng nterpretaton whch s applcable to any latent varable model: When the mssng nformaton s small compared to the complete nformaton, EM exhbts quas-newton behavor and enjoys fast, typcally superlnear, convergence n the neghborhood of Θ. If the fracton of mssng nformaton approaches unty, the egenvalues of the frst term above approach zero and EM wll exhbt extremely slow convergence. Fgure 1 llustrates these results n the case of fttng a mxture of Gaussans model to well-clustered and not-wellclustered data. It turns out that many other models also show ths same effect. For example, when Hdden Markov Models or ggregate Markov Models 7 are traned on very structured sequences, EM exhbts quas-newton behavor, n partcular when the state transton matrx s sparse and the output dstrbutons are almost determnstc at each state.

3.2 Generalzed Iteratve Scalng GIS) In ths secton we consder the Generalzed Iteratve Scalng algorthm 1, wdely used for parameter estmaton n maxmum entropy models. Its goal s to determne the parameters Θ of an exponental famly 1 dstrbuton pxθ) = ZΘ) exp ΘT F ) such that certan generalzed margnal constrants are preserved: x pxθ )F = x pf, where ZΘ) s the normalzng factor, p s a gven emprcal dstrbuton and F = f 1,..., f d T s a gven feature vector on the nputs. The GIS algorthm requres that f > but we wll not requre f = 1)4. The loglkelhood s: LΘ) = x p ln pxθ) = x pθt F ln ZΘ) We note that ln ZΘ) ZΘ)/ZΨ) + ln ZΨ) 1 for any Ψ, and exp Θ f f exp Θ + 1 f, wth f 1. Defnng s = max x f, we construct a lower bound: f s LΘ) x p Θ f ln ZΨ) + pxψ) f exp sθ Ψ ) = GΘ, Ψ) s x Ths lower bound has the useful property that ts maxmzaton s decoupled across the parameters Θ. The GIS algorthm s then gven by: Θ t+1 = Θ t + 1 s ln x pf x pxθt )f Defne F Θ ) x pxθ )F to be the mean of the feature vectors, DΘ ) dag F Θ ) to be the correspondng dagonal matrx, and CovΘ ) to be covarance of the feature vectors under model dstrbuton pxθ ). We can compute second order statstcs usng 2): 2 G Θ ) = s dag F Θ ) = sdθ ) 2 GΘ, Ψ ) = s dag F Θ ) x pxθ )F F T F Θ ) F Θ ) T = sdθ ) CovΘ ) ccordng to 7), n the neghborhood of a soluton for suffcently large t), the step GIS takes n parameter space and true gradent are related by the matrx: 1 1 P Θ t ) s CovΘt )DΘ t ) 1 SΘ t ) Due to the concavty of GΘ, Ψ ) for any fxed Ψ, the step a GIS algorthm takes n parameter space always has postve projecton onto the true gradent of the objectve functon. The convergence rate matrx M Θ ) s of the form: MΘ) Θ=Θ = I 1 s CovΘ )DΘ ) 1 8) and depends on the covarance and the mean of the feature vectors. We can nterpret ths result as follows: when feature vectors become less correlated and closer to the orgn, GIS exhbts faster convergence n the neghborhood of Θ. If features are hghly dependent, then GIS wll exhbt extremely slow convergence. 3.3 Non-Negatve Matrx Factorzaton NMF) Gven a non-negatve matrx V, the NMF algorthm3 tres to fnd matrces W and H, such that V W H. Posed as an optmzaton problem, we are nterested n mnmzng a dvergence LW, H) = DV W H), subject to W, H) elementwse: LW, H) = V ln V W H) V + W H) We use ln c W ch cj c α c, c) ln WcHcj α c,c) where α a, b) = Wa t Ht bj / r W r t Ht rj, so that α c, c) sum to one. Defnng Θ = W, H) and Ψ = W t, H t ), we can construct the upper bound on the cost functon: LΘ) V ln V V + W c H cj 9) c V α c, c) ln W ch cj = GΘ, Ψ) α c, c) c One can now compute second order statstcs usng 2). In the appendx we derve the explct form of the convergence rate matrx M. We also note that the convergence matrx of NMF much resembles the convergence matrx of GIS, snce both algorthms make use of the bound that comes from Jensen s nequalty. 3.4 Concave-Convex Procedure CCCP) CCCP 1 optmzer seeks to mnmze an energy functon EΘ), whch can be decomposed nto a convex E vex Θ) and a concave E cave Θ) functon: EΘ) = E vex Θ) + E cave Θ) 1) It s easy to see that CCCP belongs to the class of bound optmzaton algorthms, and therefore can be analyzed as a frst order teratve algorthm. Its bound functon s: EΘ) E vex Θ) + E cave Ψ) + CCCP algorthm s then gven by: ) Θ Ψ) T E cave Ψ) = GΘ, Ψ) E vex Θ t+1 ) = E cave Θ t ) Employng 2), we have: 2 G Θ ) = 2 E vex Θ) T Θ=Θ 2 G Θ, Ψ ) = 2 E cave Ψ) Ψ Ψ T Ψ=Θ

The convergence rate matrx s gven by: M Θ ) = 2 E caveψ) Ψ Ψ T Ψ=Θ 1 2 E vexθ) T Θ=Θ whch can be nterpreted as a rato of concave curvature to convex curvature. ccordng to 7) n the neghborhood of a soluton for suffcently large t) the gradent and step are related by: P Θ t ) I ) ) 1 2 E cave 2 E vex 2 2 Θ=Θ t 1 SΘ t ) Of course, the step CCCP takes n parameter space has postve projecton onto the true gradent of the orgnal energy functon EΘ). The above vew of CCCP has an nterestng nterpretaton: If the concave energy functon has small curvature compared to the convex energy term n the neghborhood of Θ, CCCP wll exhbt a quas-newton behavor and wll possess fast, typcally superlnear convergence. s the fracton of concave-convex curvature nformaton approaches one, CCCP wll exhbt extremely slow, frst order convergence behavor. Fgure 4 llustrates exactly such an example. 4 Improvng Convergence Rates The above analyses helped to answer the queston: when and why wll bound optmzers converge slowly? They can also help to answer the more practcal queston: what can we do to speed up convergence? In the case of EM, t s possble to estmate the key quantty controllng convergence fracton of mssng nformaton) and swtch to drect gradent-based) optmzaton when we predct slow behavor of EM. We have expermented wth such a hybrd approach wth some success6. For other bound optmzers, smlar hybrd algorthms are possble. ut there s another, ntrgung approach to mprovng convergence speed: modfy the orgnal nput to the algorthms based on our analyss of convergence rates. In the case of GIS ths nvolves transformng features, n the case of NMF, ths requres translatng data vectors, and for CCCP ths comes down to desgnng dfferent convex-concave decompostons of the objectve. egnnng wth GIS, we can show that translatng feature vectors to brng them closer to the orgn and decorrelatng whtenng) them both speed up convergence. Homogeneously rescalng all features by a sngle constant does not affect convergence.) In partcular, the optmal translaton of features s gven by F new = F V wth V = mn x f, and the optmal lnear transformaton F new s that whch makes CovΘ ) T equal to dentty matrx. We provde sketch proofs of both results n the appendx.) Of course, the covarance n the second condton cannot be evaluated untl the optmal parameters are known, but t can be approxmated by usng the sample covarance of features on the tranng set. For NMF, smlar to GIS, we can show that translatng data vectors to brng them closer to the orgn speeds up convergence, whereas homogeneously rescalng all data by a sngle constant does not affect convergence. For CCCP, t s well-known that any energy functon has many convex-concave decompostons but no clear prncple for fndng a good one has been known. Our analyss provdes gudance n ths regard: we should mnmze the rato of curvatures between the convex and concave parts of the energy. In the next secton we llustrate that approprate preprocessng of the nput to these varous bound optmzaton algorthms does result n a much faster rate of convergence. Expermental Results We now present emprcal results to support the valdty of our analyss for several bound optmzaton algorthms. We frst apply EM to learnng the parameters of two latent varable models: Mxtures of Gaussns MoG) and Hdden Markov Models HMM). We then analyze and apply Iteratve Scalng IS) to a logstc regresson model. Next, we show the effect of data translaton on the convergence propertes of NMF. Fnally, we fnsh by descrbng and analyzng the effect of varous energy functon decompostons on the convergence behavor of the CCCP algorthm. Though not shown, we confrmed that the convergence results presented below do not vary sgnfcantly for dfferent random ntal startng ponts n the parameter space. Frst, consder a mxture of Gaussans MoG) model. In ths model the proporton of mssng nformaton corresponds to how well or not-well the data s separated nto dstnct clusters. We therefore consdered two types of data sets, a well-separated case and a not-wellseparated case n whch the data overlaps n one contguous regon. s predcted by our analyss, n the wellseparated case, n the vcnty of the local optmum Θ the drectons of the vectors P Θ) L Θ) and S) 1 L Θ) become dentcal fg. 1), showng that EM wll have quas- Newton convergence behavor. In not-well-separated case, due to the large proporton of mssng nformaton, these drectons are sgnfcantly dfferent and EM possesses poor, frst-order convergence behavor. We also appled the MoG model to cluster a set of, 8 8 greyscale pxel mage patches. 1 Fgure 2 dsplays the convergence behavor of EM for M= and M= mxture components. The expermental results reveal, that wth fewer mxture components, EM converges quckly to a local optmum, snce the components generally model the data wth farly dstnct, non-contguous clusters. s the 1 The data set used was the mlog data set publcly avalable at ftp://hlab.phys.rug.nl/pub/samples/mlog

1 2 3 4 6 Log Lkelhood + Const.2.4.6 2 2 EM: Mxture of Gaussans 2 2 6 4 2 3 1 1 Log Lkelhood + Const.2.2.4.6.8 EM: Hdden Markov Models Structured Sequence : ECDEEC... Unstructured Sequence : CEDCEDDC...... Log Lkelhood + Const 1 2 3 x 1 3 EM: Mxture of Gaussans Component Component 1 1 2 2 3 4 1 2 3 4 6 7 8 9 1 Fgure 2: Learnng curves of EM algorthm for two models: MoG and HMM. Dfferent data sets are shown on the same plots for convenence. The teraton number s shown on the horzontal axs, and log-lkelhood s shown on the vertcal axs wth the zerolevel lkelhood correspondng to the convergng pont of the EM algorthm. For well-separated and structured data ), EM possesses quas-newton convergence behavor. EM n ths case converges n 1-1 teratons wth stoppng crteron: LΘ t+1 ) LΘ t )/abslθ t+1 )) < 1 1. For overlappng, alased data ), EM posses poor, frst-order convergence. Rght panel dsplays convergence behavor of EM by fttng component as opposed to component MoG model on the same data set of gray mage patches. Condtonal Log Lkelhood + Const 4 4 6 6 7, C Logstc Regresson Feature 2 2 2 1 1 Feature 2 2 2 1 1 C 1 1 2 2 3 Iteraton Number 1 1 2 2 Feature 1 1 1 2 2 Feature 1 2 2 Condtonal Log Lkelhood + Const 41 41 42 42 43 43 44 44 Z Y Logstc Regresson X Feature 2 2 1 1 Y X Feature 2 2 1 1 Z X 1 1 2 2 3 3 4 4 Interaton Number 1 1 2 2 Feature 1 1 1 2 2 Fgure 3: Learnng curves left panels) of Iteratve Scalng algorthm for logstc regresson model, showng the effect that translaton and whtenng of the feature vectors have on the IS convergence behavor. Top panels show an experment wth 2, 2-dmensonal feature vectors drawn from standard normal, bottom panels dsplay an dentcal experment wth 2, feature vectors drawn from normal wth orented covarance. Top, rght panel shows that scalng feature vectors by constant does not affect the convergence of IS. number of mxtures components ncreases, clusters overlap n contguous regons, creatng a relatvely hgh proporton of mssng nformaton. In ths case the convergence of EM slows by several orders of magntude. We then appled EM to tranng Hdden Markov Models HMMs). Mssng nformaton n ths model s hgh when the observed data do not well determne the underlyng state sequence gven the parameters). We therefore generated two synthetc data sets from a -state HMM, wth an alphabet sze of characters. The frst data set alased sequences) was generated from a HMM where output parameters were set to unform values plus some small nose. The second data set structured sequences ) was generated from a HMM wth sparse transton and output matrces. Fgure 2 shows that for the very structured data, EM performs well and exhbts second order convergence n the vcnty of the local optmum. For the ambguous or Feature 1 alased data, EM posses extremely slow, frst-order convergence behavor. Ths analyss may also shed lght on why hard-clusterng algorthms such as k-means and Vterb style E-steps for HMMs appear to have faster convergence than ther softer cousns: they suppress the mssng nformaton. To confrm our analyss of GIS, we appled teratve scalng algorthm to a smple 2-class logstc regresson model: py = ±1x, w) = 1/1 + exp yw T x)) 4. In our frst experment, N feature vectors of dmensonalty d were drawn from normal: x N, 2I d ), wth the true parameter vector w beng randomly chosen on the surface of the d-dmensonal sphere wth radus 2. To make features postve, the data set was modfed by addng 2 to all feature values. Fgure 3 shows that for N = 2 and d = 2, nave IS, that runs on the orgnal unpreprocessed

2 4 6 8 1 12 14 Dvergence 2 18 16 14 12 1 8 6 4 2 NMF Iteraton Number Energy Functon E 1.9 1.96 1.97 1.98 1.99 2 Vex1 + Cave1 CCCP Vex2 + Cave2 4 8 12 2 4 6 8 1 12 14 3 1 Vex3 + Cave3 Iteraton Number 6 4 2 2 4 E E vex1 E vex3 E vex2 E cave3 E cave1 E cave2 6 4 3 2 1 1 2 3 4 Fgure 4: Learnng curves of NMF and CCCP algorthms. For NMF, we show the effect that data translaton has on the convergence behavor of NMF n our case black pxels correspond to, whte to 3). pplyng CCCP to mnmze a smple energy functon E = x 4 3x 2 + 2x 2, we dsplay the effect that dfferent energy decompostons left panel) have on CCCP convergence. features, takes over 2 teratons to converge. When feature vectors are translated closer to the orgn, IS converges to exactly the same maxmum lkelhood soluton, but beats nave IS by a factor of almost twelve. Our second experment was smlar, but feature vectors of dmensonalty d were drawn from a Gaussan wth orented covarance. Fgure 3 shows that for N=2, and d=2, translatng features mproves the convergence of IS by a factor of over 4, whereas translatng and whtenng feature vectors results n speedup by factor of over twenty. Smlar results are obtaned f dmensonalty of the data s ncreased. Next, we expermented wth the NMF algorthm. Data vectors were drawn from standard normal: x N, I 16 ). To make features postve, the data set was modfed by addng 2 to all data values, formng non-negatve matrx V. We then appled NMF to perform non-negatve factorzaton: V W H. Fgure 4 reveals that nave NMF, that runs on the orgnal unpreprocessed data data set ), takes over 1,3 teratons to converge. Once data vectors are translated closer to the orgn data set ), NMF converges to exactly the same value of the cost functon n about 23 teratons, outperformng nave NMF by a factor of over fve. Fnally, we expermented wth the CCCP algorthm. We consdered a smple energy functon E=x 4-3x 2 +2x-2, whch has many decompostons fg.4). decomposton whch mnmzes the rato of concave-convex curvature s: E cave =-3x 2-2 and E vex =x 4 +2x. Other decompostons: E cave =-13x 2-2 and E vex =x 4 +1x 2 +2x; E cave =-9x 4-3x 2-2 and E vex =1x 4 +2x; clearly ncrease the proporton of concave-convex curvature. In our experment, all runs of CCCP were started from the same ntal pont n the parameter space. Fgure 4 reveals that as the proporton of the local concave-convex curvature ncreases, the convergence rate of CCCP sgnfcantly slows down, by several orders of magntude. 6 Dscusson In ths paper we have analyzed a large class of bound optmzaton algorthms and ther relatonshp to drect optmzaton algorthms such as gradent-based methods. We have also analyzed and determned condtons under whch O algorthms exhbt local-gradent and fast quas- Newton convergence behavors. ased on ths analyss and nterpretaton, we have also provded some recommendatons for how the nput to these algorthms can be preprocessed to yeld faster convergence. Currently, usng dervaton of an explct form of the convergence rate matrx, we are also workng on dentfyng analytc condtons under whch CCCP possesses fast or extremely slow convergence n mnmzng ethe and Kkuch free energes n approxmate nference problems. Smlar analyss can be appled to other bound optmzaton algorthms, for example Sha et. al. 8 recently ntroduced a multplcatve algorthm for tranng SVMs and provded a convergence analyss based on margns. The analyss and experments motvate the use of alternatve optmzaton technques n the regme where the convergence rate matrx has large egenvalues, and a bound optmzer s lkely to perform poorly. Slow convergence s expected when mssng nformaton s hgh whle learnng wth EM; when feature vectors are hghly dependent whle estmatng parameters wth GIS or NMF; or when the rato of concave-convex curvature s large when mnmzng energy functon wth CCCP. In these cases, drect optmzaton algorthms such as conjugate-gradent are lkely to have far superor performance; ether such alternatves should be employed or else the nput should be preprocessed to speed convergence. cknowledgments Funded n part by the IRIS project, Precarn Canada. References 1 Stephen Della Petra, Vncent J. Della Petra, and John D. Lafferty. Inducng features of random felds. IEEE Transactons on Pattern nalyss and Machne Intellgence, 194):38 393, 1997. 2. P. Dempster, N. M. Lard, and D.. Rubn. Maxmum lkelhood from ncomplete data va the EM algorthm. J. of the RS Socety seres, 39:1 38, 1977. 3 Danel D. Lee and H. Sebastan Seung. Learnng the parts of objects by non-negatve matrx factorzaton. Letters to Nature, 41:788 791, 1999.

4 Tom Mnka. lgorthms for maxmum-lkelhood logstc regresson. Techncal Report 78, Dept. of Statstcs, Carnege Melon Unversty, 21. Ruslan Salakhutdnov. Relatonshp between gradent and EM steps for several latent varable models. http://www.cs.toronto.edu/ rsalakhu/ecg. 6 Ruslan Salakhutdnov, Sam Rowes, and Zoubn Ghahraman. Optmzaton wth EM and Expectaton-Conjugate- Gradent. In Submtted to Proc. 2th Internatonal Conf. on Machne Learnng, 23. 7 Lawrence Saul and Fernando Perera. ggregate and mxedorder Markov models for statstcal language processng. In Proceedngs of the Second Conference on Emprcal Methods n Natural Language Processng, pages 81 89. 1997. 8 Fe Sha, Lawrence Saul, and Danel Lee. Multplcatve updates for nonnegatve quadratc programmng n support vector machnes. In dvances n NIPS, volume 1, 23. 9 L. Xu and M. I. Jordan. On convergence propertes of the EM algorthm for Gaussan mxtures. Neural Computaton, 81):129 11, 1996. 1 lan Yulle and nand Rangarajan. The convex-concave computatonal procedure CCCP). In dvances n NIPS, volume 13, 21. ppendx Clam 1: Translatng feature vectors closer to the orgn speeds up convergence of GIS. The optmal translaton of features s gven by F new = F V x wth V = mn x f. Proof sketch: Consder settng F new = F V x as above. We have M newθ ) = I 1 s new CovΘ )D newθ ) 1 11) wth D newθ ) = DΘ ) dagv ), and s new = s V. Let us denote QΘ ) CovΘ )DΘ ) 1, Q newθ ) CovΘ )D newθ ) 1, and λ max) the largest egenvalue of. We can now show that ths translaton forces the top egenvalue of M Θ ) to decrease: λ maxm newθ )) λ maxm Θ )), where we derved 8): M Θ ) = I 1 s QΘ ). Note that: λ maxm newθ )) = 1 λ 1 mn s new Q newθ ) ). Hence, our task reduces to showng: 1 λ mn Q newθ ) ) 1 λ mn s new s QΘ ) ) λ max snewq 1 newθ ) ) λ max sq 1 Θ ) ) 12) Takng nto account that s new obvous by examnng: λ max snewq 1 newθ ) ) = s, the above nequalty s s newλ max DΘ ) dagv ) Cov 1 Θ ) ) sλ max DΘ )Cov 1 Θ ) ) = sλ max Q 1 Θ ) ) It s now clear that the optmal translaton of features s gven by F new = F V x wth V = mn x f. Clam 2: Decorrelatng whtenng) feature vectors speeds up convergence of GIS In partcular, the optmal lnear transformaton F new s that whch makes CovΘ ) T be dentty. Proof sketch: Consder spectral decomposton: CovΘ ) = W HW T. Let F new = F 2, n whch case CovΘ ) T = I. Then: M newθ ) = I 1 s new CovΘ ) T D newθ ) 1 = I 1 D newθ ) 1 13) s new wth s new = max x F, and DnewΘ ) = dag x pxθ )F = dag F. We now show that, n general, λ maxm newθ )) λ maxm Θ )). Ths task reduces to showng see eq 12)): λ max snewd newθ ) ) λ max sq 1 Θ ) ). Frst note that: λ max sq 1 Θ ) ) = sλ max DΘ )Cov 1 Θ ) ) On the other sde: = sλ max DΘ ) T ) 14) s newλ max DnewΘ ) ) = s new F s new DΘ ) 1) It can also be shown that s new sλ max) = s 2. y usng above facts, slghtly more relaxed bound holds: DΘ ) s new 2 DΘ ) s 2 16) Therefore n general, whtenng feature vectors, pushes down the top egenvalue of the convergence rate matrx, whch accordng to our analyss, results n ts faster rate of convergence. Non-Negatve Matrx Factorzaton: We use a bound on the objectve functon 9) to derve the explct form of the convergence rate matrx M. Defnng Θ = W, H) and Ψ = W t, H t ), we employ 2): 2 GΘ ) Wc W = δ kp k δ cp j 2 GΘ )) = δ cpδ Hcj jl H pl 2 GΘ, Ψ ) W c W kp 2 GΘ, Ψ ) H cj H pl 2 GΘ, Ψ ) W c H pl 2 GΘ, Ψ ) H cj W kp V V = δ k δ cp = δ jl δ cp j H cj W c W c H cj V V = V δ cp α l c, p)) = V kj δ V kj cp α kj c, p)) 2 GΘ ) = δ Wc cp H pl 2 GΘ ) Hcj W = δ kp cp Hcj δ Wc k j Wc Hcj δ jl V V Hcj α c, p) Wc Wc Hcj where we defne = c W ch cj, and δ = 1 f = j; otherwse. The convergence rate matrx M wll be of the form: MΘ) Θ=Θ = 2 GΘ, Ψ ) 2 GΘ ) 1 2 Here we are assumng that the new feature vector F has only postve entres. If F has negatve entres t mght be necessary to decorrelate and add a translaton, whch trades off the advantage of Clam 1 and Clam 2. α c, p))