CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificil Intelligence Spring 2011 Lecture 19: Dynmic Byes Nets, Nïve Byes 4/6/2011 Pieter Aeel UC Berkeley Slides dpted from Dn Klein. Announcements W4 out, due next week Mondy P4 out, due next week Fridy Mid-semester survey 2 1

Course contest Announcements II Regulr tournments. Instructions hve een posted! First week extr credit for top 20, next week top 10, then top 5, then top 3. First nightly tournment: tenttively Mondy night 3 P4: Ghostusters 2.0 Plot: Pcmn's grndfther, Grndpc, lerned to hunt ghosts for sport. He ws linded y his power, ut could her the ghosts nging nd clnging. Trnsition Model: All ghosts move rndomly, ut re sometimes ised Emission Model: Pcmn knows noisy distnce to ech ghost Noisy distnce pro True distnce = 8 15 13 11 9 7 5 3 1 2

Tody Dynmic Byes Nets (DBNs) [sometimes clled temporl Byes nets] Demos: Locliztion Simultneous Locliztion And Mpping (SLAM) Strt mchine lerning 5 Dynmic Byes Nets (DBNs) We wnt to trck multiple vriles over time, using multiple sources of evidence Ide: Repet fixed Byes net structure t ech time Vriles from time t cn condition on those from t-1 t =1 t =2 t =3 G 1 G 2 G 3 G 1 G 2 G 3 E 1 E 1 E 2 E 2 E 3 E 3 Discrete vlued dynmic Byes nets re lso HMMs 3

Exct Inference in DBNs Vrile elimintion pplies to dynmic Byes nets Procedure: unroll the network for T time steps, then eliminte vriles until P(X T e 1:T ) is computed t =1 t =2 t =3 G 1 G 2 G 3 G 1 G 2 G 3 E 1 E 1 E 2 E 2 E 3 E 3 Online elief updtes: Eliminte ll vriles from the previous time step; store fctors for current time only 7 DBN Prticle Filters A prticle is complete smple for time step Initilize: Generte prior smples for the t=1 Byes net Exmple prticle: G 1 = (3,3) G 1 = (5,3) Elpse time: Smple successor for ech prticle Exmple successor: G 2 = (2,3) G 2 = (6,3) Oserve: Weight ech entire smple y the likelihood of the evidence conditioned on the smple Likelihood: P(E 1 G 1 ) * P(E 1 G 1 ) Resmple: Select prior smples (tuples of vlues) in proportion to their likelihood [Demo] 8 4

DBN Prticle Filters A prticle is complete smple for time step Initilize: Generte prior smples for the t=1 Byes net Exmple prticle: G 1 = (3,3) G 1 = (5,3) Elpse time: Smple successor for ech prticle Exmple successor: G 2 = (2,3) G 2 = (6,3) Oserve: Weight ech entire smple y the likelihood of the evidence conditioned on the smple Likelihood: P(E 1 G 1 ) * P(E 1 G 1 ) Resmple: Select prior smples (tuples of vlues) in proportion to their likelihood 9 Trick I to Improve Prticle Filtering Performnce: Low Vrince Resmpling Advntges: More systemtic coverge of spce of smples If ll smples hve sme importnce weight, no smples re lost Lower computtionl complexity 5

Trick II to Improve Prticle Filtering Performnce: Regulriztion If no or little noise in trnsitions model, ll prticles will strt to coincide à regulriztion: introduce dditionl (rtificil) noise into the trnsition model SLAM SLAM = Simultneous Locliztion And Mpping We do not know the mp or our loction Our elief stte is over mps nd positions! Min techniques: Klmn filtering (Gussin HMMs) nd prticle methods [DEMOS] DP-SLAM, Ron Prr 6

Root Locliztion In root locliztion: We know the mp, ut not the root s position Oservtions my e vectors of rnge finder redings Stte spce nd redings re typiclly continuous (works siclly like very fine grid) nd so we cnnot store B(X) Prticle filtering is min technique [Demos] Glol-floor SLAM SLAM = Simultneous Locliztion And Mpping We do not know the mp or our loction Stte consists of position AND mp! Min techniques: Klmn filtering (Gussin HMMs) nd prticle methods 7

Prticle Filter Exmple 3 prticles mp of prticle 1 mp of prticle 3 mp of prticle 2 15 SLAM DEMOS fstslm.vi, visionslm_helioffice.wmv 8

Further redings We re done with Prt II Proilistic Resoning To lern more (eyond scope of 188): Koller nd Friedmn, Proilistic Grphicl Models (CS281A) Thrun, Burgrd nd Fox, Proilistic Rootics (CS287) Prt III: Mchine Lerning Up until now: how to reson in model nd how to mke optiml decisions Mchine lerning: how to cquire model on the sis of dt / experience Lerning prmeters (e.g. proilities) Lerning structure (e.g. BN grphs) Lerning hidden concepts (e.g. clustering) 9

Mchine Lerning Tody An ML Exmple: Prmeter Estimtion Mximum likelihood Smoothing Applictions Min concepts Nïve Byes Prmeter Estimtion r g g r g g r g g r r g g g g Estimting the distriution of rndom vrile Elicittion: sk humn (why is this hrd?) Empiriclly: use trining dt (lerning!) E.g.: for ech outcome x, look t the empiricl rte of tht vlue: r g g This is the estimte tht mximizes the likelihood of the dt Issue: overfitting. E.g., wht if only oserved 1 jelly en? 10

Estimtion: Smoothing Reltive frequencies re the mximum likelihood estimtes In Byesin sttistics, we think of the prmeters s just nother rndom vrile, with its own distriution???? Estimtion: Lplce Smoothing Lplce s estimte: Pretend you sw every outcome once more thn you ctully did H H T Cn derive this s MAP estimte with Dirichlet priors (see cs281) 11

Estimtion: Lplce Smoothing Lplce s estimte (extended): Pretend you sw every outcome k extr times H H T Wht s Lplce with k = 0? k is the strength of the prior Lplce for conditionls: Smooth ech condition independently: Exmple: Spm Filter Input: emil Output: spm/hm Setup: Get lrge collection of exmple emils, ech leled spm or hm Note: someone hs to hnd lel ll this dt! Wnt to lern to predict lels of new, future emils Fetures: The ttriutes used to mke the hm / spm decision Words: FREE! Text Ptterns: $dd, CAPS Non-text: SenderInContcts Der Sir. First, I must solicit your confidence in this trnsction, this is y virture of its nture s eing utterly confidencil nd top secret. TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is ltntly OT ut I'm eginning to go insne. Hd n old Dell Dimension XPS sitting in the corner nd decided to put it to use, I know it ws working pre eing stuck in the corner, ut when I plugged it in, hit the power nothing hppened. 12

Exmple: Digit Recognition Input: imges / pixel grids Output: digit 0-9 Setup: Get lrge collection of exmple imges, ech leled with digit Note: someone hs to hnd lel ll this dt! Wnt to lern to predict lels of new, future digit imges 0 1 2 Fetures: The ttriutes used to mke the digit decision Pixels: (6,8)=ON Shpe Ptterns: NumComponents, AspectRtio, NumLoops 1?? Other Clssifiction Tsks In clssifiction, we predict lels y (clsses) for inputs x Exmples: Spm detection (input: document, clsses: spm / hm) OCR (input: imges, clsses: chrcters) Medicl dignosis (input: symptoms, clsses: diseses) Automtic essy grder (input: document, clsses: grdes) Frud detection (input: ccount ctivity, clsses: frud / no frud) Customer service emil routing mny more Clssifiction is n importnt commercil technology! 13

Importnt Concepts Dt: leled instnces, e.g. emils mrked spm/hm Trining set Held out set Test set Fetures: ttriute-vlue pirs which chrcterize ech x Experimenttion cycle Lern prmeters (e.g. model proilities) on trining set (Tune hyperprmeters on held-out set) Compute ccurcy of test set Very importnt: never peek t the test set! Evlution Accurcy: frction of instnces predicted correctly Overfitting nd generliztion Wnt clssifier which does well on test dt Overfitting: fitting the trining dt very closely, ut not generlizing well We ll investigte overfitting nd generliztion formlly in few lectures Trining Dt Held-Out Dt Test Dt Byes Nets for Clssifiction One method of clssifiction: Use proilistic model! Fetures re oserved rndom vriles F i Y is the query vrile Use proilistic inference to compute most likely Y You lredy know how to do this inference 14

Simple Clssifiction Simple exmple: two inry fetures M S F direct estimte Byes estimte (no ssumptions) Conditionl independence + Generl Nïve Byes A generl nive Byes model: Y x F n prmeters Y Y prmeters n x F x Y prmeters F 1 F 2 F n We only specify how ech feture depends on the clss Totl numer of prmeters is liner in n 15

Inference for Nïve Byes Gol: compute posterior over cuses Step 1: get joint proility of cuses nd evidence Step 2: get proility of evidence + Step 3: renormlize Generl Nïve Byes Wht do we need in order to use nïve Byes? Inference (you know this prt) Strt with unch of conditionls, P(Y) nd the P(F i Y) tles Use stndrd inference to compute P(Y F 1 F n ) Nothing new here Estimtes of locl conditionl proility tles P(Y), the prior over lels P(F i Y) for ech feture (evidence vrile) These proilities re collectively clled the prmeters of the model nd denoted y θ Up until now, we ssumed these ppered y mgic, ut they typiclly come from trining dt: we ll look t this now 16

Input: pixel grids A Digit Recognizer Output: digit 0-9 Nïve Byes for Digits Simple version: One feture F ij for ech grid position <i,j> Possile feture vlues re on / off, sed on whether intensity is more or less thn 0.5 in underlying imge Ech input mps to feture vector, e.g. Here: lots of fetures, ech is inry vlued Nïve Byes model: Wht do we need to lern? 17

Exmples: CPTs 1 0.1 2 0.1 3 0.1 4 0.1 5 0.1 6 0.1 7 0.1 8 0.1 9 0.1 0 0.1 1 0.01 2 0.05 3 0.05 4 0.30 5 0.80 6 0.90 7 0.05 8 0.60 9 0.50 0 0.80 1 0.05 2 0.01 3 0.90 4 0.80 5 0.90 6 0.90 7 0.25 8 0.85 9 0.60 0 0.80 Prmeter Estimtion Estimting distriution of rndom vriles like X or X Y Empiriclly: use trining dt For ech outcome x, look t the empiricl rte of tht vlue: r g g This is the estimte tht mximizes the likelihood of the dt Elicittion: sk humn! Usully need domin experts, nd sophisticted wys of eliciting proilities (e.g. etting gmes) Troule clirting 18

A Spm Filter Nïve Byes spm filter Dt: Collection of emils, leled spm or hm Note: someone hs to hnd lel ll this dt! Split into trining, heldout, test sets Clssifiers Lern on the trining set (Tune it on held-out set) Test it on new emils Der Sir. First, I must solicit your confidence in this trnsction, this is y virture of its nture s eing utterly confidencil nd top secret. TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is ltntly OT ut I'm eginning to go insne. Hd n old Dell Dimension XPS sitting in the corner nd decided to put it to use, I know it ws working pre eing stuck in the corner, ut when I plugged it in, hit the power nothing hppened. Nïve Byes for Text Bg-of-Words Nïve Byes: Predict unknown clss lel (spm vs. hm) Assume evidence fetures (e.g. the words) re independent Wrning: sutly different ssumptions thn efore! Genertive model Word t position i, not i th word in the dictionry! Tied distriutions nd g-of-words Usully, ech vrile gets its own conditionl proility distriution P(F Y) In g-of-words model Ech position is identiclly distriuted All positions shre the sme conditionl pros P(W C) Why mke this ssumption? 19

Exmple: Spm Filtering Model: Wht re the prmeters? hm : 0.66 spm: 0.33 the : 0.0156 to : 0.0153 nd : 0.0115 of : 0.0095 you : 0.0093 : 0.0086 with: 0.0080 from: 0.0075... the : 0.0210 to : 0.0133 of : 0.0119 2002: 0.0110 with: 0.0108 from: 0.0107 nd : 0.0105 : 0.0100... Where do these tles come from? Spm Exmple Word P(w spm) P(w hm) Tot Spm Tot Hm (prior) 0.33333 0.66666-1.1-0.4 Gry 0.00002 0.00021-11.8-8.9 would 0.00069 0.00084-19.1-16.0 you 0.00881 0.00304-23.8-21.8 like 0.00086 0.00083-30.9-28.9 to 0.01517 0.01339-35.1-33.2 lose 0.00008 0.00002-44.5-44.0 weight 0.00016 0.00002-53.3-55.0 while 0.00027 0.00027-61.5-63.2 you 0.00881 0.00304-66.2-69.0 sleep 0.00006 0.00001-76.0-80.5 P(spm w) = 98.9 20

Exmple: Overfitting 2 wins!! Exmple: Overfitting Posteriors determined y reltive proilities (odds rtios): south-west : inf ntion : inf morlly : inf nicely : inf extent : inf seriously : inf... screens : inf minute : inf gurnteed : inf $205.00 : inf delivery : inf signture : inf... Wht went wrong here? 21

Generliztion nd Overfitting Reltive frequency prmeters will overfit the trining dt! Just ecuse we never sw 3 with pixel (15,15) on during trining doesn t men we won t see it t test time Unlikely tht every occurrence of minute is 100% spm Unlikely tht every occurrence of seriously is 100% hm Wht out ll the words tht don t occur in the trining set t ll? In generl, we cn t go round giving unseen events zero proility As n extreme cse, imgine using the entire emil s the only feture Would get the trining dt perfect (if deterministic leling) Wouldn t generlize t ll Just mking the g-of-words ssumption gives us some generliztion, ut isn t enough To generlize etter: we need to smooth or regulrize the estimtes Estimtion: Smoothing Prolems with mximum likelihood estimtes: If I flip coin once, nd it s heds, wht s the estimte for P (heds)? Wht if I flip 10 times with 8 heds? Wht if I flip 10M times with 8M heds? Bsic ide: We hve some prior expecttion out prmeters (here, the proility of heds) Given little evidence, we should skew towrds our prior Given lot of evidence, we should listen to the dt 22

Estimtion: Lplce Smoothing Lplce s estimte (extended): Pretend you sw every outcome k extr times H H T Wht s Lplce with k = 0? k is the strength of the prior Lplce for conditionls: Smooth ech condition independently: Estimtion: Liner Interpoltion In prctice, Lplce often performs poorly for P(X Y): When X is very lrge When Y is very lrge Another option: liner interpoltion Also get P(X) from the dt Mke sure the estimte of P(X Y) isn t too different from P(X) Wht if α is 0? 1? For even etter wys to estimte prmeters, s well s detils of the mth see cs281, cs288 24

Rel NB: Smoothing For rel clssifiction prolems, smoothing is criticl New odds rtios: helvetic : 11.4 seems : 10.8 group : 10.2 go : 8.4 res : 8.3... verdn : 28.8 Credit : 28.4 ORDER : 27.2 <FONT> : 26.9 money : 26.5... Do these mke more sense? Tuning on Held-Out Dt Now we ve got two kinds of unknowns Prmeters: the proilities P(Y X), P(Y) Hyperprmeters, like the mount of smoothing to do: k, α Where to lern? Lern prmeters from trining dt Must tune hyperprmeters on different dt Why? For ech vlue of the hyperprmeters, trin nd test on the held-out dt Choose the est vlue nd do finl test on the test dt 25

Bselines First step: get seline Bselines re very simple strw mn procedures Help determine how hrd the tsk is Help know wht good ccurcy is Wek seline: most frequent lel clssifier Gives ll test instnces whtever lel ws most common in the trining set E.g. for spm filtering, might lel everything s hm Accurcy might e very high if the prolem is skewed E.g. clling everything hm gets 66%, so clssifier tht gets 70% isn t very good For rel reserch, usully use previous work s (strong) seline Confidences from Clssifier The confidence of proilistic clssifier: Posterior over the top lel Represents how sure the clssifier is of the clssifiction Any proilistic model will hve confidences No gurntee confidence is correct Clirtion Wek clirtion: higher confidences men higher ccurcy Strong clirtion: confidence predicts ccurcy rte Wht s the vlue of clirtion? 26

Precision vs. Recll Let s sy we wnt to clssify we pges s homepges or not In test set of 1K pges, there re 3 homepges Our clssifier sys they re ll non-homepges 99.7 ccurcy! Need new mesures for rre positive events Precision: frction of guessed positives which were ctully positive Recll: frction of ctul positives which were guessed s positive Sy we guess 5 homepges, of which 2 were ctully homepges Precision: 2 correct / 5 guessed = 0.4 Recll: 2 correct / 3 true = 0.67 Which is more importnt in customer support emil utomtion? Which is more importnt in irport fce recognition? - guessed + ctul + Precision vs. Recll Precision/recll trdeoff Often, you cn trde off precision nd recll Only works well with wekly clirted clssifiers To summrize the trdeoff: Brek-even point: precision vlue when p = r F-mesure: hrmonic men of p nd r: 27

Errors, nd Wht to Do Exmples of errors Der GlolSCAPE Customer, GlolSCAPE hs prtnered with ScnSoft to offer you the ltest version of OmniPge Pro, for just $99.99* - the regulr list price is $499! The most common question we've received out this offer is - Is this genuine? We would like to ssure you tht this offer is uthorized y ScnSoft, is genuine nd vlid. You cn get the...... To receive your $30 Amzon.com promotionl certificte, click through to http://www.mzon.com/pprel nd see the prominent link for the $30 offer. All detils re there. We hope you enjoyed receiving this messge. However, if you'd rther not receive future e-mils nnouncing new store lunches, plese click... Wht to Do Aout Errors? Need more fetures words ren t enough! Hve you emiled the sender efore? Hve 1K other people just gotten the sme emil? Is the sending informtion consistent? Is the emil in ALL CAPS? Do inline URLs point where they sy they point? Does the emil ddress you y (your) nme? Cn dd these informtion sources s new vriles in the NB model Next clss we ll tlk out clssifiers which let you esily dd ritrry fetures more esily 28

Summry Nïve Byes Clssifier Byes rule lets us do dignostic queries with cusl proilities The nïve Byes ssumption tkes ll fetures to e independent given the clss lel We cn uild clssifiers out of nïve Byes model using trining dt Smoothing estimtes is importnt in rel systems Clssifier confidences re useful, when you cn get them 29