Essen,al Probability Concepts

Similar documents
Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Bayesian Methods: Naïve Bayes

2 When Some or All Labels are Missing: The EM Algorithm

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. an Introduction

Logistic Regression. Hongning Wang

FEATURES. Features. UCI Machine Learning Repository. Admin 9/23/13

Predicting NBA Shots

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Evaluating and Classifying NBA Free Agents

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Predicting Horse Racing Results with Machine Learning

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering

A computer program that improves its performance at some task through experience.

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management

knn & Naïve Bayes Hongning Wang

#1 Accurately Rate and Rank each FBS team, and

CS145: INTRODUCTION TO DATA MINING

Ivan Suarez Robles, Joseph Wu 1

Looking at Spacings to Assess Streakiness

Navigate to the golf data folder and make it your working directory. Load the data by typing

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

A few things to remember about ANOVA

Lecture 5. Optimisation. Regularisation

Chapter 5: Methods and Philosophy of Statistical Process Control

Predicting the Total Number of Points Scored in NFL Games

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

DATA MINING ON CRICKET DATA SET FOR PREDICTING THE RESULTS. Sushant Murdeshwar

CS 528 Mobile and Ubiquitous Computing Lecture 7a: Applications of Activity Recognition + Machine Learning for Ubiquitous Computing.

Home Team Advantage in English Premier League

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Machine Learning an American Pastime

Projecting Three-Point Percentages for the NBA Draft

PREDICTING the outcomes of sporting events

NCSS Statistical Software

Warm up: At what number do you think this ruler would balance on a pencil point?

Simulating Major League Baseball Games

A history-based estimation for LHCb job requirements

Measuring range Δp (span = 100%) Pa

Introduction to Pattern Recognition

Real Time Early Warning Indicators for Costly Asset Price Boom/Bust Cycles: A Role for Global Liquidity

CSC242: Intro to AI. Lecture 21

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

/435 Artificial Intelligence Fall 2015

An Artificial Neural Network-based Prediction Model for Underdog Teams in NBA Matches

Spatio-temporal analysis of team sports Joachim Gudmundsson

Project 1 Those amazing Red Sox!

Measuring Returns to Scale in Nineteenth-Century French Industry Technical Appendix

Special Topics: Data Science

Reliable Real-Time Recognition of Motion Related Human Activities using MEMS Inertial Sensors

Chapter 4 (sort of): How Universal Are Turing Machines? CS105: Great Insights in Computer Science

Calculation of Trail Usage from Counter Data

2. Determine how the mass transfer rate is affected by gas flow rate and liquid flow rate.

Package STI. August 19, Index 7. Compute the Standardized Temperature Index

21st ECMI Modelling Week Final Report

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

Finding your feet: modelling the batting abilities of cricketers using Gaussian processes

Model 1: Reasenberg and Jones, Science, 1989

Lecture 10. Support Vector Machines (cont.)

CS249: ADVANCED DATA MINING

Biostatistics & SAS programming

Predicting the NCAA Men s Basketball Tournament with Machine Learning

Using Poisson Distribution to predict a Soccer Betting Winner

P Oskarshamn site investigation. Borehole: KAV01 Results of tilt testing. Panayiotis Chryssanthakis Norwegian Geotechnical Institute, Oslo

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Opleiding Informatica

Write a hypothesis for your experiment (which factor increases cricket chirping) in the If then format.

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Estimating the Probability of Winning an NFL Game Using Random Forests

De-Mystify Aviation Altimetry. What an Aircraft Altimeter Really Tells You, And What It Doesn t!

Appendix: Tables. Table XI. Table I. Table II. Table XII. Table III. Table IV

Smart-Walk: An Intelligent Physiological Monitoring System for Smart Families

Mastering the Mechanical E6B in 20 minutes!

Determination of the Design Load for Structural Safety Assessment against Gas Explosion in Offshore Topside

Remote Towers: Videopanorama Framerate Requirements Derived from Visual Discrimination of Deceleration During Simulated Aircraft Landing

Modelling the distribution of first innings runs in T20 Cricket

5.1 Introduction. Learning Objectives

Contingent Valuation Methods

University of Cincinnati

Comparisons of Discretionary Lane Changing Behavior

MJA Rev 10/17/2011 1:53:00 PM

Using Multi-Class Classification Methods to Predict Baseball Pitch Types

Real-Time Electricity Pricing

Deconstructing Data Science

Anabela Brandão and Doug S. Butterworth

A.6 Graphic Files. A Scale-Free Transportation Network Explains the City-Size Distribution. Figure 6. The Interstate route map (abridged).

Epidemics and zombies

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 03, 2016 ISSN (online):

CHEMICAL ENGINEERING LABORATORY CHEG 239W. Control of a Steam-Heated Mixing Tank with a Pneumatic Process Controller

Empirical Example II of Chapter 7

It s conventional sabermetric wisdom that players

The Wind Resource: Prospecting for Good Sites

Transformer fault diagnosis using Dissolved Gas Analysis technology and Bayesian networks

Building an NFL performance metric

Predicting Season-Long Baseball Statistics. By: Brandon Liu and Bryan McLellan

Lab 11: Introduction to Linear Regression

Sea State Analysis. Topics. Module 7 Sea State Analysis 2/22/2016. CE A676 Coastal Engineering Orson P. Smith, PE, Ph.D.

1.0 General Guide WARNING!

Background Summary Kaibab Plateau: Source: Kormondy, E. J. (1996). Concepts of Ecology. Englewood Cliffs, NJ: Prentice-Hall. p.96.

Transcription:

Naïve Bayes

Essen,al Probability Concepts Marginaliza,on: Condi,onal Probability: Bayes Rule: P (B) = P (A B) = X v2values(a) P (A B) = P (B ^ A = v) P (A ^ B) P (B) ^ P (B A) P (A) P (B) Independence: A? B $ P (A ^ B) =P (A) P (B) $ P (A B) =P (A) A? B C $ P (A ^ B C) =P (A C) P (B C) 2

Density Es,ma,on 3

Recall the Joint Distribu,on... alarm alarm earthquake earthquake earthquake earthquake burglary 0.01 0.08 0.001 0.009 burglary 0.01 0.09 0.01 0.79 4

How Can We Obtain a Joint Distribu,on? Op#on 1: Elicit it from an expert human Op#on 2: Build it up from simpler probabilis,c facts e.g, if we knew P(a) = 0.7!!P(b a) = 0.2!! P(b a) = 0.1" then, we could compute P(a b)" Op#on 3: Learn it from data... Based on slide by Andrew Moore 5

Learning a Joint Distribu,on Step 1: Step 2: Build a JD table for your Then, fill in each row with: atributes in which the recordsmatching row probabili,es are unspecified Pˆ (row) = total number of records A B C Prob 0 0 0? 0 0 1? 0 1 0? 0 1 1? 1 0 0? 1 0 1? 1 1 0? 1 1 1? A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 Slide Andrew Moore Frac,on of all records in which A and B are true but C is false

Example of Learning a Joint PD This Joint PD was obtained by learning from three atributes in the UCI Adult Census Database [Kohavi 1995] Slide Andrew Moore

Density Es,ma,on Our joint distribu,on learner is an example of something called Density Es#ma#on A Density Es,mator learns a mapping from a set of atributes to a probability Input ATributes Density Es,mator Probability Slide Andrew Moore

Density Es,ma,on Compare it against the two other major kinds of models: Input ATributes Input ATributes Input ATributes Classifier Density Es,mator Regressor Predic,on of categorical output Probability Predic,on of real- valued output Slide Andrew Moore

Evalua,ng Density Es,ma,on Test- set criterion for es,ma,ng performance on future data Input ATributes Classifier Predic,on of categorical output Test set Accuracy Input ATributes Density Es,mator Probability? Input ATributes Regressor Predic,on of real- valued output Test set Accuracy Slide Andrew Moore

Evalua,ng a Density Es,mator Given a record x, a density es,mator M can tell you how likely the record is: ˆP (x M) The density es,mator can also tell you how likely the dataset is: Under the assump,on that all records were independently generated from the Density Es,mator s JD (that is, i.i.d.) ˆP (x 1 ^ x 2 ^...^ x n M) = ny ˆP (x i M) dataset i=1 Slide by Andrew Moore

Example Small Dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 192 records in the training set mpg modelyear maker good 75to78 asia bad 70to74 america bad 75to78 europe bad 70to74 america bad 70to74 america bad 70to74 asia bad 70to74 asia bad 75to78 america : : : : : : : : : bad 70to74 america good 79to83 america bad 75to78 america good 79to83 america bad 75to78 america good 79to83 america good 79to83 america bad 70to74 america good 75to78 europe bad 75to78 europe Slide by Andrew Moore

Example Small Dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 192 records in the training set mpg modelyear maker Slide by Andrew Moore good 75to78 asia bad 70to74 america bad 75to78 europe bad 70to74 america bad 70to74 america bad 70to74 asia bad 70to74 asia bad 75to78 america : : : : : : : : : bad 70to74 america good 79to83 america bad 75to78 america good 79to83 america bad 75to78 america good 79to83 america good 79to83 america bad 70to74 america good 75to78 europe bad 75to78 europe ˆP (dataset M) = ny i=1 ˆP (x i M) =3.4 10 203 (in this case)

Log Probabili,es For decent sized data sets, this product will underflow ny ˆP (dataset M) = ˆP (x i M) i=1 Therefore, since probabili,es of datasets get so small, we usually use log probabili,es log ˆP ny nx (dataset M) = log ˆP (x i M) = log ˆP (x i M) i=1 i=1 Based on slide by Andrew Moore

Example Small Dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 192 records in the training set mpg modelyear maker Slide by Andrew Moore good 75to78 asia bad 70to74 america bad 75to78 europe bad 70to74 america bad 70to74 america bad 70to74 asia bad 70to74 asia bad 75to78 america : : : : : : : : : bad 70to74 america good 79to83 america bad 75to78 america good 79to83 america bad 75to78 america good 79to83 america good 79to83 america bad 70to74 america good 75to78 europe bad 75to78 europe log ˆP (dataset M) = nx i=1 log ˆP (x i M) = 466.19 (in this case)

Pros/Cons of the Joint Density Es,mator The Good News: We can learn a Density Es,mator from data. Density es,mators can do many good things Can sort the records by probability, and thus spot weird records (anomaly detec,on) Can do inference Ingredient for Bayes Classifiers (coming very soon...) The Bad News: Density es,ma,on by directly learning the joint is trivial, mindless, and dangerous Slide by Andrew Moore

The Joint Density Es,mator on a Test Set An independent test set with 196 cars has a much worse log- likelihood Actually it s a billion quin,llion quin,llion quin,llion quin,llion,mes less likely Density es,mators can overfit......and the full joint density es,mator is the overfikest of them all! Slide by Andrew Moore

Overfikng Density Es,mators If this ever happens, the joint PDE learns there are certain combina,ons that are impossible log ˆP nx (dataset M) = log ˆP (x i M) i=1 = 1 if for any i, ˆP (xi M) =0 Copyright Andrew W. Moore Slide by Andrew Moore

Slide by Christopher Bishop Curse of Dimensionality

The Joint Density Es,mator on a Test Set The only reason that the test set didn t score - is that the code was hard- wired to always predict a probability of at least 1/10 20 We need Density Es-mators that are less prone to overfi7ng... Slide by Andrew Moore

The Naïve Bayes Classifier 21

Recall Baye s Rule: P (hypothesis evidence) = Bayes Rule Equivalently, we can write: where X is a random variable represen,ng the evidence and Y is a random variable for the label This is actually short for: P (evidence hypothesis) P (hypothesis) P (evidence) P (Y = y k X = x i )= P (Y = y k)p (X = x i Y = y k ) P (X = x i ) P (Y = y k X = x i )= P (Y = y k)p (X 1 = x i,1 ^...^ X d = x i,d Y = y k ) P (X 1 = x i,1 ^...^ X d = x i,d ) where X j denotes the random variable for the j th feature 22

Naïve Bayes Classifier Idea: Use the training data to es,mate P (X Y ) and P (Y ). Then, use Bayes rule to infer P (Y X new ) for new data Easy to es,mate from data Imprac,cal, but necessary P (Y = y k X = x i )= P (Y = y k)p (X 1 = x i,1 ^...^ X d = x i,d Y = y k ) P (X 1 = x i,1 ^...^ X d = x i,d ) Unnecessary, as it turns out Recall that es,ma,ng the joint probability distribu,on P (X 1,X 2,...,X d Y ) is not prac,cal 23

Naïve Bayes Classifier Problem: es,ma,ng the joint PD or CPD isn t prac,cal Severely overfits, as we saw before However, if we make the assump,on that the atributes are independent given the class label, es,ma,on is easy! dy P (X 1,X 2,...,X d Y )= P (X j Y ) j=1 In other words, we assume all atributes are condi-onally independent given Y Onen this assump,on is violated in prac,ce, but more on that later 24

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) =?!!!!!! P( play) =?" P(Sky = sunny play) =?!! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 25

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) =?!!!!!! P( play) =?" P(Sky = sunny play) =?!! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 26

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) =?!! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 27

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) =?!! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 28

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 29

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 30

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) = 0" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 31

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) = 0" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 32

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) = 0" P(Humid = high play) = 2/3! P(Humid = high play) =?"...!!!!!!!!!..." 33

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) = 0" P(Humid = high play) = 2/3! P(Humid = high play) =?"...!!!!!!!!!..." 34

Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) = 0" P(Humid = high play) = 2/3! P(Humid = high play) = 1"...!!!!!!!!!..." 35

" Training Naïve Bayes Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng! Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 1!! P(Sky = sunny play) = 0" P(Humid = high play) = 2/3! P(Humid = high play) = 1"...!!!!!!!!!..." 36

Laplace Smoothing No,ce that some probabili,es es,mated by coun,ng might be zero Possible overfikng! Fix by using Laplace smoothing: Adds 1 to each count P (X j = v Y = y k )= X v 0 2values(X j ) c v +1 c v 0 + values(x j ) where c v is the count of training instances with a value of v for atribute j and class label y k! values(x j ) is the number of values X j can take on 37

Training Naïve Bayes with Laplace Smoothing Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng with Laplace smoothing: Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 4/5! P(Sky = sunny play) =?" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 38

Training Naïve Bayes with Laplace Smoothing Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng with Laplace smoothing: Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 4/5! P(Sky = sunny play) = 1/3" P(Humid = high play) =?!! P(Humid = high play) =?"...!!!!!!!!!..." 39

Training Naïve Bayes with Laplace Smoothing Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng with Laplace smoothing: Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 4/5! P(Sky = sunny play) = 1/3" P(Humid = high play) = 3/5! P(Humid = high play) =?"...!!!!!!!!!..." 40

Training Naïve Bayes with Laplace Smoothing Es,mate P (X j Y ) and P (Y ) directly from the training data by coun,ng with Laplace smoothing: Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(play) = 3/4!!!!!! P( play) = 1/4" P(Sky = sunny play) = 4/5! P(Sky = sunny play) = 1/3" P(Humid = high play) = 3/5! P(Humid = high play) = 2/3"...!!!!!!!!!..." 41

Using the Naïve Bayes Classifier Now, we have P (Y = y k X = x i )= P (Y = y k) Q d j=1 P (X j = x i,j Y = y k ) P (X = x i ) This is constant for a given instance, and so irrelevant to our predic,on In prac,ce, we use log- probabili,es to prevent underflow To classify a new point x, h(x) = arg max y k P (Y = y k ) YdY P (X j = x j Y = y k ) j=1 = arg max y k log P (Y = y k )+ X j th atribute value of x! dx log P (X j = x j Y = y k ) j=1 42

The Naïve Bayes Classifier Algorithm For each class label y k! Es,mate P(Y = y k ) from the data For each value x i,j of each atribute X i! Es,mate P(X i = x i,j Y = y k )" Classify a new point via: h(x) = arg max y k log P (Y = y k )+ dx log P (X j = x j Y = y k ) j=1 In prac,ce, the independence assump,on doesn t onen hold true, but Naïve Bayes performs very well despite it 43

Compu,ng Probabili,es (Not Just Predic,ng Labels) NB classifier gives predic,ons, not probabili,es, because we ignore P(X) (the denominator in Bayes rule) Can produce probabili,es by: For each possible class label y k, compute dy P (Y = y k X = x) =P (Y = y k ) P (X j = x j Y = y k ) This is the numerator of Bayes rule, and is therefore off the true probability by a factor of α that makes probabili,es sum to 1 α is given by = P #classes k=1 Class probability is given by j=1 1 P (Y = y k X = x) P (Y = y k X = x) = P (Y = y k X = x) 44

Naïve Bayes Applica,ons Text classifica,on Which e- mails are spam? Which e- mails are mee,ng no,ces? Which author wrote a document? Classifying mental states Learning P(BrainActivity WordCategory)" Pairwise Classifica,on Accuracy: 85% People Words Animal Words

The Naïve Bayes Graphical Model Labels (hypotheses) Y" P(Y)" ATributes (evidence) X 1" X i! X d! P(X 1 Y)" P(X i Y)" P(X d Y)" Nodes denote random variables Edges denote dependency Each node has an associated condi,onal probability table (CPT), condi,oned upon its parents 46

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Sky Temp Humid 47

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? yes no P(Play) Sky Temp Humid 48

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid 49

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes rainy yes sunny no rainy no 50

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 51

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes cold yes warm no cold no 52

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes 4/5 cold yes 1/5 warm no 1/3 cold no 2/3 53

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes 4/5 cold yes 1/5 warm no 1/3 cold no 2/3 Humid Play? P(Humid Play) high yes norm yes high no norm no 54

Example NB Graphical Model Data: Sky Temp Humid Play? sunny warm normal yes sunny warm high yes rainy cold high no sunny warm high yes Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes 4/5 cold yes 1/5 warm no 1/3 cold no 2/3 Humid Play? P(Humid Play) high yes 3/5 norm yes 2/5 high no 2/3 norm no 1/3 55

Example NB Graphical Model Play Play? P(Play) yes 3/4 no 1/4 Sky Temp Humid Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes 4/5 cold yes 1/5 warm no 1/3 cold no 2/3 Humid Play? P(Humid Play) high yes 3/5 norm yes 2/5 high no 2/3 norm no 1/3 Some redundancies in CPTs that can be eliminated 56

Example Using NB for Classifica,on Play Sky Temp Humid Play? P(Play) yes 3/4 no 1/4 Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes 4/5 cold yes 1/5 warm no 1/3 cold no 2/3 Humid Play? P(Humid Play) high yes 3/5 norm yes 2/5 high no 2/3 norm no 1/3 h(x) = arg max y k log P (Y = y k )+ dx log P (X j = x j Y = y k ) j=1 Goal: Predict label for x = (rainy, warm, normal) 57

Example Using NB for Classifica,on Play Sky Temp Humid Predict label for: x = (rainy, warm, normal) Play? P(Play) yes 3/4 no 1/4 Sky Play? P(Sky Play) sunny yes 4/5 rainy yes 1/5 sunny no 1/3 rainy no 2/3 Temp Play? P(Temp Play) warm yes 4/5 cold yes 1/5 warm no 1/3 cold no 2/3 Humid Play? P(Humid Play) high yes 3/5 norm yes 2/5 high no 2/3 norm no 1/3 predict PLAY 58

Naïve Bayes Summary Advantages: Fast to train (single scan through data) Fast to classify Not sensi,ve to irrelevant features Handles real and discrete data Handles streaming data well Disadvantages: Assumes independence of features Slide by Eamonn Keogh 59