Bayesian Methods: Naïve Bayes

Similar documents
Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Logistic Regression. Hongning Wang

knn & Naïve Bayes Hongning Wang

CS145: INTRODUCTION TO DATA MINING

Special Topics: Data Science

CS249: ADVANCED DATA MINING

Communication Amid Uncertainty

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Lecture 5. Optimisation. Regularisation

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Communication Amid Uncertainty

ISyE 6414 Regression Analysis

2 When Some or All Labels are Missing: The EM Algorithm

Functions of Random Variables & Expectation, Mean and Variance

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ISyE 6414: Regression Analysis

NCSS Statistical Software

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

Imperfectly Shared Randomness in Communication

Combining Experimental and Non-Experimental Design in Causal Inference

Essen,al Probability Concepts

Operational Risk Management: Preventive vs. Corrective Control

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Machine Learning Application in Aviation Safety

What is Restrained and Unrestrained Pipes and what is the Strength Criteria

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Analysis of Gini s Mean Difference for Randomized Block Design

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

CT4510: Computer Graphics. Transformation BOCHANG MOON

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016

Lecture 10. Support Vector Machines (cont.)

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

Operations on Radical Expressions; Rationalization of Denominators

SNARKs with Preprocessing. Eran Tromer

Looking at Spacings to Assess Streakiness

Product Decomposition in Supply Chain Planning

Coaches, Parents, Players and Fans

NATIONAL FEDERATION RULES B. National Federation Rules Apply with the following TOP GUN EXCEPTIONS

Nonlife Actuarial Models. Chapter 7 Bühlmann Credibility

MATH 118 Chapter 5 Sample Exam By: Maan Omran

Home Team Advantage in English Premier League

Evaluating and Classifying NBA Free Agents

Estimating the Probability of Winning an NFL Game Using Random Forests

Simulating Major League Baseball Games

Simplifying Radical Expressions and the Distance Formula

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND

Conservation of Energy. Chapter 7 of Essential University Physics, Richard Wolfson, 3 rd Edition

New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary Variables

San Francisco State University ECON 560 Summer Midterm Exam 2. Monday, July hour 15 minutes

Tie Breaking Procedure

hot hands in hockey are they real? does it matter? Namita Nandakumar Wharton School, University of

Name May 3, 2007 Math Probability and Statistics

Use of Auxiliary Variables and Asymptotically Optimum Estimators in Double Sampling

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

The Effect of Public Sporting Expenditures on Medal Share at the Summer Olympic Games: A Study of the Differential Impact by Sport and Gender

Introduction to Genetics

(b) Express the event of getting a sum of 12 when you add up the two numbers in the tosses.

STATISTICS - CLUTCH CH.5: THE BINOMIAL RANDOM VARIABLE.

A point-based Bayesian hierarchical model to predict the outcome of tennis matches

PREDICTING the outcomes of sporting events

Write these equations in your notes if they re not already there. You will want them for Exam 1 & the Final.

/435 Artificial Intelligence Fall 2015

The Intrinsic Value of a Batted Ball Technical Details

Example: sequence data set wit two loci [simula

Reliable Real-Time Recognition of Motion Related Human Activities using MEMS Inertial Sensors

First-Server Advantage in Tennis Matches

Bayesian model averaging with change points to assess the impact of vaccination and public health interventions

Section 5.1 Randomness, Probability, and Simulation

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

Physical Design of CMOS Integrated Circuits

A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept

Navigate to the golf data folder and make it your working directory. Load the data by typing

COMP Intro to Logic for Computer Scientists. Lecture 13

Math 243 Section 4.1 The Normal Distribution

The difference between a statistic and a parameter is that statistics describe a sample. A parameter describes an entire population.

Addition and Subtraction of Rational Expressions

Chapter 12 Practice Test

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Introduction to Pattern Recognition

R * : EQUILIBRIUM INTEREST RATE. Yuriy Gorodnichenko UC Berkeley

Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

Prediction Market and Parimutuel Mechanism

Biostatistics & SAS programming

EE582 Physical Design Automation of VLSI Circuits and Systems

STAT/MATH 395 PROBABILITY II

Stat 139 Homework 3 Solutions, Spring 2015

My ABC Insect Discovery Book

Finding your feet: modelling the batting abilities of cricketers using Gaussian processes

5.1A Introduction, The Idea of Probability, Myths about Randomness

TSP at isolated intersections: Some advances under simulation environment

Mathematics of Pari-Mutuel Wagering

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts

Name: Grade: LESSON ONE: Home Row

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

COMPLEX NUMBERS. Powers of j. Consider the quadratic equation; It has no solutions in the real number system since. j 1 ie. j 2 1

Chapter 20. Planning Accelerated Life Tests. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Transcription:

Bayesian Methods: Naïve Bayes Nicholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate

Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior distributions Posterior distributions Today: more parameter learning and naïve Bayes 2

Maximum Likelihood Estimation (MLE) Data: Observed set of αα HH heads and αα TT tails Hypothesis: Coin flips follow a binomial distribution Learning: Find the best θθ MLE: Choose θ to maximize the likelihood (probability of D given θθ) θθ MMLLLL = arg max θθ pp(dd θθ) 3

MAP Estimation Choosing θθ to maximize the posterior distribution is called maximum a posteriori (MAP) estimation θθ MMMMMM = arg max θθ pp(θθ DD) The only difference between θθ MMMMMM and θθ MMMMMM is that one assumes a uniform prior (MLE) and the other allows an arbitrary prior 4

Sample Complexity How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? Can use Chernoff bound (again!) Suppose YY 1,, YY NN are i.i.d. random variables taking values in {0, 1} such that EE pp YY ii = yy. For εε > 0, pp yy 1 NN ii YY ii εε 2ee 2NNεε2 5

Sample Complexity How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? Can use Chernoff bound (again!) For the coin flipping problem with XX 1,, XX nn iid coin flips and εε > 0, pp θθ tttttttt 1 NN ii XX ii εε 2ee 2NNεε2 6

Sample Complexity How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? Can use Chernoff bound (again!) For the coin flipping problem with XX 1,, XX nn iid coin flips and εε > 0, pp θθ tttttttt θθ MMMMMM εε 2ee 2NNεε2 7

Sample Complexity How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? Can use Chernoff bound (again!) For the coin flipping problem with XX 1,, XX nn iid coin flips and εε > 0, pp θθ tttttttt θθ MMMMMM εε 2ee 2NNεε2 δδ 2ee 2NNεε2 NN 1 2εε 2 ln 2 δδ 8

MLE for Gaussian Distributions Two parameter distribution characterized by a mean and a variance 9

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian XX ~ NN(μμ, σσ 2 ) YY = aaaa + bb YY ~ NN(aaaa + bb, aa 2 σσ 2 ) Sum of Gaussians is Gaussian XX ~ NN(μμ XX, σσ XX 2), YY ~ NN(μμ YY, σσ YY 2) ZZ = XX + YY ZZ ~ NN(μμ XX + μμ YY, σσ XX 2 + σσ YY 2) Easy to differentiate, as we will see soon! 10

Learning a Gaussian Collect data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ ii Exam Score 0 85 1 95 2 100 3 12 99 89 11

MLE for Gaussian: Probability of NN i.i.d. samples DD = xx (1),, xx (NN) pp DD μμ, σσ = 1 2ππσσ 2 NN NN ii=1 ee xx ii μμ 2σσ 2 2 Log-likelihood of the data ln pp(dd μμ, σσ) = NN 2 ln 2ππσσ2 NN xx ii μμ 2 ii=1 2σσ 2 12

MLE for the Mean of a Gaussian ln pp(dd μμ, σσ) = NN NN xx ii 2 ln μμ 2 2ππσσ2 2σσ 2 ii=1 NN xx ii μμ 2 = ii=1 2σσ 2 NN xx ii μμ = ii=1 NN σσ 2 xx ii = NNNN ii=1 σσ 2 = 0 NN μμ MMMMMM = 1 NN ii=1 xx ii 13

MLE for Variance σσ ln pp(dd μμ, σσ) = σσ NN NN xx ii 2 ln μμ 2 2ππσσ2 2σσ 2 ii=1 = NN σσ + NN xx ii σσ μμ 2 2σσ 2 ii=1 NN xx ii μμ 2 = NN σσ + ii=1 σσ 3 = 0 NN 2 σσ MMMMMM = 1 NN ii=1 xx ii μμ MMMMMM 2 14

Learning Gaussian parameters NN μμ MMMMMM = 1 NN ii=1 xx ii NN 2 σσ MMMMMM = 1 NN ii=1 xx ii μμ MMMMMM 2 MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator 2 σσ uuuuuuuuuuuuuuuu = 1 NN NN 1 ii=1 xx ii μμ MMMMMM 2 15

Bayesian Categorization/Classification Given features xx = (xx 1,, xx mm ) predict a label yy If we had a joint distribution over xx and yy, given xx we could find the label using MAP inference arg max yy pp(yy xx 1,, xx mm ) Can compute this in exactly the same way that we did before using Bayes rule: pp yy xx 1,, xx mm = pp xx 1,, xx mm yy pp yy pp xx 1,, xx mm 16

Article Classification Given a collection of news articles labeled by topic goal is, given an unseen news article, to predict topic One possible feature vector: One feature for each word in the document, in order xx ii corresponds to the ii ttt word xx ii can take a different value for each word in the dictionary 17

Text Classification 18

Text Classification xx 1, xx 2, is sequence of words in document The set of all possible features, and hence pp(yy xx), is huge Article at least 1000 words, xx = (xx 1,, xx 1000 ) xx ii represents ii ttt word in document Can be any word in the dictionary at least 10,000 words 10,000 1000 = 10 4000 possible values Atoms in Universe: ~10 80 19

Bag of Words Model Typically assume position in document doesn t matter pp XX ii = "tttttt YY = yy = pp(xx kk = "tttttt YY = yy) All positions have the same distribution Ignores the order of words Sounds like a bad assumption, but often works well! Features Set of all possible words and their corresponding frequencies (number of times it occurs in the document) 20

Bag of Words aardvark 0 about2 all 2 Africa 1 apple0 anxious 0... gas 1... oil 1 Zaire 0 21

Need to Simplify Somehow Even with the bag of words assumption, there are too many possible outcomes Too many probabilities pp(xx 1,, xx mm yy) Can we assume some are the same? pp(xx 1, xx 2 yy ) = pp(xx 1 yy) pp(xx 2 yy) This is a conditional independence assumption 22

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z Equivalent to pp XX YY, ZZ = PP(XX ZZ) pp XX, YY ZZ = pp XX ZZ PP(YY ZZ) 23

Naïve Bayes Naïve Bayes assumption Features are independent given class label pp(xx 1, xx 2 yy ) = pp(xx 1 yy) pp(xx 2 yy) More generally mm pp xx 1,, xx mm yy = pp(xx ii yy) ii=1 How many parameters now? Suppose xx is composed of dd binary features 24

Naïve Bayes Naïve Bayes assumption Features are independent given class label More generally pp(xx 1, xx 2 yy ) = pp(xx 1 yy) pp(xx 2 yy) mm pp xx 1,, xx mm yy = pp(xx ii yy) How many parameters now? ii=1 Suppose xx composed of dd binary features OO(dd LL) where LL is the number of class labels 25

The Naïve Bayes Classifier Given Prior pp(yy) YY mm conditionally independent features XX given the class YY XX 1 XX 2 XX nn For each XX ii, we have likelihood PP(XX ii YY) Classify via yy = h NNNN xx = arg max yy = arg max yy pp yy pp(xx 1,, xx mm yy) mm pp(yy) ii pp(xx ii yy) 26

MLE for the Parameters of NB Given dataset, count occurrences for all pairs CCCCCCCCCC(XX ii = xx ii, YY = yy) is the number of samples in which XX ii = xx ii and YY = yy MLE for discrete NB pp YY = yy = pp XX ii = xx ii YY = yy = CCCCCCCCCC YY = yy yy CCCCCCCCCC(YY = yy ) CCCCCCCCCC XX ii = xx ii, YY = yy xxii CCCCCCCCCC (XX ii = xx ii, YY = yy) 27

Naïve Bayes Calculations 28

Subtleties of NB Classifier: #1 Usually, features are not conditionally independent: mm pp xx 1,, xx mm yy pp(xx ii yy) ii=1 The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases Plausible reason: Only need the probability of the correct class to be the largest! Example: binary classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99). 29

Subtleties of NB Classifier What if you never see a training instance (XX 1 = aa, YY = bb) Example: you did not see the word Nigerian in spam Then pp XX 1 = aa YY = bb = 0 Thus no matter what values XX 2,, XX mm take PP XX 1 = a, XX 2 = xx 2,, XX mm = xx mm YY = bb = 0 Why? 30

Subtleties of NB Classifier To fix this, use a prior! Already saw how to do this in the coin-flipping example using the Beta distribution For NB over discrete spaces, can use the Dirichlet prior The Dirichlet distribution is a distribution over zz 1,, zz kk (0,1) such that zz 1 + + zz kk = 1 characterized by kk parameters αα 1,, αα kk ff zz 1,, zz kk ; αα 1,, αα kk kk ii=1 zz ii αα ii 1 Called smoothing, what are the MLE estimates under these kinds of priors? 31