Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Similar documents
Bayesian Methods: Naïve Bayes

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Decision Trees. an Introduction

Lecture 5. Optimisation. Regularisation

CS145: INTRODUCTION TO DATA MINING

knn & Naïve Bayes Hongning Wang

Logistic Regression. Hongning Wang

CS249: ADVANCED DATA MINING

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

Imperfectly Shared Randomness in Communication

Functions of Random Variables & Expectation, Mean and Variance

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CSC242: Intro to AI. Lecture 21

Communication Amid Uncertainty

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016

Organizing Quantitative Data

Simplifying Radical Expressions and the Distance Formula

Special Topics: Data Science

Communication Amid Uncertainty

Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) Computer Science Department

/435 Artificial Intelligence Fall 2015

NCSS Statistical Software

Optimizing Cyclist Parking in a Closed System

PREDICTING the outcomes of sporting events

Estimating the Probability of Winning an NFL Game Using Random Forests

ECE 697B (667) Spring 2003

SNARKs with Preprocessing. Eran Tromer

CS Lecture 5. Vidroha debroy. Material adapted courtesy of Prof. Xiangnan Kong and Prof. Carolina Ruiz at Worcester Polytechnic Institute

Lecture 10. Support Vector Machines (cont.)

COMP Intro to Logic for Computer Scientists. Lecture 13

ISyE 6414 Regression Analysis

Uninformed Search (Ch )

Navigate to the golf data folder and make it your working directory. Load the data by typing

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

Evaluating and Classifying NBA Free Agents

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Essen,al Probability Concepts

Physical Design of CMOS Integrated Circuits

Extensive Games with Perfect Information

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

hot hands in hockey are they real? does it matter? Namita Nandakumar Wharton School, University of

Machine Learning Application in Aviation Safety

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

Uninformed Search Strategies

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

2 When Some or All Labels are Missing: The EM Algorithm

CSE 3401: Intro to AI & LP Uninformed Search II

Using Spatio-Temporal Data To Create A Shot Probability Model

Uninformed Search (Ch )

Lesson 14: Modeling Relationships with a Line

Product Decomposition in Supply Chain Planning

Operations on Radical Expressions; Rationalization of Denominators

Bayesian Optimized Random Forest for Movement Classification with Smartphones

A Novel Approach to Predicting the Results of NBA Matches

Uninformed search methods II.

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Problem Solving Agents

Lab 2: Probability. Hot Hands. Template for lab report. Saving your code

The system design must obey these constraints. The system is to have the minimum cost (capital plus operating) while meeting the constraints.

Machine Learning an American Pastime

Mathematics of Pari-Mutuel Wagering

Uninformed search methods II.

download.file(" destfile = "kobe.rdata")

A computer program that improves its performance at some task through experience.

Calculation of Trail Usage from Counter Data

Conservation of Energy. Chapter 7 of Essential University Physics, Richard Wolfson, 3 rd Edition

Exercise 11: Solution - Decision tree

Predicting the development of the NBA playoffs. How much the regular season tells us about the playoff results.

Operational Risk Management: Preventive vs. Corrective Control

Introduction to Pattern Recognition

Human Performance Evaluation

A new Decomposition Algorithm for Multistage Stochastic Programs with Endogenous Uncertainties

Addition and Subtraction of Rational Expressions

Legendre et al Appendices and Supplements, p. 1

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND

Predicting Tennis Match Outcomes Through Classification Shuyang Fang CS074 - Dartmouth College

Building an NFL performance metric

DESIGN AND ANALYSIS OF ALGORITHMS (DAA 2017)

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Coaches, Parents, Players and Fans

Predicting the Total Number of Points Scored in NFL Games

EE582 Physical Design Automation of VLSI Circuits and Systems

CT4510: Computer Graphics. Transformation BOCHANG MOON

Application of Bayesian Networks to Shopping Assistance

DATA MINING ON CRICKET DATA SET FOR PREDICTING THE RESULTS. Sushant Murdeshwar

Excel Solver Case: Beach Town Lifeguard Scheduling

Better Search Improved Uninformed Search CIS 32

(b) Express the event of getting a sum of 12 when you add up the two numbers in the tosses.

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms *

Planning and Acting in Partially Observable Stochastic Domains

The Intrinsic Value of a Batted Ball Technical Details

B. AA228/CS238 Component

CENG 466 Artificial Intelligence. Lecture 4 Solving Problems by Searching (II)

Uninformed search methods

DECISION-MAKING ON IMPLEMENTATION OF IPO UNDER TOPOLOGICAL UNCERTAINTY

Depth-bounded Discrepancy Search

Transcription:

Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag

Announcements Course TA: Hao Xiong Office hours: Friday 2pm-4pm in ECSS2.104A1 First homework due yesterday If you have questions about the perceptron algorithm or SVMs, stop by office hours Next homework out soon! 2

Supervised Learning Input: labeled training data i.e., data plus desired output Assumption: there exists a function ff that maps data items xx to their correct labels Goal: construct an approximation to ff 3

Today We ve been focusing on linear separators Relatively easy to learn (using standard techniques) Easy to picture, but not clear if data will be separable Next two lectures we ll focus on other hypothesis spaces Decision trees Nearest neighbor classification 4

Application: Medical Diagnosis Suppose that you go to your doctor with flu-like symptoms How does your doctor determine if you have a flu that requires medical attention? 5

Application: Medical Diagnosis Suppose that you go to your doctor with flu-like symptoms How does your doctor determine if you have a flu that requires medical attention? Check a list of symptoms: Do you have a fever over 100.4 degrees Fahrenheit? Do you have a sore throat or a stuffy nose? Do you have a dry cough? 6

Application: Medical Diagnosis Just having some symptoms is not enough, you should also not have symptoms that are not consistent with the flu For example, If you have a fever over 100.4 degrees Fahrenheit? And you have a sore throat or a stuffy nose? You probably do not have the flu (most likely just a cold) 7

Application: Medical Diagnosis In other words, your doctor will perform a series of tests and ask a series of questions in order to determine the likelihood of you having a severe case of the flu This is a method of coming to a diagnosis (i.e., a classification of your condition) We can view this decision making process as a tree 8

Decision Trees A tree in which each internal (non-leaf) node tests the value of a particular feature Each leaf node specifies a class label (in this case whether or not you should play tennis) 9

Decision Trees Features: (Outlook, Humidity, Wind) Classification is performed root to leaf The feature vector (Sunny, Normal, Strong) would be classified as a yes instance 10

Decision Trees Can have continuous features too Internal nodes for continuous features correspond to thresholds 11

Decision Trees Decision trees divide the feature space into axis parallel rectangles 8 7 6 5 4 3 2 1 0 xx 2 xx 2 > 4 0 1 2 3 4 5 6 7 8 xx 1 xx 1 6 xx 1 > 4 xx 2 > 5 xx 2 > 2 12

Decision Trees Decision trees divide the feature space into axis parallel rectangles 8 7 6 5 4 3 2 1 0 xx 2 xx 2 > 4 0 1 2 3 4 5 6 7 8 xx 1 xx 1 6 xx 1 > 4 xx 2 > 5 xx 2 > 2 13

Decision Trees Decision trees divide the feature space into axis parallel rectangles 8 7 6 5 4 3 2 1 0 xx 2 xx 2 > 4 0 1 2 3 4 5 6 7 8 xx 1 xx 1 6 xx 1 > 4 xx 2 > 5 xx 2 > 2 14

Decision Trees Decision trees divide the feature space into axis parallel rectangles 8 7 6 5 4 3 2 1 0 xx 2 xx 2 > 4 0 1 2 3 4 5 6 7 8 xx 1 xx 1 6 xx 1 > 4 xx 2 > 5 xx 2 > 2 15

Decision Trees Decision trees divide the feature space into axis parallel rectangles 8 7 6 5 4 3 2 1 0 xx 2 xx 2 > 4 0 1 2 3 4 5 6 7 8 xx 1 xx 1 6 xx 1 > 4 xx 2 > 5 xx 2 > 2 16

Decision Trees Decision trees divide the feature space into axis parallel rectangles 8 7 6 5 4 3 2 1 0 xx 2 xx 2 > 4 0 1 2 3 4 5 6 7 8 xx 1 xx 1 6 xx 1 > 4 xx 2 > 5 xx 2 > 2 17

Decision Trees Worst case decision tree may require exponentially many nodes 1 xx 2 0 0 1 xx 1 18

Decision Tree Learning Basic decision tree building algorithm: Pick some feature/attribute Partition the data based on the value of this attribute Recurse over each new partition 19

Decision Tree Learning Basic decision tree building algorithm: Pick some feature/attribute (how to pick the best?) Partition the data based on the value of this attribute Recurse over each new partition (when to stop?) We ll focus on the discrete case first (i.e., each feature takes a value in some finite set) 20

Decision Trees What functions can be represented by decision trees? Every function can be represented by a sufficiently complicated decision tree Are decision trees unique? 21

Decision Trees What functions can be represented by decision trees? Every function of /- can be represented by a sufficiently complicated decision tree Are decision trees unique? No, many different decision trees are possible for the same set of labels 22

Choosing the Best Attribute Because the complexity of storage and classification increases with the size of the tree, should prefer smaller trees Simplest models that explain the data are usually preferred over more complicated ones Finding the smallest tree is an NP-hard problem Instead, use a greedy heuristic based approach to pick the best attribute at each stage 23

Choosing the Best Attribute xx 1, xx 2 {0,1} Which attribute should you split on? 1 yy = : 0 yy = : 4 xx 1 xx 2 0 1 0 yy = : 3 yy = : 1 yy = : 1 yy = : 3 yy = : 2 yy = : 2 xx 11 xx 22 yy 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 24

Choosing the Best Attribute 1 yy = : 0 yy = : 4 xx 1, xx 2 {0,1} Which attribute should you split on? xx 1 xx 2 0 1 0 yy = : 3 yy = : 1 yy = : 1 yy = : 3 yy = : 2 yy = : 2 Can think of these counts as probability distributions over the labels: if xx = 1, the probability that yy = is equal to 1 xx 11 xx 22 yy 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 25

Choosing the Best Attribute The selected attribute is a good split if we are more certain about the classification after the split If each partition with respect to the chosen attribute has a distinct class label, we are completely certain about the classification after partitioning If the class labels are evenly divided between the partitions, the split isn t very good (we are very uncertain about the label for each partition) What about other situations? How do you measure the uncertainty of a random process? 26

Discrete Probability Sample space specifies the set of possible outcomes For example, Ω = {H, T} would be the set of possible outcomes of a coin flip Each element ωω Ω is associated with a number p ωω [0,1] called a probability ωω Ω pp ωω = 1 For example, a biased coin might have pp HH =.6 and pp TT =.4 27

Discrete Probability An event is a subset of the sample space Let Ω = {1, 2, 3, 4, 5, 6} be the 6 possible outcomes of a dice role AA = 1, 5, 6 Ω would be the event that the dice roll comes up as a one, five, or six The probability of an event is just the sum of all of the outcomes that it contains pp AA = pp 1 pp 5 pp(6) 28

Independence Two events A and B are independent if pp AA BB = pp AA PP(BB) Let's suppose that we have a fair die: pp 1 = = pp 6 = 1/6 If AA = {1, 2, 5} and BB = {3, 4, 6} are AA and BB indpendent? AA 1 2 5 BB 3 6 4 29

Independence Two events A and B are independent if pp AA BB = pp AA PP(BB) Let's suppose that we have a fair die: pp 1 = = pp 6 = 1/6 If AA = {1, 2, 5} and BB = {3, 4, 6} are AA and BB indpendent? AA 1 2 5 BB 3 6 4 No! pp AA BB = 0 1 4 30

Independence Now, suppose that Ω = { 1,1, 1,2,, 6,6 } is the set of all possible rolls of two unbiased dice Let AA = { 1,1, 1,2, 1,3,, 1,6 } be the event that the first die is a one and let BB = { 1,6, 2,6,, 6,6 } be the event that the second die is a six Are AA and BB independent? AA 1,2 (1,5) (1,1) (1,4) (1,3) (1,6) 2,6 (6,6) (3,6) 31 BB (5,6) (4,6)

Independence Now, suppose that Ω = { 1,1, 1,2,, 6,6 } is the set of all possible rolls of two unbiased dice Let AA = { 1,1, 1,2, 1,3,, 1,6 } be the event that the first die is a one and let BB = { 1,6, 2,6,, 6,6 } be the event that the second die is a six Are AA and BB independent? AA BB (1,1) (3,6) 1,2 (1,4) (1,6) 2,6 (5,6) (1,5) (4,6) (1,3) (6,6) 32 Yes! pp AA BB = 1 36 = 1 6 1 6

Conditional Probability The conditional probability of an event AA given an event BB with pp BB > 0 is defined to be pp AA BB = pp AA BB PP BB This is the probability of the event AA BB over the sample space Ω = BB Some properties: ωω Ω pp(ωω BB) = 1 If AA and BB are independent, then pp AA BB = pp(aa) 33

Discrete Random Variables A discrete random variable, XX, is a function from the state space Ω into a discrete space DD For each xx DD, pp XX = xx pp ωω Ω XX ωω = xx is the probability that XX takes the value xx pp(xx) defines a probability distribution xx DD pp(xx = xx) = 1 Random variables partition the state space into disjoint events 34

Example: Pair of Dice Let Ω be the set of all possible outcomes of rolling a pair of dice Let pp be the uniform probability distribution over all possible outcomes in Ω Let XX ωω be equal to the sum of the value showing on the pair of dice in the outcome ωω pp XX = 2 =? pp XX = 8 =? 35

Example: Pair of Dice Let Ω be the set of all possible outcomes of rolling a pair of dice Let pp be the uniform probability distribution over all possible outcomes in Ω Let XX ωω be equal to the sum of the value showing on the pair of dice in the outcome ωω pp XX = 2 = 11 3333 pp XX = 8 =? 36

Example: Pair of Dice Let Ω be the set of all possible outcomes of rolling a pair of dice Let pp be the uniform probability distribution over all possible outcomes in Ω Let XX ωω be equal to the sum of the value showing on the pair of dice in the outcome ωω pp XX = 2 = 11 3333 pp XX = 8 = 55 3333 37

Discrete Random Variables We can have vectors of random variables as well XX ωω = [XX 1 ωω,, XX nn ωω ] The joint distribution is pp XX 1 = xx 1,, XX nn = xx nn is typically written as pp(xx 1 = xx 1 XX nn = xx nn ) pp(xx 1,, xx nn ) Because XX ii = xx ii is an event, all of the same rules from basic probability apply 38

Entropy A standard way to measure uncertainty of a random variable is to use the entropy HH YY = YY=yy pp YY = yy log pp(yy = yy) Entropy is maximized for uniform distributions Entropy is minimized for distributions that place all their probability on a single outcome 39

Entropy of a Coin Flip HH(XX) XX = oooooooooooooo oooo cccccccc ffffffff wwwwwww pppppppppppppppppppppp oooo heeeeeeee pp pp 40

Conditional Entropy We can also compute the entropy of a random variable conditioned on a different random variable HH YY XX = pp(xx = xx) pp YY = yy XX = xx log pp(yy = yy XX = xx) xx This is called the conditional entropy yy This is the amount of information needed to quantify the random variable YY given the random variable XX 41

Information Gain Using entropy to measure uncertainty, we can greedily select an attribute that guarantees the largest expected decrease in entropy (with respect to the empirical partitions) Called information gain IIII XX = HH YY HH(YY XX) Larger information gain corresponds to less uncertainty about YY given XX Note that HH YY XX HH(YY) 42

Decision Tree Learning Basic decision tree building algorithm: Pick the feature/attribute with the highest information gain Partition the data based on the value of this attribute Recurse over each new partition 43

Choosing the Best Attribute 1 yy = : 0 yy = : 4 xx 1, xx 2 {0,1} Which attribute should you split on? xx 1 xx 2 0 1 0 yy = : 3 yy = : 1 yy = : 1 yy = : 3 What is the information gain in each case? yy = : 2 yy = : 2 xx 11 xx 22 yy 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 44

Choosing the Best Attribute 1 yy = : 0 yy = : 4 xx 1, xx 2 {0,1} Which attribute should you split on? xx 1 xx 2 0 1 0 yy = : 3 yy = : 1 yy = : 1 yy = : 3 yy = : 2 yy = : 2 xx 11 xx 22 yy 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 HH YY = 5 8 log 5 8 3 8 log 3 8 HH YY XX 1 =.5 0 log 0 1 log 1.5.75 log.75.25 log.25 HH YY XX 2 =.5.5 log.5.5 log.5.5.75 log.75.25 log.25 HH YY HH YY XX 1 HH YY HH YY XX 2 = log.5 > 0 Should split on xx1 45

When to Stop If the current set is pure (i.e., has a single label in the output), stop If you run out of attributes to recurse on, even if the current data set isn t pure, stop and use a majority vote If a partition contains no data points, use the majority vote at its parent in the tree If a partition contains no data items, nothing to recurse on For fixed depth decision trees, the final label is determined by majority vote 46

Handling Real-Valued Attributes For continuous attributes, use threshold splits Split the tree into xx kk < tt and xx kk tt Can split on the same attribute multiple times on the same path down the tree How to pick the threshold tt? 47

Handling Real-Valued Attributes For continuous attributes, use threshold splits Split the tree into xx kk < tt and xx kk tt Can split on the same attribute multiple times on the same path down the tree How to pick the threshold tt? Try every possible tt How many possible tt are there? 48

Handling Real-Valued Attributes Sort the data according to the kk ttt attribute: zz 1 > zz 2 > > zz nn zz Only a finite number of thresholds make sense tt 49

Handling Real-Valued Attributes Sort the data according to the kk ttt attribute: zz 1 > zz 2 > > zz nn zz Only a finite number of thresholds make sense tt Just split in between each consecutive pair of data points (e.g., splits of the form tt = zz iizz ii1 ) 2 50

Handling Real-Valued Attributes Sort the data according to the kk ttt attribute: zz 1 > zz 2 > > zz nn Only a finite number of thresholds make sense Just split in between each consecutive pair of data points (e.g., splits of the form tt = zz iizz ii1 ) 2 tt zz Does it make sense for a threshold to appear between two xx s with the same class label? 51

Handling Real-Valued Attributes Compute the information gain of each threshold Let XX: tt denote splitting with threshold tt and compute HH YY XX: tt = pp XX < tt pp YY = yy XX < tt log pp YY = yy XX < tt yy pp XX tt pp YY = yy XX tt log pp(yy = yy XX tt) yy In the learning algorithm, maximize over all attributes and all possible thresholds of the real-valued attributes max tt HH YY HH(YY XX: tt), for real-valued XX HH YY HH(YY XX), for discrete XX 52

Decision Trees Because of speed/ease of implementation, decision trees are quite popular Can be used for regression too Decision trees will always overfit! It is always possible to obtain zero training error on the input data with a deep enough tree (if there is no noise in the labels) Solution? 53