Logistic Regression. Hongning Wang

Size: px

Start display at page:

Download "Logistic Regression. Hongning Wang"

Merilyn Porter
5 years ago
Views:

1 Logistic Regression Hongning Wang

2 Today s lecture Logistic regression model A discriminative classification model Two different perspectives to derive the model Parameter estimation CS@UVa CS 6501: Text Mining 2

3 Review: Bayes risk minimization Risk assign instance to a wrong class yy = aaaaaaaaaaxx yy PP(yy XX) pp(xx, yy) yy = 0 yy = 1 *Optimal Bayes decision boundary We have learned multiple ways to estimate this pp XX yy = 0 pp(yy = 0) pp XX yy = 1 pp(yy = 1) False negative False positive CS@UVa CS 6501: Text Mining 3 XX

4 Instance-based solution k nearest neighbors Approximate Bayes decision rule in a subset of data around the testing point CS@UVa CS 6501: Text Mining 4

5 Instance-based solution k nearest neighbors Approximate Bayes decision rule in a subset of data around the testing point Let VV be the volume of the mm dimensional ball around xx containing the kk nearest neighbors for xx, we have pp xx VV = kk NN pp xx yy = 1 = kk 1 => pp xx = kk NNVV NN 1 VV Total number of instances With Bayes rule: pp yy = 1 xx = NN 1 NN kk 1 NN 1 VV kk NNNN = kk 1 kk pp yy = 1 = NN 1 NN Total number of instances in class 1 Counting the nearest neighbors from class1 CS@UVa CS 6501: Text Mining 5

6 Generative solution Naïve Bayes classifier yy = aaaaaaaaaaxx yy PP yy XX = aaaaaaaaaaxx yy PP XX yy PP(yy) dd = aaaaaaaaaaxx yy ii=1 PP(xx ii yy) PP yy y By Bayes rule By independence assumption x 1 x 2 x 3 x v CS@UVa CS 6501: Text Mining 6

7 Estimating parameters Maximial likelihood estimator PP xx ii yy = dd jj δδ(xx jj dd =xxii,yy dd =yy) dd δδ(yy dd =yy) PP(yy) = dd δδ(yy dd=yy) dd 1 text information identify mining mined is useful to from apple delicious Y D D D CS@UVa CS 6501: Text Mining 7

8 Discriminative v.s. generative models All instances are considered for probability density estimation yy = ff(xx) More attention will be put onto the boundary points CS@UVa CS 6501: Text Mining 8

9 Parametric form of decision boundary For binary case in Naïve Bayes ff XX = ssssss(log PP yy = 1 XX log PP(yy = 0 XX)) PP yy = 1 = ssssss log PP yy = 0 + ii=1 dd cc xx ii, dd log PP xx ii yy = 1 PP xx ii yy = 0 where = ssssss(ww TT XX) ww = log Linear regression? PP yy = 1 PP yy = 0, log PP xx 1 yy = 1 PP xx 1 yy = 0,, log PP xx vv yy = 1 PP xx vv yy = 0 XX = (1, cc(xx 1, dd),, cc(xx vv, dd)) CS@UVa CS 6501: Text Mining 9

10 Regression for classification? Linear regression yy ww TT XX Relationship between a scalar dependent variable yy and one or more explanatory variables CS@UVa CS 6501: Text Mining 10

11 Regression for classification? Linear regression yy ww TT XX Relationship between a scalar dependent variable yy and one or more explanatory variables y Y is discrete in a classification problem! yy = 1 wwtt XX > ww TT XX 0.5 What if we have an outlier? Optimal regression model 0.00 CS@UVa CS 6501: Text Mining 11 x

12 Regression for classification? Logistic regression pp yy xx = σσ ww TT XX = 1+exp( ww TT XX) Directly modeling of class posterior Sigmoid function P(y x) 1 What if we have an outlier? CS@UVa CS 6501: Text Mining 12 x

13 Logistic regression for classification Why sigmoid function? PP XX yy = 1 PP(yy=1) PP yy = 1 XX = PP XX yy = 1 PP yy=1 +PP XX yy = 0 PP(yy=0) 1 = PP XX yy = 0 PP(yy = 0) 1 + PP XX yy = 1 PP(yy = 1) Binomial P(y x) PP yy = 1 = αα 1.00 PP XX yy = 0 = NN(μμ 0, δδ 2 ) PP XX yy = 1 = NN(μμ 1, δδ 2 ) Normal with identical variance 0.00 CS@UVa CS 6501: Text Mining 13 x

14 Logistic regression for classification Why sigmoid function? PP XX yy = 1 PP(yy=1) PP yy = 1 XX = PP XX yy = 1 PP yy=1 +PP XX yy = 0 PP(yy=0) 1 = PP XX yy = 0 PP(yy = 0) 1 + PP XX yy = 1 PP(yy = 1) 1 = PP XX yy = 1 PP(yy = 1) 1 + exp ln PP XX yy = 0 PP(yy = 0) CS@UVa CS 6501: Text Mining 14

15 Logistic regression for classification Why sigmoid function? PP XX yy = 1 PP(yy=1) ln PP XX yy = 0 PP(yy=0) PP(yy=1) = ln + VV PP(yy=0) ii=1 ln PP(xx ii yy=1) PP(xx ii yy=0) VV = ln αα 1 αα + ii=1 μμ 1ii μμ 0ii δδ ii 2 PP xx yy = 1 δδ 2ππ ee xx ii μμ 1ii 2 2 μμ 0ii 2 2δδ ii xx μμ 2 2δδ 2 Origin of the name: logit function VV = ww 0 + ii=1 = ww 0 + ww TT XX = ww TT XX μμ1ii μμ 0ii δδ ii 2 xx ii CS@UVa CS 6501: Text Mining 15

16 Logistic regression for classification Why sigmoid function? PP XX yy = 1 PP(yy=1) PP yy = 1 XX = PP XX yy = 1 PP yy=1 +PP XX yy = 0 PP(yy=0) 1 = PP XX yy = 0 PP(yy = 0) 1 + PP XX yy = 1 PP(yy = 1) 1 = PP XX yy = 1 PP(yy = 1) 1 + exp ln PP XX yy = 0 PP(yy = 0) 1 = 1 + exp( ww TT XX) Generalized Linear Model Note: it is still a linear relation among the features! CS@UVa CS 6501: Text Mining 16

17 Logistic regression for classification For multi-class categorization PP yy = kk XX = exp(ww kk TT XX) KK jj=1 exp(ww jj TT XX) x 1 y x 2 x 3 x v PP yy = kk XX exp(ww kk TT XX) Warning: redundancy in model parameters, When KK=2, since KK jj=1 PP(yy = jj XX) = 1 exp(ww TT 1 XX) PP yy = 1 XX = exp(ww TT 1 XX) + exp(ww TT 0 XX) 1 = 1 + exp( ww 1 ww TT 0 XX) ww CS@UVa CS 6501: Text Mining 17

18 Logistic regression for classification Decision boundary for binary case 1, pp yy = 1 XX > 0.5 yy = 0, oottttttttttttttt i.f.f. i.f.f. pp yy = 1 XX = exp ww TT XX < 1 ww TT XX > exp( ww TT XX) > 0.5 yy = 1, wwtt xx > 0 0, ooooooooooooooooo A linear model! CS@UVa CS 6501: Text Mining 18

19 Logistic regression for classification Decision boundary in general yy = aaaaaaaaaaxx yy pp yy XX = aaaaaaaaaaxx yy exp(ww TT yy XX) = aaaaaaaaaaxx yy ww TT yy XX A linear model! CS@UVa CS 6501: Text Mining 19

20 Logistic regression for classification Summary PP yy = 1 XX = = Binomial PP XX yy = 1 PP(yy=1) PP XX yy = 1 PP yy=1 +PP XX yy = 0 PP(yy=0) 1 PP XX yy = 0 PP(yy = 0) 1 + PP XX yy = 1 PP(yy = 1) P(y x) PP yy = 1 = αα 1.00 PP XX yy = 0 = NN(μμ 0, δδ 2 ) PP XX yy = 1 = NN(μμ 1, δδ 2 ) Normal with identical variance 0.00 CS@UVa CS 6501: Text Mining 20 x

21 A different perspective Imagine we have the following Documents happy, good, purchase, item, indeed Sentiment positive pp xx = "happy", yy = 1 + pp xx = "good", yy = 1 + pp xx = "pppppppppppppppp, yy = 1 +pp xx = "item", yy = 1 + pp xx = "iiiiiiiiiiiii, yy = 1 = 1 Question: find a distribution pp(xx, yy) that satisfies this observation. Answer1: Answer2: pp xx = "item", yy = 1 = 1, and all the others 0 pp xx = "indeed", yy = 1 = 0.5, pp xx = "good", yy = 1 = 0.5, and all the others 0 We have too little information to favor either one of them. CS@UVa CS 6501: Text Mining 21

22 Occam's razor A problem-solving principle among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. William of Ockham ( ) Principle of Insufficient Reason: "when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely Pierre-Simon Laplace ( ) CS@UVa CS 6501: Text Mining 22

23 A different perspective Imagine we have the following Documents happy, good, purchase, item, indeed Sentiment positive pp xx = "happy", yy = 1 + pp xx = "good", yy = 1 + pp xx = "pppppppppppppppp, yy = 1 +pp xx = "item", yy = 1 + pp xx = "iiiiiiiiiiiii, yy = 1 = 1 Question: find a distribution pp(xx, yy) that satisfies this observation. As a result, a safer choice would be: pp xx = " ", yy = 1 = 0.2 Equally favor every possibility CS@UVa CS 6501: Text Mining 23

24 A different perspective Imagine we have the following Observations Sentiment happy, good, purchase, item, indeed positive 30% of time good, item positive pp xx = "happy", yy = 1 + pp xx = "good", yy = 1 + pp xx = "pppppppppppppppp, yy = 1 +pp xx = "item", yy = 1 + pp xx = "iiiiiiiiiiiii, yy = 1 = 1 pp xx = "good", yy = 1 + pp xx = "item", yy = 1 = 0.3 Question: find a distribution pp(xx, yy) that satisfies this observation. Again, a safer choice would be: pp xx = "gggggggg", yy = 1 = pp xx = "iiiiiiii", yy = 1 = 0.15, and all the others 7 30 Equally favor every possibility CS@UVa CS 6501: Text Mining 24

25 A different perspective Imagine we have the following Observations happy, good, purchase, item, indeed Sentiment positive 30% of time good, item positive 50% of time good, happy positive pp xx = "happy", yy = 1 + pp xx = "good", yy = 1 + pp xx = "pppppppppppppppp, yy = 1 +pp xx = "item", yy = 1 + pp xx = "iiiiiiiiiiiii, yy = 1 = 1 pp xx = "good", yy = 1 + pp xx = "item", yy = 1 = 0.3 pp xx = "good", yy = 1 + pp xx = "happy", yy = 1 = 0.5 Question: find a distribution pp(xx, yy) that satisfies this observation. Time to think about: 1) what do we mean by equally/uniformly favoring the models? 2) given all these constraints, how could we find the most preferred model? CS@UVa CS 6501: Text Mining 25

Maximized when P(X) is uniform distribution Question 1 is

26 Maximum entropy modeling A measure of uncertainty of random events HH XX = EE II XX = xx XX PP xx log PP(xx) Maximized when P(X) is uniform distribution Question 1 is answered, then how about question 2? CS@UVa CS 6501: Text Mining 26

27 Represent the constraints Indicator function E.g., to express the observation that word good occurs in a positive document ff xx, yy = 1 0 if yy = 1 and xx = ggggggggg otherwise Usually referred as feature function CS@UVa CS 6501: Text Mining 27

28 Represent the constraints Empirical expectation of feature function over a corpus EE[ pp ff ] = xx,yy pp xx, yy ff(xx, yy) where pp xx, yy = cc(ff(xx,yy)) NN Expectation of feature function under a given statistical model EE[pp ff ] = xx,yy pp xx pp(yy xx)ff(xx, yy) i.e., frequency of observing ff(xx, yy) in a given collection. Empirical distribution of xx in the same collection. Model s estimation of conditional distribution. CS@UVa CS 6501: Text Mining 28

29 Represent the constraints When a feature is important, we require our preferred statistical model to accord with it CC pp PP EE[pp ff ii ] = EE[ pp ff ii ], ii {1,2,, nn} EE[pp ff ii ] = EE[ pp ff ii ] xx,yy pp xx, yy ff ii xx, yy = xx,yy pp xx pp yy xx ff ii (xx, yy) We only need to specify this in our preferred model! Is Question 2 answered? CS@UVa CS 6501: Text Mining 29

30 Represent the constraints Let s visualize this (a) No constraint (b) Under constrained How to deal with these situations? (c) Feasible constraint (d) Over constrained CS@UVa CS 6501: Text Mining 30

31 Maximum entropy principle To select a model from a set CC of allowed probability distributions, choose the model pp CC with maximum entropy HH(pp) pp = aaaaaaaaaaxx pp CC HH(pp) pp(yy xx) Both questions are answered! CS@UVa CS 6501: Text Mining 31

32 Maximum entropy principle Let s solve this constrained optimization problem with Lagrange multipliers Primal: Lagrangian: pp = aaaaaaaaaaxx pp CC HH(pp) LL pp, λλ = HH pp + ii a strategy for finding the local maxima and minima of a function subject to equality constraints λλ ii (pp ff ii pp ff ii ) CS@UVa CS 6501: Text Mining 32

33 Maximum entropy principle Let s solve this constrained optimization problem with Lagrange multipliers Lagrangian: Dual: LL pp, λλ = HH pp + pp λλ (yy xx) = 1 ZZ λλ xx exp ii λλ ii (pp ff ii pp ff ii ) ii λλ ii ff ii (xx, yy) Ψ λλ = xx pp xx log ZZ λλ xx + ii λλ ii pp ff ii CS@UVa CS 6501: Text Mining 33

34 Maximum entropy principle Let s solve this constrained optimization problem with Lagrange multipliers Dual: where Ψ λλ = xx Z λλ = yy pp xx log ZZ λλ xx + exp λλ ii ff ii (xx, yy) ii ii λλ ii pp ff ii CS@UVa CS 6501: Text Mining 34

35 Maximum entropy principle Let s take a close look at the dual function where Ψ λλ = xx Z λλ = yy pp xx log ZZ λλ xx + exp λλ ii ff ii (xx, yy) ii ii λλ ii pp ff ii CS@UVa CS 6501: Text Mining 35

36 Maximum entropy principle Let s take a close look at the dual function Ψ λλ = = xx = xx xx pp xx log ZZ λλ xx + pp xx log exp( ii λλ ii pp ff ii ) ZZ λλ xx pp xx log pp(yy xx) xx pp xx ii λλ ii pp ff ii Maximum likelihood estimator! CS@UVa CS 6501: Text Mining 36

37 Maximum entropy principle Primal: maximum entropy pp = aaaaaaaaaaxx pp CC HH(pp) Dual: logistic regression pp λλ (yy xx) = 1 ZZ λλ xx exp ii λλ ii ff ii (xx, yy) where Z λλ = yy exp ii λλ ii ff ii (xx, yy) λλ is determined by Ψ(λλ) CS@UVa CS 6501: Text Mining 37

38 Questions haven t been answered Class conditional density Why it should be Gaussian with equal variance? Model parameters What is the relationship between ww and λλ? How to estimate them? CS@UVa CS 6501: Text Mining 38

39 Maximum entropy principle The maximum entropy model subject to the constraints CC has a parametric solution pp λλ (yy xx) where the parameters λλ can be determined by maximizing the likelihood function of pp λλ (yy xx) over a training set With a Gaussian distribution, differential entropy is maximized for a given variance. Features follow Gaussian distribution Maximum entropy model Logistic regression CS@UVa CS 6501: Text Mining 39

40 Parameter estimation Maximum likelihood estimation LL ww = dd DD yy dd log pp(yy dd = 1 XX dd ) + (1 yy dd ) log pp(yy dd = 0 XX dd ) Take gradient of LL ww with respect to ww LL ww = dd DD yy dd log pp(yy dd = 1 XX dd ) + (1 yy dd ) log pp(yy dd = 0 XX dd ) ww CS@UVa CS 6501: Text Mining 40

41 Parameter estimation Maximum likelihood estimation log pp(yy dd=1 XX dd ) ww log pp(yy dd=0 XX dd ) ww = log 1+exp( wwtt XX dd ) ww = exp( wwtt XX dd ) 1 + exp( ww TT XX dd ) XX dd = 1 pp(yy dd = 1 XX dd ) XX dd = 0 pp(yy dd = 1 XX dd ) XX dd CS@UVa CS 6501: Text Mining 41

42 Parameter estimation Maximum likelihood estimation LL ww = dd DD yy dd log pp(yy dd = 1 XX dd ) + (1 yy dd ) log pp(yy dd = 0 XX dd ) Take gradient of LL ww with respect to ww LL ww = dd DD yy dd log pp(yy dd = 1 XX dd ) + (1 yy dd ) log pp(yy dd = 0 XX dd ) ww = yy dd 1 pp yy dd = 1 XX dd XX dd + 1 yy dd 0 pp yy dd = 1 XX dd XX dd dd DD = dd DD yy dd pp yy = 1 XX dd XX dd concave function for ww Good news: neat format, Can be easily generalized to multi-class case Bad news: no close form solution CS@UVa CS 6501: Text Mining 42

43 Gradient-based optimization Gradient descent ww ww LL ww = [,,, ww 0 ww 1 ww (tt+1) = ww (tt) ηη (tt) ww ww ww VV ] Iterative updating Step-size, affects convergence CS@UVa CS 6501: Text Mining 43

44 Parameter estimation Stochastic gradient descent while not converge randomly choose dd DD LL dd ww = [ LL dd ww, LL dd ww ww 0 ww 1 ww (tt+1) = ww (tt) ηη tt LL dd ww ηη (tt+1) = aaηη tt,, LL dd ww ww VV ] Gradually shrink the step-size CS@UVa CS 6501: Text Mining 44

45 Parameter estimation Batch gradient descent while not converge Compute gradient w.r.t. all training instances LL DD ww = [ LL DD ww ww 0 Compute step size ηη tt ww (tt+1) = ww (tt) ηη tt LL dd ww, LL DD ww ww 1,, LL DD ww ww VV ] Line search is required to ensure sufficient decent First order method Second order methods, e.g., quasi- Newton method and conjugate gradient, provide faster convergence CS@UVa CS 6501: Text Mining 45

46 Model regularization Avoid over-fitting We may not have enough samples to well estimate model parameters for logistic regression Regularization Impose additional constraints over the model parameters E.g., sparsity constraint enforce the model to have more zero parameters CS 6501: Text Mining 46

47 Model regularization L2 regularized logistic regression Assume the model parameter ww is drawn from Gaussian: w NN(0, σσ 2 ) pp yy dd, ww XX dd pp yy dd XX dd, ww pp(ww) LL ww = dd DD [yy dd log pp(yy dd = 1 XX dd ) +(1 yy dd ) log pp yy dd = 0 XX dd ] wwtt ww 2σσ 2 L2-norm of ww CS@UVa CS 6501: Text Mining 47

48 Generative V.S. discriminative models Generative Specifying joint distribution Full probabilistic specification for all the random variables Dependence assumption has to be specified for pp XX yy and pp(yy) Flexible, can be used in unsupervised learning Discriminative Specifying conditional distribution Only explain the target variable Arbitrary features can be incorporated for modeling pp yy XX Need labeled data, only suitable for (semi-) supervised learning CS 6501: Text Mining 48

49 Naïve Bayes V.S. Logistic regression Naive Bayes Conditional independence pp XX yy = ii pp(xx ii yy) Distribution assumption of pp(xx ii yy) # parameters kk(vv + 1) Model estimation Closed form MLE Asymptotic convergence rate Logistic Regression No independence assumption Functional form assumption of pp yy XX exp(ww yy TT XX) # parameters (kk 1)(VV + 1) Model estimation Gradient-based MLE Asymptotic convergence rate εε NNNN,nn εε NNNN, + OO( log VV nn ) Need more training data εε LLLL,nn εε LLLL, + OO( VV nn ) CS@UVa CS 6501: Text Mining 49

50 Naïve Bayes V.S. Logistic regression LR NB "On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Ng, Jordan NIPS 2002, UCI Data set CS@UVa CS 6501: Text Mining 50

51 What you should know Two different derivations of logistic regression Functional form from Naïve Bayes assumptions pp(xx yy) follows equal variance Gaussian Sigmoid function Maximum entropy principle Primal/dual optimization Generalization to multi-class Parameter estimation Gradient-based optimization Regularization Comparison with Naïve Bayes CS 6501: Text Mining 51

52 Today s reading Speech and Language Processing Chapter 6: Hidden Markov and Maximum Entropy Models 6.6 Maximum entropy models: background 6.7 Maximum entropy modeling CS@UVa CS 6501: Text Mining 52

knn & Naïve Bayes Hongning Wang

knn & Naïve Bayes Hongning Wang CS@UVa Today s lecture Instance-based classifiers k nearest neighbors Non-parametric learning algorithm Model-based classifiers Naïve Bayes classifier A generative model