CS249: ADVANCED DATA MINING

Size: px

Start display at page:

Download "CS249: ADVANCED DATA MINING"

Jasmine Susanna Taylor
5 years ago
Views:

1 CS249: ADVANCED DATA MINING Linear Regression, Logistic Regression, and GLMs Instructor: Yizhou Sun April 24, 2017

2 About WWW2017 Conference 2

3 Turing Award Winner Sir Tim Berners-Lee 3

4 Course Project Announcements Team formation due this Wednesday PTE will be decided by this weekend Office hour for TA (Jae LEE): 3-5 PM on

5 Methods to Learn Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 5

6 How to learn these algorithms? Three levels When it is applicable? Input, output, strengths, weaknesses, time complexity How it works? Pseudo-code, work flows, major steps Can work out a toy problem by pen and paper Why it works? Intuition, philosophy, objective, derivation, proof 6

7 Vector Data: Regression Vector Data Linear Regression Model Logistic Regression Model Generalized Linear Model Summary 7

8 Example A matrix of n pp: n data objects / points p attributes / dimensions x 11 x i1 x n1 x 1f x if x nf x 1p x ip x np 8

9 Numerical E.g., height, income Attribute Type Categorical / discrete E.g., Sex, Race 9

10 Categorical Attribute Types Nominal: categories, states, or names of things Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings 10

11 Vector Data: Regression Vector Data Linear Regression Model Logistic Regression Model Generalized Linear Model Summary 11

12 Linear Regression Ordinary Least Square Regression Closed form solution Gradient descent Linear Regression with Probabilistic Interpretation 12

13 The Linear Regression Problem Any Attributes to Continuous Value: x y {age; major ; gender; race} GPA {income; credit score; profession} loan {college; major ; GPA} future income 13

14 Example of House Price Living Area (sqft) # of Beds Price (1000$) x y 14

15 Illustration 15

16 Formalization Data: n independent data objects yy ii, i = 1,, nn xx ii = xx iii, xx iii,, xx iiii T, i = 1,, nn A constant factor is added to model the bias term, i. e., xx iii = 1 T New x: xx ii = xx iii, xx iii, xx iii,, xx iiii Model: yy: dependent variable xx: explanatory variables ββ = ββ 0, ββ 1,, ββ pp TT : wwwwwwwwwww vvvvvvvvvvvv yy = xx TT ββ = ββ 0 + xx 1 ββ 1 + xx 2 ββ xx pp ββ pp 16

17 Model Construction A 3-step Process Use training data to find the best parameter ββ, denoted as ββ Model Selection Use validation data to select the best model E.g., Feature selection Model Usage Apply the model to the unseen data (test data): yy = xx TT ββ 17

18 Least Square Estimation Cost function (Total Square Error): JJ ββ = 1 2 ii xx ii TT ββ yy ii 2 Matrix form: JJ ββ = Xββ yy TT (XXββ yy)/2 1, 1, 1, x 11 x i1 x n1 or Xββ yy 2 /2 x 1f x if x nf x 1p x ip x np XX: nn pp + 11 matrix yy 1 yy ii yy nn y: nn 11 vvvvvvvvvvvv 18

19 Ordinary Least Squares (OLS) Goal: find ββ that minimizes JJ ββ JJ ββ = 1 2 Xββ yy TT XXββ yy = 1 2 (ββtt XX TT XXββ yy TT XXββ ββ TT XX TT yy + yy TT yy) Ordinary least squares Set first derivative of JJ ββ as 0 ββ = ββtt X TT X yy TT XX = 0 ββ = XX TT XX 1 XX TT yy 19

20 Gradient Descent Minimize the cost function by moving down in the steepest direction 20

21 Batch Gradient Descent Move in the direction of steepest descend Repeat until converge { } ββ (tt+1) :=ββ (t) ηη ββ ββ=ββ (t), e.g., ηη = 0.01 Where JJ ββ = 1 2 ii xx TT 2 ii ββ yy ii = ii JJ ii (ββ) and ββ = JJ ii ββ = xx ii (xx TT ii ββ yy ii ) ii ii 21

22 Stochastic Gradient Descent When a new observation, i, comes in, update weight immediately (extremely useful for largescale datasets): Repeat { for i=1:n { ββ (tt+1) :=ββ (t) + ηη(yy ii xx ii TT ββ (tt) )xx ii } } If the prediction for object i is smaller than the real value, ββ should move forward to the direction of xx ii 22

23 Other Practical Issues What if XX TT XX is not invertible? Add a small portion of identity matrix, λii, to it (ridge regression* ) What if some attributes are categorical? Set dummy variables E.g., xx = 1, iiii ssssss = FF; xx = 0, iiii ssssss = MM Nominal variable with multiple values? Create more dummy variables for one variable What if non-linear correlation exists? Transform features, say, xx to xx 2 23

24 Probabilistic Interpretation Review of normal distribution X~NN μμ, σσ 2 ff XX = xx = 1 2ππσσ 2 ee xx μμ 2 2σσ 2 24

25 Probabilistic Interpretation Model: yy ii = xx ii TT ββ + ε ii ε ii ~NN(0, σσ 2 ) yy ii xx ii, ββ~nn(xx ii TT ββ, σσ 2 ) EE yy ii xx ii, ββ = xx ii TT ββ Likelihood: LL ββ = ii pp yy ii xx ii, ββ) = ii 1 exp{ yy TT 2 ii xx ii ββ 2ππσσ 2 2σσ 2 } Maximum Likelihood Estimation find ββ that maximizes L ββ arg max LL = arg min JJ, Equivalent to OLS! 25

26 Vector Data: Regression Vector Data Linear Regression Model Logistic Regression Model Generalized Linear Model Summary 26

27 Linear Regression VS. Logistic Regression Linear Regression (prediction) Y: cccccccccccccccccccc vvvvvvvvvv, + Y = xx TT ββ = ββ 0 + xx 1 ββ 1 + xx 2 ββ xx pp ββ pp Y xx, ββ~nn(xx TT ββ, σσ 2 ) Logistic Regression (classification) Y: dddddddddddddddd vvvvvvvvvv ffffffff mm cccccccccccccc pp YY = CC jj [0,1] aaaaaa jj pp YY = CC jj = 1 27

28 Logistic Function Logistic Function / sigmoid function: σσ xx = 1 1+ee xx NNNNNNNN: σσ xx = σσ(xx)(1 σσ(xx)) 28

29 Modeling Probabilities of Two Classes PP YY = 1 XX, ββ = σσ XX TT ββ = 1 1+exp{ XX TT ββ} = exp{xxtt ββ} 1+exp{XX TT ββ} PP YY = 0 XX, ββ = 1 σσ XX TT ββ = exp{ XXTT ββ} 1+exp{ XX TT ββ} = 1 1+exp{XX TT ββ} ββ = ββ 0 ββ 1 ββ pp In other words YY X, ββ~bbeeeeeeeeeeeeeeee(σσ XX TT ββ ) 29

30 The 1-d Situation PP YY = 1 xx, ββ 0, ββ 1 = σσ ββ 1 xx + ββ , =0, =1 0 1 =0, =2 0 1 ) 1 =5, = Prob(Y=1 x, x 30

31 Example Q: What is ββ 0 here? 31

32 MLE estimation Parameter Estimation Given a dataset DD, wwwwwww nn dddddddd pppppppppppp For a single data object with attributes xx ii, class label yy ii Let pp ii = pp YY = 1 xx ii, ββ, ttttt pppppppp. oooo ii iiii cccccccccc 1 The probability of observing yy ii would be If yy ii = 1, ttttttt pp ii If yy ii = 0, ttttttt 1 pp ii yy Combing the two cases: pp ii ii 1 1 yy ppii ii LL = ii pp ii yy ii 1 ppii 1 yy ii = ii exp XX TT ββ 1+exp XX TT ββ yy ii 1 1+exp XX TT ββ 1 yy ii 32

33 Optimization Equivalent to maximize log likelihood LL = ii yy ii xx ii TT ββ log 1 + exp xx ii TT ββ Gradient ascent update: ββ nnnnnn = ββ oooooo + ηη (ββ) Newton-Raphson update Step size where derivatives at evaluated at ββ old 33

34 First Derivative pp(xx ii ; ββ) j = 0, 1,, p 34

35 Second Derivative It is a (p+1) by (p+1) matrix, Hessian Matrix, with jth row and nth column as 35

36 What about Multiclass Classification? It is easy to handle under logistic regression, say M classes PP YY = jj XX = 1,, MM 1 PP YY = MM XX = 1+ mm=1 exp{xx TT ββ jj } MM 1 exp{xx TT ββ mm }, for j = 1 MM 1 exp{xx TT ββ mm } 1+ mm=1 36

37 Vector Data: Regression Vector Data Linear Regression Model Logistic Regression Model Generalized Linear Model Summary 37

38 Recall Linear Regression and Logistic Regression Linear Regression y xx, ββ~nn(xx TT ββ, σσ 2 ) Logistic Regression y xx, ββ~bbbbbbbbbbbbbbbbbb(σσ xx TT ββ ) How about other distributions? Yes, as long as they belong to exponential family 38

39 Canonical Form Exponential Family pp yy; ηη = bb yy exp ηη TT TT yy aa ηη ηη: natural parameter TT yy : sufficient statistic aa ηη : log partition function for normalization bb yy : function that only dependent on yy 39

40 Examples of Exponential Family Many: pp yy; ηη = bb yy exp ηη TT TT yy aa ηη Gaussian, Bernoulli, Poisson, beta, Dirichlet, categorical, For Gaussian (not interested in σσ) For Bernoulli ηη 40

41 Recipe of GLMs Determines a distribution for y E.g., Gaussian, Bernoulli, Poisson Form the linear predictor for ηη ηη = xx TT ββ Determines a link function: μμ = gg 1 (ηη) Connects the linear predictor to the mean of the distribution E.g., μμ = ηη for Gaussian, μμ = σσ ηη for Bernoulli, μμ = eeeeee ηη for Poisson 41

42 Vector Data: Regression Vector Data Linear Regression Model Logistic Regression Model Generalized Linear Model Summary 42

43 What is vector data? Attribute types Linear regression Summary OLS, Probabilistic interpretation Logistic regression Sigmoid function, multiclass classification Generalized linear model Exponential family, link function 43

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING 3: Vector Data: Logistic Regression Instructor: Yizhou Sun yzsun@cs.ucla.edu October 9, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification