Special Topics: Data Science

Size: px

Start display at page:

Download "Special Topics: Data Science"

Alexina Burke
5 years ago
Views:

1 Special Topics: Data Science L Linear Methods for Prediction Dr. Vidhyasaharan Sethu School of Electrical Engineering & Telecommunications University of New South Wales Sydney, Australia V. Sethu 1

2 Topics 1. Linear Regression. Regularisation 3. Bayesian View of Linear Regression 4. Classification Systems 5. Discriminant functions 6. Logistic Regression References Friedman, J., Hastie, T., & Tibshirani, R. (001). The elements of statistical learning, New York, NY, USA:: Springer series in statistics. Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. Bishop, C. M. (006). Pattern recognition and machine learning. V. Sethu

3 Linear Regression Input Data Machine Prediction about some quantity of interest xx = xx 1,, xx TT h θθ yy R yy = ββ 0 + ii=1 ββ ii xx ii Parameters of the model: ββ = ββ 0, ββ 1,, ββ TT Residual Sum of Squares: RRRRRR ββ = yy jj yy jj Least squares estimate is the ββ that corresponds to the minimum RSS Friedman, J., Hastie, T., & Tibshirani, R. (001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 3

4 Least Squares Linear Regression Model Given a dataset, DD = xx jj, yy jj,, Residual sum of squares: RRRRRR ββ = yy jj yy jj = yy jj ββ 0 xx jjjj ββ ii ii=1 In matrix notation: RRRRRR ββ = yy XXββ TT yy XXββ ββ = ββ 0 ββ 1 ββ yy = yy 1 yy XX = xx 11 TT xx TT = xx 11 xx 1 xx N xx V. Sethu 4

5 Least Squares Linear Regression Model For least squares solution, ββ = XXTT yy XXββ = 00 Noting: Hat Matrix: ββ = XX TT XX 11 XX TT yy ββ ββ TT = XXTT XX yy = XX ββ = XX XX TT XX 11 XX TT yy = HHyy is Positive Definite if XX is full rank xx Columns of XX xx 11 V. Sethu 5

6 Sequential (on-line) Learning The least squares solution may be hard to compute in big data setting, ββ = XX TT XX 11 XX TT yy Consider Computationally expensive! Very large RRRRRR ββ = EE jj ; EE jj = yy jj ββ 0 xx jjjj ββ ii May not be available all at once (e.g., Real-time alications where data is continuously streaming) ii=1 = yy jj ββ TT xx jj Iterative estimation of ββ can be carried out: For the least squares case this is, ββ (ττ+11) = ββ ττ ηη EE jj ββ Stochastic Gradient Descent ββ (ττ+11) = ββ ττ ηη yy jj ββ ττ TT xx jj xx jj LMS (Least Mean Squares) algorithm V. Sethu 6

Note on Gradient Descent https://en.wikipedia.

7 Note on Gradient Descent Negative of the gradient points in the direction of steepest descent (of the surface of cost function). Noisy gradients estimates from single data points (or small batches of data) used in stochastic gradient descent. Noise gradients may be beneficial when negotiating complex surfaces. V. Sethu 7

8 Potential issues with Least squares estimate, Regularisation ββ = XX TT XX 11 XX TT yy yy = XX ββ Low bias but may have high variance (Prediction accuracy may suffer) Ridge Regression ββ rrrrrrrrrr = arg min ββ Maybe desirable to determine a subset of features that exhibit the strongest effects. yy jj ββ 0 xx jjjj ββ ii ii=1 + λλ ii=1 ββ ii Regularlisation term (LL regularisation) Note: ββ 0 is not included Equivalent to ββ rrrrrrrrrr = arg min ββ subject to ββ ii t ii=1 yy jj ββ 0 xx jjjj ββ ii ii=1 One-to-one correspondence between λλ and tt V. Sethu 8

Regularisation ββ rrrrrrrrrr = arg min ββ yy jj ββ 0 xx jjjj ββ ii ii=1 + λλ ii=1 ββ ii Solutions are not equivariant under scaling of the inputs Inclusion of ββ 0 will make it dependent on origin

9 Regularisation ββ rrrrrrrrrr = arg min ββ yy jj ββ 0 xx jjjj ββ ii ii=1 + λλ ii=1 ββ ii Solutions are not equivariant under scaling of the inputs Inclusion of ββ 0 will make it dependent on origin chosen for yy i.e., adding a constant cc to all yy will not simply result in yy being offset by the same amount cc. Solution can be separated into two parts after centering inputs : - Estimate ββ 0 - Estimate ββ 1,, ββ yy = yy jj xx jjjj xx jjjj xx ii ββ = ββ rrrrrrrrrr = arg min ββ ββ 1 ββ yy = yy 1 yy yy XXββ TT yy XXββ + λλββ TT ββ XX = xx 11 TT xx TT = xx 11 xx 1 xx N xx V. Sethu 9

10 Ridge Regression ββ rrrrrrrrrr = arg min ββ yy XXββ TT yy XXββ + λλββ TT ββ ββ rrrrrrrrrr = XX TT XX + λλii 11 XX TT yy Makes the problem non-singular even if XX TT XX is not of full rank Comparing Least Squares ( ββ llll ) and Ridge Regression ( ββ rrrrrrrrrr ) (Using singular value decomposition: XX = UUUUVV TT ) XX ββ llll = XX TT XX 11 XX TT yy = UUUU TT yy XX ββ rrrrrrrrrr = XX TT XX + λλii 11 XX TT yy = UUUU DD + λλii 11 DDUU TT yy = ii=11 uu ii dd ii dd ii + λλ uu ii TT yy V. Sethu 10

11 Ridge Regression XX ββ rrrrrrrrrr = ii=11 uu ii dd ii dd ii + λλ uu ii TT yy Shrinks directions of least variance the most Note: XX TT XX = VVDD VV TT (Using singular value decomposition: XX = UUUUVV TT ) Also sample covariance, SS = 1 XXTT XX Therefore, vv ii (columns of VV) are eigen-vectors Also, XXvv ii = uu ii dd ii (from XX = UUUUVV TT ) Friedman, J., Hastie, T., & Tibshirani, R. (001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 11

12 Tuning hyperparameter λλ df λλ = dd ii + λλ ii=1 dd ii Effective degrees of freedom Friedman, J., Hastie, T., & Tibshirani, R. (001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 1

13 LL 11 Regularisation (Lasso) ββ rrrrrrrrrr = arg min ββ yy jj ββ 0 xx jjjj ββ ii ii=1 + λλ ii=1 ββ ii LL 11 regularisation term Equivalent to ββ rrrrrrrrrr = arg min ββ subject to ii=1 ββ ii yy jj ββ 0 xx jjjj ββ ii t ii=1 Encourages sparsity some coefficients will be exactly zero Makes the problem non-linear no closed form solution Quadratic programming problem but efficient algorithms exist V. Sethu 13

14 Feature Selection with Lasso ss = tt ii=1 ββ ii Friedman, J., Hastie, T., & Tibshirani, R. (001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 14

15 LL 11 and LL regularisation (sparsity) ββ 1 + ββ tt ββ 1 + ββ tt Friedman, J., Hastie, T., & Tibshirani, R. (001). The elements of statistical learning New York, NY, USA:: Springer series in statistics. V. Sethu 15

Bayesian View Assuming the error model, yy = yy + εε = ββ TT xx + εε, wwwwwwwwww εε~ 0, σσ We can write, PP yy xx, ββ, σσ = yy ββ TT xx, σσ In the Bayesian framework the parameters (ββ) are treated

16 Bayesian View Assuming the error model, yy = yy + εε = ββ TT xx + εε, wwwwwwwwww εε~ 0, σσ We can write, PP yy xx, ββ, σσ = yy ββ TT xx, σσ In the Bayesian framework the parameters (ββ) are treated as random variables instead of fixed but unknown parameters. In terms of the dataset, DD = xx jj, yy jj,, Likelihood function, ll ββ PP yy XX, ββ, σσ = yy jj ββ TT xx jj, σσ Taking the logarithm, ln PP yy XX, ββ, σσ = ln yy jj ββ TT xx jj, σσ ln PP yy XX, ββ, σσ = ln σσ 1 ln ππ σσ yy jj ββ TT xx jj Maximum likelihood (ML) solution is equivalent to least squares solution V. Sethu 16

17 Bayes theorem states, Bayesian View (of LL Regularisation) Posterior Likelihood PP ββ XX, yy, σσ = PP yy ββ, XX, σσ PP ββ PP yy XX, σσ Prior Evidence If we assume a Gaussian prior for the parameters (ββ) conditional on some hyperparameter (αα), We obtain PP ββ = PP ββ αα = ββ 00, αα 1 II PP ββ XX, yy, σσ = ββ mm, SS Where, SS = mm = 1 σσ SS XX TT yy ααii + 1 σσ XXTT XX 1 Consequently the log posterior is given by, ln PP ββ XX, yy, σσ = 1 σσ yy jj ββ TT xx jj αα ββtt ββ Maximum a posteriori (MAP) estimate is equivalent to Ridge regression 17

18 Visualising Bayesian Learning Bishop, C. M. (006). Pattern recognition and machine learning V. Sethu 18

19 Classification Input Data Machine Prediction about some quantity of interest TT xx = xx 1,, xx R h θθ yy ωω 1,, ωω cc yy ωω 1,, ωω cc The Bayesian Decision Theory Picture: PP ωω jj xx = PP xx ωω jj PP ωω jj PP xx Defining a loss function (ξξ) such that ξξ αα ii ωω jj describes the loss incurred by estimating the class as ωω ii when the true class was ωω jj (Note: Here αα ii is used to denote yy = ωω ii ) gives us the expected loss associated with αα ii as: cc RR αα ii xx = ξξ αα ii ωω jj PP ωω jj xx V. Sethu 19

20 Bayesian Decision Theory For any general decision rule, αα xx, where αα xx assumes one of the values αα 1,, αα cc, the overall risk is given by, RR = RR αα xx xx PP xx ddxx xx If αα xx is chosen such that RR αα xx xx is as small as possible for every xx, then overall risk (RR) will be minimised. Bayes Decision Rule αα xx = arg min αα ii RR αα ii xx Leads to minimum overall risk, called Bayes risk and denotes as RR V. Sethu 0

21 Minimum Error Rate Classification In many classification tasks, only the number of errors/mistakes (error rate) is of interest. This leads to the so-called symmetric/zero-one loss function, The corresponding conditional risk is then, 0 ii = jj ξξ αα ii ωω jj = 1 ii jj Therefore for minimum error rate: RR αα ii xx = PP ωω jj xx = 1 PP ωω ii xx jj ii h θθ xx = ωω ii iiii PP ωω ii xx > PP ωω jj xx jj ii V. Sethu 1

22 Discriminant Functions There are many ways to represent classifiers One of the most useful is in terms of a set of discriminant function, {gg ii xx : ii = 1,, cc} such that, h θθ xx = ωω ii iiii gg ii xx > gg jj xx, jj ii You can replace every gg ii xx by ff gg ii xx where ff is monotonically increasing and the classification is unchanged. Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. Bayes classifier: gg ii xx = RR αα ii xx Minimum error rate: gg ii xx = PP ωω ii xx = PP xx ωω ii PP ωω ii cc PP xx ωω ii PP ωω ii ii=1 gg ii xx = PP xx ωω ii PP ωω ii gg ii xx = ln PP xx ωω ii + ln PP ωω ii V. Sethu

23 Decision Boundaries/Surfaces Every decision rule divides the feature space into cc decision regions, separated by decision boundaries. The decision regions need not be simply connected. Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. V. Sethu 3

24 Gaussian Data Distributions Linear Discriminants The minimum error rate classification can be achieved by the discriminant functions: gg ii xx = ln PP xx ωω ii + ln PP ωω ii In the case of multivariate normal data distribution within each class, PP xx ωω ii = xx μμ ii, ΣΣ ii The discriminant functions can be readily evaluated as, Independent of data and class distributions can be droed gg ii xx = 1 xx μμ ii TT ΣΣ 1 xx μμ ii dd ln ππ 1 ln Σ ii + ln PP ωω ii Mahalanobis distance Consider the case where the data distribution in all the classes are normally distributed with identical covariance matrices. i.e., ΣΣ ii = ΣΣ, ii. This corresponds to a situation where data from all classes fall into hyperellipsoidal clusters of the same shape and size but different locations in the feature space: gg ii xx = 1 xx μμ ii TT ΣΣ 1 xx μμ ii + ln PP ωω ii ln Σ term is droed since it is identical for all classes Expanding the Mahalanobis distance and droing the quadratic term (xx TT ΣΣΣΣ) which is identical for all the classes makes the discriminant functions linear, gg ii xx = ββ TT ii xx + ββ iii Where, ββ ii = ΣΣ 11 μμ ii 1 ββ iii = 1 μμ ii TT ΣΣ 11 μμ ii + ln PP ωω ii 4

25 Gaussian Data Distributions Decision Surfaces The decision surfaces between decision regions R i and R j is given by gg ii xx = gg jj xx. In the case of the linear discriminant functions arising from normally distributed data, they are hyperplanes given by μμ ii μμ jj TT ΣΣ 11 xx xx 00 = 0 Where, xx 00 = 1 μμ ii + μμ jj ln PP(ωω ii )/PP(ωω jj ) Hyperplane passes through xx 00 μμ ii μμ jj μμ ii μμ jj TT ΣΣ 1 μμ ii μμ jj Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. V. Sethu 5

26 Arbitrary Gaussian Data Distributions When data is every class are normally distributed but with arbitrary covariance matrices, decision regions need not be simply connected (even in one dimensional data) Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. Decision surfaces can be hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, and hyperhyperboloids (broadly referred to as hyperquadrics) Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. V. Sethu 6

27 Quantifying Error In a -class problem, the classifier separations the feature space into two decision regions, R 1 and R. Errors can arise in two ways Points from ωω 1 falling within R and vice versa. PP eeeeeeeeee = PP xx R, ωω 1 + PP xx R 1, ωω = PP xx R ωω 1 PP ωω 1 + PP xx R ωω 1 PP ωω 1 = R PP xx ωω 1 PP ωω 1 ddxx + R1 PP xx ωω PP ωω ddxx Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. V. Sethu 7

28 Receiver Operating Characteristic (ROC) Curve Each point on the curve corresponding to different operating point of classifier dd Discriminability (some measure of distance between distributions) e.g., dd = μμ μμ 1 σσ for the 1-D case (equal variance) Duda, R. O., Hart, P. E., & Stork, D. G. (01). Pattern classification. John Wiley & Sons. V. Sethu 8

29 Logistic Regression Arises from the desire to: 1. Model posterior probability of classes as linear functions (and treat it as a regression problem). Ensure that sum of class posteriors is one. The model has the form, ln PP ωω 1 xx PP ωω cc xx = ββ 10 + ββ 11 TT xx ln PP ωω xx PP ωω cc xx = ββ 0 + ββ TT xx ln PP ωω cc 1 xx PP ωω cc xx TT = ββ cc ββ cc 11 xx Equivalently, PP ωω cc xx = PP ωω jj xx = cc 1 ii=1 ee ββ ii0+ββ TT ii xx ee ββ jj0+ββ jj TT xx 1 + cc 1 ii=1 ee ββ ii0+ββ TT ii xx, jj = 1,, cc 1 V. Sethu 9

30 Given data, DD = xx jj, yy jj 1,, where, -Class Logistic Regression yy jj = 1, cccccccccc iiii ωω 1 0, cccccccccc iiii ωω The log-likelihood of the logistic regression model is then, This reduces to, l ββ = ln PP DD ββ = yy jj ln PP ωω 1 xx jj, ββ + 1 yy jj l ββ = yy jj ββ TT xx jj ln 1 + ee ββtt xx jj ln 1 PP ωω 1 xx jj, ββ To obtained the maximum likelihood solution we can set the derivatives to zero, ββ ββ = xx jj yy jj PP ωω 1 xx jj, ββ = 0 Nonlinear equations Use iterative solution (Newton-Raphson algorithm) V. Sethu 30

31 -Class Logistic Regression Newton update: In matrix notation, ββ ττ+1 = ββ ττ l ββ ββ ββ TT ββ ββ = XXTT yy l ββ ββ ββ TT = XXTT WWWW 1 ββ ββ Hessian matrix WW = dddddddd ww 1, ww,, ww ; ww ii = PP ωω ii xx ii, ββ ττ 1 PP ωω 1 xx 11, ββ ττ XX = xx 11 TT xx TT = xx 11 xx 1 yy = xx N xx yy 1 yy = PP ωω 1 xx 11, ββ ττ PP ωω 1 xx, ββ ττ Newton update for logistic regression, ββ ττ+1 = ββ ττ + XX TT WWWW 1 XX TT yy V. Sethu 31

Lecture 5. Optimisation. Regularisation

Lecture 5. Optimisation. Regularisation COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Iterative optimisation Loss functions Coordinate