Lecture 5. Optimisation. Regularisation

Similar documents
Lecture 10. Support Vector Machines (cont.)

Special Topics: Data Science

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Logistic Regression. Hongning Wang

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Bayesian Methods: Naïve Bayes

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

knn & Naïve Bayes Hongning Wang

CS145: INTRODUCTION TO DATA MINING

Operational Risk Management: Preventive vs. Corrective Control

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Machine Learning Application in Aviation Safety

ISyE 6414 Regression Analysis

CS249: ADVANCED DATA MINING

Building an NFL performance metric

Combining Experimental and Non-Experimental Design in Causal Inference

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016

Product Decomposition in Supply Chain Planning

Use of Auxiliary Variables and Asymptotically Optimum Estimators in Double Sampling

Imperfectly Shared Randomness in Communication

Projecting Three-Point Percentages for the NBA Draft

Operations on Radical Expressions; Rationalization of Denominators

Functions of Random Variables & Expectation, Mean and Variance

NCSS Statistical Software

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Simplifying Radical Expressions and the Distance Formula

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

Analysis of Gini s Mean Difference for Randomized Block Design

ISyE 6414: Regression Analysis

Physical Design of CMOS Integrated Circuits

Estimating the Probability of Winning an NFL Game Using Random Forests

Lesson 14: Modeling Relationships with a Line

New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary Variables

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

San Francisco State University ECON 560 Summer Midterm Exam 2. Monday, July hour 15 minutes

COMP Intro to Logic for Computer Scientists. Lecture 13

Communication Amid Uncertainty

Introduction to Machine Learning NPFL 054

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

SNARKs with Preprocessing. Eran Tromer

Mathematics of Pari-Mutuel Wagering

Practical Approach to Evacuation Planning Via Network Flow and Deep Learning

Communication Amid Uncertainty

A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept

A Chiller Control Algorithm for Multiple Variablespeed Centrifugal Compressors

Analysis of Shear Lag in Steel Angle Connectors

EE 364B: Wind Farm Layout Optimization via Sequential Convex Programming

Is Tiger Woods Loss Averse? Persistent Bias in the Face of Experience, Competition, and High Stakes. Devin G. Pope and Maurice E.

Tie Breaking Procedure

CS 4649/7649 Robot Intelligence: Planning

Cleveland State University MCE441: Intr. Linear Control Systems. Lecture 6: The Transfer Function Poles and Zeros Partial Fraction Expansions

Addition and Subtraction of Rational Expressions

CT4510: Computer Graphics. Transformation BOCHANG MOON

Introduction to Genetics

New Numerical Schemes for the Solution of Slightly Stiff Second Order Ordinary Differential Equations

Evaluating and Classifying NBA Free Agents

What is Restrained and Unrestrained Pipes and what is the Strength Criteria

Algebra I: A Fresh Approach. By Christy Walters

Optimizing Cyclist Parking in a Closed System

Optimization and Search. Jim Tørresen Optimization and Search

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

Abstract In this paper, the author deals with the properties of circumscribed ellipses of convex quadrilaterals, using tools of parallel projective tr

Machine Optimisation of Dynamic Gait Parameters for Bipedal Walking

Deconstructing Data Science

Conservation of Energy. Chapter 7 of Essential University Physics, Richard Wolfson, 3 rd Edition

Better Search Improved Uninformed Search CIS 32

i) Linear programming

Numerical Model to Simulate Drift Trajectories of Large Vessels. Simon Mortensen, HoD Marine DHI Australia

Bézier Curves and Splines

HIGH RESOLUTION DEPTH IMAGE RECOVERY ALGORITHM USING GRAYSCALE IMAGE.

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Depth-bounded Discrepancy Search

Author s Name Name of the Paper Session. Positioning Committee. Marine Technology Society. DYNAMIC POSITIONING CONFERENCE September 18-19, 2001

CENG 466 Artificial Intelligence. Lecture 4 Solving Problems by Searching (II)

Predicting Horse Racing Results with TensorFlow

Coaches, Parents, Players and Fans

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Lateral Load Analysis Considering Soil-Structure Interaction. ANDREW DAUMUELLER, PE, Ph.D.

When it is 80 degrees he will sell about 42 ice creams.

Polynomial DC decompositions

Holly Burns. Publisher Mary D. Smith, M.S. Ed. Author

Deconstructing Data Science

CS 221 PROJECT FINAL

Lecture 39: Training Neural Networks (Cont d)

Report for Experiment #11 Testing Newton s Second Law On the Moon

Robust specification testing in regression: the FRESET test and autocorrelated disturbances

115th Vienna International Rowing Regatta & International Masters Meeting. June 15 to June 17, 2018

Chapter 12 Practice Test

Neural Networks II. Chen Gao. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

Launch Line. 2.0 m Figure 1. Receptacle. Golf Ball Delivery Challenge Event Specifications MESA Day 2018

Running head: DATA ANALYSIS AND INTERPRETATION 1

2 When Some or All Labels are Missing: The EM Algorithm

Legendre et al Appendices and Supplements, p. 1

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 AUDIT TRAIL

Transcription:

Lecture 5. Optimisation. Regularisation COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne

Iterative optimisation Loss functions Coordinate descent Gradient descent Regularisation Model complexity Constrained modelling Bias-variance trade-off This lecture 2

Iterative Optimisation A very brief summary of a few basic optimisation methods 3

Supervised learning* 1. Assume a model (e.g., linear model) Denote parameters of the model as θθ Model predictions are ff xx, θθ 2. Choose a way to measure discrepancy between predictions and training data E.g., sum of squared residuals yy XXXX 2 3. Training = parameter estimation = optimisation θθ = argmin LL dddddddd, θθ θθ Θ *This is the setup of what s called frequentist supervised learning. A different view on parameter estimation/training will be presented later in the subject. 4

Loss functions: Measuring discrepancy For a single training example the discrepancy between prediction and label is measured using a loss function Examples: squared loss ll ssss = yy ff xx, θθ 2 absolute loss ll aaaaaa = yy ff xx, θθ Perceptron loss (next lecture) Hinge loss (later in the subject) 5

Solving optimisation problems Analytic (aka closed form) solution Known only in limited number of cases Use necessary condition: θθ 1 = = LL θθ pp = 0 Approximate iterative solution 1. Initialisation: choose starting guess θθ (1), set ii = 1 2. Update: θθ (ii+1) SSSSSSSSSSSSSSSS θθ (ii), set ii ii + 1 3. Termination: decide whether to Stop 4. Go to Step 2 5. Stop: return θθ θθ (ii) 6

Finding the deepest point θθ 2 θθ 1 In this example, a model has 2 parameters Each location corresponds to a particular combination of parameter values Depth indicates discrepancy between model using those values and data 7

Suppose θθ = θθ 1,, θθ KK 1. Choose θθ (1) and some TT Coordinate descent 2. For ii from 1 to TT* 1. θθ (ii+1) θθ (ii) 2. For jj from 1 to KK 1. Fix components of θθ (ii+1), expect jj-th component 2. Find θθ jj ii+1 that minimises LL θθ jj ii+1 3. Update jj-th component of θθ (ii+1) 3. Return θθ θθ (ii) θθ 2 θθ 1 *Other stopping criteria can be used Wikimedia Commons. Author: Nicoguaro (CC4) 8

Gradient Gradient (at point θθ) is defined as computed at point θθ,, θθ 1 θθ pp One can show that gradient points to the direction of maximal change of LL(θθ) when departing from point θθ Shorthand notation LL LL,, LL computed at point θθ θθ 1 θθ pp Here is the nabla symbol 9

Gradient descent 1. Choose θθ (1) and some TT 2. For ii from 1 to TT* 1. θθ (ii+1) = θθ (ii) ηη LL(θθ (ii) ) 3. Return θθ θθ (ii) We assume LL is differentiable Note: ηη is dynamically updated in each step θθ (0) *Other stopping criteria can be used Wikimedia Commons. Authors: Olegalexandrov, Zerodamage 10

Regularisation Process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting (Wikipedia) 11

Previously: Regularisation Major technique, common in Machine Learning Addresses one or more of the following related problems Avoid ill-conditioning Introduce prior knowledge Constrain modelling This is achieved by augmenting the objective function Not just for linear methods. 12

Example regression problem Y -5 0 5 10 2 4 6 8 10 X How complex a model should we use? 13

Underfitting (linear regression) Y -5 0 5 10 2 4 6 8 10 X Model class Θ can be too simple to possibly fit true model. 14

Overfitting (non-parametric smoothing) Y -5 0 5 10 2 4 6 8 10 X Model class Θ can be so complex it can fit true model + noise 15

Actual model (xxsin xx) Y -5 0 5 10 2 4 6 8 10 X The right model class Θ will sacrifice some training error, for test error. 16

How to vary model complexity Method 1: Explicit model selection Method 2: Regularisation Usually, method 1 can be viewed a special case of method 2 17

1. Explicit model selection Try different classes of models. Example, try polynomial models of various degree dd (linear, quadratic, cubic, ) Use held out validation (cross validation) to select the model 1. Split training data into DD tttttttttt and DD vvvvvvvvvvvvvvvv sets 2. For each degree dd we have model ff dd 1. Train ff dd on DD tttttttttt 2. Test ff dd on DD vvvvvvvvvvvvvvee 3. Pick degree dd that gives the best test score 4. Re-train model ff dd using all data 18

2. Vary complexity by regularisation Augment the problem: θθ = argmin LL dddddddd, θθ + λλλλ θθ θθ Θ E.g., ridge regression ww = argmin ww WW yy XXXX 2 2 + λλ ww 2 2 Note that RR θθ does not depend on data Use held out validation/cross validation to choose λλ 19

Example: polynomial regression 9 th order polynomial regression model of form ff = ww 0 + ww 1 xx + + ww 9 xx 9 regularised with λλ ww 2 2 term λλ = 0 Figures from Bishop pp7-9,150 20

Regulariser as a constraint For illustrative purposes, consider a modified problem: minimise yy XXXX 2 2 subject to ww 2 2 λλ for λλ > 0 ww 2 Solution to linear regression ww 2 ww Contour lines of objective function ww ww 1 ww 1 Ridge regression ( ww 2 2 ) Regulariser defines feasible region Lasso ( ww 1 ) Lasso (L1 regularisation) encourages solutions to sit on the axes Some of the weights are set to zero Solution is sparse 21

Regularised linear regression Algorithm Minimises Regulariser Solution Linear regression yy XXXX 2 2 None XX XX 1 XX yy (if inverse exists) L2 norm Ridge regression yy XXXX 2 2 + λλ ww 2 2 XX XX + λλii 1 XX yy Lasso yy XXXX 2 2 + λλ ww 1 L1 norm No closed-form, but solutions are sparse and suitable for high-dim data 22

Bias-variance trade-off Analysis of relations between train error, test error and model complexity 23

Assessing generalisation capacity Supervised learning: train the model on existing data, then make predictions on new data Generalisation capacity of the model is an important consideration Training the model: minimisation of training error Generalisation capacity is captured by the test error Model complexity is a major factor that influences the ability of the model to generalise In this section, our aim is to explore relations between training error, test error and model complexity 24

Training error Suppose training data is DD = xx 1, yy 1,, xx nn, yy nn Let ff xx denote predicted value for xx from the model trained on DD Let ll yy ii, ff xx ii example denote loss on the ii-th training Training error: 1 nn nn ii=1 ll yy ii, ff xx ii 25

Training error and model complexity More complex model training error goes down Finite number of points usually can reduce training error to 0 (is it always possible?) yy Training error xx model complexity 26

Assume YY = h xx 0 Test error + εε xx 0 is a fixed instance h xx is an unknown true function εε is noise, EE εε = 0 and VVVVVV εε = σσ 2 Treat training data as a random variable DD Draw DD~DD, train model on DD, make predictions Prediction ff xx 0 is a random variable, despite xx 0 fixed Test error for xx 0 : EE ll YY, ff xx 0 The expectation is taken with respect to DD and εε DD and εε are assumed to be independent 27

Bias-variance decomposition For the following analysis, we consider squared loss as an important special case ll YY, ff xx 0 = YY ff xx 0 2 Lemma: Bias-Variance Decomposition EE ll YY, ff xx 0 = EE YY EE ff 2 + VVVVVV ff + VVVVVV YY test error for xx 0 (bias) 2 variance irreducible error 28

Decomposition proof sketch Here (xx) is omitted to de-clutter notation EE YY ff 2 = EE YY 2 + ff 2 2YY ff = EE YY 2 + EE ff 2 EE 2YY ff = VVVVVV YY + EE YY 2 + VVVVVV ff + EE ff 2 2EE YY EE ff = VVVVVV YY + VVVVVV ff + EE YY 2 2EE YY EE ff + EE ff 2 = VVVVVV YY + VVVVVV ff + EE YY EE ff 2 * Green slides are non-examinable 29

Training data as a random variable yy DD 1 yy DD 2 yy DD 3 xx xx xx 30

Training data as a random variable yy DD 1 yy DD 2 yy DD 3 xx xx xx 31

Model complexity and variance simple model low variance complex model high variance yy yy xx 0 xx xx 0 xx 32

Model complexity and bias simple model high bias complex model low bias yy yy h xx 0 h xx 0 xx 0 xx xx 0 xx 33

Bias-variance trade-off simple model high bias, low variance complex model low bias, high variance Test error (bias) 2 variance simple model complex model 34

Test error and training error Test error Training error simple model complex model 35

Iterative optimisation Loss functions Coordinate descent Gradient descent Regularisation Model complexity Constrained modelling Bias-variance trade-off This lecture 36