Lecture 10. Support Vector Machines (cont.)

Similar documents
Lecture 5. Optimisation. Regularisation

Logistic Regression. Hongning Wang

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Special Topics: Data Science

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016

i) Linear programming

Product Decomposition in Supply Chain Planning

A new Decomposition Algorithm for Multistage Stochastic Programs with Endogenous Uncertainties

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

Author s Name Name of the Paper Session. Positioning Committee. Marine Technology Society. DYNAMIC POSITIONING CONFERENCE September 18-19, 2001

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Cricket umpire assistance and ball tracking system using a single smartphone camera

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

2 When Some or All Labels are Missing: The EM Algorithm

Bayesian Methods: Naïve Bayes

Introduction to Machine Learning NPFL 054

Addition and Subtraction of Rational Expressions

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Prediction Market and Parimutuel Mechanism

Introduction to Pattern Recognition

Below are the graphing equivalents of the above constraints.

Operational Risk Management: Preventive vs. Corrective Control

HIGH RESOLUTION DEPTH IMAGE RECOVERY ALGORITHM USING GRAYSCALE IMAGE.

knn & Naïve Bayes Hongning Wang

EE 364B: Wind Farm Layout Optimization via Sequential Convex Programming

A Chiller Control Algorithm for Multiple Variablespeed Centrifugal Compressors

EE582 Physical Design Automation of VLSI Circuits and Systems

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Estimating Paratransit Demand Forecasting Models Using ACS Disability and Income Data

Lesson 14: Modeling Relationships with a Line

Introduction to Pattern Recognition

Simplifying Radical Expressions and the Distance Formula

SNARKs with Preprocessing. Eran Tromer

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Aryeh Rappaport Avinoam Meir. Schedule automation

Imperfectly Shared Randomness in Communication

The next criteria will apply to partial tournaments. Consider the following example:

Design Project 2 Sizing of a Bicycle Chain Ring Bolt Set ENGR 0135 Sangyeop Lee November 16, 2016 Jordan Gittleman Noah Sargent Seth Strayer Desmond

Operations on Radical Expressions; Rationalization of Denominators

Machine Learning Application in Aviation Safety

Student Outcomes. Lesson Notes. Classwork. Discussion (20 minutes)

Optimization and Search. Jim Tørresen Optimization and Search

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

ECE 697B (667) Spring 2003

A team recommendation system and outcome prediction for the game of cricket

Three New Methods to Find Initial Basic Feasible. Solution of Transportation Problems

Optimizing Cyclist Parking in a Closed System

Neural Networks II. Chen Gao. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS 4649/7649 Robot Intelligence: Planning

Predicting Momentum Shifts in NBA Games

PREDICTING the outcomes of sporting events

A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Evaluating and Classifying NBA Free Agents

Massey Method. Introduction. The Process

OPTIMAL FLOWSHOP SCHEDULING WITH DUE DATES AND PENALTY COSTS

A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems

A HYBRID METHOD FOR CALIBRATION OF UNKNOWN PARTIALLY/FULLY CLOSED VALVES IN WATER DISTRIBUTION SYSTEMS ABSTRACT

NATIONAL FEDERATION RULES B. National Federation Rules Apply with the following TOP GUN EXCEPTIONS

3D Inversion in GM-SYS 3D Modelling

A COURSE OUTLINE (September 2001)

Coupling distributed and symbolic execution for natural language queries. Lili Mou, Zhengdong Lu, Hang Li, Zhi Jin

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

The MACC Handicap System

Week 7 One-way ANOVA

A IMPROVED VOGEL S APPROXIMATIO METHOD FOR THE TRA SPORTATIO PROBLEM. Serdar Korukoğlu 1 and Serkan Ballı 2.

Advanced Hydraulics Prof. Dr. Suresh A. Kartha Department of Civil Engineering Indian Institute of Technology, Guwahati

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

Depth-bounded Discrepancy Search

Two Machine Learning Approaches to Understand the NBA Data

ISyE 6414 Regression Analysis

Title: Modeling Crossing Behavior of Drivers and Pedestrians at Uncontrolled Intersections and Mid-block Crossings

arxiv: v1 [math.co] 11 Apr 2018

CHAPTER 1 INTRODUCTION TO RELIABILITY

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

A Novel Approach to Predicting the Results of NBA Matches

Scheduling the Brazilian Soccer Championship. Celso C. Ribeiro* Sebastián Urrutia

CT4510: Computer Graphics. Transformation BOCHANG MOON

Taking Your Class for a Walk, Randomly

Abstract In this paper, the author deals with the properties of circumscribed ellipses of convex quadrilaterals, using tools of parallel projective tr

consist of friends, is open to all ages, and considers fair play of paramount importance. The matches are played without referees, since, according to

TRIP GENERATION RATES FOR SOUTH AFRICAN GOLF CLUBS AND ESTATES

DESIGN AND ANALYSIS OF ALGORITHMS (DAA 2017)

Communication Amid Uncertainty

Youth Sports Leagues Scheduling

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering

Math 4. Unit 1: Conic Sections Lesson 1.1: What Is a Conic Section?

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management

Efficient Gait Generation using Reinforcement Learning

The Constrained Ski-Rental Problem and its Application to Online Cloud Cost Optimization

Bézier Curves and Splines

Simulating Major League Baseball Games

RULES AND REGULATIONS OF FIXED ODDS BETTING GAMES

The Quasigeostrophic Assumption and the Inertial-Advective Wind

Transcription:

Lecture 10. Support Vector Machines (cont.) COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne

This lecture Soft margin SVM Intuition and problem formulation Solving the optimisation Transforming the original objective Re-parameterisation Finishing touches Complementary slackness Solving the dual problem 2

Brief recap: hard margin SVM SVM is a linear binary classifier Informally: aims for the safest boundary Formally: derive distance from point to boundary Trick to resolve ambiguity Hard margin SVM objective: argmin ww ww s.t. yy ii ww xx ii + bb 1 for ii = 1,, 3

When data is not linearly separable Hard margin loss is too stringent Real data is unlikely to be linearly separable If the data is not separable, hard margin SVMs are in trouble xx 2? xx 1 SVMs offer 3 approaches to address this problem: 1. Still use hard margin SVM, but transform the data (next lecture) 2. Relax the constraints (next slide) 3. The combination of 1 and 2 4

Addressing Non-Linearity using Soft Margin SVMs We now do not assume that the data is linearly separable 5

Soft margin SVM In the soft margin SVM formulation we relax the constraints to allow points to be inside the margin or even on the wrong side of the boundary xx 2 However, we penalise boundaries by the amount that reflects the extend of violation In the figure, the objective penalty will take into account the orange distances xx 1 6

Hinge loss: soft margin SVM loss Hard margin SVM loss ll = 0 1 yy ww xx + bb 0 oottttttttttttttt Soft margin SVM loss (hinge loss) 0 1 yy ww xx + bb 0 ll h = 1 yy ww xx + bb ooooooooooooooooo ll h (ss, yy) ss = yy ww xx + bb compare this with perceptron loss 0 1 ssyy 7

Soft margin SVM objective Soft margin SVM loss (hinge loss) 0 1 yy ww xx + bb 0 ll h = 1 yy ww xx + bb ooooooooooooooooo By analogy with ridge regression, soft margin SVM objective can be defined as argmin ww ii=1 ll h xx ii, yy ii, ww + λλ ww 2 We are going to re-formulate this objective to make it more amenable to analysis 8

Re-formulating soft margin objective Soft margin SVM loss (hinge loss) ll h = max 0,1 yy ii ww xx ii + bb Define slack variables as an upper bound on loss ξξ ii ll h = max 0,1 yy ii ww xx ii + bb or equivalently ξξ ii 1 yy ii ww xx ii + bb and ξξ ii 0 Re-write the soft margin SVM objective as: argmin ww,ξξ 1 2 ww 2 + CC ii=1 s.t. ξξ ii 1 yy ii ww xx ii + bb for ii = 1,, ξξ ii 0 for ii = 1,, ξξ ii 9

Two variations of SVM Hard margin SVM objective*: Soft margin SVM objective: argmin ww 1 2 ww 2 s.t. yy ii ww xx ii + bb 1 for ii = 1,, argmin ww,ξξ 1 2 ww 2 + CC ii=1 s.t. yy ii ww xx ii + bb 1 ξξ ii for ii = 1,, ξξ ii 0 for ii = 1,, In the second case, the constraints are relaxed ( softened ) by allowing violations by ξξ ii. Hence the name soft margin ξξ ii *Changed ww to 0.5 ww 2. The modified objective leads to the same solution 10

The Mathy Part 2 Optimisation strikes back 11

SVM training preliminaries Training an SVM means solving the corresponding optimisation problem, either hard margin or soft margin We will focus on solving the hard margin SVM (simpler) Soft margin SVM training results in a similar solution Hard margin SVM objective is a constrained optimisation problem. This is called the primal problem. 1 argmin ww 2 ww 2 s.t. yy ii ww xx ii + bb 1 0 for ii = 1,, 12

Side note: Constrained optimisation In general, a constraint optimisation problem is minimise ff(xx) s.t. gg ii xx 0, ii = 1,, s.t. h jj xx = 0, jj = 1,, mm E.g., find deepest point in the lake, south of the bridge Big deal: common optimisation methods, such as gradient descent, caot be directly applied to solve constrained optimisation Method of Lagrange/Karush-Kuhn-Tucker (KKT) multipliers Transform the original (primal) problem into an unconstrained optimisation problem (dual) Analyse/relate necessary and sufficient conditions for solutions for both problems 13

Side note: Lagrangian/KKT multipliers Introduce extra variables and define an auxiliary objective function LL xx, λλ, νν = ff xx + λλ ii gg ii xx + νν jj h jj xx ii=1 Auxiliary function LL xx, λλ, νν is called Lagrangian, and also sometimes it is called KKT objective Additional variables λλ and νν are called Lagrange multipliers (λλ are also called KKT multipliers) mm jj=1 This is accompanied by the list of KKT conditions Next slides will show KKT conditions for SVM Under mild assumptions on ff xx, gg ii (xx) and h ii (xx), KKT conditions are both necessary and sufficient for xx, λλ and νν being primal and dual optimal points The assumptions about the problem are called regularity conditions or constraint qualifications 14

Lagrangian for hard margin SVM Hard margin SVM objective is a constrained optimisation problem: 1 argmin ww 2 ww 2 s.t. yy ii ww xx ii + bb 1 0 for ii = 1,, We approach this problem using the method of Lagrange multipliers/kkt conditions To this end, we first define the Lagrangian/KKT objective LL KKKKKK ww, bb, λλ = 1 2 ww 2 λλ ii yy ii ww xx ii + bb 1 primal objective ii=1 constraints 15

KKT conditions for hard margin SVM Our Lagrangian/KKT objective is LL KKKKKK ww, bb, λλ = 1 2 ww 2 λλ ii yy ii ww xx ii + bb 1 ii=1 The corresponding KKT conditions are: yy ii ww xx ii + bb 1 0 for ii = 1,, λλ ii 0 for ii = 1,, this just repeats constraints, trivial require non-negative multipliers in order for the theorem to work λλ ii yy ii ww xx ii + bb 1 = 0 ww,bb LL KKKKKK ww, bb, λλ = 0 complementary slackness, we ll come back to that zero gradient, somewhat similar to unconstrained optimisation 16

Proposition: Why use KKT conditions If ww and bb is a solution of the primal hard margin SVM problem, then there exists λλ, such that together ww, bb and λλ satisfy the KKT conditions If some ww, bb and λλ satisfy KKT conditions then ww and bb is a solution of the primal problem Proof is outside the scope of this subject Verify that SVM primal problem satisfies certain regularity conditions 17

Gradient of Lagrangian Our Lagrangian/KKT objective is LL KKKKKK ww, bb, λλ = 1 2 ww 2 λλ ii yy ii ww xx ii + bb 1 ii=1 The following conditions are necessary: LL KKKKKK = ii=1 λλ ii yy ii = 0 LL KKKKKK ww jj = ww jj ii=1 λλ ii yy ii xx ii jj = 0 Substitute the conditions into Lagrangian to obtain LL KKKKKK λλ = λλ ii 1 2 λλ ii λλ jj yy ii yy jj xx ii xx jj ii=1 ii=1 jj=1 18

Re-parameterisation Let ww and bb be a solution of the primal problem. From the last KKT condition (zero gradient) we have LL KKKKKK ww, bb, λλ = LL KKKKKK λλ = λλ ii 1 2 λλ ii λλ jj yy ii yy jj xx ii xx jj ii=1 ii=1 jj=1 Parameters λλ are still unknown In order for ww, bb and λλ to satisfy all KKT conditions, λλ must maximise LL KKKKKK λλ Proof is outside the scope of the subject 19

Lagrangian dual for hard margin SVM Given the above considerations, in order to solve the primal problem, we pose a new optimisation problem, called Lagrangian dual problem argmax λλ λλ ii 1 2 λλ ii λλ jj yy ii yy jj xx ii xx jj ii=1 ii=1 jj=1 s.t. λλ ii 0 and ii=1 λλ ii yy ii = 0 This is a so-called quadratic optimisation problem, a standard problem that can be solved using off-the-shelf software 20

Hard margin SVM Training: finding λλ that solve argmax λλ λλ ii 1 2 ii=1 ii=1 jj=1 λλ ii λλ jj yy ii yy jj xx ii xx jj s.t. λλ ii 0 and ii=1 λλ ii yy ii = 0 Making predictions: classify new instance xx based on the sign of ss = bb + λλ ii yy ii xx ii xx ii=1 Here bb can be found by noting that for arbitrary training example jj we must have yy jj bb + ii=1 λλ ii yy ii xx ii xx jj = 1 21

Soft margin SVM Training: finding λλ that solve box constraints argmax λλ ii 1 λλ 2 λλ ii λλ jj yy ii yy jj xx ii xx jj ii=1 ii=1 jj=1 s.t. CC λλ ii 0 and ii=1 λλ ii yy ii = 0 Making predictions: classify new instance xx based on the sign of ss = bb + λλ ii yy ii xx ii xx ii=1 Here bb can be found by noting that for arbitrary training example jj we must have yy jj bb + ii=1 λλ ii yy ii xx ii xx jj = 1 22

graphics: stux at pixabay.com (CC0) 23

Finishing Touches 24

Complementary slackness Recall that one of the KKT conditions is complementary slackness λλ ii yy ii ww xx ii + bb 1 = 0 Remember that yy ii ww xx ii + bb 1 > 0 means that xx ii is outside the margin xx 2 Points outside the margin must have λλ ii = 0 Since most of the training points are outside the margin, we expect most of the λλs to be zero: the solution is sparse The points with non-zero λλs are support vectors xx 1 25

Making predictions Predictions are based on the sign of bb + ii=1 λλ ii yy ii xx ii xx Here xx is a new, previously unseen instance we need to classify The prediction depends on a weighted sum of terms The sum iterates over training data points, but λλ ii are non-zero only for support vectors So effectively, the prediction is made by dot-producting (sort of comparing ) the new point with each of the support vectors 26

Solving the dual problem The SVM Lagrangian dual problem is a quadratic optimisation problem. Using standard algorithms this problem can be solved in in OO( 3 ) This is still inefficient for large data. Several specialised solutions have been proposed These solutions mostly involve decomposing the training data and breaking down the problem into a number of smaller optimisation problems that can be solved quickly The original SVM training algorithm called chunking exploits the fact that many of λλs will be zero Sequential minimal optimisation (SMO) is another algorithm which can be viewed as an extreme case of chunking. SMO is an iterative procedure that analytically optimises randomly chosen pairs of λλs in each iteration 27

This lecture Soft margin SVM Intuition and problem formulation Solving the optimisation Transforming the original objective Re-parameterisation Finishing touches Complementary slackness Solving the dual problem 28