Deconstructing Data Science

Similar documents
Deconstructing Data Science

Building an NFL performance metric

Projecting Three-Point Percentages for the NBA Draft

Estimating the Probability of Winning an NFL Game Using Random Forests

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

Predicting Horse Racing Results with Machine Learning

Introduction to Machine Learning NPFL 054

Predicting NBA Shots

Unit 4: Inference for numerical variables Lecture 3: ANOVA

knn & Naïve Bayes Hongning Wang

A Novel Approach to Predicting the Results of NBA Matches

Lecture 5. Optimisation. Regularisation

CS 221 PROJECT FINAL

Introduction to Pattern Recognition

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

PREDICTING the outcomes of sporting events

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

A) The linear correlation is weak, and the two variables vary in the same direction.

A computer program that improves its performance at some task through experience.

Environmental Science: An Indian Journal

Name May 3, 2007 Math Probability and Statistics

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management

Chapter 12 Practice Test

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Evaluating and Classifying NBA Free Agents

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Universal Style Transfer via Feature Transforms

intended velocity ( u k arm movements

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

Influence of Forecasting Factors and Methods or Bullwhip Effect and Order Rate Variance Ratio in the Two Stage Supply Chain-A Case Study

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Predicting Season-Long Baseball Statistics. By: Brandon Liu and Bryan McLellan

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Neural Networks II. Chen Gao. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Predicting the Total Number of Points Scored in NFL Games

B. AA228/CS238 Component

Navigate to the golf data folder and make it your working directory. Load the data by typing

How Do Injuries in the NFL Affect the Outcome of the Game

CSC242: Intro to AI. Lecture 21

Matrix-analog measure-cerrelatepredict

Basketball field goal percentage prediction model research and application based on BP neural network

Inferring land use from mobile phone activity

Legendre et al Appendices and Supplements, p. 1

Introduction to Pattern Recognition

Machine Learning Methods for Climbing Route Classification

Section I: Multiple Choice Select the best answer for each problem.

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Two Machine Learning Approaches to Understand the NBA Data

GALLUP NEWS SERVICE 2018 MIDTERM ELECTION

E STIMATING KILN SCHEDULES FOR TROPICAL AND TEMPERATE HARDWOODS USING SPECIFIC GRAVITY

Acquisition and prediction of wave surface by marine radar for the safety of small ships

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts

The Economic Factors Analysis in Olympic Game

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 39: Training Neural Networks (Cont d)

Performance of Fully Automated 3D Cracking Survey with Pixel Accuracy based on Deep Learning

CAAD CTF 2018 Rules June 21, 2018 Version 1.1

A Machine Learning Approach to Predicting Winning Patterns in Track Cycling Omnium

Figure 1 Location of the ANDRILL SMS 2006 mooring site labeled ADCP1 above.

Predicting Horse Racing Results with TensorFlow

CSE 190a Project Report: Golf Club Head Tracking

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Machine Learning Application in Aviation Safety

FREEWAY WORK ZONE SPEED MODEL DOCUMENTATION

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

Predicting the NCAA Men s Basketball Tournament with Machine Learning

July 2015 Sept Cork City Pedestrian Counter Report

An Investigation of Freeway Capacity Before and During Incidents

Dynamic validation of Globwave SAR wave spectra data using an observation-based swell model. R. Husson and F. Collard

One Way ANOVA (Analysis of Variance)

COMPLETING THE RESULTS OF THE 2013 BOSTON MARATHON

Running head: DATA ANALYSIS AND INTERPRETATION 1

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: WORLD AFFAIRS

Computer Practical: Gaussian Plume Model Paul Connolly, October 2017

Player Availability Rating (PAR) - A Tool for Quantifying Skater Performance for NHL General Managers

Lab 11: Introduction to Linear Regression

Guide to Computing Minitab commands used in labs (mtbcode.out)

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Anabela Brandão and Doug S. Butterworth

What Causes the Favorite-Longshot Bias? Further Evidence from Tennis

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: WORLD AFFAIRS

Single-step genomic BLUP for national beef cattle evaluation in US:

Visual Traffic Jam Analysis Based on Trajectory Data

ARTIFICIAL NEURAL NETWORK BASED DESIGN FOR DUAL LATERAL WELL APPLICATIONS

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: WORK AND EDUCATION

Neural Network in Computer Vision for RoboCup Middle Size League

Habit Formation in Voting: Evidence from Rainy Elections Thomas Fujiwara, Kyle Meng, and Tom Vogl ONLINE APPENDIX

Machine Learning an American Pastime

Grade 6 Math Circles Fall October 7/8 Statistics

RELATIONSHIP BETWEEN CONGESTION AND TRAFFIC ACCIDENTS ON EXPRESSWAYS AN INVESTIGATION WITH BAYESIAN BELIEF NETWORKS

Application of Bayesian Networks to Shopping Assistance

Modeling Salmon Behavior on the Umpqua River. By Scott Jordan 6/2/2015

Staking plans in sports betting under unknown true probabilities of the event

An Assessment of Quality in Underwater Archaeological Surveys Using Tape Measurements

Cycling Volume Estimation Methods for Safety Analysis

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Transcription:

Deconstructing Data Science David Bamman, UC Berkele Info 29 Lecture 4: Regression overview Feb 1, 216

Regression A mapping from input data (drawn from instance space ) to a point in R (R = the set of real numbers) = the empire state building = 17444.5625

task Y predicting bo office revenue movie R

Eperiment design training development testing size 8% 1% 1% purpose training models model selection evaluation; never look at it until the ver end

Metrics Measure difference between the prediction ŷ and the true Mean squared error 1 N (ŷ i i ) 2 (MSE) N i=1 Mean absolute error 1 N ŷ i i (MAE) N i=1

Linear regression F ŷ = i β i i=1 2 15 1 5 β R F (F-dimensional vector of real numbers) 5 1 15 2

Polnomial regression F F ŷ = i β a,i + 2 i β b,i i=1 i=1 4 3 ^2 2 1 βa, βb R F (F-dimensional vector of real numbers) -2-1 1 2

Polnomial regression F F F ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1 5 ^3-5 βa, βb, βc R F (F-dimensional vector of real numbers) -2-1 1 2

Nonlinear regression Deep learning Decision trees Probabilistic graphical models Random forests Support vector machines (regression) Networks Neural networks

Number of Parameters order 1 (linear reg.) ŷ = F i=1 i β a,i F F order 2 ŷ = i β a,i + 2 i β b,i i=1 i=1 F F F order 3 ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1

2 2-2 -2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

instance space labeled data labeled data labeled data

2-2 5 1 15 2 degree 1, training MSE = 73.4

2-2 5 1 15 2 degree 2, training MSE = 71.9

2-2 5 1 15 2 degree 3, training MSE = 6.9

2-2 5 1 15 2 degree 4, training MSE = 6.6

2-2 5 1 15 2 degree 5, training MSE = 59.1

2-2 5 1 15 2 degree 6, training MSE = 5.2

2-2 5 1 15 2 degree 7, training MSE = 49.6

2-2 5 1 15 2 degree 8, training MSE = 46.8

2-2 5 1 15 2 degree 9, training MSE = 41.2

2-2 5 1 15 2 degree 1, training MSE = 35.8

2-2 5 1 15 2 degree 11, training MSE = 21.1

2-2 5 1 15 2 degree 12, training MSE = 18.4

2 2-2 18.4-2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 136.5-2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 136.5 87.7 5 1 15 2 5 1 15 2

2 2-2 -2 73.4 86.3 5 1 15 2 5 1 15 2 2 2-2 -2 65. 94.7 5 1 15 2 5 1 15 2

Overfitting Memorizing the nuances (and noise) of the training data that prevents generalizing to unseen data 2 2-2 -2 5 1 15 2 5 1 15 2

Sources of error Bias: Error due to mis-specifing the relationship between input and the output. [too few parameters, or the wrong kinds] Variance: Error due to sensitivit to random fluctuations in the training data. If ou train on different data, do ou get radicall different predictions? [too man parameters]

Low variance High variance Low bias High bias Image from Flach 212

Eample: High bias, low variance: Alwas predict Berkele geolocation on Twitter High bias, high variance: Predict most frequent cit in training data Low bias, high variance: man features, some of which capture true signal but capture random noise Low bias, low variance: enough features to capture the true signal

Ordinal regression In between classification and regression Y is categorical (e.g.,,, ) Elements of Y are ordered < < <

Ordinal regression task Y predicting star ratings movie {,, }

Computational Journalism Sarah Cohen, James T. Hamilton, and Fred Turner, Computational Journalism, Communications of the ACM (211) Slvain Parasie, Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of big data, Digital Journalism (215)

Computational Journalism Changing how stories are discovered, presented, aggregated, monetized and archived (Cohen et al. 212) Draws on earlier tradition of computer-assisted reporting and precision journalism (Meer 1972)

Computational Journalism Database linking, e.g.: voting records to the deceased press releases from different members of congress indictments/settlements from U.S. attornes documents from SEC, Pentagon, defense contractors to note movement to industr (Cohen 212) DSA database of safet status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 215)

Computational Journalism Information etraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents. Analzing the relationship between entities

Computational Journalism Data-driven stories about large-scale trends Relationship between birth ear and political views NY Times (Jul 7, 214) Change in insured Americans under the ACA, NY Times (Oct 29, 214) 4

Computational Journalism Data-driven lead generation; the outliers in analsis that point to a stor