Deconstructing Data Science

Similar documents
Deconstructing Data Science

Building an NFL performance metric

Predicting Horse Racing Results with Machine Learning

Projecting Three-Point Percentages for the NBA Draft

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

Estimating the Probability of Winning an NFL Game Using Random Forests

Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management

Introduction to Machine Learning NPFL 054

PREDICTING the outcomes of sporting events

A Novel Approach to Predicting the Results of NBA Matches

Unit 4: Inference for numerical variables Lecture 3: ANOVA

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

CSC242: Intro to AI. Lecture 21

Evaluating and Classifying NBA Free Agents

Environmental Science: An Indian Journal

CS 221 PROJECT FINAL

Influence of Forecasting Factors and Methods or Bullwhip Effect and Order Rate Variance Ratio in the Two Stage Supply Chain-A Case Study

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts

Basketball data science

Predicting Season-Long Baseball Statistics. By: Brandon Liu and Bryan McLellan

Chapter 12 Practice Test

Legendre et al Appendices and Supplements, p. 1

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

knn & Naïve Bayes Hongning Wang

Lecture 5. Optimisation. Regularisation

Inferring land use from mobile phone activity

Universal Style Transfer via Feature Transforms

Introduction to Pattern Recognition

Predicting Horse Racing Results with TensorFlow

Predicting NBA Shots

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

B. AA228/CS238 Component

Neural Networks II. Chen Gao. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Predicting the Total Number of Points Scored in NFL Games

Lane changing and merging under congested conditions in traffic simulation models

Player Availability Rating (PAR) - A Tool for Quantifying Skater Performance for NHL General Managers

Two Machine Learning Approaches to Understand the NBA Data

Standard 3.1 The student will plan and conduct investigations in which

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Section I: Multiple Choice Select the best answer for each problem.

A computer program that improves its performance at some task through experience.

Machine Learning Methods for Climbing Route Classification

intended velocity ( u k arm movements

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

Smart-Walk: An Intelligent Physiological Monitoring System for Smart Families

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Basketball field goal percentage prediction model research and application based on BP neural network

A) The linear correlation is weak, and the two variables vary in the same direction.

Matrix-analog measure-cerrelatepredict

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Journal of Quantitative Analysis in Sports Manuscript 1039

One Way ANOVA (Analysis of Variance)

How Do Injuries in the NFL Affect the Outcome of the Game

E STIMATING KILN SCHEDULES FOR TROPICAL AND TEMPERATE HARDWOODS USING SPECIFIC GRAVITY

Planning and Design of Proposed ByPass Road connecting Kalawad Road to Gondal Road, Rajkot - Using Autodesk Civil 3D Software.

Citation for published version (APA): Canudas Romo, V. (2003). Decomposition Methods in Demography Groningen: s.n.

Performance of Fully Automated 3D Cracking Survey with Pixel Accuracy based on Deep Learning

Modeling Salmon Behavior on the Umpqua River. By Scott Jordan 6/2/2015

Name May 3, 2007 Math Probability and Statistics

Staking plans in sports betting under unknown true probabilities of the event

A Machine Learning Approach to Predicting Winning Patterns in Track Cycling Omnium

Navigate to the golf data folder and make it your working directory. Load the data by typing

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Systematic Review and Meta-analysis of Bicycle Helmet Efficacy to Mitigate Head, Face and Neck Injuries

FISH 415 LIMNOLOGY UI Moscow

Dynamic validation of Globwave SAR wave spectra data using an observation-based swell model. R. Husson and F. Collard

What Causes the Favorite-Longshot Bias? Further Evidence from Tennis

Using Spatio-Temporal Data To Create A Shot Probability Model

CSE 190a Project Report: Golf Club Head Tracking

FISH 415 LIMNOLOGY UI Moscow

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

Introduction to topological data analysis

COMPLETING THE RESULTS OF THE 2013 BOSTON MARATHON

Application of Bayesian Networks to Shopping Assistance

An Investigation of Freeway Capacity Before and During Incidents

Business and housing market cycles in the euro area: a multivariate unobserved component approach

Introduction to Pattern Recognition

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

GLMM standardisation of the commercial abalone CPUE for Zones A-D over the period

FREEWAY WORK ZONE SPEED MODEL DOCUMENTATION

Advanced Animal Science TEKS/LINKS Student Objectives One Credit

Grade 6 Math Circles Fall October 7/8 Statistics

Neural Network in Computer Vision for RoboCup Middle Size League

Real-Time Electricity Pricing

Analysis of Curling Team Strategy and Tactics using Curling Informatics

RUGBY is a dynamic, evasive, and highly possessionoriented

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Road Data Input System using Digital Map in Roadtraffic

The Economic Factors Analysis in Olympic Game

Lab 11: Introduction to Linear Regression

Naïve Bayes. Robot Image Credit: Viktoriya Sukhanova 123RF.com

A history-based estimation for LHCb job requirements

INTER-AMERICAN TROPICAL TUNA COMMISSION SCIENTIFIC ADVISORY COMMITTEE FOURTH MEETING. La Jolla, California (USA) 29 April - 3 May 2013

MICROSCOPIC ROAD SAFETY COMPARISON BETWEEN CANADIAN AND SWEDISH ROUNDABOUT DRIVER BEHAVIOUR

For IEC use only. Technical Committee TC3: Information structures, documentation and graphical symbols

Transcription:

Deconstructing Data Science David Bamman, UC Berkele Info 29 Lecture 4: Regression overview Jan 26, 217

Regression A mapping from input data (drawn from instance space ) to a point in R (R = the set of real numbers) = the empire state building = 17444.5625

task predicting bo office revenue movie total bo office predicting stock movements predicting vote share $TWTR price at time t+1 Clinton 47%

Regression Supervised learning Given training data in the form of <, > pairs, learn ĥ()

Regression Can ou create (or find) labeled data that marks that value for a bunch of eamples? Can ou make that choice? Can ou create features that might help in distinguishing those classes?

Eperiment design training development testing size 8% 1% 1% purpose training models model selection evaluation; never look at it until the ver end

Metrics Measure difference between the prediction ŷ and the true Mean squared error 1 N (ŷ i i ) 2 (MSE) N i=1 Mean absolute error 1 N ŷ i i (MAE) N i=1

ŷ MAE MSE 1 2 1 1 1 1.1.1.1 81.7% of 1 1 99 981 total MAE 98.6% of 1 5 4 16 1-5 6 36 1 1 9 81 1 3 2 4 1.9.1.1 1 1 121.2 9939.2 total MSE MSE error penalizes outliers more than MAE

Linear regression F ŷ = i β i i=1 2 15 1 5 β R F (F-dimensional vector of real numbers) 5 1 15 2

Polnomial regression F F ŷ = i β a,i + 2 i β b,i i=1 i=1 4 3 ^2 2 1 βa, βb R F (F-dimensional vector of real numbers) -2-1 1 2

Polnomial regression F F F ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1 5 ^3-5 βa, βb, βc R F (F-dimensional vector of real numbers) -2-1 1 2

Nonlinear regression Deep learning Decision trees Probabilistic graphical models Random forests Support vector machines (regression) Networks Neural networks

Number of Parameters order 1 (linear reg.) ŷ = F i=1 i β a,i F F order 2 ŷ = i β a,i + 2 i β b,i i=1 i=1 F F F order 3 ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1

2 2-2 -2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

instance space labeled data labeled data labeled data

2-2 5 1 15 2 degree 1, training MSE = 73.4

2-2 5 1 15 2 degree 2, training MSE = 71.9

2-2 5 1 15 2 degree 3, training MSE = 6.9

2-2 5 1 15 2 degree 4, training MSE = 6.6

2-2 5 1 15 2 degree 5, training MSE = 59.1

2-2 5 1 15 2 degree 6, training MSE = 5.2

2-2 5 1 15 2 degree 7, training MSE = 49.6

2-2 5 1 15 2 degree 8, training MSE = 46.8

2-2 5 1 15 2 degree 9, training MSE = 41.2

2-2 5 1 15 2 degree 1, training MSE = 35.8

2-2 5 1 15 2 degree 11, training MSE = 21.1

2-2 5 1 15 2 degree 12, training MSE = 18.4

2 2-2 18.4-2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 136.5-2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 136.5 87.7 5 1 15 2 5 1 15 2

2 2-2 -2 73.4 86.3 5 1 15 2 5 1 15 2 2 2-2 -2 65. 94.7 5 1 15 2 5 1 15 2

Overfitting Memorizing the nuances (and noise) of the training data that prevents generalizing to unseen data 2 2-2 -2 5 1 15 2 5 1 15 2

Sources of error Bias: Error due to mis-specifing the relationship between input and the output. [too few parameters, or the wrong kinds] Variance: Error due to sensitivit to random fluctuations in the training data. If ou train on different data, do ou get radicall different predictions? [too man parameters]

Low variance High variance Low bias High bias Image from Flach 212

Eample: High bias, low variance: Alwas predict Berkele geolocation on Twitter High bias, high variance: Predict most frequent cit in training data Low bias, high variance: man features, some of which capture true signal but capture random noise Low bias, low variance: enough features to capture the true signal

Ordinal regression In between classification and regression Y is categorical (e.g.,,, ) Elements of Y are ordered < < <

Ordinal regression task Y predicting star ratings movie {,, }

Computational Journalism Sarah Cohen, James T. Hamilton, and Fred Turner, Computational Journalism, Communications of the ACM (211) Slvain Parasie, Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of big data, Digital Journalism (215)

Computational Journalism Changing how stories are discovered, presented, aggregated, monetized and archived (Cohen et al. 212) Draws on earlier tradition of computer-assisted reporting and precision journalism (Meer 1972)

Computational Journalism Database linking, e.g.: voting records to the deceased press releases from different members of congress indictments/settlements from U.S. attornes documents from SEC, Pentagon, defense contractors to note movement to industr (Cohen 212) DSA database of safet status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 215)

Computational Journalism Information etraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents. Analzing the relationship between entities

Computational Journalism Data-driven stories about large-scale trends Relationship between birth ear and political views NY Times (Jul 7, 214) Change in insured Americans under the ACA, NY Times (Oct 29, 214) 43

Computational Journalism Data-driven lead generation; the outliers in analsis that point to a stor

Computational Journalism Demands: High precision Fast turnaround Needs (Stra 216): Accurate document analsis Guided search Interactive methods

Project proposal, due 2/16 Collaborative project (involving up to 3 students), where the methods learned in class will be used to draw inferences about the world and criticall assess the qualit of those results. Proposal (2 pages): outline the work ou re going to undertake formulate a hpothesis to be eamined motivate its rationale as an interesting question worth asking assess its potential to contribute new knowledge b situating it within related literature in the scientific communit. (cite 5 relevant sources) who is the team and what are each of our responsibilities (everone gets the same grade)