Deconstructing Data Science - PDF Free Download

Deconstructing Data Science David Bamman, UC Berkele Info 29 Lecture 4: Regression overview Feb 1, 216

Regression A mapping from input data (drawn from instance space ) to a point in R (R = the set of real numbers) = the empire state building = 17444.5625

task Y predicting bo office revenue movie R

Eperiment design training development testing size 8% 1% 1% purpose training models model selection evaluation; never look at it until the ver end

Metrics Measure difference between the prediction ŷ and the true Mean squared error 1 N (ŷ i i ) 2 (MSE) N i=1 Mean absolute error 1 N ŷ i i (MAE) N i=1

Linear regression F ŷ = i β i i=1 2 15 1 5 β R F (F-dimensional vector of real numbers) 5 1 15 2

Polnomial regression F F ŷ = i β a,i + 2 i β b,i i=1 i=1 4 3 ^2 2 1 βa, βb R F (F-dimensional vector of real numbers) -2-1 1 2

Polnomial regression F F F ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1 5 ^3-5 βa, βb, βc R F (F-dimensional vector of real numbers) -2-1 1 2

Nonlinear regression Deep learning Decision trees Probabilistic graphical models Random forests Support vector machines (regression) Networks Neural networks

Number of Parameters order 1 (linear reg.) ŷ = F i=1 i β a,i F F order 2 ŷ = i β a,i + 2 i β b,i i=1 i=1 F F F order 3 ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1

2 2-2 -2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

instance space labeled data labeled data labeled data

2-2 5 1 15 2 degree 1, training MSE = 73.4

2-2 5 1 15 2 degree 2, training MSE = 71.9

2-2 5 1 15 2 degree 3, training MSE = 6.9

2-2 5 1 15 2 degree 4, training MSE = 6.6

2-2 5 1 15 2 degree 5, training MSE = 59.1

2-2 5 1 15 2 degree 6, training MSE = 5.2

2-2 5 1 15 2 degree 7, training MSE = 49.6

2-2 5 1 15 2 degree 8, training MSE = 46.8

2-2 5 1 15 2 degree 9, training MSE = 41.2

2-2 5 1 15 2 degree 1, training MSE = 35.8

2-2 5 1 15 2 degree 11, training MSE = 21.1

2-2 5 1 15 2 degree 12, training MSE = 18.4

2 2-2 18.4-2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 136.5-2 5 1 15 2 5 1 15 2

2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 136.5 87.7 5 1 15 2 5 1 15 2

2 2-2 -2 73.4 86.3 5 1 15 2 5 1 15 2 2 2-2 -2 65. 94.7 5 1 15 2 5 1 15 2

Overfitting Memorizing the nuances (and noise) of the training data that prevents generalizing to unseen data 2 2-2 -2 5 1 15 2 5 1 15 2

Sources of error Bias: Error due to mis-specifing the relationship between input and the output. [too few parameters, or the wrong kinds] Variance: Error due to sensitivit to random fluctuations in the training data. If ou train on different data, do ou get radicall different predictions? [too man parameters]

Low variance High variance Low bias High bias Image from Flach 212

Eample: High bias, low variance: Alwas predict Berkele geolocation on Twitter High bias, high variance: Predict most frequent cit in training data Low bias, high variance: man features, some of which capture true signal but capture random noise Low bias, low variance: enough features to capture the true signal

Ordinal regression In between classification and regression Y is categorical (e.g.,,, ) Elements of Y are ordered < < <

Ordinal regression task Y predicting star ratings movie {,, }

Computational Journalism Sarah Cohen, James T. Hamilton, and Fred Turner, Computational Journalism, Communications of the ACM (211) Slvain Parasie, Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of big data, Digital Journalism (215)

Computational Journalism Changing how stories are discovered, presented, aggregated, monetized and archived (Cohen et al. 212) Draws on earlier tradition of computer-assisted reporting and precision journalism (Meer 1972)

Computational Journalism Database linking, e.g.: voting records to the deceased press releases from different members of congress indictments/settlements from U.S. attornes documents from SEC, Pentagon, defense contractors to note movement to industr (Cohen 212) DSA database of safet status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 215)

Computational Journalism Information etraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents. Analzing the relationship between entities

Computational Journalism Data-driven stories about large-scale trends Relationship between birth ear and political views NY Times (Jul 7, 214) Change in insured Americans under the ACA, NY Times (Oct 29, 214) 4

Computational Journalism Data-driven lead generation; the outliers in analsis that point to a stor