Deconstructing Data Science David Bamman, UC Berkele Info 29 Lecture 4: Regression overview Jan 26, 217
Regression A mapping from input data (drawn from instance space ) to a point in R (R = the set of real numbers) = the empire state building = 17444.5625
task predicting bo office revenue movie total bo office predicting stock movements predicting vote share $TWTR price at time t+1 Clinton 47%
Regression Supervised learning Given training data in the form of <, > pairs, learn ĥ()
Regression Can ou create (or find) labeled data that marks that value for a bunch of eamples? Can ou make that choice? Can ou create features that might help in distinguishing those classes?
Eperiment design training development testing size 8% 1% 1% purpose training models model selection evaluation; never look at it until the ver end
Metrics Measure difference between the prediction ŷ and the true Mean squared error 1 N (ŷ i i ) 2 (MSE) N i=1 Mean absolute error 1 N ŷ i i (MAE) N i=1
ŷ MAE MSE 1 2 1 1 1 1.1.1.1 81.7% of 1 1 99 981 total MAE 98.6% of 1 5 4 16 1-5 6 36 1 1 9 81 1 3 2 4 1.9.1.1 1 1 121.2 9939.2 total MSE MSE error penalizes outliers more than MAE
Linear regression F ŷ = i β i i=1 2 15 1 5 β R F (F-dimensional vector of real numbers) 5 1 15 2
Polnomial regression F F ŷ = i β a,i + 2 i β b,i i=1 i=1 4 3 ^2 2 1 βa, βb R F (F-dimensional vector of real numbers) -2-1 1 2
Polnomial regression F F F ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1 5 ^3-5 βa, βb, βc R F (F-dimensional vector of real numbers) -2-1 1 2
Nonlinear regression Deep learning Decision trees Probabilistic graphical models Random forests Support vector machines (regression) Networks Neural networks
Number of Parameters order 1 (linear reg.) ŷ = F i=1 i β a,i F F order 2 ŷ = i β a,i + 2 i β b,i i=1 i=1 F F F order 3 ŷ = i β a,i + 2 i β b,i + 3 i β c,i i=1 i=1 i=1
2 2-2 -2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2
instance space labeled data labeled data labeled data
2-2 5 1 15 2 degree 1, training MSE = 73.4
2-2 5 1 15 2 degree 2, training MSE = 71.9
2-2 5 1 15 2 degree 3, training MSE = 6.9
2-2 5 1 15 2 degree 4, training MSE = 6.6
2-2 5 1 15 2 degree 5, training MSE = 59.1
2-2 5 1 15 2 degree 6, training MSE = 5.2
2-2 5 1 15 2 degree 7, training MSE = 49.6
2-2 5 1 15 2 degree 8, training MSE = 46.8
2-2 5 1 15 2 degree 9, training MSE = 41.2
2-2 5 1 15 2 degree 1, training MSE = 35.8
2-2 5 1 15 2 degree 11, training MSE = 21.1
2-2 5 1 15 2 degree 12, training MSE = 18.4
2 2-2 18.4-2 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2
2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 5 1 15 2 5 1 15 2
2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 136.5-2 5 1 15 2 5 1 15 2
2 2-2 -2 18.4 118.8 5 1 15 2 5 1 15 2 2 2-2 -2 136.5 87.7 5 1 15 2 5 1 15 2
2 2-2 -2 73.4 86.3 5 1 15 2 5 1 15 2 2 2-2 -2 65. 94.7 5 1 15 2 5 1 15 2
Overfitting Memorizing the nuances (and noise) of the training data that prevents generalizing to unseen data 2 2-2 -2 5 1 15 2 5 1 15 2
Sources of error Bias: Error due to mis-specifing the relationship between input and the output. [too few parameters, or the wrong kinds] Variance: Error due to sensitivit to random fluctuations in the training data. If ou train on different data, do ou get radicall different predictions? [too man parameters]
Low variance High variance Low bias High bias Image from Flach 212
Eample: High bias, low variance: Alwas predict Berkele geolocation on Twitter High bias, high variance: Predict most frequent cit in training data Low bias, high variance: man features, some of which capture true signal but capture random noise Low bias, low variance: enough features to capture the true signal
Ordinal regression In between classification and regression Y is categorical (e.g.,,, ) Elements of Y are ordered < < <
Ordinal regression task Y predicting star ratings movie {,, }
Computational Journalism Sarah Cohen, James T. Hamilton, and Fred Turner, Computational Journalism, Communications of the ACM (211) Slvain Parasie, Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of big data, Digital Journalism (215)
Computational Journalism Changing how stories are discovered, presented, aggregated, monetized and archived (Cohen et al. 212) Draws on earlier tradition of computer-assisted reporting and precision journalism (Meer 1972)
Computational Journalism Database linking, e.g.: voting records to the deceased press releases from different members of congress indictments/settlements from U.S. attornes documents from SEC, Pentagon, defense contractors to note movement to industr (Cohen 212) DSA database of safet status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 215)
Computational Journalism Information etraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents. Analzing the relationship between entities
Computational Journalism Data-driven stories about large-scale trends Relationship between birth ear and political views NY Times (Jul 7, 214) Change in insured Americans under the ACA, NY Times (Oct 29, 214) 43
Computational Journalism Data-driven lead generation; the outliers in analsis that point to a stor
Computational Journalism Demands: High precision Fast turnaround Needs (Stra 216): Accurate document analsis Guided search Interactive methods
Project proposal, due 2/16 Collaborative project (involving up to 3 students), where the methods learned in class will be used to draw inferences about the world and criticall assess the qualit of those results. Proposal (2 pages): outline the work ou re going to undertake formulate a hpothesis to be eamined motivate its rationale as an interesting question worth asking assess its potential to contribute new knowledge b situating it within related literature in the scientific communit. (cite 5 relevant sources) who is the team and what are each of our responsibilities (everone gets the same grade)