Model Selection Erwan Le Pennec Fall 2015

Size: px

Start display at page:

Download "Model Selection Erwan Le Pennec Fall 2015"

Kimberly Armstrong
5 years ago
Views:

1 Model Selection Erwan Le Pennec Fall 2015 library("dplyr") library("ggplot2") library("ggfortify") library("reshape2") Model Selection We will now use another classical dataset birthwt which corresponds to a study on risk factors associated with low infant birth weight conducted at Baystate Medical Center, Springfield, Mass during It consists of 189 observations of 10 variables. Variable low age lwt race smoke ptl ht ui ftv bwt Content indicator of birth weight less than 2.5 kg. mother s age in years. mother s weight in pounds at last menstrual period. mother s race (1 = white, 2 = black, 3 = other). smoking status during pregnancy. number of previous premature labors. history of hypertension. presence of uterine irritability. number of physician visits during the first trimester. birth weight in grams. Our goal will be to predict bwt, the birth weight, from all the other variables (except low!). 1. Load the dataset from the package MASS and inspect it with glimpse. lbw <- MASS::birthwt glimpse(lbw) Observations: 189 Variables: 10 $ low (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... $ age (int) 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30,... $ lwt (int) 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, $ race (int) 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3,... $ smoke (int) 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,... $ ptl (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,... $ ht (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,... $ ui (int) 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,... $ ftv (int) 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0,... $ bwt (int) 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663,

2 2. Fix the different factor issues. lbw <- mutate(lbw, low = factor(low, levels = c(0,1), labels = c("normal", "low"))) lbw <- mutate(lbw, race = factor(race, levels = c(1,2,3), labels = c("white", "black", "other"))) lbw <- mutate(lbw, smoke = factor(smoke, levels = c(0,1), labels = c("no","yes"))) lbw <- mutate(lbw, ht = factor(ht, levels = c(0,1), labels = c("no","yes"))) lbw <- mutate(lbw, ui = factor(ui, levels = c(0,1), labels = c("no","yes"))) lbw <- select(lbw, -low) glimpse(lbw) Observations: 189 Variables: 9 $ age (int) 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30,... $ lwt (int) 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, $ race (fctr) black, other, white, white, white, other, white, other,... $ smoke (fctr) no, no, yes, yes, yes, no, no, no, yes, yes, no, no, no... $ ptl (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,... $ ht (fctr) no, no, no, no, no, no, no, no, no, no, no, no, yes, no... $ ui (fctr) yes, no, no, yes, yes, no, no, no, no, no, no, no, no,... $ ftv (int) 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0,... $ bwt (int) 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, Verify that the dataset does not contain any missing values. summary(lbw) age lwt race smoke ptl Min. :14.00 Min. : 80.0 white:96 no :115 Min. : st Qu.: st Qu.:110.0 black:26 yes: 74 1st Qu.: Median :23.00 Median :121.0 other:67 Median : Mean :23.24 Mean :129.8 Mean : rd Qu.: rd Qu.: rd Qu.: Max. :45.00 Max. :250.0 Max. : ht ui ftv bwt no :177 no :161 Min. : Min. : 709 yes: 12 yes: 28 1st Qu.: st Qu.:2414 Median : Median :2977 Mean : Mean :2945 3rd Qu.: rd Qu.:3487 Max. : Max. : Inspect visually all the variables independently. for (name in names(lbw)) { print(qplot(data = lbw, get(name), xlab = name)) } 2

3 15 count age count lwt 3

4 count white black other race 90 count no smoke yes 4

5 count ptl 150 count no ht yes 5

6 count no ui yes 75 count ftv 6

7 15 10 count bwt 5. Inspect visually the relation between every variable and bwt. Can you infer the most useful variables? for (name in names(lbw)[-9]) { if (class(lbw[[name]])=="factor") { print(ggplot(data = lbw, aes_string(x = name, y = "bwt")) + geom_boxplot() + geom_point(position = position_jitter(width =.1))) } } else { print(ggplot(data = lbw, aes_string(x = name, y = "bwt")) + geom_point(position = position_jitter(width =.1)) + geom_smooth()) } 7

8 bwt age bwt lwt 8

9 bwt white black other race 4000 bwt no smoke yes 9

10 bwt ptl bwt no ht yes 10

11 bwt no ui yes 4000 bwt ftv 6. Compute the full regression with all the variables and compute its summary (and maybe its diagnostic plots). 11

12 reglbw <- lm(bwt ~., data = lbw) summary(reglbw) Call: lm(formula = bwt ~., data = lbw) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** age lwt * raceblack ** raceother ** smokeyes ** ptl htyes ** uiyes *** ftv Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 179 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 179 DF, p-value: 7.891e-08 autoplot(reglbw) 12

13 Residuals Residuals vs Fitted Fitted values Standardized residuals Normal Q Q Theoretical Quantiles Standardized residuals Scale Location Fitted values Standardized Residuals Residuals vs Leverage Leverage 7. Compute the trivial regression with no variables but the intercept as a reference of a _bad_method reglbwtriv <- lm(bwt ~ 1, data = lbw) summary(reglbwtriv) Call: lm(formula = bwt ~ 1, data = lbw) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 188 degrees of freedom 7. Create a function that given a lm model computes the empirical error, the debiased error, the cross validation error, the deviance ( 2 Log-likelihood), the AIC criteria and the BIC criteria. V <- 5 LbwFolds <- caret::createmultifolds(lbw[["bwt"]], k = V, times = T) 13

14 computeerrlm <- function(model, name) { err <- mean((lbw[["bwt"]]-predict(model))^2) errcp <- err * ( * length(model[["coefficients"]]) / nrow(lbw)) errcvtmp <- matrix(0, nrow = 1, ncol = (T*V)) for (v in 1: (T*V)) { lbwtrain <- slice(lbw, LbwFolds[[v]]) lbwtest <- slice(lbw, -LbwFolds[[v]]) regtmp <- lm(model, data = lbwtrain) predtmp <- predict(regtmp, newdata = lbwtest) errcvtmp[v] <- mean((lbwtest[["bwt"]]-predtmp)^2) } errcv <- mean(errcvtmp) errcvup <- errcv + 2 * sd(errcvtmp) / sqrt(t*v) LogLik <- -2 * loglik(model) LogLikAIC <- AIC(model) LogLikBIC <- BIC(model) } data.frame( method = name, err = err, errcp = errcp, errcv = errcv, errcvup = errcvup, LogLik = LogLik, LogLikAIC = LogLikAIC, LogLikBIC = LogLikBIC) 8. Compute the errors of the trivial and the full model. errs <- computeerrlm(reglbwtriv, "Trivial") errs <- rbind(errs, computeerrlm(reglbw, "Full")) errs method err errcp errcv errcvup LogLik LogLikAIC LogLikBIC 1 Trivial Full Create a function that takes a data frame of errors for possibly several models and plot them. Test it on the full model. Plot_Err <- function(errs) { ggplot(data = melt(select(errs, -matches("loglik"))), aes(x = method, y = value, color = variable)) + geom_point(size = 5) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) } Plot_Err(errs) 14

15 value variable err errcp errcv errcvup Trivial method Full Plot_LogLik <- function(errs) { ggplot(data = melt(select(errs, -matches("err"))), aes(x = method, y = value, color = variable)) + geom_point(size = 5) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) } Plot_LogLik(errs) 15

16 value 3000 variable LogLik LogLikAIC LogLikBIC 2980 Trivial method Full 10. According to the summary, which variables can be removed from the model? Test this assumption by removing them, computing the errors and ploting them for the two models. reglbw2 <- update(reglbw, ~. - age - ptl -ftv) summary(reglbw2) Call: lm(formula = bwt ~ lwt + race + smoke + ht + ui, data = lbw) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** lwt * raceblack ** raceother ** smokeyes *** htyes ** uiyes *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 182 degrees of freedom 16

17 Multiple R-squared: , Adjusted R-squared: F-statistic: 9.6 on 6 and 182 DF, p-value: 3.601e-09 errs <- rbind(errs, computeerrlm(reglbw2, "Simplified")) Plot_Err(errs) value variable err errcp errcv errcvup Trivial Full method Simplified Plot_LogLik(errs) 17

18 value 3000 variable LogLik LogLikAIC LogLikBIC 2980 Trivial Full method Simplified Find_Best <- function(errs) { nameserr <- names(errs)[-1] for (nameerr in nameserr) { writelines(strwrap(paste(nameerr, ": ", errs[["method"]][which.min(errs[[nameerr]])], "(",min(errs[[nameerr]], na.rm =TRUE),")"))) } } Find_Best(errs) err : Full ( ) errcp : Simplified ( ) errcv : Simplified ( ) errcvup : Simplified ( ) LogLik : Full ( ) LogLikAIC : Simplified ( ) LogLikBIC : Simplified ( ) 11. What would be the next simplification? Is it efficient? reglbw3 <- update(reglbw2, ~. - lwt) summary(reglbw3) Call: 18

19 lm(formula = bwt ~ race + smoke + ht + ui, data = lbw) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** raceblack ** raceother *** smokeyes *** htyes * uiyes e-05 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 183 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 5 and 183 DF, p-value: 1.98e-08 errs <- rbind(errs, computeerrlm(reglbw3, "Simplified2")) Plot_Err(errs) value variable err errcp errcv errcvup Trivial Full method Simplified Simplified2 Plot_LogLik(errs) 19

20 value 3000 variable LogLik LogLikAIC LogLikBIC 2980 Trivial Full method Simplified Simplified2 Find_Best(errs) err : Full ( ) errcp : Simplified ( ) errcv : Simplified ( ) errcvup : Simplified ( ) LogLik : Full ( ) LogLikAIC : Simplified ( ) LogLikBIC : Simplified ( ) 12. Use glmulti from the package of the same name to test all the possible variable subset without any interaction. (Use the level = 1 option!) What is the best model according to the AIC criterion. library(glmulti) bests <- glmulti(bwt ~., data = lbw, level = 1, family = "gaussian", plotty = FALSE) #You may use plott Initialization... TASK: Exhaustive screening of candidate set. Fitting... After 50 models: Best model: bwt~1+race+smoke Crit= Mean crit= After 100 models: 20

21 Best model: bwt~1+race+smoke+lwt Crit= Mean crit= After 150 models: Best model: bwt~1+race+smoke+ht+lwt Crit= Mean crit= After 200 models: Best model: bwt~1+race+smoke+ui+lwt Crit= Mean crit= After 250 models: Best model: bwt~1+race+smoke+ht+ui Crit= Mean crit= Completed. 13. contains a list of the best models. Use this to compute all the errors for the 25 bests models. Compare those errors with those of our naive attempts. errmulti <- data.frame() for (f in 1:50) { model <- lm(bests@formulas[[f]], data = lbw) errmulti <- rbind(errmulti, computeerrlm(model,sprintf("best_%g",f))) } errs_multi <- rbind(errs, errmulti) Plot_Err(errs_multi) 21

22 Trivial Full Simplified Simplified2 Best_1 Best_2 Best_3 Best_4 Best_5 Best_6 Best_7 Best_8 Best_9 Best_10 Best_11 Best_12 Best_13 Best_14 Best_15 Best_16 Best_17 Best_18 Best_19 Best_20 Best_21 Best_22 Best_23 Best_24 Best_25 Best_26 Best_27 Best_28 Best_29 Best_30 Best_31 Best_32 Best_33 Best_34 Best_35 Best_36 Best_37 Best_38 Best_39 Best_40 Best_41 Best_42 Best_43 Best_44 Best_45 Best_46 Best_47 Best_48 Best_49 Best_50 method value variable err errcp errcv errcvup Plot_LogLik(errs_multi) Trivial Full Simplified Simplified2 Best_1 Best_2 Best_3 Best_4 Best_5 Best_6 Best_7 Best_8 Best_9 Best_10 Best_11 Best_12 Best_13 Best_14 Best_15 Best_16 Best_17 Best_18 Best_19 Best_20 Best_21 Best_22 Best_23 Best_24 Best_25 Best_26 Best_27 Best_28 Best_29 Best_30 Best_31 Best_32 Best_33 Best_34 Best_35 Best_36 Best_37 Best_38 Best_39 Best_40 Best_41 Best_42 Best_43 Best_44 Best_45 Best_46 Best_47 Best_48 Best_49 Best_50 method value variable LogLik LogLikAIC LogLikBIC 22

23 Find_Best(errmulti) err : Best_9 ( ) errcp : Best_1 ( ) errcv : Best_1 ( ) errcvup : Best_4 ( ) LogLik : Best_9 ( ) LogLikAIC : Best_1 ( ) LogLikBIC : Best_1 ( ) Find_Best(errs_multi) err : Full ( ) errcp : Simplified ( ) errcv : Simplified ( ) errcvup : Best_4 ( ) LogLik : Full ( ) LogLikAIC : Simplified ( ) LogLikBIC : Simplified ( ) 14. Add the interaction of level 2 and use glmulti with method = d to find the number of model. Is the exhaustive search possible? glmulti(bwt ~., data = lbw, level = 2, family = "gaussian", method ="d", plotty = FALSE) #You may use p Initialization... TASK: Diagnostic of candidate set. Sample size: factor(s). 4 covariate(s). 0 f exclusion(s). 0 c exclusion(s). 0 f:f exclusion(s). 0 c:c exclusion(s). 0 f:c exclusion(s). Size constraints: min = 0 max = -1 Complexity constraints: min = 0 max = -1 Your candidate set contains models. [1] Use the genetic algorithm of glmulti (method = g ) to explore those models and examine the best 25 solutions. bestsgen <- glmulti(bwt ~., data = lbw, level = 2, family = "gaussian", method ="g", plotty = FALSE) #Y Initialization... TASK: Genetic algorithm in the candidate set. Initialization... Algorithm started... 23

24 After 10 generations: Best model: bwt~1+race+smoke+ht+ui+age+lwt+ptl+lwt:age+ptl:age+ftv:ptl+smoke:age+smoke:ptl+ht:age+ht: Crit= Mean crit= Change in best IC: / Change in mean IC: After 20 generations: Best model: bwt~1+race+smoke+ui+age+lwt+ptl+ftv+lwt:age+ptl:age+ptl:lwt+ftv:age+smoke:age+smoke:ptl+h Crit= Mean crit= Change in best IC: / Change in mean IC: After 30 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: / Change in mean IC: After 40 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 50 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 60 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 70 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 80 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 90 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC:

25 After 100 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:age+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+u Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 110 generations: Best model: bwt~1+race+smoke+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: / Change in mean IC: After 120 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: / Change in mean IC: After 130 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 140 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 150 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 160 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 170 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 180 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC:

26 After 190 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 200 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 210 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 220 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 230 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 240 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 250 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 260 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 270 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC:

27 After 280 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 290 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 300 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 310 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 320 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 330 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 340 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 350 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 360 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC:

28 After 370 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: 0 After 380 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 390 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 400 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 410 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 420 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 430 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 440 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 450 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC:

29 After 460 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 470 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: 0 After 480 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 490 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 500 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: 0 After 510 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 520 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 530 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 540 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: 0 29

30 After 550 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 560 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Change in best IC: 0 / Change in mean IC: After 570 generations: Best model: bwt~1+race+age+lwt+ptl+ftv+ptl:lwt+ftv:age+smoke:age+ht:age+ht:lwt+ui:lwt+ui:ptl Crit= Mean crit= Improvements in best and average IC have bebingo en below the specified goals. Algorithm is declared to have converged. Completed. errgen <- data.frame() for (f in 1:25) { model <- lm(bestsgen@formulas[[f]], data = lbw) errgen <- rbind(errgen, computeerrlm(model,sprintf("bestgen_%g",f))) } errs_gen <- rbind(errs_multi, errgen) Plot_Err(errs_gen) 30

31 4e+05 5e+05 6e+05 Trivial Full Simplified Simplified2 Best_1 Best_2 Best_3 Best_4 Best_5 Best_6 Best_7 Best_8 Best_9 Best_10 Best_11 Best_12 Best_13 Best_14 Best_15 Best_16 Best_17 Best_18 Best_19 Best_20 Best_21 Best_22 Best_23 Best_24 Best_25 Best_26 Best_27 Best_28 Best_29 Best_30 Best_31 Best_32 Best_33 Best_34 Best_35 Best_36 Best_37 Best_38 Best_39 Best_40 Best_41 Best_42 Best_43 Best_44 Best_45 Best_46 Best_47 Best_48 Best_49 Best_50 BestGen_1 BestGen_2 BestGen_3 BestGen_4 BestGen_5 BestGen_6 BestGen_7 BestGen_8 BestGen_9 BestGen_10 BestGen_11 BestGen_12 BestGen_13 BestGen_14 BestGen_15 BestGen_16 BestGen_17 BestGen_18 BestGen_19 BestGen_20 BestGen_21 BestGen_22 BestGen_23 BestGen_24 BestGen_25 method value variable err errcp errcv errcvup Plot_LogLik(errs_gen) Trivial Full Simplified Simplified2 Best_1 Best_2 Best_3 Best_4 Best_5 Best_6 Best_7 Best_8 Best_9 Best_10 Best_11 Best_12 Best_13 Best_14 Best_15 Best_16 Best_17 Best_18 Best_19 Best_20 Best_21 Best_22 Best_23 Best_24 Best_25 Best_26 Best_27 Best_28 Best_29 Best_30 Best_31 Best_32 Best_33 Best_34 Best_35 Best_36 Best_37 Best_38 Best_39 Best_40 Best_41 Best_42 Best_43 Best_44 Best_45 Best_46 Best_47 Best_48 Best_49 Best_50 BestGen_1 BestGen_2 BestGen_3 BestGen_4 BestGen_5 BestGen_6 BestGen_7 BestGen_8 BestGen_9 BestGen_10 BestGen_11 BestGen_12 BestGen_13 BestGen_14 BestGen_15 BestGen_16 BestGen_17 BestGen_18 BestGen_19 BestGen_20 BestGen_21 BestGen_22 BestGen_23 BestGen_24 BestGen_25 method value variable LogLik LogLikAIC LogLikBIC 31

32 Find_Best(errgen) err : BestGen_19 ( ) errcp : BestGen_1 ( ) errcv : BestGen_2 ( ) errcvup : BestGen_2 ( ) LogLik : BestGen_19 ( ) LogLikAIC : BestGen_1 ( ) LogLikBIC : BestGen_2 ( ) Find_Best(errs_gen) err : BestGen_19 ( ) errcp : BestGen_1 ( ) errcv : BestGen_2 ( ) errcvup : BestGen_2 ( ) LogLik : BestGen_19 ( ) LogLikAIC : BestGen_1 ( ) LogLikBIC : Simplified ( ) 16. Use glmnet to try a regularization method to obtain a best model. X <- model.matrix(bwt ~.^2-1, data = lbw) Y <- lbw[["bwt"]] library("glmnet") lbw_lasso <- glmnet(x, Y, family = "gaussian") coeffs_lbw_lasso <- cbind(data.frame(t(as.matrix(coef(lbw_lasso)))), lambda = lbw_lasso[["lambda"]]) ggplot(data = melt(coeffs_lbw_lasso, "lambda"), aes(x = lambda, y = value, color = variable)) + geom_lin 32

33 value lambda age.smokeyes age.ptl age.htyes age.uiyes age.ftv lwt.raceblack lwt.raceother lwt.smokeyes lwt.ptl lwt.htyes lwt.uiyes lwt.ftv raceblack.smokeyes raceother.smokeyes raceblack.ptl raceother.ptl raceblack.htyes raceother.htyes raceblack.uiyes computeerrglmnet <- function(model, lambda, name) { err <- mean((y-predict(model, X, lambda))^2) errcp <- err * ( * (sum(abs(coef(model,lambda))>0)) / nrow(lbw)) errcvtmp <- matrix(0, nrow = 1, ncol = (T*V)) for (v in 1: (T*V)) { Xtrain <- X[LbwFolds[[v]],] Xtest <- X[-LbwFolds[[v]],] Ytrain <- Y[LbwFolds[[v]]] Ytest <- Y[-LbwFolds[[v]]] regtmp <- glmnet(xtrain, Ytrain, family = "gaussian", lambda = lambda) predtmp <- predict(regtmp, Xtest, lambda) errcvtmp[v] <- mean((ytest-predtmp)^2) } errcv <- mean(errcvtmp) errcvup <- errcv + 2 * sd(errcvtmp) / sqrt(t*v) } data.frame( method = name, err = err, errcp = errcp, errcv = errcv, errcvup = errcvup, LogLik = NA, LogLikAIC = NA, LogLikBIC = NA) computeerrlm2 <- function(model, name) { err <- mean((lbwint[["bwt"]]-predict(model))^2) errcp <- err * ( * length(model[["coefficients"]]) / nrow(lbw)) errcvtmp <- matrix(0, nrow = 1, ncol = (T*V)) 33

34 for (v in 1: (T*V)) { lbwtrain <- slice(lbwint, LbwFolds[[v]]) lbwtest <- slice(lbwint, -LbwFolds[[v]]) regtmp <- lm(model, data = lbwtrain) predtmp <- predict(regtmp, newdata = lbwtest) errcvtmp[v] <- mean((lbwtest[["bwt"]]-predtmp)^2) } errcv <- mean(errcvtmp) errcvup <- errcv + 2 * sd(errcvtmp) / sqrt(t*v) LogLik <- -2 * loglik(model) LogLikAIC <- AIC(model) LogLikBIC <- BIC(model) } data.frame( method = name, err = err, errcp = errcp, errcv = errcv, errcvup = errcvup, LogLik = LogLik, LogLikAIC = LogLikAIC, LogLikBIC = LogLikBIC) errlambda <- data.frame() errlambdasup <- data.frame() dx <- data.frame(x) lbwint <- cbind(dx, bwt = Y) for (l in 1:length(lbw_lasso[["lambda"]])) { lambda <- lbw_lasso[["lambda"]][l] errlambda <- rbind(errlambda, computeerrglmnet(lbw_lasso, lambda, sprintf("lasso_%g",l))) subsetlambda <- which(abs(coef(lbw_lasso,lambda)[-1]) > 0) if (length(subsetlambda)>0) { reglambda <- lm(bwt ~., data = mutate(select(dx, subsetlambda), bwt = Y)) errlambdasup <- rbind(errlambdasup, computeerrlm2(reglambda, sprintf("lassosup_%g",l))) } } errs_lasso <- rbind(errs_gen, errlambda, errlambdasup) Plot_Err(errs_lasso) 34

35 0e+00 1e+07 2e+07 3e+07 Trivial Full Simplified Simplified2 Best_1 Best_2 Best_3 Best_4 Best_5 Best_6 Best_7 Best_8 Best_9 Best_10 Best_11 Best_12 Best_13 Best_14 Best_15 Best_16 Best_17 Best_18 Best_19 Best_20 Best_21 Best_22 Best_23 Best_24 Best_25 Best_26 Best_27 Best_28 Best_29 Best_30 Best_31 Best_32 Best_33 Best_34 Best_35 Best_36 Best_37 Best_38 Best_39 Best_40 Best_41 Best_42 Best_43 Best_44 Best_45 Best_46 Best_47 Best_48 Best_49 Best_50 BestGen_1 BestGen_2 BestGen_3 BestGen_4 BestGen_5 BestGen_6 BestGen_7 BestGen_8 BestGen_9 BestGen_10 BestGen_11 BestGen_12 BestGen_13 BestGen_14 BestGen_15 BestGen_16 BestGen_17 BestGen_18 BestGen_19 BestGen_20 BestGen_21 BestGen_22 BestGen_23 BestGen_24 BestGen_25 Lasso_1 Lasso_2 Lasso_3 Lasso_4 Lasso_5 Lasso_6 Lasso_7 Lasso_8 Lasso_9 Lasso_10 Lasso_11 Lasso_12 Lasso_13 Lasso_14 Lasso_15 Lasso_16 Lasso_17 Lasso_18 Lasso_19 Lasso_20 Lasso_21 Lasso_22 Lasso_23 Lasso_24 Lasso_25 Lasso_26 Lasso_27 Lasso_28 Lasso_29 Lasso_30 Lasso_31 Lasso_32 Lasso_33 Lasso_34 Lasso_35 Lasso_36 Lasso_37 Lasso_38 Lasso_39 Lasso_40 Lasso_41 Lasso_42 Lasso_43 Lasso_44 Lasso_45 Lasso_46 Lasso_47 Lasso_48 Lasso_49 Lasso_50 Lasso_51 Lasso_52 Lasso_53 Lasso_54 Lasso_55 Lasso_56 Lasso_57 Lasso_58 Lasso_59 Lasso_60 Lasso_61 Lasso_62 Lasso_63 Lasso_64 Lasso_65 Lasso_66 Lasso_67 Lasso_68 Lasso_69 Lasso_70 Lasso_71 Lasso_72 Lasso_73 Lasso_74 Lasso_75 Lasso_76 Lasso_77 Lasso_78 Lasso_79 Lasso_80 Lasso_81 Lasso_82 Lasso_83 Lasso_84 Lasso_85 Lasso_86 Lasso_87 Lasso_88 Lasso_89 Lasso_90 Lasso_91 Lasso_92 Lasso_93 LassoSup_2 LassoSup_3 LassoSup_4 LassoSup_5 LassoSup_6 LassoSup_7 LassoSup_8 LassoSup_9 LassoSup_10 LassoSup_11 LassoSup_12 LassoSup_13 LassoSup_14 LassoSup_15 LassoSup_16 LassoSup_17 LassoSup_18 LassoSup_19 LassoSup_20 LassoSup_21 LassoSup_22 LassoSup_23 LassoSup_24 LassoSup_25 LassoSup_26 LassoSup_27 LassoSup_28 LassoSup_29 LassoSup_30 LassoSup_31 LassoSup_32 LassoSup_33 LassoSup_34 LassoSup_35 LassoSup_36 LassoSup_37 LassoSup_38 LassoSup_39 LassoSup_40 LassoSup_41 LassoSup_42 LassoSup_43 LassoSup_44 LassoSup_45 LassoSup_46 LassoSup_47 LassoSup_48 LassoSup_49 LassoSup_50 LassoSup_51 LassoSup_52 LassoSup_53 LassoSup_54 LassoSup_55 LassoSup_56 LassoSup_57 LassoSup_58 LassoSup_59 LassoSup_60 LassoSup_61 LassoSup_62 LassoSup_63 LassoSup_64 LassoSup_65 LassoSup_66 LassoSup_67 LassoSup_68 LassoSup_69 LassoSup_70 LassoSup_71 LassoSup_72 LassoSup_73 LassoSup_74 LassoSup_75 LassoSup_76 LassoSup_77 LassoSup_78 LassoSup_79 LassoSup_80 LassoSup_81 LassoSup_82 LassoSup_83 LassoSup_84 LassoSup_85 LassoSup_86 LassoSup_87 LassoSup_88 LassoSup_89 LassoSup_90 LassoSup_91 LassoSup_92 LassoSup_93 method value variable err errcp errcv errcvup Plot_LogLik(errs_lasso) Trivial Full Simplified Simplified2 Best_1 Best_2 Best_3 Best_4 Best_5 Best_6 Best_7 Best_8 Best_9 Best_10 Best_11 Best_12 Best_13 Best_14 Best_15 Best_16 Best_17 Best_18 Best_19 Best_20 Best_21 Best_22 Best_23 Best_24 Best_25 Best_26 Best_27 Best_28 Best_29 Best_30 Best_31 Best_32 Best_33 Best_34 Best_35 Best_36 Best_37 Best_38 Best_39 Best_40 Best_41 Best_42 Best_43 Best_44 Best_45 Best_46 Best_47 Best_48 Best_49 Best_50 BestGen_1 BestGen_2 BestGen_3 BestGen_4 BestGen_5 BestGen_6 BestGen_7 BestGen_8 BestGen_9 BestGen_10 BestGen_11 BestGen_12 BestGen_13 BestGen_14 BestGen_15 BestGen_16 BestGen_17 BestGen_18 BestGen_19 BestGen_20 BestGen_21 BestGen_22 BestGen_23 BestGen_24 BestGen_25 Lasso_1 Lasso_2 Lasso_3 Lasso_4 Lasso_5 Lasso_6 Lasso_7 Lasso_8 Lasso_9 Lasso_10 Lasso_11 Lasso_12 Lasso_13 Lasso_14 Lasso_15 Lasso_16 Lasso_17 Lasso_18 Lasso_19 Lasso_20 Lasso_21 Lasso_22 Lasso_23 Lasso_24 Lasso_25 Lasso_26 Lasso_27 Lasso_28 Lasso_29 Lasso_30 Lasso_31 Lasso_32 Lasso_33 Lasso_34 Lasso_35 Lasso_36 Lasso_37 Lasso_38 Lasso_39 Lasso_40 Lasso_41 Lasso_42 Lasso_43 Lasso_44 Lasso_45 Lasso_46 Lasso_47 Lasso_48 Lasso_49 Lasso_50 Lasso_51 Lasso_52 Lasso_53 Lasso_54 Lasso_55 Lasso_56 Lasso_57 Lasso_58 Lasso_59 Lasso_60 Lasso_61 Lasso_62 Lasso_63 Lasso_64 Lasso_65 Lasso_66 Lasso_67 Lasso_68 Lasso_69 Lasso_70 Lasso_71 Lasso_72 Lasso_73 Lasso_74 Lasso_75 Lasso_76 Lasso_77 Lasso_78 Lasso_79 Lasso_80 Lasso_81 Lasso_82 Lasso_83 Lasso_84 Lasso_85 Lasso_86 Lasso_87 Lasso_88 Lasso_89 Lasso_90 Lasso_91 Lasso_92 Lasso_93 LassoSup_2 LassoSup_3 LassoSup_4 LassoSup_5 LassoSup_6 LassoSup_7 LassoSup_8 LassoSup_9 LassoSup_10 LassoSup_11 LassoSup_12 LassoSup_13 LassoSup_14 LassoSup_15 LassoSup_16 LassoSup_17 LassoSup_18 LassoSup_19 LassoSup_20 LassoSup_21 LassoSup_22 LassoSup_23 LassoSup_24 LassoSup_25 LassoSup_26 LassoSup_27 LassoSup_28 LassoSup_29 LassoSup_30 LassoSup_31 LassoSup_32 LassoSup_33 LassoSup_34 LassoSup_35 LassoSup_36 LassoSup_37 LassoSup_38 LassoSup_39 LassoSup_40 LassoSup_41 LassoSup_42 LassoSup_43 LassoSup_44 LassoSup_45 LassoSup_46 LassoSup_47 LassoSup_48 LassoSup_49 LassoSup_50 LassoSup_51 LassoSup_52 LassoSup_53 LassoSup_54 LassoSup_55 LassoSup_56 LassoSup_57 LassoSup_58 LassoSup_59 LassoSup_60 LassoSup_61 LassoSup_62 LassoSup_63 LassoSup_64 LassoSup_65 LassoSup_66 LassoSup_67 LassoSup_68 LassoSup_69 LassoSup_70 LassoSup_71 LassoSup_72 LassoSup_73 LassoSup_74 LassoSup_75 LassoSup_76 LassoSup_77 LassoSup_78 LassoSup_79 LassoSup_80 LassoSup_81 LassoSup_82 LassoSup_83 LassoSup_84 LassoSup_85 LassoSup_86 LassoSup_87 LassoSup_88 LassoSup_89 LassoSup_90 LassoSup_91 LassoSup_92 LassoSup_93 method value variable LogLik LogLikAIC LogLikBIC 35

36 Find_Best(errlambdasup) err : LassoSup_54 ( ) errcp : LassoSup_7 ( ) errcv : LassoSup_7 ( ) errcvup : LassoSup_7 ( ) LogLik : LassoSup_54 ( ) LogLikAIC : LassoSup_7 ( ) LogLikBIC : LassoSup_5 ( ) Find_Best(errs_lasso) err : LassoSup_54 ( ) errcp : BestGen_1 ( ) errcv : BestGen_2 ( ) errcvup : BestGen_2 ( ) LogLik : LassoSup_54 ( ) LogLikAIC : BestGen_1 ( ) LogLikBIC : LassoSup_5 ( ) 17. Find a better model... 36

Navigate to the golf data folder and make it your working directory. Load the data by typing

Navigate to the golf data folder and make it your working directory. Load the data by typing Golf Analysis 1.1 Introduction In a round, golfers have a number of choices to make. For a particular shot, is it better to use the longest club available to try to reach the green, or would it be better