ISDS 4141 Sample Data Mining Work Taylor C. Veillon Tool Used: SAS Enterprise Guide You may have seen the movie, Moneyball, about the Oakland A s baseball team and general manager, Billy Beane, who focused on an analytical, evidence-based, approach to assembling a competitive baseball team, despite Oakland's disadvantaged revenue situation. Suppose you have the following data from the 2008 MLB season and want to determine an appropriate model for predicting the Number of Wins (Y) based upon the number of Runs, Hits, Walks, Errors, and Saves. Save the data at the end of this document in an Excel spreadsheet name BB2008 and BB2009, respectively. (This assignment has 50 points which will be converted to 100 pts) BB2008 data is your training data and BB2009 is your validation data be sure to use the appropriate data for the specified questions below. Use what you have learned this semester to build the best model by answering questions 1 8: 1. Complete descriptive statistics, including means, standard deviations, minimum, maximum, and number of observations for all variables. Also include distributional analyses and comment on the normality of each variable. Indicate if there are any outliers when looking at the univariate variables. (Don t delete any outliers at this point they may not be influential in the regression analysis). (Please report to 1 decimal) (4 pts) Descriptive Statistics Wins (Y) Runs (X1) Hits (X2) Walks (X3) Errors (X4) Saves (X5) Mean 81.0 745.8 1461.4 544.4 97.0 34.9 Std Dev 11.1 60.9 76.5 72.1 11.2 7.0 Min 59.0 637.0 1329.0 403.0 67.0 26.0 Max 100.0 855.0 1631.0 687.0 117.0 47.0 Sample Size 30 30 30 30 30 30 Evidence of Non-normality Yes Yes Yes Yes Yes Yes Outliers - Yes Yes Yes Yes Yes (For non-normality and outliers, place a check in the box if your answer is Yes; otherwise leave blank)
2. Which team had the most Wins? California Angels (100 wins) (1 pt) Which team had the least Wins? Washington (50 wins) (1 pt) 3. Run a baseline model using the 2008 data to predict Number of Wins (Y) using all predictors. a. What is the estimated regression equation of your best (baseline) model? (2 pts) Yˆ = 64.53528 + 0.05574(X1) 0.02739(X2) 0.03468(X3) + 0.03818(X4) + 0.86235(X5) b. VALIDATE your results using BB2009 data. Attach a spreadsheet (like the one I provided in class when I discussed validation RMSE) showing the actual Y, predicted Y, error, error-squared and validation RMSE. Your validation RMSE is (5 pts) RMSE = 3.171319189 c. Which predictor seems most related to Number of Wins (Y)? (1 pt) Saves (Semi partial correlation = 0.08455), and I know the predictor is definitely significant because its p value is less than alpha indicating that there is a relationship between the amount of saves and the number of wins for a team. d. Which predictor seems least related to Number of Wins (Y)? (1 pt) Errors (Semi partial correlation = 0.00105). However, this predictor is still potentially significant because its p-value is less than alpha. 4. To go along with the regression analysis, review a scatter plot matrix for Number of Wins (Y) and each predictor (X) to ensure that the linearity
assumption is reasonable (you don t have to include that output with your homework). For any situation where the relationship between Number of Wins (Y) and a predictor is non-linear, transform the predictor, so that the proper form of the predictor is used. Try to find the best transformation (both X 2 and X, ln(x), log(x), exp(x), sqrt(x), or 1/X). (NOTE: Always transform variables where necessary before checking for outliers. Remember correcting the form of the relationship may fix potential problems with outliers). a. Which predictor(s) need(s) transforming? (2 pts) Saves, Runs, Hits Saves:
Runs: Adjusted r squared is 0.4645 Taylor C. Veillon
Hits: 0.2393 is adjusted r squared. Appears that equal variance may be an issue. b. For each variable(s) needing transforming, what transformation was BEST? (4 pt) Saves(X2, X), RunsInverse, HitsLN
Saves: Variable Transformation Type Adjusted R 2 Saves 0.8035 SavesInverse 0.8500 SavesLN 0.8313 SavesSqRt 0.8185 SavesX 2 0.8495 SavesEXP 0.1942 The two highest adjusted R 2 values are for Saves inverse and SavesX 2. Of those two transformations, SavesX 2 appears to improve the linearity of the data more which leads me to assume that SavesX 2 is the best transformation for this variable. The p value is less than alpha so predictor is significant. Adjusted r squared is 0.8495. SavesX2-
Errors: Variable Transformation Type Adjusted R 2 Errors 0.0529 ErrorsInverse 0.0563 ErrorsLN 0.0561 ErrorsSqRt 0.0549 ErrorsX 2 0.0286 Taylor C. Veillon I have decided to leave errors in its original form. The original distribution of the data looks the best visually even though its adjusted r squared is not the highest.
Runs: Variable Transformation Type Adjusted R 2 Runs 0.4645 RunsInverse 0.4663 RunsLN 0.4662 RunsSqRt 0.4654 RunsX 2 0.4462 Taylor C. Veillon For runs, the two transformations with the highest adjusted R 2 values are RunsLN and RunsInverse. Of those two transformation types, RunsInverse does a better job with the linearity/distribution of the data leading me to believe that RunsInverse would be the best choice of possible transformations for this predictor. Its p-value is less than alpha indicating that it is significant in predicting wins, and its adjusted R 2 is equal to 0.4663. RunsInverse-
Variable Transformation Type Adjusted R 2 Walks 0.3259 WalksLN 0.3091 Walks SqRt 0.3189 WalksInverse 0.2879 WalksX 2 0.3276 Of the transformation options for the variable walks, the two highest adjusted R 2 values are for Walks and WalksX 2. Although WalksX 2 has a higher adjusted R 2, its effect on the distribution and linearity of the graph is bad, so I chose to keep the walks predictor in its original format. The p value is also less than alpha indicating that the predictor is significant. Walks-
Variable Transformation Type Adjusted R 2 Hits 0.2393 HitsLN 0.2394 HitsSqRt 0.2394 HitsInverse 0.2392 HitsX 2 0.2112 The adjusted R 2 values for the transformation types are incredibly similar for this predictor. However, the transformation with the best effect on the distribution and linearity of this predictor is HitsLN. It s p value is also less than alpha indicating it is significant.
HitsLN- Taylor C. Veillon
5. Run an analysis using 2008 data ONLY (as training data) to determine if there are any influential observations. Do this analysis using all of the predictors (taking into account any transformations you made in 4b above). a. What are the cutoffs for the following values for determining which observations are influential: leverage (h ii ) 0.4625 (1 pt) rstudent (t i ) _-2 or +2_(1 pt) DFFITS i -0.95 or +0.95 (1 pt) Cook s D i 0.135 (1 pt) b. Based upon your cutoffs, If any, which observations have high leverage? None (1 pt) Why? No observations have an x value greater than the cutoff of 0.4625. (1 pt)
If any, which observations are discrepant (outliers)? (1 pt) Observation 14 (rstudent = 2.648522166) Why? (1 pt) Observation 14 has an rstudent score beyond the cutoff of +2.0. If any, which observations are influential? (1 pt) No observations are influential based on leverage and discrepancy alone. However, observation 14 does appear to have influence as well as observation 3 (once observation 14 is deleted). Why? (1 pt) Observation 14 has a DFFITS score of 2.01519 which is far beyond the cutoff of 0.95. The Cooks D value is.45988 which is beyond the cutoff. Observation 14 also affects the slopes of all of the predictors slopes in my current prediction equation (runsinverse, HitsLN, Walks, Errors, Saves, SavesX2 and even the intercept). It appears to be affecting SavesX2 the most. Observation 20 was also a suspect for potential influence. It does have a DFFITS score of 1.0343934 which is only slightly beyond the cutoff and is technically influencing the slope of LN(Hits). Cooks D for 20 is.139757. However, compared to observation 14, it does not appear to be influencing the prediction greatly; its leverage is also very low.
c. Assess the situation to see if you are justified in eliminating influential observations. Will you eliminate any influential observations? Yes or No (1 pt) Yes. Observation 14 and Observation 3. Why or why not? (1 pt) After eliminating observation 14, the adjusted r squared increased from.9094 to.9226.
Also using DFBETAS, the deletion statistic, the following calculations were made: DFBETAS cutoff is -0.4 to +0.4 Observation 14: DFBETASintercept = -0.44418 DFBETASRunsInverse =.66938 DFBETASHitsLN =.642641 DFBETASWalks = -0.78282 DFBETASErrors=.417140 DFBETASSaves= -0.8027 DFBETASSavesX2= 1.00711 Each of these values is beyond the cutoff zone further justifying my reasoning for deleting observation 14. Justification for observation 3 s deletion: (The influence of observation 3 became apparent after observation 14 was deleted.)
With outliers: Taylor C. Veillon
Without outliers: Taylor C. Veillon 6. At this point, you have decided whether or not to transformed your data and whether or not to delete influential observations. With those decisions made, conduct an all-possible-subset analysis and determine the best model (use 2008 data as TRAINING data)
a. Which predictors are included in your BEST model? (1 pt) runs-inverse (X1), hitsln (X2), saves (X3), savesx2 (X4) b. How did you arrive at that decision? (1 pt) This model has the highest adjusted r squared, the lowest root MSE, all of the predictors p-values are less than alpha indicating that they are significant in predicting wins. b. What is the estimated regression equation of your best model? (2 pts) Y hat = 375.82680 39365(X1) 47.65786(X2) + 5.29157(X3) -0.06268(X4) d. VALIDATE your results using BB2009 data. Attach a spreadsheet (like the one I provided in class when I discussed validation RMSE) showing the actual Y, predicted Y, error, error-squared and validation RMSE. Your validation RMSE is 12.68766896 (5 pts) 7. Which regression model is BEST? The one you found in 3a or 6c? (2 pts) 3A Why did you choose the model you did? (2 pts) In my case, it appears the best model I picked in 6c could perhaps be over-fitted and therefore does not do a great job predicting random samples of the population. Therefore, in following the criteria for picking the best model based on validation, I would have to choose the original model in 3A because its validation RMSE is lower.
8. Use your best model provided in #7 above to explain to a coach what to do to increase the number of Wins? (4 pts) Yˆ = 64.53528 + 0.05574(X1) 0.02739(X2) 0.03468(X3) + 0.03818(X4) + 0.86235(X5) Based on the original model, I would explain that to increase the number of wins the coach should focus on two variables: number of runs and number of saves. Both have the two highest semi partial correlations indicating that they can explain the variance in wins the most of the five predictors. He should focus on increasing the team s number of runs by 5.574% and increase the number of saves by 86.235%. NOTE: The person with the BEST model will receive 4 extra credit points toward the total number of points for the semester.
2008 Team Runs Hits Walks Errors Saves Wins Arizona 720 1403 451 113 33 82 Atlanta 753 1439 586 107 28 72 Baltimore 782 1538 687 100 29 68 Boston 845 1369 548 85 47 95 Chicago Cubs 855 1329 548 99 44 97 Chicago White Sox 810 1469 457 108 33 88 Cincinnati 704 1542 557 114 31 74 Cleveland 805 1530 444 94 31 81 Colorado 747 1547 562 96 28 74 Detroit 760 1541 644 113 27 74 Florida 770 1421 586 117 36 84 Houston 712 1453 492 67 38 86 Kansas City 691 1473 515 96 29 75 California Angels 765 1455 457 91 47 100 Los Angeles Dodgers 700 1381 480 101 35 84 Milwaukee 750 1415 528 101 45 90 Minnesota 829 1563 403 108 37 88 New York Mets 799 1415 590 83 43 89 New York Yankees 789 1478 489 83 39 90 Oakland 646 1364 576 98 28 75 Philadelphia 799 1444 533 90 47 92 Pittsburgh 735 1631 657 107 27 67 San Diego 637 1466 561 89 28 63 Seattle 671 1544 626 99 26 61 San Francisco 640 1416 652 96 30 72 St. Louis 779 1517 496 85 41 88 Tampa Bay 774 1349 526 90 40 97 Texas 752 1525 625 99 33 79 Toronto 714 1330 467 84 40 86 Washington 641 1496 588 96 27 59 Taylor C. Veillon
2009 TEAMS Wins Runs Hits Walks Errors Saves Arizona 70 782 1408 571 124 36 Atlanta 86 641 1459 602 96 38 Baltimore 64 876 1508 517 90 31 Boston 95 736 1495 659 82 41 Chicago Cubs 83 672 1398 592 105 40 Chicago White Sox 79 732 1410 534 113 36 Cincinnati 78 723 1349 531 89 41 Cleveland 65 865 1468 582 97 25 Colorado 92 715 1408 660 87 45 Detroit 86 745 1443 540 88 42 Florida 87 766 1493 568 106 45 Houston 74 770 1415 448 78 39 Kansas City 65 842 1432 457 117 34 California Angels 97 761 1604 547 85 51 Los Angeles Dodgers 95 611 1511 607 83 44 Milwaukee 80 818 1447 610 98 44 Minnesota 87 765 1539 585 76 48 New York Mets 70 757 1472 526 97 39 New York Yankees 103 753 1604 663 86 51 Oakland 75 761 1464 527 105 38 Philadelphia 93 709 1439 589 76 44 Pittsburgh 62 768 1364 499 73 28 San Diego 75 769 1315 586 94 45 Seattle 85 692 1430 421 105 49 San Francisco 88 611 1411 392 88 41 St. Louis 91 640 1436 528 96 43 Tampa Bay 84 754 1434 642 98 41 Texas 87 740 1436 472 106 45 Toronto 75 771 1516 548 76 25 Washington 59 874 1416 617 143 33