Empirical Example II of Chapter 7 1. We use NBA data. The description of variables is --- --- --- storage display value variable name type format label variable label marr byte %9.2f =1 if married wage float %9.2f annual salary, millions $ exper byte %9.2f years as a professional player age byte %9.2f age in years coll byte %9.2f years playing at college games byte %9.2f average games per year minutes int %9.2f minutes per season points float %9.2f points per game rebounds float %9.2f rebounds per game assists float %10.2f assists per game draft int %9.2f draft number allstar byte %9.2f all-star player avgmin float %9.2f minutes per game black byte %9.2f =1 if black children byte %9.2f =1 if has children position str7 position of the player 2. We are interested in how the position of a player affects points and wage. But we need to be careful because position is a string and categorical (qualitative) variable. Unlike the dummy variable, position can take three values. In probability theory, we assume position follows multinomial distribution. 1
. tab position position Freq. Percent Cum. center 51 17.41 17.41 forward 116 39.59 57.00 guard 126 43.00 100.00 Total 293 100.00 Exercise: Can you guess which position has highest average wage? 3. The bar graph below compares the average points across positions graph bar points, over(position) mean of points 0 2 4 6 8 10 center forward guard Exercise: how much is the average point of center? 4. The statistical significance of the difference cannot be seen from the graph. To do so, we need to generate a set of dummy variables, one dummy for each position:. gen guard = (position=="guard"). gen center = (position=="center"). gen forward = (position=="forward") 2
You can use command tab to verify that the dummies are generated appropriately.. tab guard guard Freq. Percent Cum. 0 167 57.00 57.00 1 126 43.00 100.00 Total 293 100.00 5. The fact that the sum of guard, center and forward is one, a constant, indicates that we cannot use all three dummy variables along with the constant term. Otherwise we would run into dummy variable trap, a situation in which perfect multicollinearity arises. 6. Intuitively, because there are only three positions, we know a person must be center if he or she is not forward or guard. In other words, the center dummy is redundant once forward and guard dummies are included in the regression. 7. So, to avoid the dummy variable trap, we try using only two dummy variables along with the constant term. reg points forward guard Source SS df MS Number of obs = 287 -------------+------------------------------ F( 2, 284) = 3.36 Model 226.926754 2 113.463377 Prob > F = 0.0363 Residual 9602.39007 284 33.8112326 R-squared = 0.0231 -------------+------------------------------ Adj R-squared = 0.0162 Total 9829.31682 286 34.3682406 Root MSE = 5.8147 points Coef. Std. Err. t P> t [95% Conf. Interval] forward 1.940836.9782515 1.98 0.048.0152921 3.866379 guard 2.502496.9707714 2.58 0.010.5916758 4.413316 3
_cons 8.115686.8142268 9.97 0.000 6.513001 9.718371 8. ˆβ0 = 8.115686 is the average points for center (the base group, for which both forward and guard equal zero, or for which we drop the corresponding dummy variable). ˆβ1 = 1.940836 is the difference of average points between forward and center; ˆβ 2 = 2.502496 is the difference of average points between guard and center. In short, all comparison is made relative to the base group, and in this case, the base group is center. 9. Exercise: how much is ˆβ 0 if we use command reg points center guard? Which is the base group now? 10. We can test the null hypothesis that position does not matter for points, i.e., no difference between forward and center, and no difference between guard and center:. test forward guard ( 1) forward = 0 ( 2) guard = 0 F( 2, 284) = 3.36 Prob > F = 0.0363 The p-value is less than 0.05, so we find evidences that there is difference in points across positions (or position matters for points). Notice that this F test is reported by reg command automatically next to the ANOVA table. It is called F statistic for overall significance of a regression, see page 152 of the textbook for detail. 11. In fields like biology, people would say position is treatment, and another name for the F test is analysis of variance, or ANOVA for short. Simply put, we can carry out ANOVA by regressing a variable on a set of dummy variables and conduct the F test that all coefficients of dummy variables equal zero. 12. Alternatively, we can include all three dummy variables in regression, but then we have to drop the constant term with option noc. reg points center forward guard, noc 4
Source SS df MS Number of obs = 287 -------------+------------------------------ F( 3, 284) = 282.27 Model 28631.6899 3 9543.89665 Prob > F = 0.0000 Residual 9602.39007 284 33.8112326 R-squared = 0.7489 -------------+------------------------------ Adj R-squared = 0.7462 Total 38234.08 287 133.219791 Root MSE = 5.8147 points Coef. Std. Err. t P> t [95% Conf. Interval] center 8.115686.8142268 9.97 0.000 6.513001 9.718371 forward 10.05652.5422276 18.55 0.000 8.989227 11.12382 guard 10.61818.528613 20.09 0.000 9.577685 11.65868 The advantage of this regression without intercept is that we can get average points for each position directly. Nevertheless, the disadvantage is that we cannot test the difference directly, and R-squared now becomes misleading because without constant term one of the first order conditions of OLS becomes invalid i ûi 0. 13. For position, we can generate a non-string categorical variable called pid,. encode position, gen(pid). list position pid in 1/5, nolab +----------------+ position pid ---------------- 1. guard 3 2. guard 3 3. center 1 4. guard 3 5. forward 2 +----------------+ 5
and run regression based on pid. reg points i.pid, nohe points Coef. Std. Err. t P> t [95% Conf. Interval] pid 2 1.940836.9782515 1.98 0.048.0152921 3.866379 3 2.502496.9707714 2.58 0.010.5916758 4.413316 _cons 8.115686.8142268 9.97 0.000 6.513001 9.718371 Pay attention the regressor is specified as i.pid. The result is the same as the regression that uses forward and guard as regressors. 14. What is wrong with this command reg points pid? 6
15. Finally, let s see what factors matter for log wage. reg lwage marr exper minutes points rebounds assists allstar avgmin black guard. gen lwage = log(wage) (13 missing values generated) lwage Coef. Std. Err. t P> t [95% Conf. Interval] marr -.0018594.0830561-0.02 0.982 -.1654107.1616918 exper.0712154.0122108 5.83 0.000.0471703.0952604 minutes -.0000538.000121-0.44 0.657 -.000292.0001844 points.0513876.0165816 3.10 0.002.0187356.0840396 rebounds.0020807.026334 0.08 0.937 -.0497753.0539368 assists.0269989.0322371 0.84 0.403 -.0364813.0904791 allstar -.2507736.1620092-1.55 0.123 -.5697966.0682493 avgmin.030364.0159912 1.90 0.059 -.0011254.0618534 black.1288088.1006909 1.28 0.202 -.0694682.3270857 guard -.3711601.1513513-2.45 0.015 -.6691958 -.0731244 forward -.0901512.1149876-0.78 0.434 -.3165807.1362783 _cons -1.435384.1599524-8.97 0.000-1.750357-1.120412 Exercise: how to interpret each coefficient? 16. What are the possible reasons that minutes, rebounds and assists are insignificant? 7