Sample Final Exam MAT 128/SOC 251, Spring 2018

Similar documents
STT 315 Section /19/2014

Navigate to the golf data folder and make it your working directory. Load the data by typing

Running head: DATA ANALYSIS AND INTERPRETATION 1

Chapter 12 Practice Test

1wsSMAM 319 Some Examples of Graphical Display of Data

(c) The hospital decided to collect the data from the first 50 patients admitted on July 4, 2010.

Section I: Multiple Choice Select the best answer for each problem.

Driv e accu racy. Green s in regul ation

Week 7 One-way ANOVA

Quantitative Literacy: Thinking Between the Lines

CHAPTER 2 Modeling Distributions of Data

1. The data below gives the eye colors of 20 students in a Statistics class. Make a frequency table for the data.

Effect of homegrown players on professional sports teams

Lab 5: Descriptive Statistics

Foundations of Data Science. Spring Midterm INSTRUCTIONS. You have 45 minutes to complete the exam.

Tying Knots. Approximate time: 1-2 days depending on time spent on calculator instructions.

Fundamentals of Machine Learning for Predictive Data Analytics

STA 103: Midterm I. Print clearly on this exam. Only correct solutions that can be read will be given credit.

AP Statistics Midterm Exam 2 hours

Math 146 Statistics for the Health Sciences Additional Exercises on Chapter 2

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Full file at

Probability & Statistics - Solutions

NBA TEAM SYNERGY RESEARCH REPORT 1

North Point - Advance Placement Statistics Summer Assignment

Lesson 3 Pre-Visit Teams & Players by the Numbers

In the actual exam, you will be given more space to work each problem, so work these problems on separate sheets.

MATH 118 Chapter 5 Sample Exam By: Maan Omran

May 11, 2005 (A) Name: SSN: Section # Instructors : A. Jain, H. Khan, K. Rappaport

b) (2 pts.) Does the study show that drinking 4 or more cups of coffee a day caused the higher death rate?

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Organizing Quantitative Data

Math 1040 Exam 2 - Spring Instructor: Ruth Trygstad Time Limit: 90 minutes

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

3.3 - Measures of Position

Reminders. Homework scores will be up by tomorrow morning. Please me and the TAs with any grading questions by tomorrow at 5pm

Diameter in cm. Bubble Number. Bubble Number Diameter in cm

Name Student Activity

% per year Age (years)

Unit 3 - Data. Grab a new packet from the chrome book cart. Unit 3 Day 1 PLUS Box and Whisker Plots.notebook September 28, /28 9/29 9/30?

STAT 101 Assignment 1

Name May 3, 2007 Math Probability and Statistics

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

That pesky golf game and the dreaded stats class

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Background Information. Project Instructions. Problem Statement. EXAM REVIEW PROJECT Microsoft Excel Review Baseball Hall of Fame Problem

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

Smoothing the histogram: The Normal Curve (Chapter 8)

Confidence Intervals with proportions

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

AREA - Find the area of each figure. 1.) 2.) 3.) 4.)

Descriptive Statistics Project Is there a home field advantage in major league baseball?

Histogram. Collection

How to Win in the NBA Playoffs: A Statistical Analysis

DS5 The Normal Distribution. Write down all you can remember about the mean, median, mode, and standard deviation.

Boyle s Law: Pressure-Volume. Relationship in Gases

Homework Exercises Problem Set 1 (chapter 2)

Lesson 14: Modeling Relationships with a Line

Reproducible Research: Peer Assessment 1

Select Boxplot -> Multiple Y's (simple) and select all variable names.

Example 1: One Way ANOVA in MINITAB

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

Practice Test Unit 6B/11A/11B: Probability and Logic

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

PRACTICE PROBLEMS FOR EXAM 1

Lab 11: Introduction to Linear Regression

PGA Tour Scores as a Gaussian Random Variable

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Stats 2002: Probabilities for Wins and Losses of Online Gambling

Bivariate Data. Frequency Table Line Plot Box and Whisker Plot

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Exploring Measures of Central Tendency (mean, median and mode) Exploring range as a measure of dispersion

PRACTICAL EXPLANATION OF THE EFFECT OF VELOCITY VARIATION IN SHAPED PROJECTILE PAINTBALL MARKERS. Document Authors David Cady & David Williams

Confidence Interval Notes Calculating Confidence Intervals

NAME: Math 403 Final Exam 12/10/08 15 questions, 150 points. You may use calculators and tables of z, t values on this exam.

LESSON 5: THE BOUNCING BALL

APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Boyle s Law: Pressure-Volume Relationship in Gases

STAT 625: 2000 Olympic Diving Exploration

Stat 139 Homework 3 Solutions, Spring 2015

Statistical Studies: Analyzing Data III.B Student Activity Sheet 6: Analyzing Graphical Displays

Statistical Studies: Analyzing Data III.B Student Activity Sheet 6: Analyzing Graphical Displays

Was John Adams more consistent his Junior or Senior year of High School Wrestling?

Internet Technology Fundamentals. To use a passing score at the percentiles listed below:

Quiz 1.1A AP Statistics Name:

Pitching Performance and Age

Statistical Theory, Conditional Probability Vaibhav Walvekar

Legendre et al Appendices and Supplements, p. 1

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

The pth percentile of a distribution is the value with p percent of the observations less than it.

Safety at Intersections in Oregon A Preliminary Update of Statewide Intersection Crash Rates

NCSS Statistical Software

Political Science 30: Political Inquiry Section 5

1 Hypothesis Testing for Comparing Population Parameters

The Reliability of Intrinsic Batted Ball Statistics Appendix

Assignment. To New Heights! Variance in Subjective and Random Samples. Use the table to answer Questions 2 through 7.

Name: Class: Date: (First Page) Name: Class: Date: (Subsequent Pages) 1. {Exercise 5.07}

Section 3.2: Measures of Variability

Transcription:

Sample Final Exam MAT 128/SOC 251, Spring 2018 Name: Each question is worth 10 points. You are allowed one 8 1/2 x 11 sheet of paper with hand-written notes on both sides. 1. The CSV file citieshistpop.csv contains the historical population for several north-east American cities. Write a complete Python program that plots the population of New York over time. Assume the CSV file is in the same directory as your.py file. The graph should be a line graph and have a title. citieshistpop.csv: The data in this file comes from::,,, http://www1.nyc.gov,,, http://www.iboston.org,,, http://physics.bu.edu/~rednerl,,, Year,Boston,New York,Philadelphia 1900,560892,3437202,1293697 1910,670585,4766883,1549008 1920,748060,5620048,1823779 1930,781188,6930446,1950961 1940,770816,7454995,1931334 1950,801444,7891957,2071605 1960,697197,7783314,2002512 1970,641071,7894798,1948609 1980,562994,7071639,1688210 1990,574994,7322564,1585577 2000,590433,8008278,1517313 2. (a) Below are 3 histograms. Each histogram is generated by sampling from the same unknown distribution a different number of times. Draw a sketch of the distribution. (b) Below is a boxplot of some data. Use it to answer the following questions about the data:

(i) What is the minimum data value? (ii) What is the maximum data value? (iii) What is the median? (iv) What is the 25th percentile? (v) What is the 75th percentile? 3. Assume the NBA dataset (see last page) has been loaded into the dataframe nba (a) Write a piece of Python code to calculate and print the variance of the number of free throws made by a player during the season. (b) Write a piece of Python code to calculate and print the maximum number of points scored by a starting forward (SF) who is 25 years or older. 4. Assume the NBA dataset (see last page) has been loaded into the dataframe nba (a) Write a piece of Python code that uses the NBA dataset to estimate the probability that a player is a center (position is C) and print this probability. (b) What is the formula for computing the probability that a center player scores more than 800 points in a season? 5. Assume the NBA dataset (see last page) has been loaded into the dataframe nba The following Python code creates a linear model from the data in nba: lm = smf.ols(formula = pts ~ games, data = nba).fit() (a) What is this model predicting? (b) What information is this model using to make the prediction? (c) Running the code print(lm.params) gives the following output: Intercept -204.083706 g 13.532706 dtype: float64 What does the number 13.532706 represent?

(d) Write the line of Python code to create and fit a linear model that predicts the number of points scored in a season from the number of games started and the number of free throw points scored in a season. (e) Write a piece of Python code to compute and print out the R-Squared value of your model from part (d). 6. Assume the NBA dataset (see last page) has been loaded into the dataframe nba (a) The correlation matrix for the columns age, games, games started, free throws, and pts in the nba dataframe is shown below. age games games started free throws pts age 1.000-0.012. 0.025-0.047-0.012 games -0.012 1.000 0.611 0.598 0.728 games started 0.025 0.611 1.000 0.707 0.810 free throws -0.047 0.598 0.707 1.000 0.928 pts -0.012 0.728 0.810 0.928 1.000 (i) Which two columns are the most correlated? (ii) Which two columns are the least correlated? (b) Write a piece of Python code to compute and print out the mean number of games started in during the season, as well as the 90% confidence interval. 7. (a) Write a complete Python program that computes 500 samples from the normal distribution with a mean of 7 and a standard deviation of 1.5, and plots the samples as a histogram. The histogram should have 15 bins and a title. (b) Suppose the mean length (in inches) of a New York squirrel s tail is 8 with a standard deviation of 0.5. A park ranger randomly catches a sample of 60 squirrels, measures the lengths of their tails, and computes the mean. What distribution does this sample mean come from? Why? 8. This question refers to the NBA dataset on the last page. Our (alternative) hypothesis is that players who start more than 50 games are more likely to score more than 500 points than players who start 50 games or fewer. (a) What is the null hypothesis? (b) To test our hypothesis, we count the number of players in the data meeting a certain criteria and compare it with a histogram. This histogram is made by counting the number of players meeting the same criteria in samples simulated by assuming the null hypothesis is true. (i) When we count the number of players in the data meeting a certain criteria, what was that criteria? (ii) How many players are simulated with one sample?

(iii) Assume each simulated player is assigned 1 if they scored more than 500 points and 0 if they scored 500 points or fewer. How do you compute the probability that a player is assigned a 1? (c) The histogram and count from the data (shown as a dashed line on the histogram plot) are below. Would you reject the null hypothesis? Why or why not? 9. The CSV file citieshistpop.csv from Question 1 contains the historical population for several north-east American cities. Write a complete R program that computes and prints out the maximum population of Philadelphia. citieshistpop.csv: The data in this file comes from::,,, http://www1.nyc.gov,,, http://www.iboston.org,,, http://physics.bu.edu/~rednerl,,, Year,Boston,New York,Philadelphia 1900,560892,3437202,1293697 1910,670585,4766883,1549008 1920,748060,5620048,1823779 1930,781188,6930446,1950961 1940,770816,7454995,1931334 1950,801444,7891957,2071605 1960,697197,7783314,2002512 1970,641071,7894798,1948609 1980,562994,7071639,1688210 1990,574994,7322564,1585577 2000,590433,8008278,1517313 10. Using the ggplot2 library, write a piece of R code to plot the number of games started in a season (x-axis) vs. the number of free throws in a season (y-axis) for all players on the New York Knicks (NYK) and the Brooklyn Nets (BRK). Assume the NBA dataset (see last page) has been loaded into the dataframe nba. The plot should have: the title NYC Team Statistics

each data point plotted as a point the points be colored by team NBA Dataset The following dataset is used for questions 3, 4, 5, 6, 8, and 10. It contains statistics on all NBA players for the 2012-2013 season. Assume that it has been read in from a csv file and is stored as a Pandas dataframe in the variable nba. player position age team games games started free throws pts Quincy Acy SF 23 TOT 63 0 35 171 Steven Adams C 20 OKC 81 20 79 265 Jeff Adrien PF 27 TOT 53 12 76 362 Arron Affalo SG 28 ORL 73 73 274 1330 Alexis Ajinca C 25 NOP 56 30 56 328 Tyler Zeller C 24 CLE 70 9 87 399 The columns are: player = name of NBA player position = position of player (eg. C = center, SF = starting forward) age = age of player team = player s team name, abbreviated games = number of games played by the player in the season games started = number of games started in by the player in the season free throws = number of successful free throws during the season pts = total number of points scored by player during the season