Statistical Theory, Conditional Probability Vaibhav Walvekar

Similar documents
Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Sample Final Exam MAT 128/SOC 251, Spring 2018

Navigate to the golf data folder and make it your working directory. Load the data by typing

NCSS Statistical Software

Running head: DATA ANALYSIS AND INTERPRETATION 1

Chapter 12 Practice Test

DESCRIBE the effect of adding, subtracting, multiplying by, or dividing by a constant on the shape, center, and spread of a distribution of data.

Smoothing the histogram: The Normal Curve (Chapter 8)

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Opleiding Informatica

A few things to remember about ANOVA

Is lung capacity affected by smoking, sport, height or gender. Table of contents

IHS AP Statistics Chapter 2 Modeling Distributions of Data MP1

The pth percentile of a distribution is the value with p percent of the observations less than it.

Section I: Multiple Choice Select the best answer for each problem.

Chapter 9: Hypothesis Testing for Comparing Population Parameters

How are the values related to each other? Are there values that are General Education Statistics

Stat 139 Homework 3 Solutions, Spring 2015

CHAPTER 2 Modeling Distributions of Data

Chapter 3.4. Measures of position and outliers. Julian Chan. September 11, Department of Mathematics Weber State University

PSY201: Chapter 5: The Normal Curve and Standard Scores

A Shallow Dive into Deep Sea Data Sarah Solie and Arielle Fogel 7/18/2018

Chapter 1 Test B. 4. What are two advantages of using simulation techniques instead of actual situations?

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Calculating Probabilities with the Normal distribution. David Gerard

Background Information. Project Instructions. Problem Statement. EXAM REVIEW PROJECT Microsoft Excel Review Baseball Hall of Fame Problem

One-way ANOVA: round, narrow, wide

Example 1: One Way ANOVA in MINITAB

ID-3 Triathlon Road Closure Control Zonghuan Li and Mason Chen

1 Hypothesis Testing for Comparing Population Parameters

Week 7 One-way ANOVA

CHAPTER 2 Modeling Distributions of Data

AP 11.1 Notes WEB.notebook March 25, 2014

ANOVA - Implementation.

BIOL 101L: Principles of Biology Laboratory

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Data Science Final Project

Chapter 13. Factorial ANOVA. Patrick Mair 2015 Psych Factorial ANOVA 0 / 19

Today s plan: Section 4.2: Normal Distribution

Homework Exercises Problem Set 1 (chapter 2)

APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754

Stats 2002: Probabilities for Wins and Losses of Online Gambling

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Internet Technology Fundamentals. To use a passing score at the percentiles listed below:

(JUN10SS0501) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

Lesson 2 Pre-Visit Batting Average Part 1: Fractions

Confidence Interval Notes Calculating Confidence Intervals

1. The data below gives the eye colors of 20 students in a Statistics class. Make a frequency table for the data.

(c) The hospital decided to collect the data from the first 50 patients admitted on July 4, 2010.

Chapter 2: Modeling Distributions of Data

Descriptive Statistics. Dr. Tom Pierce Department of Psychology Radford University

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Calculation of Trail Usage from Counter Data

Averages. October 19, Discussion item: When we talk about an average, what exactly do we mean? When are they useful?

How Effective is Change of Pace Bowling in Cricket?

Frequency Distributions

5.1 Introduction. Learning Objectives

STANDARD SCORES AND THE NORMAL DISTRIBUTION

DS5 The Normal Distribution. Write down all you can remember about the mean, median, mode, and standard deviation.

How-to Pull & Schedule Club Reports

Package mrchmadness. April 9, 2017

Histogram. Collection

CHAPTER 2 Modeling Distributions of Data

Naval Postgraduate School, Operational Oceanography and Meteorology. Since inputs from UDAS are continuously used in projects at the Naval

Lesson 3 Pre-Visit Teams & Players by the Numbers

Measuring Relative Achievements: Percentile rank and Percentile point

3.3 - Measures of Position

STAT 625: 2000 Olympic Diving Exploration

b. Graphs provide a means of quickly comparing data sets. Concepts: a. A graph can be used to identify statistical trends

STAT 101 Assignment 1

PLANNED ORTHOGONAL CONTRASTS

One-factor ANOVA by example

Understanding Winter Road Conditions in Yellowstone National Park Using Cumulative Sum Control Charts

1wsSMAM 319 Some Examples of Graphical Display of Data

Online Companion to Using Simulation to Help Manage the Pace of Play in Golf

Homework 7, Due March

STT 315 Section /19/2014

Launch Reaction Time

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

1. In a hypothesis test involving two-samples, the hypothesized difference in means must be 0. True. False

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND

PGA Tour Scores as a Gaussian Random Variable

In my left hand I hold 15 Argentine pesos. In my right, I hold 100 Chilean

Descriptive Statistics Project Is there a home field advantage in major league baseball?

The Incremental Evolution of Gaits for Hexapod Robots

Unit4: Inferencefornumericaldata 4. ANOVA. Sta Spring Duke University, Department of Statistical Science

Mrs. Daniel- AP Stats Ch. 2 MC Practice

Safety at Intersections in Oregon A Preliminary Update of Statewide Intersection Crash Rates

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certificate of Education Ordinary Level

North Point - Advance Placement Statistics Summer Assignment

Online League Management lta.tournamentsoftware.com. User Manual. Further support is available online at

Diameter in cm. Bubble Number. Bubble Number Diameter in cm

May 11, 2005 (A) Name: SSN: Section # Instructors : A. Jain, H. Khan, K. Rappaport

Lab 5: Descriptive Statistics

Scaled vs. Original Socre Mean = 77 Median = 77.1

Hitting with Runners in Scoring Position

Psychology - Mr. Callaway/Mundy s Mill HS Unit Research Methods - Statistics

Transcription:

Statistical Theory, Conditional Probability Vaibhav Walvekar # Load some helpful libraries library(tidyverse) If a baseball team scores X runs, what is the probability it will win the game? Baseball is a played between two teams who take turns batting and fielding. A run is scored when a player advances around the bases and returns to home plate. More information about the dataset can be found at http://www.retrosheet.org/. #Reading column names for the data present in GL2010, GL2010, GL2011, GL2012, GL2013 colnames <- read.csv("cnames.txt", header=true) #Creating a null df baseballdata <- NULL #Looping to read dat from all the GL.txt files in sequence from 2010 to 2013 for (year in seq(2010,2013,by=1)){ #Constructing the name of the file by using year dynimically to read from working directory mypath <- paste('gl',year,'.txt',sep='') # cat(mypath,'\n') #Reading each GL.txt filesand binding them row by row and #using Name column from colnames df for header names baseballdata <- rbind(baseballdata,read.csv(mypath, col.names=colnames$name)) #Converting to a table df baseballdata <- tbl_df(baseballdata) } # baseballdata Selecting relevant columns and to create a new data frame to store the data for analysis. Date Home Visitor HomeLeague VisitorLeague HomeScore VisitorScore baseballdata_final <- baseballdata[,c('date','home','visitor','homeleague','visitorleague','homescore',' Considering only games between two teams in the National League, computing the conditional probability of the team winning given X runs scored, for X = 0,..., 10. Doing this separately for Home and Visitor teams. Design a visualization that shows your results. Discuss what you find. 1

baseballdata_filtered <- filter(baseballdata_final, baseballdata_final$visitorleague == "NL", baseballda #Considering home win baseballdata_filtered$resultforhome <- ifelse(baseballdata_filtered$homescore>baseballdata_filtered$visi # Calculating total matches for each score by Home team by_homescore = group_by(baseballdata_filtered,homescore) sum_hs = summarize(by_homescore, totalcount = n()) # Calculating total matches for each score by Home team when they won by_homescore_resultforhome = group_by(baseballdata_filtered,homescore,resultforhome) sum_hs_rsh = filter(summarize(by_homescore_resultforhome, count = n()),resultforhome == "Won") # Merging the above two df, to calculate probabilty final_hs_rsh = merge(sum_hs_rsh, sum_hs, by= "HomeScore") # Adding neew column for probability final_hs_rsh <- mutate(final_hs_rsh,probability = count/totalcount) #Plotting probabilty of win for each score p3 <- ggplot(final_hs_rsh, aes(x = HomeScore, y = Probability)) p3 + geom_point() +labs(x = "Home Score", y="probability of Winning", title = "P(Home Winning/Score)") 1.00 P(Home Winning/Score) Probability of Winning 0.75 0.50 0.25 5 10 15 Home Score #Considering visitor win baseballdata_filtered$resultforvisitor <- ifelse(baseballdata_filtered$homescore<baseballdata_filtered$v # Calculating total matches for each score by Visitor team by_visitorscore = group_by(baseballdata_filtered,visitorscore) 2

sum_vs = summarize(by_visitorscore, totalcount = n()) # Calculating total matches for each score by Visitor team when they won by_visitorscore_resultforvisitor = group_by(baseballdata_filtered,visitorscore,resultforvisitor) sum_vs_rsv = filter(summarize(by_visitorscore_resultforvisitor, count = n()),resultforvisitor == "Won") # Merging the above two df, to calculate probabilty final_vs_rsv = merge(sum_vs_rsv, sum_vs, by= "VisitorScore") # Adding new column for probability final_vs_rsv <- mutate(final_vs_rsv,probability = count/totalcount) #Plotting probabilty of win for each score p4 <- ggplot(final_vs_rsv, aes(x = VisitorScore, y = Probability)) p4 + geom_point() +labs(x = "Visitor Score", y="probability of Winning", title = "P(Visitor Winning/Score)") P(Visitor Winning/Score) 1.00 Probability of Winning 0.75 0.50 0.25 5 10 15 20 Visitor Score From both the above graphs for Home team or the Visitor we observe that once the team scores above 14 runs, they have a probability of 1 for winning the match. A score of above 6 runs, gives almost a 75% chance of winning the match. Repeating the above problem, but now considering the probability of winning given the number of hits. #Including Hhits and Vhits in df baseballdata_final_hits <- baseballdata[,c('date','home','visitor','homeleague','visitorleague','homesco 3

#Filtering only NL games baseballdata_filtered_hits <- filter(baseballdata_final_hits, baseballdata_final$visitorleague == "NL", #Adding a new column, which consists of the number of hits scored by the team winning the match baseballdata_filtered_hits$hitsinwinningcause <- ifelse(baseballdata_filtered_hits$homescore>baseballdat, baseballdata_filtered_hits$hhits, baseballdata_filt # Calculating total matches for each hit by Home team by_hhits = group_by(baseballdata_filtered_hits,hhits) sum_hhits = summarize(by_hhits, Hhitscount = n()) # Calculating total matches for each hit by Visitor team by_vhits = group_by(baseballdata_filtered_hits,vhits) sum_vhits = summarize(by_vhits, Vhitscount = n()) # Merging the above two df, to get total count of matches in which each number of hit were scored Total_Hhits_Vhits = merge(sum_hhits, sum_vhits, by.x= "Hhits",by.y = "Vhits") # Calculating matches for each hit for winning cause by_hitsinwinningcause = group_by(baseballdata_filtered_hits,hitsinwinningcause) sum_hitsinwinningcause = summarize(by_hitsinwinningcause, hitscount_winning = n()) # Merging the above two df, to calculate probability Total_hits_winning_cause = merge(sum_hitsinwinningcause, Total_Hhits_Vhits, by.x= "HitsInWinningCause",b # Adding new column for probability Total_hits_winning_cause <- mutate(total_hits_winning_cause,probability = hitscount_winning/(hhitscount+ #Plotting probabilty of win for each score p5 <- ggplot(total_hits_winning_cause, aes(x = HitsInWinningCause, y = Probability)) p5 + geom_point() +labs(x = "Number of Hits", y="probability of Winning", title = "P(Winning/Number of hits)") 4

P(Winning/Number of hits) 1.00 Probability of Winning 0.75 0.50 0.25 0 5 10 15 20 Number of Hits From the above graphic we can see that as the number of hits increase for a team, the probability of that team winning the match goes up. One abnormal observation is that when number of hits is equal to 20, the probability of team winning goes down a bit considering lower values (15-19) of number of hits. Triathlon Times In triathlons, it is common for racers to be placed into age and gender groups. Friends Leo and Mary both completed the Hermosa Beach Triathlon, where Leo competed in the Men, Ages 30-34 group while Mary competed in the Women, Ages 25-29 group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race in 1:31:53 (5513 seconds). Obviously Leo finished faster, but they are curious about how they did within their respective groups. Here is some information on the performance of their groups: The finishing times of the Men, Ages 30-34 group has a mean of 4313 seconds with a standard deviation of 583 seconds. The finishing times of the Women, Ages 25-29 group has a mean of 5261 seconds with a standard deviation of 807 seconds. The distributions of finishing times for both groups are approximately Normal. Remember: a better performance corresponds to a faster finish. A short-hand description for these two normal distributions. We are considering distribution of two categories of triathlons run at the Hermosa Beach Triathlon. One distribution is of the timings taken by all participants in the Mens - age (30-34) category. This distribution 5

has a mean of 4313 seconds and standard deviation of 583 seconds. Second distribution is of the timings taken by all participants in the Womens - age (25-29) category. This distribution has a mean of 4948 seconds and standard deviation of 807 seconds. The timings of the completion of race of the participants is the random varibale in both these distributions. Both distbutions are given to be approximately normal, which means that the scores (timings) are symettrical across both sides of the means. z = X-mu/(sigma) zscoreleo = (4948-4313)/583 zscoremary = (5513-5261)/807 Normal distribution for Leo - n1(mean = 4313, sd = 583) Normal distribution for Mary - n2(mean = 5261, sd = 807) Calculating Z Score Leo_time <- 4948 Mary_time <- 5513 Men_mean_time <- 4313 Women_mean_time <- 5261 Men_sd <- 583 Women_sd <- 807 zscoreleo <- (4948-4313)/583 ZScoreMary <- (5513-5261)/807 print(zscoreleo) ## [1] 1.089194 print(zscoremary) ## [1] 0.3122677 The Zscore for Leo and Mary are 1.089 and 0.3122 respectively. From the ZScore we can say that, Leo has a completion timing which is 1.089 standard deviations away from the mean of that distribution and Mary has a completion timing which is 0.3122 standard deviations away from the mean of that distribution. In whole, Zscore helps us compare scores from two different distributions, because when we compute zscores, we transform the distribution to a standard normal distribution having mean = 0 and standard deviation = 1. The sign (+ or -) indicates the direction of score from the mean. Comparing Ranks As discussed above, by calculating the zscores, we can now make direct comparisons betweent the scores of Leo and Mary in terms of zscores. As Leo is 1.089 standard deviations away from the mean and Mary is only 0.3122 standard deviations from the mean, Leo would have more participants finishing before him than Mary in terms of percentile in respective groups. Thus we can say that Mary ranked better than Leo in their respective groups. Percentile ranks (1 - pnorm(zscoreleo))*100 ## [1] 13.80342 Leo finished faster than 13.803% of triatheletes in his group. The pnorm function gives the percentile of participants finishing faster than Leo, thus had to subtract the result from 1. 6

Percentile ranks (1 -pnorm(zscoremary))*100 ## [1] 37.74186 Mary finished faster than 37.741% of triatheletes in her group. The pnorm function gives the percentile of participants finishing faster than Mary, thus had to subtract the result from 1. What if distribution is not normal We have answers to the above question based on the distributions being normal, which means, the scores in both observations were symettrical across the means. The Zscore computed was with the assumption that subtracting mean from the actual score would standard normalize the distrbution with mean as 0 and sd as 1. If the above conditions arent satisfied then we cannot make direct comparisons about scores from two different distributions and thus the answers would change. Sampling with and without Replacement In the following situations assume that half of the specified population is male and the other half is female. From a room of 10 people, compare sampling 2 females with and without replcement. Sampling with Replacement: To sample out two females in a row, we need to find the probability of choosing one female in and then again choosing one more female with replacement of the first female in the room. As there 10 people and 50% are female, there are a total of 5 females in the room. Thus P(choosing a female) = number of favourable choices/total choices = 5/10 = 1/2, similarly in the second chance, since we are replacing the first selected female back in the rooom, again the P(choosing women second time) = 5/10 = 1/2. Thus the P(sampling two females in a row) = P(choosing a female) * P(choosing women second time) = (1/2) * (1/2) = 1/4 = 0.25. Sampling without Replacement: Here the P(choosing a female) will remain the same 1/2, but since the chosen women is not replaced in the room, the P(choosing women second time) = number of favourable choices/total choices = 4/9. Therefore, P(sampling two females in a row) = P(choosing a female) * P(choosing women second time) = (1/2) * (4/9) = 2/9 = 0.222. From a room of 10000 people, compare sampling 2 females with and without replcement. Sampling with Replacement: To sample out two females in a row, we need to find the probability of choosing one female in and then again choosing one more female with replacement of the first female in the room. As there 10000 people and 50% are female, there are a total of 5000 females in the room. Thus P(choosing a female) = number of favourable choices/total choices = 5000/10000 = 1/2, similarly in the second chance, since we are replacing the first selected female back in the rooom, again the P(choosing women second time) = 5000/10000 = 1/2. Thus the P(sampling two females in a row) = P(choosing a female) * P(choosing women second time) = (1/2) * (1/2) = 1/4 = 0.25. Sampling without Replacement: Here the P(choosing a female) will remain the same 1/2, but since the chosen women is not replaced in the room, the P(choosing women second time) = number of favourable 7

choices/total choices = 4999/9999. Therefore, P(sampling two females in a row) = P(choosing a female) * P(choosing women second time) = (1/2) * (4999/9999) = 2222/9999 = 0.249. Often smaples from large population are considered as independent, explain whether or not this assumption is reasonable. From above, we can observe that as the population has increased considerably, the probabilty with and without replacement is almost same (0.25 ~ 0.249). Thus as the population size increases, the probability of with replacement is not changed but it affects the probability of without replacement. This is the case because a sample chosen at random in first attempt is highly unlikely to be selected in the second attempt when replaced in the room with population being very large. When the population is small, the sample chosen in second attempt depends very much on what was chosen in first attempt, which is not the case for large paopulations. Thus it is reasonable to assume that a sample chosen from a large population are independent. Sample Means You are given the following hypotheses: H 0 : µ = 34, H A : µ > 34. We know that the sample standard deviation is 10 and the sample size is 65. For what sample mean would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied. As p value has to be equal to 0.05, that means the sample mean is at the extreme value and any more increase in mean would reject the H_0. Thus for retaining H_0 and having the sample mean at the extreme value in a one tailed distribution, we have to calculate the tcrtitical or zcritcal value. The percentile at the extreme value of the distribution which will retain null hypthosis is 95% as p value is given to be 0.05. Hence, using it we calculate the tcritical or zcritcal. Following which we can calculate sample mean by using this formula : tcritical = (sample_mean - mu)/sample_sd_means sample_sd_means = sample_sd/(n)ˆ0.5 n = 65 mu = 34 sample_sd = 10 sample_size = 65 #zcritcal <- qnorm(0.95) # Degree of freedom = n-1 tcritcal <- qt(.95, 64) sample_mean = tcritcal*sample_sd/(n)^0.5 +mu print(sample_mean) ## [1] 36.07016 Using t statitic as we are unaware of the population variance to use Z statistic. We have to use t test when we only know sample variance. As the sample size is greater than 30, the value of t statistic and z statistic will be almost same. Thus if the sample mean is 36.07016, then we would have the p value as 0.05. 8