Best Practices in Mathematics Education STATISTICS MODULES

Similar documents
y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Running head: DATA ANALYSIS AND INTERPRETATION 1

8th Grade. Data.

Lesson 14: Modeling Relationships with a Line

Exploring Measures of Central Tendency (mean, median and mode) Exploring range as a measure of dispersion

Get in Shape 2. Analyzing Numerical Data Displays

A Study of Olympic Winning Times

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

STANDARD SCORES AND THE NORMAL DISTRIBUTION

STT 315 Section /19/2014

Assessment Schedule 2016 Mathematics and Statistics: Demonstrate understanding of chance and data (91037)

WHO WON THE SYDNEY 2000 OLYMPIC GAMES?

Calculation of Trail Usage from Counter Data

Chapter 12 Practice Test

How to Make, Interpret and Use a Simple Plot

Displaying Quantitative (Numerical) Data with Graphs

save percentages? (Name) (University)

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Is lung capacity affected by smoking, sport, height or gender. Table of contents

46 Chapter 8 Statistics: An Introduction

March Madness Basketball Tournament

Section 5 Critiquing Data Presentation - Teachers Notes

Practice Test Unit 06B 11A: Probability, Permutations and Combinations. Practice Test Unit 11B: Data Analysis

Influence of the size of a nation s population on performances in athletics

Lab 11: Introduction to Linear Regression

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Age of Fans

March Madness Basketball Tournament

Grade: 8. Author(s): Hope Phillips

Chapter 2 Displaying and Describing Categorical Data

Which Countries Received the Most Medals per Population at the 2008 Summer Olympics? By W. W. Munroe August 2008

Unit 6, Lesson 1: Organizing Data

Practice Test Unit 6B/11A/11B: Probability and Logic

Algebra 1 Unit 6 Study Guide

Applying Hooke s Law to Multiple Bungee Cords. Introduction

ISyE 6414 Regression Analysis

Section I: Multiple Choice Select the best answer for each problem.

9.3 Histograms and Box Plots

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

Acknowledgement: Author is indebted to Dr. Jennifer Kaplan, Dr. Parthanil Roy and Dr Ashoke Sinha for allowing him to use/edit many of their slides.

Navigate to the golf data folder and make it your working directory. Load the data by typing

The Economic Factors Analysis in Olympic Game

Ocean Waves and Graphs

RATE OF CHANGE AND INSTANTANEOUS VELOCITY

Airport Forecasting Prof. Richard de Neufville

5.1. Data Displays Batter Up. My Notes ACTIVITY

Gait Analyser. Description of Walking Performance

Analysis of Curling Team Strategy and Tactics using Curling Informatics

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

NBA TEAM SYNERGY RESEARCH REPORT 1

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

% per year Age (years)

Ozobot Bit Classroom Application: Boyle s Law Simulation

Black Sea Bass Encounter

A Hare-Lynx Simulation Model

WHAT CAN WE LEARN FROM COMPETITION ANALYSIS AT THE 1999 PAN PACIFIC SWIMMING CHAMPIONSHIPS?

Bioequivalence: Saving money with generic drugs

MATH IN ACTION TABLE OF CONTENTS. Lesson 1.1 On Your Mark, Get Set, Go! Page: 10 Usain Bolt: The fastest man on the planet

STAT 625: 2000 Olympic Diving Exploration

Reality Math Dot Sulock, University of North Carolina at Asheville

Shedding Light on Motion Episode 4: Graphing Motion

USING A CALCULATOR TO INVESTIGATE WHETHER A LINEAR, QUADRATIC OR EXPONENTIAL FUNCTION BEST FITS A SET OF BIVARIATE NUMERICAL DATA

Taking Your Class for a Walk, Randomly

Chapter 2 - Frequency Distributions and Graphs

North Point - Advance Placement Statistics Summer Assignment

Quantitative Literacy: Thinking Between the Lines

Besides the reported poor performance of the candidates there were a number of mistakes observed on the assessment tool itself outlined as follows:

Create a bungee line for an object to allow it the most thrilling, yet SAFE, fall from a height of 3 or more meters.

Atmospheric Rossby Waves in Fall 2011: Analysis of Zonal Wind Speed and 500hPa Heights in the Northern and Southern Hemispheres

Analyzing Categorical Data & Displaying Quantitative Data Section 1.1 & 1.2

Year 10 Mathematics, 2009

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

! Problem Solving Students will use past Olympic statistics and mathematics to predict the most recent Olympic statistics.

4-3 Rate of Change and Slope. Warm Up Lesson Presentation. Lesson Quiz

Objective Determine how the speed of a runner depends on the distance of the race, and predict what the record would be for 2750 m.

5.3 Standard Deviation

Atomspheric Waves at the 500hPa Level

Global Construction Outlook: Laura Hanlon Product Manager, Global Construction Outlook May 21, 2009

Exploring the relationship between the pressure of the ball and coefficient of restitution.

Evaluating the Influence of R3 Treatments on Fishing License Sales in Pennsylvania

Frequency Distributions

CHAPTER 1 ORGANIZATION OF DATA SETS

27Quantify Predictability U10L9. April 13, 2015

Chapter 2: Visual Description of Data

Journal of Chemical and Pharmaceutical Research, 2014, 6(3): Research Article

Atmospheric Rossby Waves Fall 2012: Analysis of Northern and Southern 500hPa Height Fields and Zonal Wind Speed

Exemplar for Internal Achievement Standard. Mathematics and Statistics Level 1

Algebra I: A Fresh Approach. By Christy Walters

Psychology - Mr. Callaway/Mundy s Mill HS Unit Research Methods - Statistics

If a fair coin is tossed 10 times, what will we see? 24.61% 20.51% 20.51% 11.72% 11.72% 4.39% 4.39% 0.98% 0.98% 0.098% 0.098%

Winter Olympics. By Rachel McCann (B.Teach; B.Ed Hons; M.ED (Special Ed.)

I. World trade in Overview

Overview. Learning Goals. Prior Knowledge. UWHS Climate Science. Grade Level Time Required Part I 30 minutes Part II 2+ hours Part III

Unit 3 ~ Data about us

Internet Technology Fundamentals. To use a passing score at the percentiles listed below:

More than half the world lives on less than $2 a day

Grade 6 Math Circles Fall October 7/8 Statistics

Compression Study: City, State. City Convention & Visitors Bureau. Prepared for

Performance Task # 1

Transcription:

Best Practices in Mathematics Education STATISTICS MODULES APEC Technical Assistance & Training Facility (APEC TATF) APEC Project HRD 01/2009A - 21 st Century Mathematics Education for All in the APEC Region

Table of Contents Introduction: Data and Statistics Module 1: Representing and Interpreting Data Pictographs Bar Graphs Line Graphs Pie Charts Module 2: Descriptive Statistics Mean/Median/Mode Dispersion Module 3: Bi-Variate Statistics Scatter Plots Correlation Lines of Best Fit Module 4: Misrepresenting data 2

STATISTICS MODULES Introduction: Statistics and Data 3

Data Data are facts or figures about situations or conditions. The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of "datum") are typically the results of counts and measurements and are often the basis of graphs that are used to represent the data. Data are often viewed as the lowest level of abstraction from which information in the form of statistics and then knowledge are derived. Raw data, that is, unprocessed data, refer to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into statistics that can be interpreted and used to make decisions. 4

Some examples of data Data about an individual: Age Height Weight Country of birth Nationality Annual income Data about an economy: Population Per capita income Land mass Population density Percent of adults with a high school education Square miles of arable land 5

Statistics Statistics gives meaning to data. Statistics is what turn data into information and knowledge. Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments. 6

Some examples of statistics Given the population of each APEC economy ordered alphabetically, statistics is ordering the populations from greatest to least and representing the populations on a bar graph to show their relative sizes. Given the population of each APEC economy in 2000 and in 2010, statistics is calculating and interpreting the growth rate for each economy. Given growth rates, statistics is understanding and interpreting the causes of different growth rates in terms of birth rates, death rates, immigration rates and emigration rates. 7

Our approach in these modules You will learn by analyzing data. You will use the following framework: What are the questions? What data should be collected? How can the data be organized and analyzed? How can the results be interpreted? We will use, whenever possible, international data, especially from economies in the Asia- Pacific region. 8

STATISTICS MODULES MODULE 1: Representing Data 9

Module Overview Pictographs Bar graphs Line graphs Pie charts Deciding which representation to use 10

Pictographs Pictographs use objects or pictures to represent data. Each object or picture in a pictograph represents the same quantity, that is, each coin could represent one coin in one pictograph or each coin could represent ten coins in another pictograph. Pictographs can be drawn vertically or horizontally and are easily converted to bar graphs. 11

Reading pictographs Gold Medals at the 2008 Olympic Games = 1 gold medal Thailand Ethiopia Canada Bulgaria How many gold medals did Thailand win? What about Canada? 12

Reading pictographs Gold Medals at the 2008 Olympic Games = 1 gold medal Thailand Ethiopia Canada Bulgaria Who earned the most gold medals of these four economies? The least? 13

Reading pictographs Gold Medals at the 2008 Olympic Games = 1 gold medal Thailand Ethiopia Canada Bulgaria How many more gold medals did Canada receive than Bulgaria? 14

Understanding pictographs Silver Medals at the 2008 Olympic Games = 2 silver medals Japan New Zealand Sweden Chinese Taipei Note that the scale has changed. How many silver medals did New Zealand receive? Chinese Taipei? 15

Understanding pictographs Silver Medals at the 2008 Olympic Games = 2 silver medals Japan New Zealand Sweden Chinese Taipei How many more medals does Japan have than Sweden? Than Chinese Taipei? 16

Constructing pictographs Bronze Medals at the 2008 Olympic Games = 3 bronze medals Economy Bronze medals Indonesia 3 Canada 6 Nigeria 3 Indonesia Canada Nigeria Belarus Belarus 9 Print this slide and construct a pictograph using the data table above. 17

Interpreting pictographs Silver Medals at the 2008 Olympic Games Economy Silver medals = 3 bronze medals Japan New Zealand Sweden Chinese Taipei Using the pictograph, complete the table on the right. 18

Bar Graphs A bar graph is a graph with rectangular bars with lengths that are proportional to the values that they represent. The bars can be plotted vertically or horizontally. Bar graphs are used for displaying a set of data for which the relative magnitude or size is of interest. For example, one could use bar graph to display such discontinuous data as 'shoe size' or 'eye color'. Bar graphs help to identify clumps, bumps and holes in a set of data. 19

Understanding bar graphs Economy Silver medals 7 Silver Medals at the 2008 Olympic Games 6 5 4 3 2 1 0 Japan New Zealand Sweden Chinese Taipei Using the bar graph, complete the table on the left. 20

Understanding bar graphs Japan Silver Medals at the 2008 Olympic Games New Zealand = 3 bronze medals Sweden Chinese Taipei 7 6 5 4 3 2 1 0 Silver Medals at the 2008 Olympic Games Japan New Zealand Sweden Chinese Taipei The two graphs above display the same information. 21

Reading bar graphs 50 45 40 35 30 25 20 15 10 5 0 Total Medals at the 2008 Olympic Games Japan Australia Kenya New Zealand 22 How many more medals does Australia have than Japan? Than Kenya?

Reading bar graphs 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Race Times at the 2008 Olympic Games (50M) Yei Yah Yellow Davelaar Rodion Hamadeh Anas Hall Luke When data are close together, a bar graph can be difficult to read. 23

Reading bar graphs 24.5 Race Times at the 2008 Olympic Games (50M) 24.4 24.3 24.2 24.1 24 23.9 23.8 Yei Yah Yellow Davelaar Rodion Hamadeh Anas Hall Luke This graph displays the same data as the previous one, but the differences between runners times are more noticeable because the scale on the vertical axis has been changed. 24

Reading bar graphs 24.5 24.4 24.3 24.2 24.1 24 23.9 23.8 Race Times at the 2008 Olympic Games (50M) Yei Yah Yellow Davelaar Rodion Hamadeh Anas Hall Luke However, do not be misled by a change in the scale. Yei Yah Yellow did not finish three times more quickly than Hall Luke (although this is how it would appear if the axis was not labeled). Instead, the difference between times was less than a half-second. This is why it is very important to always label axes. It cannot be assumed that an axis will always start with zero. 25

Reading bar graphs Total Medals at the 2008 Olympic Games New Zealand Kenya Australia Japan 0 10 20 30 40 50 How many fewer medals does New Zealand have than Japan? How many fewer than Australia? 26

Reading bar graphs 60 Total Medals at the 2008 Olympic Games Total Medals at the 2008 Olympic Games NZ 40 KE 20 AU JP 0 JP AU KE NZ 0 20 40 60 A vertical bar graph and a horizontal bar graph display the same information. 27

Creating bar graphs Economy Medals Canada 18 Georgia 6 Korea 31 Brazil 15 Using the data set above, construct a horizontal bar graph and a vertical bar graph. Remember to title each graph, to label all axes, and to include a scale and tick marks. 28

Line graphs Line graphs are used to display continuous data such as changes in height or population over time. A line graph is used to display information as a series of data points connected by straight line segments. Line graphs are often used to display a trend in data over intervals of time, thus the line is often drawn chronologically. 29

Reading line graphs 7 Total Medals Earned by Mexico, 1996-2008 6 5 4 3 2 1 0 1996 2000 2004 2008 How many medals did Mexico earn in 2000? 2008? 30

Reading line graphs 7 Total Medals Earned by Mexico, 1996-2008 6 5 4 3 2 1 0 1996 2000 2004 2008 In which year did Mexico earn the fewest medals? The most? 31

Reading line graphs 7 Total Medals Earned by Mexico, 1996-2008 6 5 4 3 2 1 0 1996 2000 2004 2008 Between 1996 and 2008, how many medals did Mexico win in all? 32

Reading line graphs 7 Total Medals Earned by Mexico, 1996-2008 6 5 4 3 2 1 0 1996 2000 2004 2008 How many more medals did Mexico earn in 2004 than in 1996? 33

Reading line graphs 7 Total Medals Earned by Mexico, 1996-2008 6 5 4 3 2 1 0 1996 2000 2004 2008 Between which two years did Mexico have the greatest increase in medals? 34

Reading line graphs 7 Total Medals Earned by Mexico, 1996-2008 6 5 4 3 2 1 0 1996 2000 2004 2008 Overall, does Mexico s medal count per year seem to be increasing or decreasing? 35

Reading line graphs 7 6 5 4 3 2 1 0 Total Medals Earned by Mexico and Indonesia, 1996-2008 1996 2000 2004 2008 Mexico Indonesia Indonesia s medal counts have been added to the graph. 36

Reading line graphs 7 6 5 4 3 2 1 0 Total Medals Earned by Mexico and Indonesia, 1996-2008 1996 2000 2004 2008 Mexico Indonesia How many medals did Indonesia earn in 1996? 2004? 37

Reading line graphs 7 6 5 4 3 2 1 0 Total Medals Earned by Mexico and Indonesia, 1996-2008 1996 2000 2004 2008 Mexico Indonesia In which year(s) did Indonesia win the fewest medals? The most? 38

Reading line graphs 7 6 5 4 3 2 1 0 Total Medals Earned by Mexico and Indonesia, 1996-2008 1996 2000 2004 2008 Mexico Indonesia How many medals did Indonesia earn in all, from 1996 to 2008? 39

Reading line graphs 7 6 5 4 3 2 1 0 Total Medals Earned by Mexico and Indonesia, 1996-2008 1996 2000 2004 2008 Mexico Indonesia Which economy received more medals in 1996? How many more? What about 2008? 40

Reading line graphs 7 6 5 4 3 2 1 0 Total Medals Earned by Mexico and Indonesia, 1996-2008 1996 2000 2004 2008 Mexico Indonesia In which year(s) was there the largest gap between the economies medal counts? The smallest? 41

Understanding line graphs 7 Total Medals Earned by Mexico, 1996-2008 7 Total Medals Earned by Mexico, 1996-2008 6 6 5 5 4 4 3 3 2 2 1 1 0 0 1996 2000 2004 2008 1996 2000 2004 2008 Here is the data for just Mexico, presented in two different ways. Looking at trends over time, which graph better represents the data? Why? 42

Creating line graphs Total Medals Earned by Thailand, 1996-2008 14 12 10 8 6 4 2 0 1996 2000 2004 2008 43 In 1996, 2000, 2004, and 2008, Thailand received 2, 3, 8, and 4 medals, respectively. Construct a line graph using this data.

Pie charts Pie charts are useful for categorizing and displaying parts of a whole. A pie chart (or a circle graph) is a circle divided into sectors that represent the proportion of the total. In a pie chart, the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents. 44

Reading pie charts We could use a pie chart to look at how many of each type of medal an economy received. Poland s Medal Count, 2008 Olympic Games Gold Silver Bronze In the 2008 Olympic Games, which type of medal did Poland win the most? The least? Approximately what percentage of medals were silver? 45

Reading pie charts How many silver medals did Poland receive? How many medals did Poland receive in total? Poland s Medal Count, 2008 Olympic Games 6 1 3 Gold Silver Bronze Now, what is the actual percentage of silver medals won? What is the actual percentage of bronze medals won? 46

Understanding pie charts Poland s Medal Count, 2008 Olympic Games Poland s Medal Count, 2008 Olympic Games Gold Silver Bronze Gold Silver Bronze Here is the same information, presented as a bar graph and a pie chart. When deciding what percentage of a total each part represents, which graph is better? Why? 47

Deciding which representation to use Athlete 50M time Yei Yah Yellow 24.00 Davelaar Rodion 24.21 Hamadeh Anas 24.40 Hall Luke 24.41 Lee Daniel 24.92 Camal Chakyl 24.93 Roberts Niall 25.13 Attoumane Mohamed 29.63 Economy China 3 Japan 6 United States Medals won for wrestling 3 Canada 2 Year 1996 50 2000 59 2004 63 2008 100 Medals won by China For which of the following data sets, which representation would be most appropriate? Why? 48

Additional resources Reviewing these representations: http://www.mathleague.com/help/data/data.htm#lin egraphs http://www.bbc.co.uk/schools/ks3bitesize/maths/h andling_data/representing_data/revise1.shtml Graph construction tools: http://www.shodor.org/interactivate/activities 49

STATISTICS MODULES MODULE 2: Descriptive Statistics 50

Module Overview Measures of central tendency Mean Median Range Standard deviation 51

Understanding the mean 8 7 6 5 4 3 2 1 0 Gold Medals, 2008 Olympic Games Thailand Ukraine Romania France How many gold medals did Thailand, Ukraine, Romania, and France win in total? 52

Understanding the mean 8 7 6 5 4 3 2 1 0 Gold Medals, 2008 Olympic Games Thailand Ukraine Romania France Redistribute this data to find the mean. What is the mean of these data? 53

Understanding the mean Economy Malaysia 1 Japan 6 Canada 9 Egypt 0 Silver medals When calculating the mean, always start by finding the total. Malaysia Japan Canada Egypt What is the total number of silver medals won by all four economies? 54

Understanding the mean Malaysia Japan Canada Egypt After adding together the medals, divide them equally between economies. 55

Understanding the mean Malaysia Japan Canada Egypt 56

Understanding the mean Malaysia Japan Canada Egypt 57

Understanding the mean Malaysia Japan Canada Egypt 58

Understanding the mean Malaysia Japan Canada Egypt 4is the arithmetic average, or mean, of silver medals won between these four economies. 59

Understanding the mean The mean can be computed easily: First, total all data. Economy Malaysia 1 Japan 6 Canada 9 Egypt 0 Total medals 16 Silver medals Then, divide by number of categories. Total economies 4 The quotient is your mean. 4 60

Calculating the mean Economy Medals 2000 (Sydney) 2004 (Athens) 2008 (Beijing) China 59 63 100 United States 91 102 110 Australia 58 49 46 Economy Total Medals Number of categories Australia 222 3 Mean Given the data above, complete the second table and find the average medal count for each economy. 61

Calculating the mean If (a + b + c) 3 = m where m represents the mean, try to solve the problem below. Round Score Preliminary 453.6 Semifinal 518.8 Final Mean 503.0 At the 2008 Beijing Olympics, Canadian diver Alexandre Despatie earned the silver medal in the Men s 3M Springboard competition. What score did he receive for his final dive? 62

Understanding the median Rank Athlete Economy Time 1 Yei Yah Yellow Nigeria 24.00 2 Davelaar Rodion Netherlands Antilles 24.21 3 HamadehAnas Jordan 24.40 4 Hall Luke Swaziland 24.41 5 Lee Daniel Sri Lanka 24.92 6 Camal Chakyl Mozambique 24.93 7 Roberts Niall Guyana 25.13 Here is a table of results from the sixth Men s 50M Free Style Heat. The median is the middle value. Since the times are ordered from least the greatest, the median or middle value is 24.41, the 4 th of the 7 times. 63

Understanding the median Rank Athlete Economy Time 1 Yei Yah Yellow Nigeria 24.00 2 Davelaar Rodion Netherlands Antilles 24.21 3 HamadehAnas Jordan 24.40 4 Hall Luke Swaziland 24.41 5 Lee Daniel Sri Lanka 24.92 6 Camal Chakyl Mozambique 24.93 7 Roberts Niall Guyana 25.13 8 Attoumane Mohamed Comoros 29.63 Here is a table of all 8 results from the sixth Men s 50M Free Style Heat. The median is the middle value, but when there is an even number of data points, the median is the average of the 4 th and 5 times or (24.41 + 24.92) 2 = 24.665. 64

Using the median and the mean Rank Athlete Economy Time 1 Yei Yah Yellow Nigeria 24.00 2 Davelaar Rodion Netherlands Antilles 24.21 3 HamadehAnas Jordan 24.40 4 Hall Luke Swaziland 24.41 5 Lee Daniel Sri Lanka 24.92 6 Camal Chakyl Mozambique 24.93 7 Roberts Niall Guyana 25.13 8 Attoumane Mohamed Comoros 29.63 Here is the same table of results from the sixth Men s 50M Free Style Heat. What is the mean? What is the median? Which is a more accurate descriptor of the results? Why? 65

Using the median and the mean Rank Athlete Economy Time 1 Yei Yah Yellow Nigeria 24.00 2 Davelaar Rodion Netherlands Antilles 24.21 3 HamadehAnas Jordan 24.40 4 Hall Luke Swaziland 24.41 5 Lee Daniel Sri Lanka 24.92 6 Camal Chakyl Mozambique 24.93 7 Roberts Niall Guyana 25.13 MEAN: 25.2 8 Attoumane Mohamed Comoros 29.63 The mean, 25.2, falls at the bottom of the distribution. 66

Using the median Rank Athlete Economy Time 1 Yei Yah Yellow Nigeria 24.00 2 Davelaar Rodion Netherlands Antilles 24.21 3 HamadehAnas Jordan 24.40 4 Hall Luke Swaziland 24.41 MEDIAN: 24.665 5 Lee Daniel Sri Lanka 24.92 6 Camal Chakyl Mozambique 24.93 7 Roberts Niall Guyana 25.13 8 Attoumane Mohamed Comoros 29.63 But the median, 24.665, falls in the middle of the distribution. The median is a better descriptor of the data in this situation. 67

Understanding range Rank Athlete Economy Time 1 Yei Yah Yellow Nigeria 24.00 2 Davelaar Rodion Netherlands Antilles 24.21 3 HamadehAnas Jordan 24.40 4 Hall Luke Swaziland 24.41 5 Lee Daniel Sri Lanka 24.92 6 Camal Chakyl Mozambique 24.93 7 Roberts Niall Guyana 25.13 8 Attoumane Mohamed Comoros 29.63 The range of a set of data is simply the difference between the greatest value and the least value. For this set of data, the range is 29.63 24.00 or 5.63 seconds. 68

Standard deviation Standard deviation is a widely used measure of variability or diversity used in statistics. The standard deviation shows how much variation or dispersion exists from the mean. A low standard deviation indicates that the data points tend to be very close to the mean and not very spread out A high standard deviation indicates that the data points are spread out over a large range of values. 69

Calculating the standard deviation To find the standard deviation of a set of data: Find the square root of the average of the square of the difference between the mean and each data point. Algebraically, this is written as: N 1 ( x x) N i i 1 where N = the number of data points, x i is the i th data point and x is the mean of the data points. 2 70

Calculating the standard deviation Time Mean Difference Difference squared 24.00 25.2 24.21 25.2 24.40 25.2 24.41 25.2 24.92 25.2 24.93 25.2 25.13 25.2 29.63 25.2 First identify the data and calculate its mean. 71

Calculating the standard deviation Time Mean Difference Difference squared 24.00 25.2 1.2 24.21 25.2 0.99 24.40 25.2 0.8 24.41 25.2 0.79 24.92 25.2 0.28 24.93 25.2 0.27 25.13 25.2 0.07 29.63 25.2-4.43 Then calculate the difference between each element in the data set and the mean. 72

Calculating the standard deviation Time Mean Difference Difference squared 24.00 25.2 1.2 1.44 24.21 25.2 0.99 0.98 24.40 25.2 0.8 0.64 24.41 25.2 0.79 0.62 24.92 25.2 0.28 0.08 24.93 25.2 0.27 0.07 25.13 25.2 0.07 0.005 29.63 25.2-4.43 19.62 23.45 The average of the difference squared = 23.45 8 = 2.93. The square root of 2.93 = 1.71. 1.71 is the standard deviation of these 8 times. Then square the differences and find the average of the square of the difference. The standard deviation for these data is the square root of this average. Now calculate the standard deviation for only the first 7 times and observe how much smaller it is than the standard deviation 73

Standard deviation and distribution When data are normally distributed about a mean, 68% of the data fall within one standard deviation of the mean, 95% of the data fall within two standard deviations of the mean, and 99.7% of the data fall within three standard deviations of the mean. 74

Using measures of central tendency Economy 2008 Number of gold medals Australia 14 Canada 3 China 51 Japan 9 New Zealand 3 Russia 23 Korea 13 United States 36 Use the data to find the mean, median, range and standard deviation. Explain what each statistic tells you about the data. 75

Additional resources Mean, median and range: http://www.mathsisfun.com/mean.html http://www.mathsisfun.com/median.html http://www.mathsisfun.com/data/range.html Standard deviation: http://www.mathsisfun.com/data/standarddeviation.html http://www.mathsisfun.com/data/standard-deviationcalculator.html 76

STATISTICS MODULES MODULE 3: Bivariate statistics 77

Module Overview Scatter Plots Correlation Lines of Best Fit 78

Arranging observations At Doctor Monroe s Pediatric clinic, every child has their height measured and weight checked upon each visit. One day, a nurse was looking through the height and weight charts of 9 children. She decided to compile a list of each child s height (inches) and weight (lbs) and see if there was any discernible pattern. Child Height Weight 1 36 66 2 32 60 3 60 120 4 48 90 5 26 44 6 55 113 7 40 80 8 42 90 9 29 50 79

Bivariate Data Two Variables Child Height Weight 1 36 66 2 32 60 3 60 120 4 48 90 5 26 44 6 55 113 7 40 80 8 42 90 9 29 50 When there are two variables (height and weight) for every observation (child), such as in this table, this is known as bivariate data. The nurse wanted to find a pattern between height and weight, but this table alone is not very revealing. It helps to visualize the data with a graph. 80

Exercise Draw two axes (x and y) and label the x- axis as Weight and the y-axis as Height. What should the range on the x-axis and the y-axis be? Should the range be the same for both axes, or different? Plot the height and weight for each of the 9 children. Result on next slide. 81

Scatter Plot This type of graph is called a scatter plot. This scatter plot shows a linear, upward sloping relationship between height and weight. Notice how linear relationships are much more apparent on a scatter plot than on a table and reveals a positive relationship between height and weight. 82

Interpreting the Data How should we interpret this scatter plot? What does the linear pattern among the nine children seem to indicate? Are taller children always heavier? Are heavy children necessarily tall? To address these questions, we need to look at a mathematical concept known as correlation. 83

Defining Correlation In the previous example, we observed a linear relationship between height and weight. Correlation describes the strength of the linear relationship between height and weight. Look at the scatter plot and describe what you see. For example: Height and weight seem to increase together. In addition, they seem to increase at fairly regular intervals (in a straight line). When we observe such patterns, we say that height and weight seem to be correlated. 84

Types of Correlation Perfect Positive Correlation Perfect Negative Correlation 85

Types of Correlation Positive Correlation Negative Correlation 86

Types of Correlation No Correlation What does no correlation mean? Assume that for the graph on the left, the x-axis is weight and the y-axis is height. If there is no correlation between height and weight, does that mean height and weight have no determinable relationship? Does zero correlation mean that height has nothing to do with weight? 87

Defining the Correlation Coefficient We can quantify our observations on height and weight using the correlation coefficient. The correlation coefficient, also known as the Pearson Product-Moment Correlation, measures the degree of linear correlation between variables. pronounced rho (for population data) r (for sample data) 88

Correlation The absolute value of or r describes the magnitude of the linear relationship between two variables. The sign describes the direction. If r is negative, are the points moving up or down? If r is positive? 89

Different Values of r The correlation coefficient ranges between -1 r 1. r = 1 : perfect positive correlation; as one variable gets bigger, the other variable tends to get bigger. r = -1 : perfect negative correlation; as one variable gets bigger; the other variable tends to get smaller. r= 0 :no linear correlation; the variables have no linear relationship. 90

Different Values of r As r gets closer to 1 or -1, the stronger the correlation, and the more clustered the points are around a line. As r gets closer to 0, the weaker the correlation, and the farther away the points are from the line. Remember that r = 0 does NOT mean there is no relationship between two variables. It is possible for two variables to have zero linear relationship but still have a strong non-linear relationship. 91

Correlation is not Causation Correlation is a mathematical concept that helps us observe patterns in bivariate data and make predictions about missing data values. But correlation between two variables does not necessarily mean that one variable caused the other; it does not demonstrate cause and effect. Think back to height v. weight. Are taller children always heavier? Are heavy children necessarily tall? Before, we only demonstrated a mathematical linear relationship, not a real world one. 92

Why is correlation useful? Why is correlation useful? If we can find a linear pattern between two variables, we can extrapolate values for unobserved variables. Refer back to the height vs. weight example. We only had 9 children in the sample. But because we recognize correlation, from the scatter plot we can make predictions about the height of a child whose weight is known or about the weight of a child whose height is known. 93

Visual Exercise Match the r value with the appropriate scatter plot. r = 1 r = -0.54 r = 0.17 r = 0.89 94

Exercise Observations Notice that with weaker correlation (an r value closer to zero), the more scattered the points are. The stronger the correlation (an r value closer to 1 or -1), the more tightly clustered the points are in a linear pattern. 95

Analytical Exercise The correlation between income and education level is 0.89. The correlation between income and number of hours spent watching television is 0.23. True or false: a college education tends to be associated with higher overall income. True or false: Watching television will reduce income. 96

Calculating the Correlation Coefficient The mathematical formula for the Pearson Coefficient, r, between two variables X and Y is the following: r xy 2 x You will rarely have to compute this by hand. Most software packages like Excel and many graphing calculators have a correlation function that will do these calculations for you. y 2 97

From Correlation to Linear Regression After drawing a scatter plot and calculating the correlation coefficient, we have determined that our bivariate data is correlated. Why is this useful? If there is a linear pattern between two variables, we can predict values for unobserved variables. Refer back to the height vs. weight example. We had 9 children in the sample. Suppose we add a 10 th child to the sample, and this new child weighs 50 pounds. By only using the data from the other 9 children, can we predict how tall the 10 th child will be? 98

Linear Regression We can predict the height of the 10 th child through regression. Regression is a way of modeling a patterned relationship between two variables. Linear regression is also called the line of best fit or trend line. We say the line has a good fit if it is close to the data points. In most cases, the line will not cross all of the data points, and in some cases, it may not cross any of them. However, the line will tend to have an equal number of data points above and below it. 99

Linear Regression, Line of Best Fit, or Trend Line Here is the regression line, shown in red, and its equation (Y=1)44 1.23 x for the data represented by the set of points in the scatter plot. 10 0

What Does the Regression Tell Us? This regression line is a predictive line, or in other words, a generalization based upon observed data points. With linear regression, we assume that the linear pattern we found amongst the observed data points will also hold for unobserved data. That is, the linear relationship between the two variables derived from the sample should also hold for the population. 10 1

The Error or Residual When we draw a line of best fit, there are two types of data on the scatter plot: Observed values - data points Predicted values - regression line The vertical distance between the observed and predicted values is known as the error or residual. As the name suggests, we want to make the errors as small as possible. Our regression line should thus be drawn as close as possible to the values actually observed. 10 2

Visualizing the Error Notice the vertical distance between each of the data points on the line. A regression line is the line for which the sum of the squares of these vertical distances is a minimum. That is why a regression line is often referred to as a least squares line. 10 3

Calculating Linear Regression Since we want to minimize the errors, we want our regression line to be as precise as possible. In fact, there is a more accurate way to find a regression line without having to eye-ball it on a scatter plot. We can think of linear regression as a linear equation that crosses through, or fits, the observed data points. The linear regression equation always describes one variable as a function of the other. Think y = ax+b, where x and y represent bivariate data values, and y is a function of x. 10 4

Keep Your Terms Straight In statistical analysis, terminology is very important. If weight is the independent variable and height is the dependent variable, then we regress height on weight. Once we make this distinction, we cannot say the opposite, that weight is regressed on height. The correct phrasing is always dependent regressed on independent. 10 5

Attributes of the Regression Equation To summarize, a linear regression equation should do the following: Describe a linear pattern of a bivariatedata set that demonstrates correlation. Be a mathematical equation wherein one variable is a function of the other. Minimize the errors o i.e. the distances between actual and predicted values. 10 6

Calculating the Regression Equation Now that we have established what we are looking for in a regression equation, how do we go about calculating the actual line of best fit? In statistics, all linear regressions are based upon the population equation: Where 0 is the y-intercept, 1 is the slope, and is the error term. Y 0 1 X i Do not be intimidated by this equation! Interpret it just as you would interpret a simple linear equation; think of y = ax + b. i i 10 7

Interpreting the Population Equation The population equation is the linear equation which accurately predicts values for the entire population, or the entire set of possible data values. Think back to our first example of height and weight. We only had nine children in the sample. Can we use the information from nine children to predict the height and weight of all children across the entire world with 100% accuracy? Unfortunately not. The more data points we have, the more accurate the estimate will be. But where do we draw the line? When can we stop getting data? How many thousands, millions, or billions of children will we have to weigh and measure in order to get a 100% accurate linear regression equation? The truth is, we could go on collecting data forever and still not collect enough for the entire population. This means that we will never know the exact values of the population equation. The values we find for and will at best be close 0 1 estimates. 10 8

The Sample Regression Equation The sample linear regression equation is an estimate of the population regression equation: yˆ b b x e 0 1 i i ŷ is an estimate of Y b 0 is an estimate of is an estimate of b 1 is an estimate of e i 0 1 i Yes, we even estimate the error. 10 9

Estimation and Confidence Bands Estimates contain a certain degree of uncertainty. This uncertainty is expressed in confidence bands about the regression line. The confidence bands have the same interpretation as the standard error of the mean, except that the uncertainty varies according to the location along the line. The uncertainty is least at the sample mean of the X-values and gets larger as the distance from the mean increases. 11 0

Minimizing the Errors As we have already established, we want a regression equation which minimizes the distances between the actual observations, y, and the estimated values, ŷ, for each corresponding value of x. How do we minimize the difference (error, residual) between observed values and estimated values using our sample regression equation? 11 1

Minimizing the Errors Continued Consider the following equation: min ( y yˆ) At first, this seems correct because this equation minimizes the sum of the differences between actual and predicted y. But remember that when we drew our regression line, about half of the y values were above and half were below the line. There are therefore both positive and negative error terms which would cancel each other out when added together. This could give us a false impression of the regression line, making it seem as if the distances between actual and predicted Y are negligible when really they are just cancelling each other out. 11 2

Sum of Least Squares To solve this problem of negative and positive errors, we take the sum of squared errors. 2 The sum of squared errors, SSE = min ( y yˆ) By squaring the distance between y and ŷ, we make sure that we only deal with positive distances that will not cancel each other out. 11 3

Sum of Least Squares Using the SSE, we can derive an equation for the slope and y-intercept of the sample regression: 2 1 ) ( ) ( ) ( x x y y x x b i i i b x y b 1 0 11 4

Interpolation v. Extrapolation Whenever you run a linear regression, the range of the data should be carefully observed. When we make predictions within the range of predictor values provided by the sample, this is called interpolation. This is generally a safe tactic. However, making predictions outside of the range of predictor values is known as extrapolation. The more removed the prediction is from the range of values used to fit the model, the riskier the prediction becomes because there is no way to check that the relationship continues to be linear. Consider our first example, a linear model relating weight gain to age for children. Applying this model to adults would be inappropriate since the correlation between age and weight gain is not consistent for all age groups. 11 5

Additional resources Scatter plots: http://stattrek.com/ap-statistics-1/scatterplot.aspx Correlation: http://www.statsoft.com/textbook/basic-statistics/#correlations Linear regression: http://www.stat.yale.edu/courses/1997-98/101/linreg.htm http://www.duke.edu/~rnau/regintro.htm 11 6

STATISTICS MODULES MODULE 4: Misrepresenting Data 11 7

Misrepresentation of statistics Using Olympic race times since 1900, researchers have noticed that women s running times are improving at a faster rate than men s. They offer the following projection: 11 8

Misrepresentation of statistics The researchers suggest that by the 2156 Olympics, women will overtake men. This has led many news outlets to declare that females will eventually become faster runners than males: An analysis of the shrinking gender gap in athletic performance indicates that women athletes are catching their male counterparts A study has shown that women are running faster than they have ever done over 100 meters and, at their current rate of improvement, female sprinters should have overtaken men within 150 years. The Independent, Women of the future on track to run faster than men 11 9

Misrepresentation of statistics The authors of the study admit that their analysis may have some weaknesses. What do you think these weaknesses are? Can the rate of female runners be expected to remain stable over the next several decades? Is The Independent wrong in declaring that women of the future [are] on track to run faster than men? 12 0

Misrepresentation of statistics Weaknesses in the analysis: Though Olympic teams have historically put less effort into recruiting women, this is beginning to change As it becomes more socially acceptable for women to be athletically involved, more females are taking up sports, including running Olympians are outliers by nature, so applying the sample to the entire population (women) is problematic Perhaps women are not getting faster, as The Independent reports, but recruiters are looking more closely for those outliers, and now have a larger selection to choose from. 12 1

Misrepresentation of statistics A misleading linear relationship This claim assumes that just because women are currently improving their times at a faster rate than men, they will continue to do so Is this a safe assumption? Consider biological limits to improvement; males aided by testosterone naturally have a greater muscle mass, less body fat, a larger heart, and more hemoglobin A linear pattern, in this case, does not make sense Performance will undoubtedly plateau at some point; otherwise, runners by the 28th century will have race times of under two seconds clearly an impossible feat 12 2

Misrepresentation of statistics The Projection An alternative hypothesis Winning (s) 13 12 11 10 9 8 7 6 1900 1916 1932 1948 1964 1980 1996 2012 2028 2044 2060 2076 2092 2108 2124 2140 2156 2172 2188 2204 2220 2236 2252 Year In the alternative scenario, a plateau is reached and female runners do not overtake male runners 12 3

Summary It is hoped that the various statistical concepts presented in this module have given you a deeper understanding of how statistics from representing and interpreting data to using statistics to describe data is a powerful tool for making sense of quantitative information. You might now want to turn to the 10 Statistics Investigations and the Master APEC Statistics Spreadsheet and explore how statistics can be used in realistic situations. 12 4