Biostatistics Advanced Methods in Biostatistics IV

Similar documents
Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Navigate to the golf data folder and make it your working directory. Load the data by typing

Geology 10 Activity 8 A Tsunami

Chapter 12 Practice Test

STANDARDIZED CATCH RATE OF SAILFISH (Istiophorus platypterus) CAUGHT BY BRAZILIAN LONGLINERS IN THE ATLANTIC OCEAN ( )

Section I: Multiple Choice Select the best answer for each problem.

SCIENTIFIC COMMITTEE SECOND REGULAR SESSION August 2006 Manila, Philippines

Bioequivalence: Saving money with generic drugs

Anabela Brandão and Doug S. Butterworth

Atmospheric Rossby Waves in Fall 2011: Analysis of Zonal Wind Speed and 500hPa Heights in the Northern and Southern Hemispheres

PROPOSAL OF NEW PROCEDURES FOR IMPROVED TSUNAMI FORECAST BY APPLYING COASTAL AND OFFSHORE TSUNAMI HEIGHT RATIO

save percentages? (Name) (University)

New Earthquake Classification Scheme for Mainshocks and Aftershocks in the NGA-West2 Ground Motion Prediction Equations (GMPEs)

METHODS PAPER: Downstream Bathymetry and BioBase Analyses of Substrate and Macrophytes

Boyle s Law. Pressure-Volume Relationship in Gases. Figure 1

CVEN Computer Applications in Engineering and Construction. Programming Assignment #4 Analysis of Wave Data Using Root-Finding Methods

Geology 15 Activity 5 A Tsunami

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

WAVE MECHANICS FOR OCEAN ENGINEERING

Section 6. The Surface Circulation of the Ocean. What Do You See? Think About It. Investigate. Learning Outcomes

Training program on Modelling: A Case study Hydro-dynamic Model of Zanzibar channel

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Conservation Limits and Management Targets

3.6 Magnetic surveys. Sampling Time variations Gradiometers Processing. Sampling

Real-Time Electricity Pricing

Calculation of Trail Usage from Counter Data

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Currents measurements in the coast of Montevideo, Uruguay

STUDY ON TSUNAMI PROPAGATION INTO RIVERS

Applying Hooke s Law to Multiple Bungee Cords. Introduction

DEVELOPMENT OF A ROBUST PUSH-IN PRESSUREMETER

Grade: 8. Author(s): Hope Phillips

ISyE 6414 Regression Analysis

Overview. Learning Goals. Prior Knowledge. UWHS Climate Science. Grade Level Time Required Part I 30 minutes Part II 2+ hours Part III

Boyle s Law: Pressure-Volume Relationship in Gases. PRELAB QUESTIONS (Answer on your own notebook paper)

Pitching Performance and Age

Scales of Atmospheric Motion Scale Length Scale (m) Time Scale (sec) Systems/Importance Molecular (neglected)

CENTER PIVOT EVALUATION AND DESIGN

Boyle s Law: Pressure-Volume. Relationship in Gases

THE WAY THE VENTURI AND ORIFICES WORK

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

The events associated with the Great Tsunami of 26 December 2004 Sea Level Variation and Impact on Coastal Region of India

A New Generator for Tsunami Wave Generation

Test 1: Ocean 116 (Oceanography Lab.)

Craig A. Brown. NOAA Fisheries, Southeast Fisheries Center, Sustainable Fisheries Division 75 Virginia Beach Drive, Miami, FL, , USA

OFFICE OF STRUCTURES MANUAL FOR HYDROLOGIC AND HYDRAULIC DESIGN CHAPTER 11 APPENDIX B TIDEROUT 2 USERS MANUAL

Monitoring the length structure of commercial landings of albacore tuna during the fishing year

Name May 3, 2007 Math Probability and Statistics

Pitching Performance and Age

Pool Plunge: Linear Relationship between Depth and Pressure

FISH 415 LIMNOLOGY UI Moscow

Copy of my report. Why am I giving this talk. Overview. State highway network

MIKE 21 Toolbox. Global Tide Model Tidal prediction

SURFACE CURRENTS AND TIDES

On the Quality of HY-2A Scatterometer Winds

Lat. & Long. Review. Angular distance N or S of equator Equator = 0º Must indicate N or S North pole = 90º N

Equilibrium Model of Tides

INTER-AMERICAN TROPICAL TUNA COMMISSION SCIENTIFIC ADVISORY COMMITTEE FOURTH MEETING. La Jolla, California (USA) 29 April - 3 May 2013

INFLUENCE OF ENVIRONMENTAL PARAMETERS ON FISHERY

An integrated three-dimensional model of wave-induced pore pressure and effective stresses in a porous seabed: II. Breaking waves

Advanced Hydraulics Prof. Dr. Suresh A. Kartha Department of Civil Engineering Indian Institute of Technology, Guwahati

Process Dynamics, Operations, and Control Lecture Notes - 20

Algebra I: A Fresh Approach. By Christy Walters

AN INDEX OF ABUNDANCE OF BLUEFIN TUNA IN THE NORTHWEST ATLANTIC OCEAN FROM COMBINED CANADA-U.S. PELAGIC LONGLINE DATA

Data Analysis of the Seasonal Variation of the Java Upwelling System and Its Representation in CMIP5 Models

Boyle s Law: Pressure-Volume Relationship in Gases

Effective Use of Box Charts

Boyle s Law: Pressure-Volume Relationship in Gases

The MACC Handicap System

International Journal of Technical Research and Applications e-issn: , Volume 4, Issue 3 (May-June, 2016), PP.

SCIENTIFIC COMMITTEE SEVENTH REGULAR SESSION August 2011 Pohnpei, Federated States of Micronesia

DUE TO EXTERNAL FORCES

Forecasting. Forecasting In Practice

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic,

Addendum to SEDAR16-DW-22

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 AUDIT TRAIL

Design of Experiments Example: A Two-Way Split-Plot Experiment

SCIENCE OF TSUNAMI HAZARDS

IMO REVISION OF THE INTACT STABILITY CODE. Proposal of methodology of direct assessment for stability under dead ship condition. Submitted by Japan

ValidatingWindProfileEquationsduringTropicalStormDebbyin2012

IMPROVEMENT OF 3-DIMENSIONAL BASIN STRUCTURE MODEL USING GROUND MOTION RECORDINGS

DEPARTMENT OF FISH AND GAME

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Validation of 12.5 km Resolution Coastal Winds. Barry Vanhoff, COAS/OSU Funding by NASA/NOAA

SCDNR Charterboat Logbook Program Data,

Atmospheric Waves James Cayer, Wesley Rondinelli, Kayla Schuster. Abstract

Gravity waves in stable atmospheric boundary layers

A Combined Recruitment Index for Demersal Juvenile Cod in NAFO Divisions 3K and 3L

WAVES, WAVE BEHAVIOR, GEOPHYSICS AND SOUND REVIEW ANSWER KEY

Global Structure of Brunt Vaisala Frequency as revealed by COSMIC GPS Radio Occultation

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

Atomspheric Waves at the 500hPa Level

STOCK ASSESSMENT OF YELLOWFIN TUNA IN THE EASTERN PACIFIC OCEAN UPDATE OF 2011 STOCK ASSESSMENT. January 1975 December 2011

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Algebra I: A Fresh Approach. By Christy Walters

NUMERICAL MODELING ANALYSIS FOR TSUNAMI HAZARD ASSESMENT IN WEST COAST OF SOUTHERN THAILAND

Sea State Analysis. Topics. Module 7 Sea State Analysis 2/22/2016. CE A676 Coastal Engineering Orson P. Smith, PE, Ph.D.

Transcription:

Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 8 1 / 22

Tip + Paper Tip In data analysis - particularly for complex high-dimensional data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of over-fitting/over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically simple analyses (Note: being sensible and simple does not always lead to higher profile results). Beware ridiculograms! http://www.youtube.com/watch?v= YS-asmU3p_4&feature=channel_video_title Paper of the Day: The genetic landscape of a cell http://www.sciencemag.org/content/327/5964/425.short 2 / 22

A ridiculogram! (Apologies to C. Myers) 3 / 22

Outline For Today Data analysis 4 / 22

The Earthquake Data The earthquake data available from the class website look like this: Src,Eqid,Version,Datetime,Lat,Lon,Magnitude,Depth,NST,Region ak,10208729,1,"monday, April 11, 2011 21:28:47 UTC",59.8229,-150.2962,2.6,23.80,23,"Kenai Peninsula, Alaska" us,c0002nwt,6,"monday, April 11, 2011 21:21:19 UTC",38.8311,141.9716,4.7,58.00,39,"near the east coast of Honshu, Japan" ak,10208720,1,"monday, April 11, 2011 21:19:21 UTC",61.1910,-150.8566,1.5,35.00,13,"Southern Alaska" us,c0002nvr,4,"monday, April 11, 2011 21:03:02 UTC",36.3607,145.7883,4.7,15.10,19,"off the east coast of Honshu, Japan" Read the data into R > edata = read.csv("eqs7day-m1.txt",header=t) > dim(edata) [1] 1007 10 Also, the dates are for April 11. Did any of you re-download the data from the website? http://www.data.gov/raw/34. 5 / 22

Earthquake depth/magnitude The question of interest is: what is the relationship between earthquake magnitude and earthquake depth? What are earthquake depth/magnitude? Let s start here: http://earthquake.usgs.gov/earthquakes/glossary.php The focus of an earthquake is the place where the energy is initially released (i.e., where the rocks first break). The depth where the earthquake begins to rupture in km. This depth may be relative to mean sea-level or the average elevation of the seismic stations which provided arrival-time data for the earthquake location. Earthquakes can occur anywhere between the Earth s surface and about 700 kilometers below the surface. 1 Earthquake magnitude is a logarithmic measure of earthquake size. In simple terms, this means that at the same distance from the earthquake, the shaking will be 10 times as large during a magnitude 5 earthquake as during a magnitude 4 earthquake. 1 http://earthquake.usgs.gov/learn/topics/seismology/ determining_depth.php 6 / 22

Depth Values Sometimes when depth is poorly constrained by available seismic data, the location program will set the depth at a fixed value. For example, 33 km is often used as a default depth for earthquakes determined to be shallow, but whose depth is not satisfactorily determined by the data, whereas default depths of 5 or 10 km are often used in mid-continental areas and on mid-ocean ridges since earthquakes in these areas are usually shallower than 33 km. 7 / 22

Depth Values > summary(edata$depth) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 4.70 9.90 22.54 22.50 565.50 > integerdepths = table(edata$depth[which(edata$depth==round(edata$depth))]) > barplot(integerdepths[as.numeric(names(integerdepths)) < 50],cex.names=0.7,ylab="Frequency",xlab="Depth") 8 / 22

Magnitude Values and Depth vs. Magnitude Plot > summary(edata$magnitude) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.400 1.800 2.189 2.600 7.100 > plot(edata$depth,edata$magnitude,pch=19,col="blue") 9 / 22

f(depth) versus Magnitude > par(mfrow=c(1,2)) > plot(log(edata$depth),edata$magnitude,pch=19,col="blue") > plot(sqrt(edata$depth),edata$magnitude,pch=19,col="blue") 10 / 22

The other variables > names(edata) [1] "Src" "Eqid" "Version" [4] "Datetime" "Lat" "Lon" [7] "Magnitude" "Depth" "NST" [10] "Region" Src - the source of the information presented in the table Eqid - Event/Earthquake ID - a unique identifier for each earthquake Version - the version number as the data are updated - higher numbers mean more updates Datetime - Time of the initial rupture in UTC Lat - the latitude of the epicenter of the earthquake (positive is N, negative is S) Lon - the longitude of the epicenter of the earthquake from Greenwich (positive is E, negative is W) NST - Number of seismic stations which reported P- and S-arrival times for this earthquake Region - The region name is an automatically generated name from the Flinn-Engdahl (F-E) seismic and geographical regionalization scheme, proposed in 1965, defined in 1974 and revised in 1995. 11 / 22

log(depth) versus Magnitude - colored by Latitude, Longitude > library(hmisc) > library(rcolorbrewer) > cols1 = c(brewer.pal(12,"set3"),brewer.pal(6,"dark2")) > lonfactor = cut2(edata$lon,g=10) > latfactor = cut2(edata$lat,g=10) > par(mfrow=c(1,2)) > plot(log(edata$depth),edata$magnitude,pch=19,col=cols1[as.numeric(lonfactor)]) > plot(log(edata$depth),edata$magnitude,pch=19,col=cols1[as.numeric(latfactor)]) 12 / 22

The Ring of Fire also... 13 / 22

log(depth) versus Magnitude - colored by Region/Version > library(rcolorbrewer) > cols1 = c(brewer.pal(12,"set3"),brewer.pal(6,"dark2")) > par(mfrow=c(1,2)) > plot(log(edata$depth),edata$magnitude,pch=19,col=cols1[as.numeric(as.factor(edata$src))]) > plot(log(edata$depth),edata$magnitude,pch=19,col=cols1[as.numeric(as.factor(edata$version))]) 14 / 22

log(depth) versus Magnitude - sized by the Number of Stations > plot(log(edata$depth),edata$magnitude,pch=19,col=cols1[5],cex=5*edata$nst/max(edata$nst)) 15 / 22

Model Magnitude and log(depth) > logdepth <- log(edata$depth + 1) > glm1 <- glm(edata$magnitude logdepth,family="gamma") > glm2 <- glm(edata$magnitude logdepth,family="poisson") > plot(logdepth,edata$magnitude,pch=19,col=cols1[5]) > points(logdepth,glm1$fitted,pch=19,cex=0.5,col=cols1[4]) > points(logdepth,glm2$fitted,pch=19,cex=0.5,col=cols1[6]) 16 / 22

Model Magnitude and log(depth) + Region >logdepth <- log(edata$depth + 1) > glm3 <- glm(edata$magnitude logdepth + as.factor(edata$region),family="gamma") > plot(logdepth,edata$magnitude,pch=19,col=cols1[5]) > points(logdepth,glm3$fitted,pch=19,cex=0.5,col=cols1[4]) 17 / 22

Model Magnitude and log(depth) + Longitude >logdepth <- log(edata$depth + 1) > glm4 <- glm(edata$magnitude logdepth + edata$lon,family="gamma") 2 > plot(logdepth,edata$magnitude,pch=19,col=cols1[5]) > points(logdepth,glm4$fitted,pch=19,cex=0.5,col=cols1[6]) 2 A compromise would be to treat Longitude as a factor using cut2 18 / 22

Model Magnitude and log(depth) + Longitude: EEs The estimating equations for this model are given by: n i=1 X T i ( Y i 1 ) X T = 0 3 β where X i = [1 Log Depth i Longitude i ] T and β = [β 0 β ld β long ] T This can be solved with Newton s method (hopefully you did it). One thing to be careful with for this set of estimating equations is that we need the linear predictor Xi T β to be positive for sensible results. Sandwich estimation proceeds by the usual formula (which hopefully you worked out for your model!) i 19 / 22

Get parameter fits and confidence intervals (non-parametic, sandwich-based) > library(sandwich) > params = round(cbind(glm4$coeff,glm4$coeff + sqrt(diag(sandwich(glm4))) %o% qnorm(c(0.025,0.975))) > rownames(params) = c("intercept","log Depth", "Longitude") > colnames(params) = c("estimate","lower","upper") > params Estimate Lower Upper Intercept 0.4941 0.4619 0.5263 Log Depth -0.0385-0.0495-0.0274 Longitude -0.0010-0.0011-0.0009 First, it is clear that both Log Depth and Longitude are statistically significant at the 0.05 level. Now, how do we interpret each of these parameters? The intercept term is interpreted as the inverse of the average magnitude earthquake that occurs on the earth s surface at the same longitude as Greenwich. Interpreting the Log Depth term and the Longitude term are somewhat more difficult for the inverse link. The parameters are negative, which suggests that for two earthquakes at the same longitude, on average the trend is that increases in Log Depth are associated with increases in Magnitude (coefficients are in the opposite direction of what you would see for the Poisson model). 20 / 22

Model/Sanity Checking First lets check that the linear predictor only produces positive values > glm4$coeff[1] + max(logdepth)*glm4$coeff[2] + max(edata$lon)*glm4$coeff[3] 0.06521122 Now see if the coefficient of variation σ/µ is about constant: > dbins = cut2(logdepth,cuts=seq(0,6,by=0.5)) > par(mfcol=c(2,1)) > plot(edata$magnitude logdepth,col=cols1[5],pch=19,xlim=c(0,6.8)) > plot(tapply(logdepth,dbins,mean),tapply(glm4$residuals,dbins,sd)/ tapply(glm4$fitted,dbins,mean), pch=19,xlim=c(0,6.8),col=cols1[1],xlab="average logdepth",ylab="average Residual Sd/Average Fitted Value") 21 / 22

Checking Constant CV Assumption 22 / 22