SEAMAP Vertical Line Dataset Challenges and Opportunities Mark A. Albins University of South Alabama, Dauphin Island Sea Lab
Analysis goals Indices of abundance for RS using avail. VL data (AL, LA, TX) Relationship between IOA and region, habitat type, depth, etc. Power analysis: # of samples needed to detect change How does sampling across depth strata contribute to overall IOA, power to detect change, and scope of inference?
Analysis plan Download VL data from SEAMAP GSMFC (seamap.gsmfc.org) Check, clean, summarize data Fit GLMs (or GLMMs) for RS numerical catch and RS biomass catch Candidate distributions: Numerical: Poisson, negative binomial, or zero-infl. version (Comp. Poisson) Biomass: log-normal, or zero-infl. Version (Tweedie) Candidate predictors: Categorical: State or region, Year, Depth stratum Continuous: Effort, Depth, Latitude, Longitude Also fit models with Temp, Sal, DO using a reduced dataset (where these are available). Simulate data using Best fit model parameters Across range of effect sizes Across range of sample sizes Fit models to simulated data and calculate Power across range of effect sizes for a range of sample sizes
Challenges and opportunities Differences in sampling protocol Among years Among state partners Differences in sampling design Among years Among state partners Overall data consistency and quality Independent observational units? Dependent subsamples?
Basic observational unit breakdown and consequences for analytical options Sites Stations Lines Hooks If we can collapse to site level without introducing bias, or eliminating too much data, we can use fixed-effects model (GLM) If we need to model repeated stations at a given site, or different numbers of non-equivalent lines (different hook sizes) at a station, then we ll need to use mixed effects model (GLMM) GLMMs are do-able, but limit options, and can be more difficult to fit (optimizer convergence problems, etc.)
Differences in sampling protocol Hook size/bait assignment 2010 random hook size/bait type assigned within backbones 2011 random hook size assigned within backbones 2012-2016 single hook size assigned to entire backbone Lines Bait Gangions Hooks/line Hook size three sequential two simultaneous three simultaneous Mackerel and Squid 11 or 12 (depending on drop) Mackerel Twisted 18 w/swivel sleeve Mackerel Twisted 18 w/out swivel sleeve 10 or 12 (depending on drop) [9, 11] or [3, 8, 11] (depending on drop) 12 8, 11, 13, 15 10 8, 11, 15
Drilling down into protocol differences What do the data tell us about Number of stations per site Number of lines per station Number of hooks per line Hook sizes across lines and stations
Data structure and critical links SEAMAP VL Database consists of three linked tables CRUISE STATION CATCH Cruise_Id / CID SID Connecting STATION with CATCH Ops. Manual states SEAMAPSTATION is link SEMAPSTATION: 6-digit date + station number for the day Not good key surveys in TX and AL from same day might have same code Unique SID assigned to each row in STATION and all associated rows in CATCH were flagged with this SID Resulted in some problems
Problems with SID link between STATION and CATCH 215 SIDs in STATION with no corresponding rows in CATCH Only 15 SEAMAPSTATION in STATION with no corresp. rows in CATCH Duplicate rows in STATION of 2063 rows... 74 primaries + 207 dup vals in SEAMAPSTATION 67 primaries + 200 dup vals in SEAMAPSTATION plus SOURCE 66 primaries + 198 dup vals in above plus LAT, LON 131 primaries + 133 dup vals in above plus TIME 77 primaries + 77 dup vals in all columns (except primary key: SID) 252 individual simultaneous lines fished at same stations unique rows in STATION with different SID & SEAMAPSTATION but duplicates in all other columns Duplicate rows in CATCH - of 38722 rows 92 primaries + 95 dup vals in all columns (except primary key: CDID)
SID in STATION with no corresponding rows in CATCH (215/2064) 210 from 2010 2012: 14 from AL, 196 from LA None have OPSCODEs indicating non-fished station A few have COMMENTs suggesting that site was not-fished Eg No Structure, missed the site, not close enough to rig Many have DEPTH, DEPTHFISHED, TIME and TIMESOAK Some have COMMENTs indicating that fish were caught E.g could not get otoliths from fish on hook #7, awesome site, same fish caught on hooks #8 and #9 Many have comments indicating problems with bait, sharks, tangled lines, etc. 5 from 2013-2016: All from LA, 2015 3 have OPSCODEs and COMMENTS indicating non-fished station Of these 3, 2 have TIMESOAK of 5 min, 1 has no Lat/Lon 2 have no OPSCODEs or COMMENT indicating non-fished station but do have values for DEPTH, DEPTHFISHED, TIME, and TIMESOAK Need to go back to beginning of pipeline to fix check against field data sheets
SEAMAPSTATION in STATION with no corresponding rows in CATCH (15/2064) 3 from LA, 2015 3 have OPSCODEs & COMMENTs indicating non-fished station 2 have TIMESOAK = 5 12 from AL 2010 2011 0 have OPSCODEs indicating non-fished station COMMENTs 5 similar to MARF 6 No Structure 5 blank 2 what should be STRUCTNAME (e.g. 2004 Pyramid 209 ) All have ENV_LAT & ENV_LON 5 have TEMP, SAL, DO 9 have DEPTH DEPTHFISHED 2 data 7 zeros 3 blanks All have TIME = 0 All have TIMESOAK = blank Need to go back to beginning of pipeline to fix check against field data sheets
Duplicate values in STATION columns 77 full duplicate pairs of rows Could be undocumented double lines fished at same site, same time (need to collapse into single row per station) Or cut-paste/copy-paste type errors (need to eliminate) 131 primaries + 133 dup vals for SEAMAPSTATION, SOURCE, LAT, LON, TIME Includes above full dups + those with different values in other columns (e.g. abiotics, DEPTHFISHED, COMMENTs, OPSCODEs, etc.) Could be double lines fished at same site, time (need to collapse) Or could be erroneous repeats (need to eliminate) 66 primaries + 198 dup vals for SEAMAPSTATION, SOURCE, LAT, LON Includes all above plus stations fished more than once in same day (different TIME) Keep these as separate stations but flag as repeat visits to same site (repeated measures) 67 primaries + 200 dup vals in SEAMAPSTATION SOURCE Includes all above plus 1 primary + 2 dup vals due to typo in SEMAPSTATION column (AL, checked field data, fixed) 74 primaries + 207 dup vals in SEAMAPSTATION 7 primaries + 7 dup vals due to same SEAMAPSTATION being assigned in LA and TX on same day Might think about including state in a unique and informative station identifier like, AL050615VL01 Need to go back to beginning of pipeline to fix most check against field data sheets
Rows in STATION with dup vals in all columns except SID & SEAMAPSTATION 252 individual simultaneous lines fished at same stations mapped to unique rows (SEAMAPSTATIONs/SIDs) in STATION All AL data AL uses alpha code at the end of the SEAMAPSTATION identifier to distinguish lines at a station (e.g. 050515VL05A ) Lines mistakenly mapped to the station level at some point in the data prep and/or transfer to GSMFC SEAMAP Suggest adding unique line identifier to CATCH for each station GEARLOC doesn t work due to sequential drops from same GEARLOC HOOKSIZE doesn t work because these were randomly assigned to lines in 2010 Already being fixed at beginning of pipeline
Duplicate rows in CATCH 92 primaries + 95 dup vals in all columns 2010 2011: 88 primaries + 91 dup vals 2012 2016: 4 primaries + 4 dup vals My best guess is that most of these (2010 2011) represent empty hooks but are missing HOOKNUM and HOOKSIZE info Need to go back to beginning of pipeline to fix check against field data sheets
Drilling down into protocol differences Number of stations per site Number of lines per station Number of hooks per line Hook sizes across lines and stations Need to fix: SID link between STATION and CATCH: differentiate clearly between non-fished and fished stations, and assure that there are rows in CATCH for all hooks at all fished stations Duplicates of critical values in STATION: eliminate erroneous repeat stations and flag true repeat stations
Options for including numbers and sizes of hooks in model Diffs. in hook number among stations can be modeled via an effort offset in the predictors Diffs. in hook size representation among stations = source of bias Eliminate stations with non-standard hook size rep. Or include hook size in model - requires inclusion of subsamples (lines/hooks), which necessitates GLMM Unfortunately
Problems with hook number and size data: 2010 2011 Number of rows in CATCH (hooks fished) per station is very inconsistent 47/233 stations with < 10 rows in CATCH data 70/233 stations with n-rows in CATCH multiple of 10 or 12 Most empty-hook (no-catch) rows missing HOOKNUM blank for 1115/5460 rows HOOKSIZE blank for 5/5460 rows Impossible to assure standardization of effort and avoid hook size bias for these early years Will likely require major data entry effort Need to go back to beginning of pipeline to fix check against field data sheets
Problems with hook number and size data: 2012-2016 Rows in CATCH (hooks fished) per station fairly consistent 0/1616 stations with < 10 rows in CATCH 14/1616 stations with n-rows in CATCH multiple of 10 Most empty-hook (no-catch) rows present HOOKNUM blank for 5/33262 rows HOOKSIZE blank for 41/33262 rows Most can be fixed relatively easily, what can t be fixed can be eliminated from dataset without large loss in sample size Need to go back to beginning of pipeline to fix check against field data sheets
Other problems with hook number: missing hooks vs. lost partial rig Potential cause of < 10 rows in CATCH per station: single line stations + inconsistent treatment of missing hooks/lost partial rig Sometimes rows included for missing hooks - often with M for BAITSTATUS Sometimes rows not included for missing hooks - often with COMMENT indicating lost rig or lost partial rig Need to clarify difference (if any) between these categories and standardize how they are treated in the data Suggestions: Stations with missing hooks flagged in OPSCODE using Y(HS) Stations with lost partial rig given full set of rows in CATCH with BAITSTATUS = M for lost hooks Stations with lost (full) rig no rows in CATCH for lost rig, flagged in OPSCODE using L(HS) Treat missing hooks and lost partial rigs same in analysis Treat lost (full) rigs same as station with less than full set of lines
Drilling down into protocol differences Number of stations per site Number of lines per station Number of hooks per line Hook sizes across lines and stations Need to fix: Blanks in HOOKSIZE & HOOKNUM Missing rows in CATCH for no-catch hooks Mising rows in CATCH for lost partial rigs Extra rows in CATCH for lost (full) rigs Easy for 2012 2016, but difficult for 2010 2011 Need to fix: SID link between STATION and CATCH: differentiate clearly between non-fished and fished stations, and assure that there are rows in CATCH for all hooks at all fished stations Duplicates of critical values in STATION: eliminate erroneous repeat stations and flag true repeat stations
Dealing with differences in sampling design and protocols Use all data (2010 2016): Multiple sampling events across time Stations with less than three lines (hook number, size) Different sets of hook sizes and different bait types Hook size and bait type uniform on lines in most years, but randomly assigned within lines in other years Limit to recent years (2012 2016): Multiple sampling events across time Stations with less than three lines (hook number, size)
Dealing with differences in sampling design and protocols Multiple sampling events across time 1. Eliminate repeat sampling events at a site 2. Incorporate repeat sampling events into a longitudinal design (mixed effects model with site as random effect) Under this option, might still be good to eliminate repeat samples within same day, or close together in time to avoid any depletion effect
Dealing with differences in sampling design and protocols (2010 2016) Stations with less than three lines (hook number, size) Different sets of hook sizes and different bait types 1. Keep all stations, collapse lines into station (ignore potential hook size and bait biases) 2. Keep all stations, collapse lines into station (minimize potential hook size and bait biases by eliminating bait types and/or hook sizes not fished during all years) 3. Keep all stations, incorporate hooks-nested-within-lines as subsamples in station, include hook size as predictor, eliminate bait types and/or hook sizes not fished during all years (mixed effects model with individual level random effect at the hook level binary response) For all three options, include number of hooks as measure of effort in model (offset term) Eliminating all stations with less than three lines means dropping all/most of 2011, so not really an option Additional concern with all three options is depletion effect of sequential lines in 2010
Dealing with differences in sampling design and protocols (2012 2016) Stations with less than three lines (hook number, hook size) 1. Eliminate stations with less than 3 lines, collapse lines into station (no hook size biases because equal rep. at all stations) 2. Keep all stations, collapse lines into station (ignore potential hook size bias) 3. Keep all stations, incorporate lines as subsamples in station, include hook size as predictor (mixed effects model with line-nested-within-station as random effect) For all three options, include number of hooks as measure of effort in model (offset term)
Other issues: bait/hook status, lost and/or tangled gear, Sharks, etc. Which hooks should be counted in effort offset? Whole bait (Y) Partial bait (Y) No bait (?) No bait on deployment (N) Damaged (N) Missing (N) Predation (N) Double hooked fish (N) Those that catch other spp. (N) Should tangled gear be included? Does it matter if the tangled gear caught fish or not? Should the whole station be removed, or just those lines affected? Should stations with large sharks present be included? If not, how to we standardize the filter?
Challenges and opportunities Differences in sampling protocol Among years Among state partners Differences in sampling design Among years Among state partners Overall data consistency and quality
CATCH (38722 rows): Categorical variables CAMERA 2422: blank 432: FALSE 240: TRUE GEARLOC 3079: blank BAITSTATUS 16: 0 1: n 1: S HOOKSIZE 1101: 15/0 1100: Aug-00 1081: Nov-00 501: 08 44: blank 2: 1
CATCH (38722 rows): COMMENT Supposed to be catch all, but should only be used for data with no home or to clarify (or add caveats to) other codes used Often used in place of BAITSTATUS (e.g. one fish caught on two hooks) Need to review all comments and use to fill in OPSCODEs, BAITSTATUS, etc.
CATCH (38722 rows): BAITSTATUS No code for bait lost upon deployment (several examples of this in COMMENTs) should treat these hooks differently in analysis One fish caught on multiple hooks treated inconsistently Suggestions: Enter fish data (including GENUS, SPECIES, BIOCODE, etc.) on one line only! Use code F on line with fish data and code L for all other hooks Make note in COMMENT identifying primary hook and all other hooks using their HOOKNUM
CATCH (38722 rows): Species identifiers BIOCODE: 1 blank (where SPECIES & GENUS not blank) SPECIES: 253 blanks (where BIOCODE not blank) 10 BIOCODES with multiple GENUS + SPECIES All of these shortened versions or different capitalizations of correct name
CATCH (38722 rows): FISHID FISHID assigned when no fish on hook (19K/30K) 398 primary + 13235 dup vals of FISHID
CATCH RS only (7195 rows): Fish size data GONADWT PCL SL FL TL WEIGHT Min. : 0.000 Min. :151.0 Min. :160.0 Min. :184.0 Min. : 55.0 Min. : 0.082 1st Qu.: 3.917 1st Qu.:281.0 1st Qu.:305.0 1st Qu.:368.0 1st Qu.: 396.0 1st Qu.: 0.850 Median : 11.367 Median :311.0 Median :370.0 Median :431.0 Median : 467.0 Median : 1.400 Mean : 28.599 Mean :325.2 Mean :395.4 Mean :452.9 Mean : 489.7 Mean : 1.899 3rd Qu.: 31.900 3rd Qu.:356.0 3rd Qu.:475.0 3rd Qu.:524.0 3rd Qu.: 566.0 3rd Qu.: 2.400 Max. :517.600 Max. :695.0 Max. :748.0 Max. :861.0 Max. :4310.0 Max. :13.200 NA's :3763 NA's :6834 NA's :3575 NA's :40 NA's :32 NA's :62
Biological parameters (RS only)
v v v v v v
Missing RS weights (62 rows) 27 have no COMMENT or other indications why missing data 2 of these have BAITSTATUS indicating one fish caught on two hooks 19 of these have data for length measurements 13 have weight removed in COMMENT All from LA on same day: 2015-05-01, 6 different stations 14 were lost before measuring, or partly consumed 7 have comment indicating one fish caught on multiple hooks 4 of these have BAITSTATUS indicating same 1 has data missing COMMENT
STATION (2064 rows): Categorical variables GEARCODE 4: VL 439: blank STRUCTTYPE 219: Artificial Structure 189: ARTIFICIAL REEFS 135: ARTIFICIAL 131: Artificial reef 129: artificial reef 16: ARTIFICAL 2: Artificial Reef 1: Artificial Reef -pyramids- 1: Artificial Reef -two-pile structure- 1: Artificial Reef -wreck- 2: Stand Pipe 1: Z-Pipe 648: PETROLEUM PLATFORMS 32: PLATFORM STRUCTTYPE (cont) 62: NATURAL BOTTOM 42: natural bottom 30: Natural Structure 25: Natural bottom 7: Natural Bottom 5: NATURAL 17: No structure 12: No Structure 39: NO STRUCTURE 41: Unknown 24: UNKNOWN 3: Unidentified Structure 250: blank
STATION (2064 rows): COMMENT Region or lab specific codes cause unnecessary clutter (e.g. Treatment 1 ) Used in place of appropriate OPSCODEs (e.g. OP CODE L11 ) Used to indicate protocols (e.g. single line 12 in gangion) Create new OPSCODEs for these?
STATION (2064 rows): OPSCODE Of 2064 rows in STATION, only 27 have an entry in OPSCODE (all LA 2015, 2016) J, K, O, S, X: each used once TXX: used 21 times LXX: used 2 times PXX: used 1 time (not listed in Appendix 2) Many rows in STATION include COMMENTs indicating an issue that should be reflected in the OPSCODE column, but is not (mostly lost gear situations) OPSCODE can be a powerful tool on the analysis end of the pipeline, but needs to be filled in retroactively, and consistently in the future to be of any use.
STATION (2064 rows): DEPTH + DEPTHFISHED DEPTH - 3 zeros, 9 blanks DEPTHFISHED - 47 zeros, 119 blanks
STATION (2064 rows): Temp, Sal, DO Often measured at nearby station on same or different day Better to leave these blank and do any substitutions as part of analysis If substitutions kept in dataset, need tractable flag and column indicating station ID of substitute station.
STATION (2064 rows): Temp, Sal, DO SECCHI TEMPMAX SALMAX DOMAX Min. : 0.00 Min. : 6.30 Min. : 6.40 Min. :0.280 1st Qu.: 0.00 1st Qu.:20.37 1st Qu.:35.50 1st Qu.:4.420 Median : 1.20 Median :22.20 Median :36.10 Median :5.670 Mean : 3.85 Mean :22.81 Mean :35.79 Mean :5.499 3rd Qu.: 5.90 3rd Qu.:24.90 3rd Qu.:36.40 3rd Qu.:6.600 Max. :30.50 Max. :35.38 Max. :39.96 Max. :8.500 NA's :1132 NA's :702 NA's :702 NA's :730
SECCHI DFTEMP DFSAL DFDO Min. : 0.00 Min. :18.19 Min. :31.07 Min. :0.000 1st Qu.: 0.00 1st Qu.:20.80 1st Qu.:35.53 1st Qu.:3.850 Median : 1.20 Median :22.53 Median :36.23 Median :4.750 Mean : 3.85 Mean :23.43 Mean :35.91 Mean :4.740 3rd Qu.: 5.90 3rd Qu.:25.90 3rd Qu.:36.39 3rd Qu.:5.905 Max. :30.50 Max. :31.62 Max. :39.96 Max. :8.300 NA's :1132 NA's :1599 NA's :1599 NA's :1625
Challenges and opportunities Differences in sampling protocol Among years Among state partners Differences in sampling design Among years Among state partners Overall data consistency and quality
Take home Many of these issues will require going back to the original field data Issues from early years will require extensive data entry/re-entry to deal with no-catch rows, hook position, and hook size issues We can move forward faster if we prioritize 2012-onward Most of these issues can be spot checked/confirmed/fixed I am more than happy to help with these tasks by providing detailed reports of issues and/or working with your data people to track down and fix problems
AL VLL statistical model Collapsed hooks and drops to station Most stations had same number of drops and same number/sizes of hooks per drop Those that didn t were eliminated before model fitting (e.g. lost gear) Also eliminated all 2010 and 2011 data due to inconsistencies in number of drops, number/size of hooks per drop, bait type, etc., and insufficient data quality to sort these out. No repeat stations at sites Therefore, able to run fixed-effects only model