plyr Modelling large data Hadley Wickham

Similar documents
Wildlife In The Landscape

Demographic Characteristics and Trends of Bexar County and San Antonio, TX

Texas Strategic Highway Safety Plan Update. 2 nd Emphasis Area Team Meeting Older Road Users 3/7/2017 Austin, TX

Texas Department of State Health Services Vital Statistics Unit

Texas Department of State Health Services Vital Statistics Unit

COLORADO 1200 COLORADO (5H) 4.95R UPC # ISBN # TITLES IN RED ARE FIVE STAR PAPER FOLD MAPS

13,351. Overall Statewide Results. How was the survey taken? Do you own or lease a personal vehicle? What is your primary means of transportation?

2018 Challenger, CMZ, SCMZ, and Challenger 1-Day Events

TEXAS DEPARTMENT OF TRANSPORTATION. Condition of Texas Pavements

2015 Annual Events VisitLongviewTexas.com

TEXAS DEPARTMENT OF TRANSPORTATION. Condition of Texas Pavements

1998 SURVEY OF FRONT SEAT OCCUPANT RESTRAINT USE IN EIGHTEEN TEXAS CITIES. by Katie N. Womack. October 1998

Pedestrian Improvements Kemp Blvd Wichita Falls (Midwestern Pkwy to Southwest Pkwy)

2014 Texas ASA Texas ASA Texas ASA Texas ASA Texas ASA. 6 & 8 Under Girls Fast Pitch - North

Report 2015 The Board of Disciplinary Appeals

HUD's 2012 Continuum of Care Homeless Assistance Programs Housing Inventory Count Report

J.R. Small Engines 9450 E. Bankhead Hwy Aledo TX, Morrison Outdoor Power Equipment 8300 County Road 508 Alvarado TX, 76009

Texas Facility Based Physician Contract Status. Important Information

Event Schedule CGA. Garland, TX, 02/16/2019. All Star / Level 1 / Junior Novice 21 Small A Allen, TX

Texas Facility Based Physician Contract Status. Important Information

2011 Tournament or Ranking Search for USTA Texas Adults, Seniors and Super Seniors KNOW the RULES Adults, Seniors and Super Seniors

THE HISTORY OF THE SOUTH CENTRAL FEDERATION

OPPORTUNITY ZONE TRACT LISTING

2018 TOURNAMENT SCHEDULE

DISTRICT AUSTIN 3226 BASTROP US NH 2011(720) $ 578, NO AWARD

Texas Housing Markets: Metropolitan vs. Border Communities. September 22, 2014

News about winners, scratch-off prizes, new games and more! COMING SOON!

% OF NON-CONTRACTED FACILITY-BASED PHYSICIANS AND BILLED AMOUNTS AT CONTRACTED FACILITIES

Texas Facility Based Physician Contract Status. Important Information

texas.usta.com 2017 Tournament Schedule

TEXAS TRANSPORTATION COMMISSION

Navigate to the golf data folder and make it your working directory. Load the data by typing

STATE CHAMPIONSHIPS THE ROAD ENDS HERE LAVENDER DIVISION-HIGH SCHOOL GIRLS GAMES PLAYED JUNE 24-25, OUR LADY OF THE LAKE UNIVERSITY

Let me know if you need help in choosing camps:

2014 National PAGA Convention & Tournament Meeting Minutes Wednesday July 30, 2014 Wyndham Hotel, EI Paso, Texas Hosted by PAGA of EI Paso Chapter

AGENDA. TEXAS TRANSPORTATION COMMISSION 125 East 11 th Street Austin, Texas THURSDAY June 30, 2005 * * *

STATEWIDE TRANSPORTATION ENHANCEMENT PROGRAM Program Call Projects Selected for Funding

TEXAS TRAFFIC SAFETY TASK FORCE. Jeff Moseley Texas Transportation Commission

REGIONAL SAFETY ADVISORY COMMITTEE North Central Texas Council of Governments Transportation Council Room Friday, April 27, :00 am 12:00 pm

Southeast District 9 4-H Event & Professional Development Calendar

Welcome Mr. Rene Sanchez President, Missouri City PAGA welcomed the delegates and guests

2016 KNOW the RULES 11/11/2015 USTA Texas Adults, Seniors and Super Seniors

OFFICIAL MINUTES. MINUTES OF THE DIRECTORS BOARD OF THE TEXAS MASTER GARDENERS ASSOCIATION FEBRUARY 3/2018 Texas A&M University College Station, Texas

Ernest Perez, San Angelo, TX was appointed Sergeant at Arms by the President.

Texas ASA Insider. The Newsletter of the Texas Amateur Softball Assoc. Vol. 4/No. 1 February 2011

State of Texas Operations and Maintenance

UMTA/TX-89/2005-1F. November The Development of Standard Transit Profiles for Texas. Diane L. Bullard. Technical Report F

5A Austin St. Dominic Savio Catholic High School

Integrated Athletic Development (2800 N I-35E, Carrollton, TX 75007)

SAFE-D Members *updated 11/1/2017

14. Sponsoring Agency Code P.O. Box 5051 Austin, Texas 78763

STATE OF TEXAS LEASED OFFICE

The University of Texas Pan American 2009 Bronc Baseball Game Notes Texas A&M Corpus Christi (DH) March 10, 2009 / Chapman Field

Mega-Events: Is the Texas-Baylor game to Waco what the Super Bowl is to Houston?

4-H Spotlight October 2013

Training Schedule. Revised 5/4/18 Page 1 of 8

Texas Transportation Institute The Texas A&M University System College Station, Texas

SAFE-D Members *updated 1/18/2019

Event Schedule DCC Fall Classic. Ford Center at The Star, 12/04/2016. General / Hip Hop / Youth 7 Small B Sanger, TX

How the Pueblo Area Stacks Up:

Duncanville Fieldhouse

FM 471. Open House Public Meeting FM 471

When Minutes Count A citizen s guide to medical emergencies

Diocese Self-Audit? Frequency

Running head: DATA ANALYSIS AND INTERPRETATION 1

Partnerships with Purpose: Housing for Texans

TEXAS TRANSPORTATION COMMISSION

Texas ASA Insider. The Newsletter of the Texas Amateur Softball Association Vol. 4/No. 1 Feb Spring workshop lays plans for big year

PARKING LOT VENDING TENANT INFORMATION:

Transportation Code, establishes prima facie reasonable and prudent speed limits for various categories of public roads, streets and highways.

27Quantify Predictability U10L9. April 13, 2015

BEST PRACTICES FOR PAVEMENT EDGE MAINTENANCE

Copyrighted STANDARD PRINTING & LITHOGRAPHING Co Capitol Ave. Houston, Texas

Texas Intersection Safety Implementation Plan (ISIP): Project Overview & Preliminary Findings. April 15, 2016

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Preliminary Findings for Texas s North Central Texas Council of Governments Region

Training Schedule. Revised 11/29/2017 Page 1 of 9

Houston HS Shoot-Out Schedule

TCBA National Playoffs - Ft. Worth, TX THE TRAVELING CLASSIC BOWLING ASSOCIATION OF AMERICA. Inside this issue:

July 2007 NO UPDATE FOR TEX-207-F, EFFECTIVE MAY 2007

County 4-H Events WCYF Events District 11 Events Texas 4-H State Events Major Stock Show Events

Connect. Brand. Everywhere!

Texas Intersection Safety Implementation Plan Workshop JUNE 2, 2016

Media Kit July 11, 2006

This page intentionally left blank

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

Your Texas Economy. Last updated: January 30, 2018

University Interscholastic League

Congratulations USTA Super Senior Plus Sectional Captains November 5 th -7 th Austin, TX.

Training Schedule. Contact Number Training Location Street Address City Zip

Texas Association of Metals Detecting Clubs. Association News. October-2017 CURRENT TAMDC OFFICERS

The IT is Community Service!

2017 North Texas Regional Bicycle Opinion Survey

APPROVED PROJECTS RESCHEDULED FROM AUGUST TO SEPTEMBER :19 Thursday, June 13,

Select Boxplot -> Multiple Y's (simple) and select all variable names.

INSIDE THE WOODS. March Parent Education Event

THE KINGWOOD U BOYS POOL KEY

Event Schedule Spirit Celebration Christmas Championship. Dallas, TX, 12/01/2018. All Star Novice / Level 1 (Tiny Novice) / Tiny 13 Large A Plano, TX

Safety Impacts: Presentation Overview

Dallas Police Department Crime Report

Transcription:

plyr Modelling large data Hadley Wickham

1. Strategy for analysing large data. 2. Introduction to the Texas housing data. 3. What s happening in Houston? 4. Using a models as a tool 5. Using models in their own right

Large data strategy Start with a single unit, and identify interesting patterns. Summarise patterns with a model. Apply model to all units. Look for units that don t fit the pattern. Summarise with a single model.

Texas housing data For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112): Number of houses listed and sold Total value of houses, and average sale price Average time on market CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/

Strategy Start with a single city (Houston). Explore patterns & fit models. Apply models to all cities.

220000 200000 avgprice 180000 160000 2000 2002 2004 2006 2008 date

8000 7000 sales 6000 5000 4000 3000 2000 2002 2004 2006 2008 date

6.5 6.0 onmarket 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date

Seasonal trends Make it much harder to see long term trend. How can we remove the trend? (Many sophisticated techniques from time series, but what s the simplest thing that might work?)

220000 200000 avgprice 180000 160000 2 4 6 8 10 12 month

Challenge What does the following function do? deseas <- function(var, month) { resid(lm(var ~ factor(month))) + mean(var, na.rm = TRUE) } How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city?

houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month) ) qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg

210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2 4 6 8 10 12 month

Model as tools Here we re using the linear model as a tool - we don t care about the coefficients or the standard errors, just using it to get rid of a striking pattern. Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns.

210000 200000 avgprice_ds 190000 180000 170000 160000 150000 2000 2002 2004 2006 2008 date

7000 6500 6000 sales_ds 5500 5000 4500 4000 2000 2002 2004 2006 2008 date

6.5 6.0 onmarket_ds 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date

Summary Most variables seem to be combination of strong seasonal pattern plus weaker longterm trend. How do these patterns hold up for the rest of Texas? We ll focus on sales.

8000 6000 sales 4000 2000 2000 2002 2004 2006 2008 date

tx <- read.csv("tx-house-sales.csv") qplot(date, sales, data = tx, geom = "line", group = city) tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city)

7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 date

It works, but... It doesn t give us any insight into the similarity of the patterns across multiple cities. Are the trends the same or different? So instead of throwing the models away and just using the residuals, let s keep the models and explore them in more depth.

Two new tools dlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame dlply + ldply = ddply

models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df)) models[[1]] coef(models[[1]]) ldply(models, coef)

Labelling Notice we didn t have to do anything to have the coefficients labelled correctly. Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls.

Back to the model What are some problems with this model? How could you fix them? Is the format of the coefficients optimal? Turn to the person next to you and discuss for 2 minutes.

qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) })

qplot(date, log10(sales), data = tx, geom = "line", group = city) Log transform sales to make coefficients comparable (ratios) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { }) data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) Puts coefficients in rows, so they can be plotted more easily

0.4 0.3 0.2 effect 0.1 0.0 0.1 2 4 6 8 10 12 month qplot(month, effect, data = coef2, group = city, geom = "line")

10^effect 2 4 6 8 10 12 month qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")

10^effect Abilene Brownsville Fort Bend Killeen Fort Hood ontgomery Coun San Angelo Victoria 2 4 6 8 1012 Amarillo yan College Stat Fort Worth Laredo Nacogdoches NE Tarrant Count San Antonio Waco 2 4 6 8 1012 Arlington Collin County Galveston ongview Marsha San Marcos Wichita Falls 2 4 6 8 1012 Austin Corpus Christi Garland Lubbock Odessa Sherman DenisonTemple Belton 2 4 6 8 1012 month Bay Area Dallas Harlingen Lufkin Palestine 2 4 6 8 1012 Beaumont Denton County Houston McAllen Paris Texarkana 2 4 6 8 1012 qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city) Brazoria County El Paso Irving Midland Port Arthur Tyler 2 4 6 8 1012

What should we do next? What do you think? You have 30 seconds to come up with (at least) one idea.

My ideas Fit a single model, log(sales) ~ city * factor(month), and look at residuals Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don t fit

# One approach - fit a single model mod <- lm(log10(sales) ~ city + factor(month), data = tx) tx$sales2 <- 10 ^ resid(mod) qplot(date, sales2, data = tx, geom = "line", group = city) last_plot() + facet_wrap(~ city)

3.5 3.0 sales2 0.5 2000 2002 2004 2006 2008 date qplot(date, sales2, data = tx, geom = "line", group = city)

sales2 0.5 3.0 3.5 0.5 3.0 3.5 0.5 3.0 3.5 Killeen Fort Hood 0.5 3.0 3.5 0.5 3.0 3.5 0.5 3.0 3.5 0.5 3.0 3.5 Abilene Brownsville Fort Bend ontgomery Coun San Angelo Victoria 000 2002 2004 2006 2008 Amarillo yan College Stat Fort Worth Laredo Nacogdoches NE Tarrant Count San Antonio Waco 000 2002 2004 2006 last_plot() + facet_wrap(~ city) Arlington Collin County Galveston ongview Marsha San Marcos Wichita Falls Austin Corpus Christi Garland Lubbock Odessa Sherman DenisonTemple Belton 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 2008 date Bay Area Dallas Harlingen Lufkin Palestine Beaumont Denton County Houston McAllen Paris Texarkana Brazoria County El Paso Irving Midland Port Arthur Tyler

# Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't # fit well. library(splines) models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df) }) # Extract rsquared from each model rsq <- function(mod) c(rsq = summary(mod)$r.squared) quality <- ldply(models3, rsq)

city Wichita Falls Waco Victoria Tyler Texarkana Temple Belton Sherman Denison San Marcos San Antonio San Angelo Port Arthur Paris Palestine Odessa NE Tarrant County Nacogdoches Montgomery County Midland McAllen Lufkin Lubbock Longview Marshall Laredo Killeen Fort Hood Irving Houston Harlingen Garland Galveston Fort Worth Fort Bend El Paso Denton County Dallas Corpus Christi Collin County Bryan College Station Brownsville Brazoria County Beaumont Bay Area Austin Arlington Amarillo Abilene qplot(rsq, city, data = quality) 0.5 0.6 0.7 0.8 0.9 rsq

reorder(city, rsq) San Antonio Montgomery County Houston Dallas Bryan College Station Collin County Denton County Austin Fort Bend Fort Worth Tyler NE Tarrant County Bay Area Corpus Christi Arlington Waco Temple Belton Lubbock Garland Longview Marshall Midland Laredo Harlingen Abilene Killeen Fort Hood Brazoria County McAllen Brownsville Wichita Falls Sherman Denison Irving Galveston Odessa San Marcos San Angelo Amarillo Nacogdoches Lufkin Victoria Beaumont Texarkana Paris Palestine El Paso Port Arthur How are the good fits different from the bad fits? 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, reorder(city, rsq), data = quality)

quality$poor <- quality$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") mfit <- ldply(models3, function(mod) { data.frame( resid = resid(mod), pred = predict(mod)) }) tx2 <- cbind(tx2, mfit[, -1]) Can you think of any potential problems with this line?

log10(sales) Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 0.5 3.0 3.5 Brownsville yan College Stat Collin County Corpus Christi Dallas Denton County El Paso 0.5 3.0 3.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 0.5 3.0 3.5 Killeen Fort Hood Laredo ongview Marsha Lubbock Lufkin McAllen Midland 0.5 3.0 3.5 ontgomery Coun Nacogdoches NE Tarrant Count Odessa Palestine Paris Port Arthur 0.5 3.0 3.5 San Angelo San Antonio San Marcos Sherman DenisonTemple Belton Texarkana Tyler 0.5 3.0 3.5 Victoria Waco Wichita Falls 0.5 3.0 3.5 Raw 000 2002 2004 2006 2008 000 2002 2004 2006 2008 000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 data2008 2004 2006 date

pred 3.5 3.0 3.5 3.0 3.5 3.0 Killeen Fort Hood 3.5 3.0 3.5 3.0 3.5 3.0 3.5 3.0 Abilene Brownsville Fort Bend ontgomery Coun San Angelo Victoria 000 2002 2004 2006 2008 Amarillo yan College Stat Fort Worth Laredo Nacogdoches NE Tarrant Count San Antonio Waco 000 2002 2004 2006 2008 Arlington Collin County Galveston ongview Marsha San Marcos Wichita Falls 000 2002 2004 2006 Austin Corpus Christi Garland Lubbock Odessa Sherman DenisonTemple Belton 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 2008 date Bay Area Dallas Harlingen Lufkin Palestine Beaumont Denton County Houston McAllen Paris Texarkana Brazoria County El Paso Irving Midland Port Arthur Tyler Predictions

resid 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 Killeen Fort Hoo 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 Abilene Brownsville Fort Bend ontgomery Coun San Angelo Victoria 000 2002 2004 2006 2008 Amarillo yan College Stat Fort Worth Laredo NacogdochesNE Tarrant Count San Antonio Waco 000 2002 2004 2006 2008 Arlington Collin County Galveston ongview Marsha San Marcos Wichita Falls 000 2002 2004 2006 Austin Corpus Christi Garland Lubbock Odessa herman Deniso 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 20082000 2002 2004 2006 date Bay Area Dallas Harlingen Lufkin Palestine Temple Belton Beaumont Denton County Houston McAllen Paris Texarkana Brazoria County El Paso Irving Midland Port Arthur Tyler Residuals2008

Conclusions Simple (and relatively small) example, but shows how collections of models can be useful for gaining understanding. Each attempt illustrated something new about the data. Plyr made it easy to create and summarise collection of models, so we didn t have to worry about the mechanics.