Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 34 min ago

rrricanes to Access Tropical Cyclone Data

Wed, 09/27/2017 - 09:00
What is rrricanes Why Write rrricanes?

There is a tremendous amount of weather data available on the internet. Much of it is in raw format and not very easy to obtain. Hurricane data is no different. When one thinks of this data they may be inclined to think it is a bunch of map coordinates with some wind values and not much else. A deeper look will reveal structural and forecast data. An even deeper look will find millions of data points from hurricane reconnaissance, computer forecast models, ship and buoy observations, satellite and radar imagery, …

rrricanes is an attempt to bring this data together in a way that doesn't just benefit R users, but other languages as well.

I began learning R in 2015 and immediately had wished I had a hurricane-specific dataset when Hurricane Patricia became a harmless, but historic hurricane roaming the Pacific waters. I found this idea revisited again as Hurricane Matthew took aim at Florida and the southeast in 2016. Unable to use R to study and consolidate Matthew's data in R led me to begin learning package development. Thus, the birth of rrricanes.

In this article, I will take you on a lengthy tour of the most important features of rrricanes and what the data means. If you have a background working with hurricane data, most of this will be redundant. My aim here is to cover the big ideas behind the package and explain them under the assumption you, the reader, are unfamiliar with the data offered.

rrricanes is not intended to be used in emergency situations. I write this article as areas I have lived or currently live are under the gun from Hurricane Harvey and rrricanes is unable to obtain data due to external issues (I will describe these later). It is designed with the intent of answering questions and exploring ideas outside of a time-sensitive environment.

rrricanes will not be available in CRAN for quite some time. The current schedule is May 15, 2018 (the "start" of the East Pacific hurricane season). This year is soley for testing under real-time conditions.

And rrricanesdata

The NHC archives text products dating back to at least 1998 (some earlier years exist but yet to be implemented in this package). Accessing this data is a time-consuming process on any computer. A limit of 4 requests per second is put in place to avoid being banned (or restricted) from the archives. So, if a hurricane has 20 text products you wish to pull and parse, this will take 5 seconds. Most cyclones have more and some, far more.

rrricanesdata is a compliment package to rrricanes. rrricanesdata contains post-scraped datasets of the archives for all available storms with the exception of advisories issued in the current month.This means you can explore the various datasets without the wait.

rrricanesdata will be updated monthly if an advisory has been issued the previous month. There will be regular monthly updates approximately from May through November – the typical hurricane season. In some cases, a cyclone may develop in the off-season. rrricanesdata will be updated on the same schedule.

ELI5 the Data

This package covers tropical cyclones that have developed in the Atlantic basin (north Atlantic ocean) or East Pacific basin (northeast Pacific east of 140#°W). Central Pacific (140#°W – 180#°W) may be mixed in if listed in the NHC archives.

While traditionally the hurricane season for each basin runs from mid-May or June through November, some cyclones have developed outside of this time frame.

Every tropical cylone (any tropical low whether classified as a tropical depression, tropical storm or hurricane) contains a core set of text products officially issued from the National Hurricane Center. These products are issued every six hours.

Much of this data has changed in format over the years. Some products have been discontinued and replaced by new products or wrapped into existing products. Some of these products are returned in raw text format; it is not cleaned and may contain HTML characters. Other products are parsed with every piece of data extracted and cleaned.

I have done my best to ensure data is high quality. But, I cannot guarantee it is perfect. If you do believe you have found an error, please let me know; even if it seems small. I would rather be notified of a false error than ignore a true one.

The Products

Each advisory product is listed below with an abbreviation in parentheses. Unless otherwise noted, these products are issued every six hours. Generally, the times issued are 03:00, 09:00, 15:00 and 21:00 UTC. Some products may be issued in three-hour increments and, sometimes, two-hour increments. update can be issued at any time.

  • Storm Discussion (discus) – These are technical discussions centered on the current structure of the cyclone, satellite presentation, computer forecast model tendencies and more. These products are not parsed.

  • Forecast/Adivsory (fstadv) – This data-rich product lists the current location of the cyclone, its wind structure, forecast and forecast wind structure.

  • Public Advisory (public) – These are general text statements issued for the public-at-large. Information in these products is a summary of the Forecast/Advisory product along with any watches and warnings issued, changed, or cancelled. Public Advisory products are the only regularly-scheduled product that may be issued intermittently (every three hours and, occasionally, every two hours) when watches and warnings are in effect. These products are not parsed.

  • Wind Speed Probabilities (wndprb) – These products list the probability of a minimum sustained wind speed expected in a given forecast window. This product replaces the Strike Probabilities product beginning in 2006 (see below).

  • Updates (update) – Tropical Cyclone Updates may be issued at any time if a storm is an immediate threat to land or if the cyclone undergoes a significant change of strength or structure. The information in this product is general. These products are not parsed.

Discontinued Products
  • Strike Probabilities (prblty) – List the probability of a tropical cyclone passing within 65 nautical miles of a location within a forecast window. Replaced in 2006 by the Wind Speed Probabilities product.

  • Position Estimates (posest) – Typically issued as a storm is threatening land but generally rare (see Hurricane Ike 2008, Key AL092008). It is generally just an update of the current location of the cyclone. After the 2011 hurricane season, this product was discontinued; Updates are now issued in their place. These products are not parsed.

Primary Key

Every cyclone has a Key. However, not all products contain this value (prblty, for example). Products issued during and after the 2005 hurricane season contain this variable.

Use Key to tie datasets together. If Key does not exist, you will need to use a combination of Name and Date, depending on your requirements. Keep in mind that, unless a name is retired, names are recycled every seven years. For example, there are multiple cyclones named Katrina but you may want to isolate on Katrina, 2005.

Installation

rrricanes will not be submitted to CRAN until prior to the hurricane season, 2018. It can be installed via github using devtools:

devtools::install_github("ropensci/rrricanes", build_vignettes = TRUE) Optional Supporting Packages

rrricanesdata uses a drat repository to host the large, pre-processed datasets.

install.packages("rrricanesdata", repos = "https://timtrice.github.io/drat/", type = "source")

To use high resolution tracking charts, you may also wish to install the `rnaturalearthhires' package:

install.packages("rnaturalearthhires", repos = "http://packages.ropensci.org", type = "source")

Linux users may also need to install:

  • libgdal-dev
  • libproj-dev
  • libxml2-dev
Get a List of Storms

We start exploring rrricanes by finding a storm (or storms) we wish to analyze. For this, we use get_storms. There are two optional parameters:

  • years Between 1998 and current year

  • basins One or both "AL" and "EP"

An empty call to the function will return storms for both the Atlantic and East Pacific basin for the current year.

library(dplyr) library(rrricanes) get_storms() %>% print(n = nrow(.)) ## # A tibble: 33 x 4 ## Year Name Basin ## ## 1 2017 Tropical Storm Arlene AL ## 2 2017 Tropical Storm Bret AL ## 3 2017 Tropical Storm Cindy AL ## 4 2017 Tropical Depression Four AL ## 5 2017 Tropical Storm Don AL ## 6 2017 Tropical Storm Emily AL ## 7 2017 Hurricane Franklin AL ## 8 2017 Hurricane Gert AL ## 9 2017 Hurricane Harvey AL ## 10 2017 Potential Tropical Cyclone Ten AL ## 11 2017 Hurricane Irma AL ## 12 2017 Hurricane Jose AL ## 13 2017 Hurricane Katia AL ## 14 2017 Hurricane Lee AL ## 15 2017 Hurricane Maria AL ## 16 2017 Tropical Storm Adrian EP ## 17 2017 Tropical Storm Beatriz EP ## 18 2017 Tropical Storm Calvin EP ## 19 2017 Hurricane Dora EP ## 20 2017 Hurricane Eugene EP ## 21 2017 Hurricane Fernanda EP ## 22 2017 Tropical Storm Greg EP ## 23 2017 Tropical Depression Eight-E EP ## 24 2017 Hurricane Hilary EP ## 25 2017 Hurricane Irwin EP ## 26 2017 Tropical Depression Eleven-E EP ## 27 2017 Tropical Storm Jova EP ## 28 2017 Hurricane Kenneth EP ## 29 2017 Tropical Storm Lidia EP ## 30 2017 Hurricane Otis EP ## 31 2017 Hurricane Max EP ## 32 2017 Hurricane Norma EP ## 33 2017 Tropical Storm Pilar EP ## # ... with 1 more variables: Link

Function get_storms returns four variables:

  • Year – year of the cyclone.

  • Name – name of the cyclone.

  • Basin – basin the cyclone developed (AL for Atlantic, EP for east Pacific).

  • Link – URL to the cyclone's archive page.

The variables Name and Link are the only variables that could potentially change. For example, you'll notice a Name value of Potential Tropical Cyclone Ten. If this storm became a tropical storm then it would receive a new name and the link to the archive page would change as well.

For this example we will explore Hurricane Harvey.

Text Products Current Data

Once we have identified the storms we want to retrieve we can begin working on getting the products. In the earlier discussion of the available products, recall I used abbreviations such as discus, fstadv, etc. These are the terms we will use when obtaining data.

The easiest method to getting storm data is the function get_storm_data. This function can take multiple storm archive URLs and return multiple datasets within a list.

ds <- get_storms() %>% filter(Name == "Hurricane Harvey") %>% pull(Link) %>% get_storm_data(products = c("discus", "fstadv"))

This process may take some time (particularly, fstadv products). This is because the NHC website allows no more than 80 connections every 10 seconds. rrricanes processes four links every half second.

rrricanes uses the dplyr progress bar to keep you informed of the status. You can turn this off by setting option dplyr.show_progress to FALSE.

An additional option is rrricanes.working_msg; FALSE by default. This option will show a message for each advisory currently being worked. I primarily added it to help find products causing problems but you may find it useful at some point.

At this point, we have a list – ds – of dataframes. Each dataframe is named after the product.

names(ds) ## [1] "discus" "fstadv"

discus is one of the products that isn't parsed; the full text of the product is returned.

str(ds$discus) ## Classes 'tbl_df', 'tbl' and 'data.frame': 43 obs. of 6 variables: ## $ Status : chr "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ... ## $ Name : chr "Nine" "Harvey" "Harvey" "Harvey" ... ## $ Adv : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ... ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Contents: chr "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nPotential Tropical Cyclone Nine Discussion Number 1\nNWS National"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number 2\nNWS National Hurricane"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number 3\nNWS National Hurricane"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number 4\nNWS National Hurricane"| __truncated__ ...

The fstadv dataframes, however, are parsed and contain the bulk of the information for the storm.

str(ds$fstadv) ## Classes 'tbl_df', 'tbl' and 'data.frame': 43 obs. of 117 variables: ## $ Status : chr "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ... ## $ Name : chr "Nine" "Harvey" "Harvey" "Harvey" ... ## $ Adv : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ... ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Lat : num 13.1 13 13 13.1 13.1 13.4 13.7 13.8 13.9 14.1 ... ## $ Lon : num -54.1 -55.8 -57.4 -59.1 -61.3 -62.9 -64.1 -65.9 -68.1 -70 ... ## $ Wind : num 30 35 35 35 35 35 35 35 35 30 ... ## $ Gust : num 40 45 45 45 45 45 45 45 45 40 ... ## $ Pressure : num 1008 1004 1005 1004 1005 ... ## $ PosAcc : num 30 30 30 30 30 40 40 30 30 30 ... ## $ FwdDir : num 270 270 270 270 270 275 275 275 275 275 ... ## $ FwdSpeed : num 15 16 16 16 18 18 16 18 19 19 ... ## $ Eye : num NA NA NA NA NA NA NA NA NA NA ... ## $ SeasNE : num NA 60 60 60 75 150 60 60 45 NA ... ## $ SeasSE : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasSW : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasNW : num NA 60 60 60 60 60 45 60 45 NA ... ## $ NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NE34 : num NA 30 50 50 60 60 60 60 0 NA ... ## $ SE34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SW34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ NW34 : num NA 30 50 50 60 60 60 60 60 NA ... ## $ Hr12FcstDate : POSIXct, format: "2017-08-18 00:00:00" "2017-08-18 06:00:00" ... ## $ Hr12Lat : num 13.1 13.1 13.2 13.2 13.3 13.6 14 14 14.1 14.3 ... ## $ Hr12Lon : num -56.4 -58.3 -59.9 -61.7 -63.8 -65.7 -66.8 -68.7 -70.9 -73 ... ## $ Hr12Wind : num 30 35 35 35 35 35 35 35 35 30 ... ## $ Hr12Gust : num 40 45 45 45 45 45 45 45 45 40 ... ## $ Hr12NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NE34 : num NA 30 50 50 60 60 60 60 60 NA ... ## $ Hr12SE34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ Hr12SW34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ Hr12NW34 : num NA 30 50 50 60 60 60 60 60 NA ... ## $ Hr24FcstDate : POSIXct, format: "2017-08-18 12:00:00" "2017-08-18 18:00:00" ... ## $ Hr24Lat : num 13.2 13.4 13.6 13.5 13.6 13.9 14.3 14.3 14.4 14.6 ... ## $ Hr24Lon : num -59.8 -61.6 -63.3 -65.2 -67.3 -69.3 -70.4 -72.7 -74.9 -77 ... ## $ Hr24Wind : num 35 40 40 40 40 40 40 40 40 35 ... ## $ Hr24Gust : num 45 50 50 50 50 50 50 50 50 45 ... ## $ Hr24NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NE34 : num 50 40 50 50 60 60 60 60 60 60 ... ## $ Hr24SE34 : num 0 0 0 0 30 0 0 0 0 0 ... ## $ Hr24SW34 : num 0 0 0 0 30 0 0 0 0 0 ... ## $ Hr24NW34 : num 50 40 50 50 60 60 60 60 60 60 ... ## $ Hr36FcstDate : POSIXct, format: "2017-08-19 00:00:00" "2017-08-19 06:00:00" ... ## $ Hr36Lat : num 13.5 13.7 13.9 13.9 14 14.2 14.5 14.6 14.9 15.2 ... ## $ Hr36Lon : num -63.2 -65.1 -67 -68.8 -71.1 -73 -74.3 -76.7 -78.7 -80.5 ... ## $ Hr36Wind : num 40 45 45 40 40 40 40 45 45 40 ... ## $ Hr36Gust : num 50 55 55 50 50 50 50 55 55 50 ... ## $ Hr36NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NE34 : num 60 60 60 60 60 60 60 70 70 60 ... ## $ Hr36SE34 : num 0 30 30 30 30 0 0 0 0 0 ... ## $ Hr36SW34 : num 0 30 30 30 30 0 0 0 0 0 ... ## $ Hr36NW34 : num 60 60 60 60 60 60 60 70 70 60 ... ## $ Hr48FcstDate : POSIXct, format: "2017-08-19 12:00:00" "2017-08-19 18:00:00" ... ## $ Hr48Lat : num 13.9 14 14.1 14.1 14.3 14.5 14.8 15.2 15.7 16 ... ## $ Hr48Lon : num -66.7 -68.8 -70.9 -72.7 -75 -76.7 -78.1 -80.1 -82.4 -83.8 ... ## $ Hr48Wind : num 45 45 50 50 50 45 45 50 50 45 ... ## $ Hr48Gust : num 55 55 60 60 60 55 55 60 60 55 ... ## $ Hr48NE50 : num NA NA 30 30 30 NA NA 30 30 NA ... ## $ Hr48SE50 : num NA NA 0 0 0 NA NA 0 0 NA ... ## $ Hr48SW50 : num NA NA 0 0 0 NA NA 0 0 NA ... ## $ Hr48NW50 : num NA NA 30 30 30 NA NA 30 30 NA ... ## $ Hr48NE34 : num 60 60 60 60 70 60 60 90 90 70 ... ## $ Hr48SE34 : num 30 30 30 30 30 30 0 50 50 0 ... ## $ Hr48SW34 : num 30 30 30 30 30 30 0 40 40 0 ... ## $ Hr48NW34 : num 60 60 60 60 70 60 60 70 70 70 ... ## $ Hr72FcstDate : POSIXct, format: "2017-08-20 12:00:00" "2017-08-20 18:00:00" ... ## $ Hr72Lat : num 14.5 14.5 14.8 15 15 15.5 16.5 17 17.5 18 ... ## $ Hr72Lon : num -74.5 -76.5 -78.6 -80.2 -82 -83.5 -84.7 -86.5 -88 -89 ... ## $ Hr72Wind : num 55 55 60 60 60 60 55 55 60 45 ... ## $ Hr72Gust : num 65 65 75 75 75 75 65 65 75 55 ... ## [list output truncated]

Each product can also be accessed on its own. For example, if you only wish to view discus products, use the get_discus function. fstadv products can be accessed with get_fstadv. Every products specific function is preceeded by get_.

To understand the variable definitions, access the help file for each of these functions (i.e., ?get_fstadv). They contain full definitions on the variables and their purpose.

As you can see, the fstadv dataframe is very wide. There may be instances you only want to focus on specific pieces of the product. I've developed tidy functions to help trim these datasets:

  • tidy_fcst

  • tidy_fcst_wr

  • tidy_fstadv

  • tidy_wr

These datasets exist in rrricanesdata as fcst, fcst_wr, adv, and wr, respectively (see below).

Most tropical cyclone forecast/advisory products will contain multiple forecast points. Initially, only three-day forecasts were issued. Beginning the with the 2003 season, 96 hour (five-day) forecasts were issued.

If a storm is not expected to survive the full forecast period, then only relevant forecasts will be issued.

We use tidy_fcst to return these forecast points in a tidy fashion from fstadv.

str(tidy_fcst(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 283 obs. of 8 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 1 1 1 1 1 1 1 2 2 2 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 15:00:00" ... ## $ FcstDate: POSIXct, format: "2017-08-18 00:00:00" "2017-08-18 12:00:00" ... ## $ Lat : num 13.1 13.2 13.5 13.9 14.5 15.5 17 13.1 13.4 13.7 ... ## $ Lon : num -56.4 -59.8 -63.2 -66.7 -74.5 -82 -87.5 -58.3 -61.6 -65.1 ... ## $ Wind : num 30 35 40 45 55 65 65 35 40 45 ... ## $ Gust : num 40 45 50 55 65 80 80 45 50 55 ...

Wind radius values are issued with parameters of 34, 50 and 64. These values are the radius to which minimum one-minute sustained winds can be expected or exist.

A tropical depression will not have associated wind radius values since the maximum winds of a depression are 30 knots. If a tropical storm has winds less than 50 knots, then it will only have wind radius values for the 34-knot wind field. If winds are greater than 50 knots, then it will have wind radius values for 34 and 50 knot winds. A hurricane will have all wind radius fields.

Wind radius values are further seperated by quadrant; NE (northeast), SE, SW and NW. Not all quadrants will have values; particularly if the cyclone is struggling to organize. For example, you will often find a minimal hurricane only has hurricane-force winds (64 knots) in the northeast quadrant.

When appropriate, a forecast/advisory product will contain these values for the current position and for each forecast position. Use tidy_wr and tidy_fcst_wr, respectively, for these variables.

str(tidy_wr(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 56 obs. of 8 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 2 3 4 5 6 7 8 9 15 16 ... ## $ Date : POSIXct, format: "2017-08-17 21:00:00" "2017-08-18 03:00:00" ... ## $ WindField: num 34 34 34 34 34 34 34 34 34 34 ... ## $ NE : num 30 50 50 60 60 60 60 0 100 80 ... ## $ SE : num 0 0 0 0 0 0 0 0 0 30 ... ## $ SW : num 0 0 0 0 0 0 0 0 0 20 ... ## $ NW : num 30 50 50 60 60 60 60 60 50 50 ... str(tidy_fcst_wr(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 246 obs. of 9 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 1 1 1 1 1 2 2 2 2 2 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 15:00:00" ... ## $ FcstDate : POSIXct, format: "2017-08-18 12:00:00" "2017-08-19 00:00:00" ... ## $ WindField: num 34 34 34 34 50 34 34 34 34 34 ... ## $ NE : num 50 60 60 80 30 30 40 60 60 80 ... ## $ SE : num 0 0 30 40 0 0 0 30 30 40 ... ## $ SW : num 0 0 30 40 0 0 0 30 30 40 ... ## $ NW : num 50 60 60 80 30 30 40 60 60 80 ...

Lastly, you may only want to focus on current storm details. For this, we use tidy_fstadv:

str(tidy_fstadv(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 43 obs. of 18 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ... ## $ Status : chr "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ... ## $ Name : chr "Nine" "Harvey" "Harvey" "Harvey" ... ## $ Lat : num 13.1 13 13 13.1 13.1 13.4 13.7 13.8 13.9 14.1 ... ## $ Lon : num -54.1 -55.8 -57.4 -59.1 -61.3 -62.9 -64.1 -65.9 -68.1 -70 ... ## $ Wind : num 30 35 35 35 35 35 35 35 35 30 ... ## $ Gust : num 40 45 45 45 45 45 45 45 45 40 ... ## $ Pressure: num 1008 1004 1005 1004 1005 ... ## $ PosAcc : num 30 30 30 30 30 40 40 30 30 30 ... ## $ FwdDir : num 270 270 270 270 270 275 275 275 275 275 ... ## $ FwdSpeed: num 15 16 16 16 18 18 16 18 19 19 ... ## $ Eye : num NA NA NA NA NA NA NA NA NA NA ... ## $ SeasNE : num NA 60 60 60 75 150 60 60 45 NA ... ## $ SeasSE : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasSW : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasNW : num NA 60 60 60 60 60 45 60 45 NA ...

In release 0.2.1, tidy_fstadv will be renamed to tidy_adv.

One final note on the data: all speed variables are measured in knots, distance variables in nautical miles, and pressure variables in millibars. Functions knots_to_mph and mb_to_in are available for speed/pressure conversions. Function nm_to_sm to convert nautical miles to survey miles will be included in release 0.2.1.

Archived Data

rrricanesdata was built to make it easier to get pre-processed datasets. As mentioned earlier, rrricanesdata will be updated the first of every month if any advisory was issued for the previous month. (As I am now writing this portion in September, all of Hurricane Harvey's advisories – the last one issued the morning of August 31 – exist in rrricanesdata release 0.0.1.4.)

As with rrricanes, rrricanesdata is not available in CRAN (nor will be due to size limitations).

I'll load all datasets with the call:

library(rrricanesdata) data(list = data(package = "rrricanesdata")$results[,3])

All core product datasets are available. The dataframes adv, fcst, fcst_wr and wr are the dataframes created by tidy_fstadv, tidy_fcst, tidy_fcst_wr and tidy_wr, respectively.

Tracking Charts

rrricanes also comes with helper functions to quickly generate tracking charts. These charts use rnaturalearthdata (for high resolution maps, use package rnaturalearthhires). These charts are not required – Bob Rudis demonstrates demonstrates succintly – so feel free to experiment.

You can generate a default plot for the entire globe with tracking_chart:

tracking_chart()

You may find this handy when examining cyclones that cross basins (from the Atlantic to east Pacific such as Hurricane Otto, 2016).

tracking_chart takes three parameters (in addition to dots for other ggplot calls):

  • countries – By default, show country borders

  • states – By default, show state borders

  • res – resolution; default is 110nm.

We do not see countries and states in the map above because of the ggplot defaults. Let's try it again:

tracking_chart(color = "black", size = 0.1, fill = "white")

We can "zoom in" on each basin with helper functions al_tracking_chart and ep_tracking_chart:

al_tracking_chart(color = "black", size = 0.1, fill = "white")

ep_tracking_chart(color = "black", size = 0.1, fill = "white")

GIS Data

GIS data exists for some cyclones and varies by year. This is a relatively new archive by the NHC and is inconsistent from storm to storm.

The "gis" functions are as follows:

  • gis_advisory

  • gis_latest

  • gis_prob_storm_surge

  • gis_windfield

  • gis_wsp

Another area of inconsistency with these products is how they are organized. For example, gis_advisory, gis_prob_storm_surge and gis_windfield can be retrieved with a storm Key (unique identifier for every cyclone; see fstadv$Key). Except for gis_prob_storm_surge, you can even pass an advisory number (see fstadv$Adv).

gis_wsp requires a datetime value; to access a specific GIS package for a storm's advisory you would need to use a variable such as fstadv$Date, subtract three hours and convert to "%Y%m%d%H" format ("%m", "%d", and "%H" are optional).

All above functions only return URL's to their respective datasets. This was done to allow you to validate the quantity of datasets you wish to retrieve as, in some cases, the dataset may not exist at all or there may be several available. Use gis_download with the requested URL's to retrieve your datasets.

Let's go through each of these. First, let's get the Key of Hurricane Harvey:

# Remember that ds already and only contains data for Hurricane Harvey key <- ds$fstadv %>% pull(Key) %>% first() gis_advisory

gis_advisory returns a dataset package containing past and forecast plot points and lines, a forecast cone (area representing where the cyclone could track), wind radius data and current watches and warnings.

gis_advisory takes two parameters:

  • Key

  • advisory (optional)

If we leave out advisory we get all related datasets for Hurricane Harvey:

x <- gis_advisory(key = key) length(x) ## [1] 77 head(x, n = 5L) ## [1] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_001.zip" ## [2] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_001A.zip" ## [3] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_002.zip" ## [4] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_002A.zip" ## [5] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_003.zip"

As you can see, there is quite a bit (and why the core gis functions only return URLs rather than the actual datasets). Let's trim this down a bit. Sneaking a peek (cheating) I find advisory 19 seems a good choice.

gis_advisory(key = key, advisory = 19) ## [1] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_019.zip"

Good; there is a data package available for this advisory. Once you have confirmed the package you want to retrieve, use gis_download to get the data.

gis <- gis_advisory(key = key, advisory = 19) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_lin" ## with 1 features ## It has 7 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_pgn" ## with 1 features ## It has 7 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_pts" ## with 8 features ## It has 23 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_ww_wwlin" ## with 5 features ## It has 8 fields

Let's see what we have.

str(gis) ## List of 4 ## $ al092017_019_5day_lin:Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots ## .. ..@ data :'data.frame': 1 obs. of 7 variables: ## .. .. ..$ STORMNAME: chr "Harvey" ## .. .. ..$ STORMTYPE: chr "HU" ## .. .. ..$ ADVDATE : chr "1000 PM CDT Thu Aug 24 2017" ## .. .. ..$ ADVISNUM : chr "19" ## .. .. ..$ STORMNUM : num 9 ## .. .. ..$ FCSTPRD : num 120 ## .. .. ..$ BASIN : chr "AL" ## .. ..@ lines :List of 1 ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:8, 1:2] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 25.2 26.1 ... ## .. .. .. .. ..@ ID : chr "0" ## .. ..@ bbox : num [1:2, 1:2] -97.5 25.2 -94.6 29.5 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "x" "y" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs" ## $ al092017_019_5day_pgn:Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots ## .. ..@ data :'data.frame': 1 obs. of 7 variables: ## .. .. ..$ STORMNAME: chr "Harvey" ## .. .. ..$ STORMTYPE: chr "HU" ## .. .. ..$ ADVDATE : chr "1000 PM CDT Thu Aug 24 2017" ## .. .. ..$ ADVISNUM : chr "19" ## .. .. ..$ STORMNUM : num 9 ## .. .. ..$ FCSTPRD : num 120 ## .. .. ..$ BASIN : chr "AL" ## .. ..@ polygons :List of 1 ## .. .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots ## .. .. .. .. ..@ Polygons :List of 1 ## .. .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots ## .. .. .. .. .. .. .. ..@ labpt : num [1:2] -95.4 29.2 ## .. .. .. .. .. .. .. ..@ area : num 51.7 ## .. .. .. .. .. .. .. ..@ hole : logi FALSE ## .. .. .. .. .. .. .. ..@ ringDir: int 1 ## .. .. .. .. .. .. .. ..@ coords : num [1:1482, 1:2] -94.6 -94.7 -94.7 -94.7 -94.7 ... ## .. .. .. .. ..@ plotOrder: int 1 ## .. .. .. .. ..@ labpt : num [1:2] -95.4 29.2 ## .. .. .. .. ..@ ID : chr "0" ## .. .. .. .. ..@ area : num 51.7 ## .. ..@ plotOrder : int 1 ## .. ..@ bbox : num [1:2, 1:2] -100 24.9 -91 33 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "x" "y" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs" ## $ al092017_019_5day_pts:Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots ## .. ..@ data :'data.frame': 8 obs. of 23 variables: ## .. .. ..$ ADVDATE : chr [1:8] "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" ... ## .. .. ..$ ADVISNUM : chr [1:8] "19" "19" "19" "19" ... ## .. .. ..$ BASIN : chr [1:8] "AL" "AL" "AL" "AL" ... ## .. .. ..$ DATELBL : chr [1:8] "10:00 PM Thu" "7:00 AM Fri" "7:00 PM Fri" "7:00 AM Sat" ... ## .. .. ..$ DVLBL : chr [1:8] "H" "H" "M" "M" ... ## .. .. ..$ FCSTPRD : num [1:8] 120 120 120 120 120 120 120 120 ## .. .. ..$ FLDATELBL: chr [1:8] "2017-08-24 7:00 PM Thu CDT" "2017-08-25 7:00 AM Fri CDT" "2017-08-25 7:00 PM Fri CDT" "2017-08-26 7:00 AM Sat CDT" ... ## .. .. ..$ GUST : num [1:8] 90 115 135 120 85 45 45 45 ## .. .. ..$ LAT : num [1:8] 25.2 26.1 27.2 28.1 28.6 28.5 28.5 29.5 ## .. .. ..$ LON : num [1:8] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 ## .. .. ..$ MAXWIND : num [1:8] 75 95 110 100 70 35 35 35 ## .. .. ..$ MSLP : num [1:8] 973 9999 9999 9999 9999 ... ## .. .. ..$ SSNUM : num [1:8] 1 2 3 3 1 0 0 0 ## .. .. ..$ STORMNAME: chr [1:8] "Hurricane Harvey" "Hurricane Harvey" "Hurricane Harvey" "Hurricane Harvey" ... ## .. .. ..$ STORMNUM : num [1:8] 9 9 9 9 9 9 9 9 ## .. .. ..$ STORMSRC : chr [1:8] "Tropical Cyclone" "Tropical Cyclone" "Tropical Cyclone" "Tropical Cyclone" ... ## .. .. ..$ STORMTYPE: chr [1:8] "HU" "HU" "MH" "MH" ... ## .. .. ..$ TCDVLP : chr [1:8] "Hurricane" "Hurricane" "Major Hurricane" "Major Hurricane" ... ## .. .. ..$ TAU : num [1:8] 0 12 24 36 48 72 96 120 ## .. .. ..$ TCDIR : num [1:8] 315 9999 9999 9999 9999 ... ## .. .. ..$ TCSPD : num [1:8] 9 9999 9999 9999 9999 ... ## .. .. ..$ TIMEZONE : chr [1:8] "CDT" "CDT" "CDT" "CDT" ... ## .. .. ..$ VALIDTIME: chr [1:8] "25/0000" "25/1200" "26/0000" "26/1200" ... ## .. ..@ coords.nrs : num(0) ## .. ..@ coords : num [1:8, 1:2] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 25.2 26.1 ... ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : NULL ## .. .. .. ..$ : chr [1:2] "coords.x1" "coords.x2" ## .. ..@ bbox : num [1:2, 1:2] -97.5 25.2 -94.6 29.5 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "coords.x1" "coords.x2" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs" ## $ al092017_019_ww_wwlin:Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots ## .. ..@ data :'data.frame': 5 obs. of 8 variables: ## .. .. ..$ STORMNAME: chr [1:5] "Harvey" "Harvey" "Harvey" "Harvey" ... ## .. .. ..$ STORMTYPE: chr [1:5] "HU" "HU" "HU" "HU" ... ## .. .. ..$ ADVDATE : chr [1:5] "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" ... ## .. .. ..$ ADVISNUM : chr [1:5] "19" "19" "19" "19" ... ## .. .. ..$ STORMNUM : num [1:5] 9 9 9 9 9 ## .. .. ..$ FCSTPRD : num [1:5] 120 120 120 120 120 ## .. .. ..$ BASIN : chr [1:5] "AL" "AL" "AL" "AL" ... ## .. .. ..$ TCWW : chr [1:5] "TWA" "HWA" "TWR" "TWR" ... ## .. ..@ lines :List of 5 ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.7 -97.4 -97.2 24.3 25.2 ... ## .. .. .. .. ..@ ID : chr "0" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.2 -97.2 -97.3 26 26.1 ... ## .. .. .. .. ..@ ID : chr "1" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.2 -97.2 -97.3 26 26.1 ... ## .. .. .. .. ..@ ID : chr "2" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:10, 1:2] -95.6 -95.3 -95.1 -95.1 -94.8 ... ## .. .. .. .. ..@ ID : chr "3" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:16, 1:2] -97.3 -97.3 -97.3 -97.4 -97.4 ... ## .. .. .. .. ..@ ID : chr "4" ## .. ..@ bbox : num [1:2, 1:2] -97.7 24.3 -94.4 29.8 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "x" "y" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs"

We get four spatial dataframes – points, polygons and lines.

names(gis) ## [1] "al092017_019_5day_lin" "al092017_019_5day_pgn" "al092017_019_5day_pts" ## [4] "al092017_019_ww_wwlin"

With the expection of point spatial dataframes (which can be converted to dataframe using tibble::as_data_frame, use helper function shp_to_df to convert the spatial dataframes to dataframes.

Forecast Track library(ggplot2) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_path(data = shp_to_df(gis$al092017_019_5day_lin), aes(x = long, y = lat))

Use geom_path instead of geom_line to keep the positions in order.

You can "zoom in" even further using ggplot2::coord_equal. For that, we need to know the limits of our objects (minimum and maximum latitude and longitude) or bounding box. Thankfully, the sp package can get us this information with the bbox function.

But, we don't want to use the "al092017_019_5day_lin" dataset. Our gis dataset contains a forecast cone which expands well beyond the lines dataset. Take a look:

sp::bbox(gis$al092017_019_5day_lin) ## min max ## x -97.5 -94.6 ## y 25.2 29.5 sp::bbox(gis$al092017_019_5day_pgn) ## min max ## x -100.01842 -90.96327 ## y 24.86433 33.01644

So, let's get the bounding box of our forecast cone dataset and zoom in on our map.

bb <- sp::bbox(gis$al092017_019_5day_pgn) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_path(data = shp_to_df(gis$al092017_019_5day_lin), aes(x = long, y = lat)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

That's much better. For simplicity I'm going to save the base map, bp, without the line plot.

bp <- al_tracking_chart(color = "black", size = 0.1, fill = "white") + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2])) Forecast Points

Forecast points identify each forecast position along with forecast winds and date. Remember that for point spatial dataframes you use tibble::as_data_frame rather than sp_to_df.

bp + geom_point(data = tibble::as_data_frame(gis$al092017_019_5day_pts), aes(x = long, y = lat))

If you ran the code above you would get an error.

Error in FUN(X[[i]], ...) : object 'long' not found

Why? The variable long does not exist as it does in other GIS datasets; it is lon. This is one of the inconsistencies I was referring to previously. Additionally, the variables are all uppercase.

names(gis$al092017_019_5day_pts) ## [1] "ADVDATE" "ADVISNUM" "BASIN" "DATELBL" "DVLBL" ## [6] "FCSTPRD" "FLDATELBL" "GUST" "LAT" "LON" ## [11] "MAXWIND" "MSLP" "SSNUM" "STORMNAME" "STORMNUM" ## [16] "STORMSRC" "STORMTYPE" "TCDVLP" "TAU" "TCDIR" ## [21] "TCSPD" "TIMEZONE" "VALIDTIME"

Let's try it again.

bp + geom_point(data = tibble::as_data_frame(gis$al092017_019_5day_pts), aes(x = LON, y = LAT))

Better.

Forecast Cone

A forecast cone identifies the probability of error in a forecast. Forecasting tropical cyclones is tricky business – errors increase the further out a forecast is issued. Theoretically, any area within a forecast cone is at risk of seeing cyclone conditions within the given period of time.

Generally, a forecast cone package contains two subsets: 72-hour forecast cone and 120-hour forecast cone. This is identified in the dataset under the variable FCSTPRD. Let's take a look at the 72-hour forecast period:

bp + geom_polygon(data = shp_to_df(gis$al092017_019_5day_pgn) %>% filter(FCSTPRD == 72), aes(x = long, y = lat, color = FCSTPRD))

Nothing there!

As mentioned earlier, these are experimental products issued by the NHC and they do contain inconsistencies. To demonstrate, I'll use Hurricane Ike advisory 42.

df <- gis_advisory(key = "AL092008", advisory = 42) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_lin" ## with 2 features ## It has 9 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_pgn" ## with 2 features ## It has 9 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_pts" ## with 13 features ## It has 20 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_ww_wwlin" ## with 5 features ## It has 10 fields al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = shp_to_df(df$al092008_042_5day_pgn) %>% filter(FCSTPRD == 72), aes(x = long, y = lat))

We do, however, have a 120-hour forecast cone for Hurricane Harvey. Let's go ahead and plot that.

bp + geom_polygon(data = gis$al092017_019_5day_pgn, aes(x = long, y = lat), alpha = 0.15)

It's an odd-looking forecast cone, for sure. But this demonstrates the entire area that Harvey could have potentially traveled.

Watches and Warnings

Our last dataset in this package is "al092017_09_ww_wlin". These are the current watches and warnings in effect. This is a spatial lines dataframe that needs shp_to_df. Again, we use geom_path instead of geom_line. And we want to group our paths by the variable TCWW.

bp + geom_path(data = shp_to_df(gis$al092017_019_ww_wwlin), aes(x = long, y = lat, group = group, color = TCWW), size = 1)

The paths won't follow our coastlines exactly but you get the idea. The abbreviations don't really give much information, either. Convert TCWW to factor and provide better labels for your legend.

ww_wlin <- shp_to_df(gis$al092017_019_ww_wwlin) ww_wlin$TCWW <- factor(ww_wlin$TCWW, levels = c("TWA", "TWR", "HWA", "HWR"), labels = c("Tropical Storm Watch", "Tropical Storm Warning", "Hurricane Watch", "Hurricane Warning")) bp + geom_path(data = ww_wlin, aes(x = long, y = lat, group = group, color = TCWW), size = 1)

See Forecast/Adivsory GIS on the rrricanes website for an example of putting all of this data together in one map.

gis_prob_storm_surge

We can also plot the probablistic storm surge for given locations. Again, you will need the storm Key for this function. There are two additional parameters:

  • products

  • datetime

products can be one or both of "esurge" and "psurge". esurge shows the probability of the cyclone exceeding the given storm surge plus tide within a given forecast period. psurge shows the probability of a given storm surge within a specified forecast period.

One or more products may not exist depending on the cyclone and advisory.

The products parameter expects a list of values for each product. For esurge products, valid values are 10, 20, 30, 40 or 50. For psurge products, valid values are 0, 1, 2, …, 20.

Let's see if any esurge products exist for Harvey.

length(gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)))) ## [1] 150

And psurge:

length(gis_prob_storm_surge(key = key, products = list("psurge" = 0:20))) ## [1] 630

So, we have access to a ton of data here. When discussing gis_advisory, we were able to filter by advisory number. With gis_prob_storm_surge, this is not an option; we have to use the datetime parameter to filter. Let's find the Date for advisory 19.

(d <- ds$fstadv %>% filter(Adv == 19) %>% pull(Date)) ## [1] "2017-08-25 03:00:00 UTC" esurge

Now, let's view all esurge products for date only (exlude time).

gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)), datetime = strftime(d, "%Y%m%d", tz = "UTC")) ## [1] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082500.zip" ## [2] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082506.zip" ## [3] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082512.zip" ## [4] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082518.zip" ## [5] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082500.zip" ## [6] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082506.zip" ## [7] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082512.zip" ## [8] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082518.zip" ## [9] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082500.zip" ## [10] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082506.zip" ## [11] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082512.zip" ## [12] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082518.zip" ## [13] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082500.zip" ## [14] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082506.zip" ## [15] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082512.zip" ## [16] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082518.zip" ## [17] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082500.zip" ## [18] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082506.zip" ## [19] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082512.zip" ## [20] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082518.zip"

That's still quite a bit. We can filter it to more by adding hour to the datetime parameter.

gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)), datetime = strftime(d, "%Y%m%d%H", tz = "UTC"))

This call will give you an error:

Error: No data available for requested storm/advisory

But, this isn't entirely correct. When an advisory package is issued it contains information for the release time. Some of the GIS datasets are based on the release time -3 hours. So, we need to subtract 3 hours from d.

Note: There is an additional value that, as of the latest release is not extracted, records the position of the cyclone three hours prior. (As I understand it from the NHC, this is due to the time it takes to collect and prepare the data.) Per Issue #102, these values will be added for release 0.2.1. Therefore, instead of subtracting three hours from the Date variable, you can simply use the PrevPosDate value for this function.

Let's try it again with the math:

gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)), datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H", tz = "UTC")) ## [1] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082500.zip" ## [2] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082500.zip" ## [3] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082500.zip" ## [4] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082500.zip" ## [5] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082500.zip"

As I don't want to get all of these datasets, I'll limit my esurge to show surge values with at least a 50% chance of being exceeded:

gis <- gis_prob_storm_surge(key = key, products = list("esurge" = 50), datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H", tz = "UTC")) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082500_e50" ## with 97 features ## It has 2 fields

This will bring us a spatial polygon dataframe.

df <- shp_to_df(gis$al092017_2017082500_e50) bb <- sp::bbox(gis$al092017_2017082500_e50) str(df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 161313 obs. of 9 variables: ## $ long : num -93.2 -93.2 -93.2 -93.2 -93.2 ... ## $ lat : num 29.9 29.9 29.9 29.9 29.9 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ piece : Factor w/ 349 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ group : Factor w/ 7909 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ id : chr "0" "0" "0" "0" ... ## $ POINTID: int 1 1 1 1 1 1 1 1 1 1 ... ## $ TCSRG50: num 0 0 0 0 0 0 0 0 0 0 ... al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, color = TCSRG50)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

This plot tells us that, along the central Texas coast, the expected storm surge along with tides is greater than 7.5 feet and there is a 50% chance of this height being exceeded.

psurge

The psurge product gives us the probabilistic storm surge for a location within the given forecast period.

gis <- gis_prob_storm_surge(key = key, products = list("psurge" = 20), datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H", tz = "UTC")) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082500_gt20" ## with 12 features ## It has 2 fields

This will bring us a spatial polygon dataframe.

df <- shp_to_df(gis$al092017_2017082500_gt20) bb <- sp::bbox(gis$al092017_2017082500_gt20) str(df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 3293 obs. of 9 variables: ## $ long : num -96.8 -96.8 -96.8 -96.8 -96.8 ... ## $ lat : num 28.5 28.5 28.4 28.4 28.4 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ piece : Factor w/ 54 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ group : Factor w/ 227 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ id : chr "0" "0" "0" "0" ... ## $ POINTID : int 1 1 1 1 1 1 1 1 1 1 ... ## $ PSurge20c: num 1 1 1 1 1 1 1 1 1 1 ... al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, color = PSurge20c)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

This map shows the cumulative probability that a storm surge of greater than 20 feet will be seen within the highlighted regions.

This particular map doesn't help much as we've zoomed in too far. What may provide use is a list of probability stations as obtained from the NHC. For this, you can use al_prblty_stations (ep_prblty_stations returns FALSE since, as of this writing, the format is invalid).

stations <- al_prblty_stations() al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, color = PSurge20c)) + geom_label(data = stations, aes(x = Lon, y = Lat, label = Location)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

gis_windfield

When possible, there may also be a cyclone wind radius dataset for the current and forecast positions. With this function we can resort back to Key and an advisory number.

gis <- gis_windfield(key = key, advisory = 19) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082503_forecastradii" ## with 15 features ## It has 13 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082503_initialradii" ## with 3 features ## It has 13 fields names(gis) ## [1] "al092017_2017082503_forecastradii" "al092017_2017082503_initialradii"

Let's get the bounding box and plot our initialradii dataset.

bb <- sp::bbox(gis$al092017_2017082503_initialradii) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = shp_to_df(gis$al092017_2017082503_initialradii), aes(x = long, y = lat, group = group, fill = factor(RADII)), alpha = 0.5) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

And add the forecast wind radii data onto the chart (modifying bb):

bb <- sp::bbox(gis$al092017_2017082503_forecastradii) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = shp_to_df(gis$al092017_2017082503_initialradii), aes(x = long, y = lat, group = group, fill = factor(RADII)), alpha = 0.5) + geom_polygon(data = shp_to_df(gis$al092017_2017082503_forecastradii), aes(x = long, y = lat, group = group, fill = factor(RADII)), alpha = 0.5) + geom_label(data = stations, aes(x = Lon, y = Lat, label = Location)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

gis_wsp

Our last GIS dataset is wind speed probabilities. This dataset is not storm specific nor even basin-specific; you may get results for cyclones halfway across the world.

The two parameters needed are:

  • datetime – again, using the %Y%m%d%H format (not all values are required)

  • res – Resolution of the probabilities; 5 degrees, 0.5 degrees and 0.1 degrees.

Wind fields are for 34, 50 and 64 knots. Not all resolutions or windfields will be available at a given time.

Sticking with our variable d, let's first make sure there is a dataset that exists for that time.

gis_wsp(datetime = strftime(d - 60 * 60 * 3, format = "%Y%m%d%H", tz = "UTC")) ## [1] "http://www.nhc.noaa.gov/gis/"

For this article, we'll stick to the higher resolution plot.

we need a temporarily fixed function to replace gis_wsp(), which will be
fixed in package soon

gis_wsp_2 <- function(datetime, res = c(5, 0.5, 0.1)) { res <- as.character(res) res <- stringr::str_replace(res, "^5$", "5km") res <- stringr::str_replace(res, "^0.5$", "halfDeg") res <- stringr::str_replace(res, "^0.1$", "tenthDeg") year <- stringr::str_sub(datetime, 0L, 4L) request <- httr::GET("http://www.nhc.noaa.gov/gis/archive_wsp.php", query = list(year = year)) contents <- httr::content(request, as = "parsed", encoding = "UTF-8") ds <- rvest::html_nodes(contents, xpath = "//a") %>% rvest::html_attr("href") %>% stringr::str_extract(".+\\.zip$") %>% .[stats::complete.cases(.)] if (nchar(datetime) < 10) { ptn_datetime <- paste0(datetime, "[:digit:]+") } else { ptn_datetime <- datetime } ptn_res <- paste(res, collapse = "|") ptn <- sprintf("%s_wsp_[:digit:]{1,3}hr(%s)", ptn_datetime, ptn_res) links <- ds[stringr::str_detect(ds, ptn)] links <- paste0("http://www.nhc.noaa.gov/gis/", links) return(links) } gis <- gis_wsp_2( datetime = strftime(d - 60 * 60 * 3, format = "%Y%m%d%H", tz = "UTC"), res = 5) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp34knt120hr_5km" ## with 11 features ## It has 1 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp50knt120hr_5km" ## with 11 features ## It has 1 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp64knt120hr_5km" ## with 11 features ## It has 1 fields

All of these datasets are spatial polygon dataframes. Again, we will need to convert to dataframe using shp_to_df.

bb <- sp::bbox(gis$`2017082500_wsp34knt120hr_5km`)

Examine the structure.

df <- shp_to_df(gis$`2017082500_wsp34knt120hr_5km`) str(df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 24182 obs. of 8 variables: ## $ long : num -97.2 -97.2 -97.2 -97.3 -97.3 ... ## $ lat : num 20.3 20.3 20.3 20.3 20.4 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ piece : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ group : Factor w/ 52 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ id : chr "0" "0" "0" "0" ... ## $ PERCENTAGE: chr "<5%" "<5%" "<5%" "<5%" ... al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, fill = PERCENTAGE), alpha = 0.25) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

There aren't many ways we can narrow this down other than using arbitrary longitude values. The observations in the dataset do not have a variable identifying which storm each set of values belongs to. So, I'll remove the coord_equal call so we're only focused on the Atlantic basin.

al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, fill = PERCENTAGE), alpha = 0.25)

Of course, you can narrow it down further as you see fit.

Do not confuse this GIS dataset with the wndprb product or similar prblty products; both of which only identify probabilities for given locations.

gis_latest

For active cyclones, you can retrieve all available GIS datasets using gis_latest. Note that, unlike the previous GIS functions, this function will return a list of all GIS dataframes available.

gis <- gis_latest()

Now we have a large list of GIS spatial dataframes. Two things to point out here; first, we now have a "windswath" GIS dataset. This dataset, to the best of my knowledge, does not exist on it's own. Therefore, no archived "windswath" datasets are available.

Second, I have found this data fluctuates even from minute to minute. Earlier this year when attempting to develop automated reporting, I found the return value of the call would vary almost with every call.

Of course, that doesn't mean it is not valuable, and why it has been included. You can easily perform checks for specific data you are looking for. If it doesn't exist, bail and try again in a few minutes.

Potential Issues Using rrricanes

I cannot stress enough that rrricanes is not intended for use during emergency situations, as I myself learned during Hurricane Harvey. The package currently relies on the NHC website which, I truly believe, is curated by hand.

The most common problems I've noticed are:

  • The NHC website unable to load or slow to respond. This was a hassle in previous releases but seems to be ironed out as of release 0.2.0.6. Nonetheless, there may be instances where response time is slow.

  • Incorrect storm archive links. Another example would be during Harvey when the link to Harvey's archive page was listed incorrectly. If I manually typed the link as it should be, the storm's correct archive page would load. However, the NHC website listed it incorrectly on the annual archives page.

As I become more aware of potential problems, I will look for workarounds. I would be greatly appreciative of any problems being posted to the rrricanes repository.

I will also post known issues beyond my control (such as NHC website issues) to Twitter using the #rrricanes hashtag.

Future Plans

The following data will be added to rrricanes as time allows:

  • Reconnaissance data (release 0.2.2)

  • Computer forecast model data (release 0.2.3)

  • Archived satellite images (tentative)

  • Ship and buoy data (tentative)

Reconnaissance data itself will be a massive project. There are numerous types of products. And, as advisory product formats have changed over the years, so have these. Any help in this task would be tremendously appreciated!

Some computer forecast models are in the public domain and can certainly be of tremendous use. Some are difficult to find (especially archived). The caution in this area is that many websites post this data but may have limitations on how it can be accessed.

Additionally, data may be added as deemed fitting.

Contribute

Anyone is more than welcome to contribute to the package. I would definitely appreciate any help. See Contributions for more information.

I would ask that you follow the Tidyverse style guide. Release 0.2.1 will fully incorporate these rules.

You do not need to submit code in order to be listed as a contributor. If there is a data source (that can legally be scraped) that you feel should be added, please feel free to submit a request. Submitting bug reports and feature requests are all extremely valuable to the success of rrricanes.

Acknowledgments

I want to thank the rOpenSci community for embracing rrricanes and accepting the package into their vast portfolio. This is my first attempt and putting a project into part of a larger community and the lessons learned have been outstanding.

I want to thank Maelle Salmon who, in a sense, has been like a guiding angel from start to finish during the entire onboarding and review process.

I want to give a very special thanks to Emily Robinson and Joseph Stachelek for taking the time to put rrricanes to the test, giving valuable insight and recommendations on improving it.

And a thank-you also to James Molyneux, Mark Padgham, and Bob Rudis, all of whom have offered guidance or input that has helped make rrricanes far better than it would have been on my own.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

What is the appropriate population scaling of the Affordable Care Act Funding?

Wed, 09/27/2017 - 05:15

(This article was first published on Jonathan Sidi's R Blog, and kindly contributed to R-bloggers)

Analysis of the effects of the Graham-Cassidy Bill on the ACA population –

I have been trying to decipher for myself, what is in the current (well, yesterday’s) Graham-Cassidy health care bill. I saw this image on many news outlets a few days ago and my inner hate for pie charts bubbled up.

This is a zoom in on the pie chart … From what I can gather, these figures are attempting to say that there are specific states that are getting relatively more of the federal health care funds under the Afordable Care Act (ACA) than their relative state population. Among the many things that are really hard to do with pie charts , comparing distributions ranks pretty high up there.

It is common practice when comparing different geographical areas that have different populations levels to scale it using the rate per person (per capita) or per number of people, e.g. rate per 1,000 people. In this case it would be population adjusted state federal funding. The question that needs answering, what is the relevant population.

Many charts in the last week have scaled the funding adjusted to state population (as is alluded to in the figure above), but the funds are not actually being used by everyone in each state, most people have health care from their employer. So, what is the actual population that is being serviced by the federal funding for the ACA? How much of a different picture does that paint from the original figure?

Hopefully this post will help motivate readers to start looking around for more data on what is the effect of the proposed bill on the approprations of federal funds on the state level.

My sources of information is the Kaiser Family Foundation site that have a great database for data on the ACA and the proposed bill, and Wikipedia for auxilary population data. We will end up with the following figure, but along the way I learned a number of things that I didn’t know from reading online and seeing the news on TV.

A quick note before you proceed – This is not meant to be an all encompassing analysis of the predicted effects of the Graham-Cassidy bill, as it has been said before: “Healthcare is hard…”, and if I made any bad assumptions I apologize in advanced and welcome any comments and suggestions to better understand the subject matter.

Saying that, let’s continue:

library(xml2) library(rvest) library(dplyr) library(ggplot2) library(geofacet) knitr::opts_chunk$set(fig.height=7,fig.width=12,warning=FALSE,message=FALSE) Scraping the relevant information Kaiser Family Foundation ACA and Graham-Cassidy federal spending by state. kf_spending <- (xml2::read_html('http://www.kff.org/health-reform/issue-brief/state-by-state-estimates-of-changes-in-federal-spending-on-health-care-under-the-graham-cassidy-bill/')%>% rvest::html_table())[[2]][-c(1,2,3,55),] names(kf_spending) <- c('Location','ACA','GC','DIFF','PCT') kf_spending$Location[which(kf_spending$Location=='DC')]='District of Columbia' kf_spending <- kf_spending%>%mutate_at(.vars=vars(ACA,GC,DIFF,PCT),.funs=funs(as.numeric(gsub('[,%]','',.)))) ACA medicare expansion by state

The decision of each state to accept medicare expansion will have a large affect on the net affect of GC on the redistribution of federal funds. States that did not accept medicare expansion are expected to have a net positive increase of federal funds.

#http://www.kff.org/health-reform/state-indicator/state-activity-around-expanding-medicaid-under-the-affordable-care-act/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D kf_expansion <- read.csv('data/kf_expansion.csv',stringsAsFactors = FALSE,skip = 2) kf_expansion <- kf_expansion[-c(1,53:61),-3] names(kf_expansion)[2] <- 'Expansion' Population of ACA enrollment by state.

The target population that will be used to scale the federal funds is the total marketplace enrollment for each state. We also add the characteristic of type of marketplace applied in the state to check if that has any effect.

  • Federally-Facilitated Market
  • State-based Marketplace
  • State-based Marketplace (using HealthCare.gov)
#http://www.kff.org/health-reform/state-indicator/total-marketplace-enrollment/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Total%20Individuals%20Who%20Have%20Selected%20a%20Marketplace%20Plan%22,%22sort%22:%22asc%22%7D kf_marketplace_pop <- read.csv('data/kf_marketplace_pop.csv',stringsAsFactors = FALSE,skip = 2) kf_marketplace_pop <- kf_marketplace_pop[-c(1,53:59),] names(kf_marketplace_pop)[c(2,3)] <- c('Marketplace_Type','N') Wikipedia State characteristics (2016 elections and general population)

To gather more characteristics of each state are the 2016 general election results and the total population in each state, so the prevalent scaling can be used as a comparison.

wiki_elections <- (xml2::read_html('https://en.wikipedia.org/wiki/United_States_presidential_election,_2016')%>% rvest::xml_nodes(xpath='//*[@id="mw-content-text"]/div/div[40]/table')%>% rvest::html_table())[[1]][-c(1,58),c(1,3,6,23)] names(wiki_elections) <- c('Location','Clinton','Trump','Total') wiki_elections$Location[grep('^Nebraska',wiki_elections$Location)] <- 'Nebraska' wiki_elections$Location[grep('^Maine',wiki_elections$Location)] <- 'Maine' wiki_elections <- wiki_elections%>% mutate_at(.vars = vars(Clinton,Trump,Total),.funs=funs(as.numeric(gsub('[,]','',.))))%>% group_by(Location)%>%summarise_at(.vars = vars(Clinton,Trump,Total),.funs = funs(sum))%>% mutate(ClintonPct=Clinton/Total,TrumpPct=Trump/Total,TrumpWin=ifelse(TrumpPct>ClintonPct,'Trump Win','Clinton Win')) wiki_pop <- (xml2::read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population')%>% rvest::xml_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]')%>% rvest::html_table())[[1]][-c(30,53:64),c(3,4)] names(wiki_pop) <- c('Location','Total_N') wiki_pop$Total_N <- as.numeric(gsub('[,]','',wiki_pop$Total_N)) Join all the data sets

We join all the information and create a new variable – the change in federal funds from ACA to GC. This is done for the rate per 1,000 individuals who have selected a market based plan and the broader per 1,000 individuals state total. The former assumes that this the more consice defition of the population better reflects what is the population serviced by the federal funding, and that it is the potential population that would be serviced by the GC bill.

kf <- kf_marketplace_pop%>% left_join(kf_expansion,by='Location')%>% left_join(wiki_pop,by='Location') kf <- kf_spending%>%left_join(kf, by = c('Location'))%>% mutate(ratio_ACA=1000*ACA/N,ratio_GC=1000*GC/N,ratio_DIFF = ratio_GC-ratio_ACA, tot_ratio_ACA=1000*ACA/Total_N,tot_ratio_GC=1000*GC/Total_N,tot_ratio_DIFF = tot_ratio_GC-tot_ratio_ACA)%>% arrange(desc(ratio_DIFF)) kf <- kf%>%left_join(wiki_elections,by='Location') kf$Expansion <- sprintf('Medicaid Expansion %s',kf$Expansion) kf$Location <- factor(kf$Location,levels = kf$Location) kf$Marketplace_Type <- factor(kf$Marketplace_Type,labels=c('Federally-Facilitated Market','State-based Marketplace','State-based Marketplace (using HealthCare.gov)')) Plots Percent of state population enrolled in ACA

First we want to see what is the scope of the population in each state that have selected an ACA market based plan. (note California… not quite 12% of the US population)

kf%>% mutate(pop_pct=100*N/Total_N)%>% arrange(desc(pop_pct))%>% mutate(Location=factor(Location,levels=Location))%>% ggplot(aes(x=Location,y=pop_pct))+ geom_point()+ coord_flip()+ labs(y='Percent of Population that have selected an ACA market based plan')

Overall distribution by Medicare Expansion

We then check that there really is a difference between states that expanded and did not expand medicaid under the ACA and if being a state that voted Republican compared to Democratic.

boxplot_dat <- kf%>% dplyr::select(Expansion,Marketplace_Type,TrumpWin, ratio_DIFF,tot_ratio_DIFF)%>% reshape2::melt(.,id=c('Marketplace_Type','Expansion','TrumpWin')) levels(boxplot_dat$variable) <- c('per 1,000 Individuals who have\nselected a market based plan','per 1,000 Individuals') boxplot_dat%>% ggplot(aes(x=Expansion, y=value, fill=TrumpWin))+ geom_boxplot()+ geom_hline(aes(yintercept=0),linetype=2)+ facet_wrap(~variable,ncol=1,scales='free_y')+ labs(title='Change in Federal Funds ACA vs Graham-Cassidy, 2020-2026', y='Change in Federal Funds ($ Millions) per 1,000 individuals')+ theme_bw()+ theme(legend.position = 'bottom')

Drilling down to state level figures we show for each state the change from ACA funding to the proposed GC funding per 1,000 persons who selected a market based ACA plan. The arrows move from ACA to GC funding and the y-axis is ordered by the increasing net difference. This comparison is faceted among the different characteristics scrapped from above.

Some things to look for:

  • New York has the largest negative net funding per 1,000 persons.
  • Kentucky has the largest negative net funding per 1,000 persons among Republican leaning states.
  • The net increase in funding per 1,000 persons for states that did not expand medicaid is mostly minimal.
p <- kf%>%ggplot(aes(x=Location,xend=Location,yend=ratio_GC,y=ratio_ACA,colour=ratio_DIFF))+ geom_segment(arrow = arrow(length = unit(0.02, "npc")))+ coord_flip()+ scale_colour_gradient(low = 'red',high = 'blue',name='Difference')+ labs(title='Change in Federal Funds per 1,000 Individuals who have\nselected a market based plan ACA vs Graham-Cassidy, 2020-2026', subtitle='Arrow pointing to movement from ACA to Graham-Cassidy', caption='Source: Kaiser Family Foundation', y='Federal Funds ($ Millions) per 1,000 individuals who have selected a market based plan')+ theme_bw()+ theme(legend.position = 'bottom') Policial Leaning p + facet_wrap(~ TrumpWin , scales='free_y')

ACA Medicare expansion p + facet_wrap(~ Expansion , scales='free_y')

ACA Medicare expansion and Political Leaning p + facet_wrap(~ Expansion + TrumpWin , scales='free_y')

State Marketplace Type p + facet_wrap(~ Marketplace_Type, scales='free_y')

ACA Medicare expansion and State Marketplace Type p + facet_wrap(~ Expansion + Marketplace_Type , scales='free_y')

Geofaceting

Lastly, we construct geographic representation of the difference between the ACA and the GC bill using Ryan Hafen’s geofacet package.

states_facet <- state_ranks%>%left_join(kf%>%rename(name=Location),by='name') states_facet$Expansion <- factor(states_facet$Expansion,labels=c('Expansion','No Expansion')) states_facet$tile_lbl <- sprintf('%s\n%s',states_facet$Expansion,states_facet$TrumpWin) Total State Population states_facet%>% ggplot(aes(x='1', y='1',fill=tot_ratio_DIFF)) + geom_tile() + geom_text(aes(label=tile_lbl),size=2)+ theme_bw() + facet_geo( ~ state)+ scale_fill_gradient2(low='red',mid='white',high='green',name='Difference') + theme(legend.position = 'bottom', axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank())+ labs(title='Change in Federal Funds per 1,000 Individuals, 2020-2026', caption='Source: Kaiser Family Foundation')

ACA enrollment population states_facet%>% ggplot(aes(x='1', y='1',fill=ratio_DIFF)) + geom_tile() + geom_text(aes(label=tile_lbl),size=2)+ theme_bw() + facet_geo( ~ state)+ scale_fill_gradient2(low='red',mid='white',high='green',name='Difference') + theme(legend.position = 'bottom', axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank())+ labs(title='Change in Federal Funds per 1,000 Individuals who have\nselected a market based plan ACA vs Graham-Cassidy, 2020-2026', caption='Source: Kaiser Family Foundation')

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jonathan Sidi's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data.Table by Example – Part 2

Wed, 09/27/2017 - 05:03

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv") names(dat) <- gsub(" ", "_", names(dat)) dat[1:3]

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)] dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE), value2 = sample(1:50, nrow(dat), replace=TRUE), value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE) dat$value2 <- sample(1:50, nrow(dat), replace=TRUE) dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))] dat[Primary_Type=="PROSTITUTION", .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")] dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){ c(mean = mean(x), median = median(x), min = min(x), max = max(x)) } dat[, sapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part, I’ll cover a few advanced features like get() and {}.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppAnnoy 0.0.10

Wed, 09/27/2017 - 04:05

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A few short weeks after the more substantial 0.0.9 release of RcppAnnoy, we have a quick bug-fix update.

RcppAnnoy is our Rcpp-based R integration of the nifty Annoy library by Erik. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours.

Michaël Benesty noticed that our getItemsVector() function didn’t, ahem, do much besides crashing. Simple bug, they happen–now fixed, and a unit test added.

Changes in this version are summarized here:

Changes in version 0.0.10 (2017-09-25)
  • The getItemsVector() function no longer crashes (#24)

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Meet the new Microsoft R Server: Microsoft ML Server 9.2

Wed, 09/27/2017 - 01:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Server has received a new name and a major update: Microsoft ML Server 9.2 is now available. ML Server provides a scalable production platform for R — and now Python — programs. The basic idea is that a local client can push R or Python code and have it operationalized on the remote server. ML Server is also included with the Data Science Virtual Machine and HDInsight Spark clusters on Azure. 

This video gives a high-level overview of the process, or you can also see details of deploying an R model or a Python model as a web service.

The related Microsoft Machine Learning Services provides similar capabilities for in-database computations within SQL Server 2017 (now with Python as well as R) and (in preview) the fully-managed Azure SQL Database. ML Services also provides real-time scoring of trained models, with predictions generated from the data in the database without invoking the R or Python engine. 

Learn more about the many updates to Microsoft ML Server 9.2 in the blog post linked below.

Cortana Intelligence and Machine Learning Blog: Introducing Microsoft Machine Learning Server 9.2 Release

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Working with air quality and meteorological data Exercises (Part-4)

Tue, 09/26/2017 - 18:11

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports.

In the previous exercises set we used data from MY1 station to see how to import data and extract basic statistical information from data. Then we practiced using functions that are available in openair package to analyze and visualize data. In this exercise set we will go through more advance functions that are very useful in air quality data analysis.

Back trajectories are very useful in air pollution analysis and can provide important information on air mass origins. Back-trajectories show the path of air masses traveled over specific time periods and can be used to identify sources of pollutants. However, calculating and analysis of back-trajectory data is a time consuming process. Therefore, to overcome some of these issues and expand the possibilities for data analysis, openair makes several functions available to access and analyse pre-calculated back trajectories. In this exercise set will go through these functions.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag openair

Please load the package openair before starting the exercises.

Exercise 1
In the first exercise we will use importTraj function to import precalculated back-trjactories which will be also used in the next exercises. Use importTraj to import pre-calculated back-trajectories for London and for the year 2011. Save the results in a dataframe named “traj”.

Exercise 2
In this exercise will use trajPlot function to plot trajectory line to visually see how air-masses has travels before arriving to the receptor point.

Use trajPlot function to plot trajectory lines for the traj dataframe obtained in the previous exercise. Here, only use trajectories for the period of 10-16 January.

Exercise 3
There are also a few other ways of representing the data shown in For example we can plot the trajectories for each day.

Use trajPlot function to plot the same trajectory lines as in the previous exercise but separate the trajectories by day.

Exercise 4
One of the most useful approaches is to link the back trajectories with the concentrations of a pollutant.

In this exercise first import data for North Kensington using importAURN function. And then use merge function to merge the pollutant data with back-trajectories into single dataframe.

Now, use trajPlot function to plot trajectories along with pm10 level asscociated with each trajectories.

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course
[Intermediate] Spatial Data Analysis with R, QGIS & More
. this course you will learn how to:

  • Work with Spatial data and maps
  • Learn about different tools to develop spatial data next to R
  • And much more

Exercise 5
Another useful analysis is to identify the contribution of high concentration back trajectories. In other words, to identify the potential of sources contributing to high pollutant concentration in the
area of interest. This can be done by trajrajLevel. In this function we can use a number of statistics option such as frequency and difference. However another statistical method that can be applied is potential source contribution function (PSCF) method which is widely use in air mass trajectory analysis to identify potential sources or hotspots for pollution.

Use trajrajLevel function to plot and identify potential sources for high pm2.5 concentration in London in 2011.

Related exercise sets:
  1. Working with air quality and meteorological data Exercises (Part-3)
  2. Working with air quality and meteorological data Exercises (Part-2)
  3. Working with air quality and meteorological data Exercises (Part-1)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Microsoft Cognitive Services Vision API in R

Tue, 09/26/2017 - 17:53

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Microsoft Cognitive Services Vision API in R

A little while ago I did a brief tutorial of the Google Vision API using RoogleVision created by Mark Edmonson. I couldn’t find anything similar to that in R for the Microsoft Cognitive Services API so I thought I would give it a shot. I whipped this example together quickly to give it a proof-of-concept but I could certainly see myself building an R package to support this (unless someone can point to one – and please do if one exists)!

A quick example, sending this image retrieved the location of the human face and created a caption! Here’s my dog lined up next to his doppelganger:

The API is extremely easy to access using RCurl and httr. There are a lot of options which can be accessed. In this example, I’ll just cover the basics of image detection and descriptions.

If you don’t want to spend time writing a bunch of code, you can simply use the “helper_functions.R” file I created and swap out your credentials and API endpoint to get it working.

Getting Started With Microsoft Cognitive Services

In order to get started, all you need is an Azure Account which is free if you can keep yourself under certain thresholds and limits. There is even a free trial period (at the time this was written, at least).

Once that is taken care of there are a few things you need to do:

  1. Login to portal.azure.com
  2. On the lefthand menu click “Add”
  3. Click on “AI + Cognitive Services” and then the “Computer Vision API”
  4. Fill out the information required. You may have “Free Trial” under Subscription. Pay special attention to Location because this will be used in your API script
  5. In the lefthand menu, click “Keys” underneath “Resource Management” and you will find what you need for credentials. Underneath your Endpoint URL, click on “Show access keys…” – copy your key and use it in your script (do not make this publicly accessible)
  6. You’re ready to go!

Now, you can write a script to utilize the power of the Microsoft Cognitive Vision API.

What data can you get?

There are a lot of features you can request. I’m only asking for: description, tags, categories, and faces. You can also return: image type, color, and adult. There are also details which can be returned such as: landmarks and celebrities.

Here is the setup and API call:

image_url = 'https://imgur.com/rapIn0u.jpg' visualFeatures = "Description,Tags,Categories,Faces" # options = "Categories, Tags, Description, Faces, ImageType, Color, Adult" details = "Landmarks" # options = Landmarks, Celebrities reqURL = paste(api_endpoint_url, "?visualFeatures=", visualFeatures, "&details=", details, sep="") APIresponse = POST(url = reqURL, content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = api_key)), body=list(url = image_url), encode = "json") df = content(APIresponse)

The dataframe returned looks messy, but isn’t too bad once you dive in. Take a look:

str(df) ## List of 5 ## $ tags :List of 6 ## ..$ :List of 2 ## .. ..$ name : chr "dog" ## .. ..$ confidence: num 0.987 ## ..$ :List of 3 ## .. ..$ name : chr "mammal" ## .. ..$ confidence: num 0.837 ## .. ..$ hint : chr "animal" ## ..$ :List of 2 ## .. ..$ name : chr "looking" ## .. ..$ confidence: num 0.814 ## ..$ :List of 2 ## .. ..$ name : chr "animal" ## .. ..$ confidence: num 0.811 ## ..$ :List of 2 ## .. ..$ name : chr "posing" ## .. ..$ confidence: num 0.54 ## ..$ :List of 2 ## .. ..$ name : chr "staring" ## .. ..$ confidence: num 0.165 ## $ description:List of 2 ## ..$ tags :List of 18 ## .. ..$ : chr "dog" ## .. ..$ : chr "mammal" ## .. ..$ : chr "looking" ## .. ..$ : chr "animal" ## .. ..$ : chr "photo" ## .. ..$ : chr "posing" ## .. ..$ : chr "camera" ## .. ..$ : chr "man" ## .. ..$ : chr "standing" ## .. ..$ : chr "smiling" ## .. ..$ : chr "face" ## .. ..$ : chr "white" ## .. ..$ : chr "holding" ## .. ..$ : chr "close" ## .. ..$ : chr "wearing" ## .. ..$ : chr "laying" ## .. ..$ : chr "head" ## .. ..$ : chr "teeth" ## ..$ captions:List of 1 ## .. ..$ :List of 2 ## .. .. ..$ text : chr "a close up of Albert Einstein and a dog posing for the camera" ## .. .. ..$ confidence: num 0.892 ## $ requestId : chr "2143e23a-14c8-47c4-9750-9bfc82381512" ## $ metadata :List of 3 ## ..$ width : int 824 ## ..$ height: int 824 ## ..$ format: chr "Jpeg" ## $ faces :List of 1 ## ..$ :List of 3 ## .. ..$ age : int 73 ## .. ..$ gender : chr "Male" ## .. ..$ faceRectangle:List of 4 ## .. .. ..$ left : int 505 ## .. .. ..$ top : int 241 ## .. .. ..$ width : int 309 ## .. .. ..$ height: int 309 The top 5 descriptions returned are: description_tags = df$description$tags description_tags_tib = tibble(tag = character()) for(tag in description_tags){ for(text in tag){ if(class(tag) != "list"){ ## To remove the extra caption from being included tmp = tibble(tag = tag) description_tags_tib = description_tags_tib %>% bind_rows(tmp) } } } knitr::kable(description_tags_tib[1:5,]) tag dog mammal looking animal photo

The caption returned:

captions = df$description$captions captions_tib = tibble(text = character(), confidence = numeric()) for(caption in captions){ tmp = tibble(text = caption$text, confidence = caption$confidence) captions_tib = captions_tib %>% bind_rows(tmp) } knitr::kable(captions_tib) text confidence a close up of Albert Einstein and a dog posing for the camera 0.891846 The metadata returned: metadata = df$metadata metadata_tib = tibble(width = metadata$width, height = metadata$height, format = metadata$format) knitr::kable(metadata_tib) width height format 824 824 Jpeg The locations of faces returned: faces = df$faces faces_tib = tibble(faceID = numeric(), age = numeric(), gender = character(), x1 = numeric(), x2 = numeric(), y1 = numeric(), y2 = numeric()) n = 0 for(face in faces){ n = n + 1 tmp = tibble(faceID = n, age = face$age, gender = face$gender, x1 = face$faceRectangle$left, y1 = face$faceRectangle$top, x2 = face$faceRectangle$left + face$faceRectangle$width, y2 = face$faceRectangle$top + face$faceRectangle$height) faces_tib = faces_tib %>% bind_rows(tmp) } #faces_tib knitr::kable(faces_tib) faceID age gender x1 x2 y1 y2 1 73 Male 505 814 241 550 A few more examples:

As always, you can find the code I used on my GitHub.

Side note: I used a ton of for loops to access the data – not ideal… please let me know if you have better ways of dealing with nested data like this when you have unknown numbers of levels.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

13 new R Jobs for R users (2017-09-26) – from all over the world

Tue, 09/26/2017 - 16:48
To post your R job on the next post

Just visit  this link and  post  a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New Jobs
  1. Freelance
    R Programmer & Statistician Empiricus – Posted by  empiricus
    Anywhere
    25 Sep 2017
  2. Full-Time
    Data Scientist Edelweiss Business Services  – Posted by  Aakash Gupta
    Mumbai Maharashtra, India
    20 Sep 2017
  3. Full-Time
    Postdoctoral Data Scientist: GIS and mHealth Harvard T.H. Chan School of Public Health  – Posted by  Christine Choirat
    Boston Massachusetts, United States
    14 Sep 2017
  4. Full-Time
    Data Scientist @ London Hiscox  – Posted by  alan.south
    London England, United Kingdom
    13 Sep 2017
  5. Freelance
    Crafty developer: R->C++ SherandoResearch  – Posted by  feersum_monkey
    Anywhere
    8 Sep 2017
  6. Full-Time
    Solutions Engineer for RStudio @ U.S.A. (or anywhere) RStudio  – Posted by  nwstephens
    Anywhere
    8 Sep 2017
  7. Full-Time
    Senior Data Scientist @ Boston, Massachusetts, U.S. Colaberry Data Analytics  – Posted by  Colaberry_DataAnalytics
    Boston Massachusetts, United States
    8 Sep 2017
  8. Part-Time
    Cran-R Programmierer Praktikum @ Zürich, Switzerland Fisch Asset Management AG  – Posted by  HR_Fisch
    Zürich Zürich, Switzerland
    4 Sep 2017
  9. Freelance
    Data analyst at talmix @ London, UK Talmix  – Posted by  manikauluck
    London England, United Kingdom
    4 Sep 2017
  10. Full-Time
    R Developer and Open Source Community Manager at UCLA @ Los Angeles, California UCLA  – Posted by  graemeblair
    Los Angeles California, United States
    4 Sep 2017
  11. Full-Time
    Work in Bodrum: Full-stack web developer with R experience @ Muğla, Turkey PranaGEO Ltd.  – Posted by  acizmeli
    Muğla Turkey
    4 Sep 2017
  12. Full-Time
    Digital Performance & Data Science Section Manager @ Paris, France Nissan Automotive Europe  – Posted by  Estefania Rendon
    Paris Île-de-France, France
    31 Aug 2017
  13. Freelance
    Quantitative Analyst (Credit) @ Zürich, Switzerland ipub  – Posted by  gluc
    Zürich Zürich, Switzerland
    29 Aug 2017

 

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

R live class | R for Data Science | Oct 4-5 Milan

Tue, 09/26/2017 - 13:06

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

 

R for Data Science is our first course of the autumn term. It takes place in October 4-5 in a location close to Milano Lima.

If you want to deepen your data analysis knowledge using the most modern R tools, or you want to figure out if R is the right solution for you, this is the your class. This course is for people that are willing to learn about R and would like to get an overview of its capabilities for data science or those that have very little knowledge of R.

R for Data Science Outline

– A bit of R history and online resources
– R and R-Studio installation and configuration
– Your first R session
– Your first R markdown document
– R objects: data and functions
– Data import from external sources: excel and database connection
– Data manipulation with tidyverse
– Tidy data with tidyr
– Data visualization using ggplot
– An overview of statistical models and data mining with R

The course is for max 6 attendees.

R for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

  • October 11-12: Statistics for Data Science. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now!
  • October 25-26: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
  • November 7-8: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
  • November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
  • November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | R for Data Science | Oct 4-5 Milan appeared first on Quantide – R training & consulting.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

When are Citi Bikes Faster than Taxis in New York City?

Tue, 09/26/2017 - 12:00

(This article was first published on Category: R | Todd W. Schneider, and kindly contributed to R-bloggers)

Every day in New York City, millions of commuters take part in a giant race to determine transportation supremacy. Cars, bikes, subways, buses, ferries, and more all compete against one another, but we never get much explicit feedback as to who “wins.” I’ve previously written about NYC’s public taxi data and Citi Bike share data, and it occurred to me that these datasets can help identify who’s fastest, at least between cars and bikes. In fact, I’ve built an interactive guide that shows when a Citi Bike is faster than a taxi, depending on the route and the time of day.

The methodology and findings will be explained more below, and all code used in this post is available open-source on GitHub.

Interactive guide to when taxis are faster or slower than Citi Bikes

Pick a starting neighborhood and a time. The map shows whether you’d expect get to each neighborhood faster with a taxi (yellow) or a Citi Bike (dark blue).

Starting neighborhood


Alphabet CityBattery ParkBattery Park CityBloomingdaleCentral ParkChinatownClinton EastClinton WestEast ChelseaEast Harlem SouthEast VillageFinancial District NorthFinancial District SouthFlatironGarment DistrictGramercyGreenwich Village NorthGreenwich Village SouthHudson SqKips BayLenox Hill EastLenox Hill WestLincoln Square EastLincoln Square WestLittle Italy/NoLiTaLower East SideManhattan ValleyMeatpacking/West Village WestMidtown CenterMidtown EastMidtown NorthMidtown SouthMurray HillPenn Station/Madison Sq WestSeaportSoHoStuy Town/Peter Cooper VillageSutton Place/Turtle Bay NorthTimes Sq/Theatre DistrictTriBeCa/Civic CenterTwo Bridges/Seward ParkUN/Turtle Bay SouthUnion SqUpper East Side NorthUpper East Side SouthUpper West Side NorthUpper West Side SouthWest Chelsea/Hudson YardsWest VillageWorld Trade CenterYorkville EastYorkville West

BedfordBoerum HillBrooklyn HeightsBrooklyn Navy YardBushwick SouthCarroll GardensClinton HillCobble HillColumbia StreetCrown Heights NorthDowntown Brooklyn/MetroTechDUMBO/Vinegar HillEast WilliamsburgFort GreeneGowanusGreenpointPark SlopeProspect HeightsProspect ParkRed HookSouth WilliamsburgStuyvesant HeightsSunset Park WestWilliamsburg (North Side)Williamsburg (South Side)

Long Island City/Hunters PointLong Island City/Queens PlazaQueensbridge/RavenswoodSunnyside

Or click the map to change neighborhoods

Weekday time window
8:00 AM–11:00 AM
11:00 AM–4:00 PM
4:00 PM–7:00 PM
7:00 PM–10:00 PM
10:00 PM–8:00 AM

From Midtown East, weekdays 8:00 AM–11:00 AM

Taxi vs. Citi Bike travel times to other neighborhoods

Turn on javascript (or click through from RSS) to view the interactive taxi vs. Citi Bike map.

Data via NYC TLC and Citi Bike

Based on trips 7/1/2016–6/30/2017

toddwschneider.com

Hover over a neighborhood (tap on mobile) to view travel time stats

40% of weekday taxi trips—over 50% during peak hours—would be faster as Citi Bike rides

I estimate that 40% of weekday taxi trips within the Citi Bike service area would expect to be faster if switched to a Citi Bike, based on data from July 2016 to June 2017. During peak midday hours, more than 50% of taxi trips would expect to be faster as Citi Bike rides.

There are some significant caveats to this estimate. In particular, if many taxi riders simultaneously switched to Citi Bikes, the bike share system would probably hit severe capacity constraints, making it difficult to find available bikes and docks. Increased bike usage might eventually lead to fewer vehicles on the road, which could ease vehicle congestion, and potentially increase bike lane congestion. It’s important to acknowledge that when I say “40% of taxi trips would be faster if they switched to Citi Bikes”, we’re roughly considering the decision of a single able-bodied person, under the assumption that everyone else’s behavior will remain unchanged.

Heading crosstown in Manhattan? Seriously consider taking a bike instead of a car!

Crosstown Manhattan trips are generally regarded as more difficult than their north-south counterparts. There are fewer subways that run crosstown, and if you take a car, the narrower east-west streets often feel more congested than the broad north-south avenues with their synchronized traffic lights. Crosstown buses are so notoriously slow that they’ve been known to lose races against tricycles.

I divided Manhattan into the crosstown zones pictured above, then calculated the taxi vs. Citi Bike win rate for trips that started and ended within each zone. Taxis fare especially badly in the Manhattan central business district. If you take a midday taxi that both starts and ends between 42nd and 59th streets, there’s over a 70% chance that the trip would have been faster as a Citi Bike ride.

Keep in mind that’s for all trips between 42nd and 59th streets. For some of the longest crosstown routes, for example, from the United Nations on the far east side to Hell’s Kitchen on the west, Citi Bikes beat taxis 90% of the time during the day. It’s worth noting that taxis made 8 times as many trips as Citi Bikes between 42nd and 59th streets from July 2016 to June 2017—almost certainly there would be less total time spent in transit if some of those taxi riders took bikes instead.

Hourly graphs for all of the crosstown zones are available here, and here’s a summary table for weekday trips between 8:00 AM and 7:00 PM:

Manhattan crosstown zone % taxis lose to Citi Bikes 96th–110th 41% 77th–96th 36% 59th–77th 54% 42nd–59th 69% 14th–42nd 64% Houston–14th 54% Canal–Houston 60% Below Canal 55%

A reminder that this analysis restricts to trips that start and end within the same zone, so for example a trip from 23rd St to 57th St would be excluded because it starts and ends in different zones.

Taxis fare better for trips that stay on the east or west sides of Manhattan: 35% of daytime taxi trips that start and end west of 8th Avenue would expect to be faster as Citi Bike trips, along with 38% of taxi trips that start and end east of 3rd Avenue. Taxis also generally beat Citi Bikes on longer trips:

Taxis are losing more to Citi Bikes over time

When the Citi Bike program began in July 2013, less than half of weekday daytime taxi trips would have been faster if switched to Citi Bikes. I ran a month-by-month analysis to see how the taxi vs. Citi Bike calculus has changed over time, and discovered that taxis are getting increasingly slower compared to Citi Bikes:

Note that this month-by-month analysis restricts to the original Citi Bike service area, before the program expanded in August 2015. The initial expansion was largely into Upper Manhattan and the outer boroughs, where taxis generally fare better than bikes, and so to keep things consistent, I restricted the above graph to areas that have had Citi Bikes since 2013.

Taxis are losing more to Citi Bikes over time because taxi travel times have gotten slower, while Citi Bike travel times have remained roughly unchanged. I ran a pair of linear regressions to model travel times as a function of:

  • trip distance
  • time of day
  • precipitation
  • whether the route crosses between Manhattan and the outer boroughs
  • month of year
  • year

The regression code and output are available on GitHub: taxi, Citi Bike

As usual, I make no claim that this is a perfect model, but it does account for the basics, and if we look at the coefficients by year, it shows that, holding the other variables constant, a taxi trip in 2017 took 17% longer than the same trip in 2009. For example, a weekday morning trip from Midtown East to Union Square that took 10 minutes in 2009 would average 11:45 in 2017.

The same regression applied to Citi Bikes shows no such slowdown over time, in fact Citi Bikes got slightly faster. The regressions also show that:

  1. Citi Bike travel times are less sensitive to time of day than taxi travel times. A peak midday taxi trip averages 40% longer than the same trip at off-peak hours, while a peak Citi Bike trip averages 15% longer than during off-peak hours.
  2. Rainy days are associated with 2% faster Citi Bike travel times and 1% slower taxi travel times.
  3. For taxis, fall months have the slowest travel times, but for Citi Bikes, summer has the slowest travel times. For both, January has the fastest travel times.
Taxis are more prone to very bad days

It’s one thing to say that 50% of midday taxi trips would be faster as Citi Bike rides, but how much does that vary from day to day? You could imagine there are some days with severe road closures, where more nimble bikes have an advantage getting around traffic, or other days in the dead of summer, when taxis might take advantage of the less crowded roads.

I ran a more granular analysis to measure win/loss rates for individual dates. Here’s a histogram of the taxi loss rate—the % of taxi trips we’d expect to be faster if switched to Citi Bikes—for weekday afternoon trips from July 2016 to June 2017:

Many days see a taxi loss rate of just over 50%, but there are tails on both ends, indicating that some days tilt in favor of either taxis or Citi Bikes. I was curious if we could learn anything from the outliers on each end, so I looked at individual dates to see if there were any obvious patterns.

The dates when taxis were the fastest compared to Citi Bikes look like dates that probably had less traffic than usual. The afternoon with the highest taxi win rate was Monday, October 3, 2016, which was the Jewish holiday of Rosh Hashanah, when many New Yorkers would have been home from work or school. The next 3 best days for taxis were all Mondays in August, when I’d imagine a lot of people were gone from the city on vacation.

The top 4 dates where Citi Bikes did best against taxis were all rainy days in the fall of 2016. I don’t know why rainy days make bikes faster relative to taxis, maybe rain causes traffic on the roads that disproportionately affects cars, but it’s also possible that there’s a selection bias. I’ve written previously about how the weather predicts Citi Bike ridership, and not surprisingly there are fewer riders when it rains. Maybe the folks inclined to ride bikes when it’s raining are more confident cyclists, who also pedal faster when the weather is nice. It’s also possible that rainy-day cyclists are particularly motivated to pedal faster so they can get out of the rain. I don’t know if these are really the causes, but they at least sound believable, and would explain the observed phenomenon.

When the President is in town, take a bike

June 8, 2016 was a particularly good day for Citi Bikes compared to taxis. President Obama came to town that afternoon, and brought the requisite street closures with him. I poked around a bit looking for the routes that appeared to be the most impacted by the President’s visit, and came to afternoon trips from Union Square to Murray Hill. On a typical weekday afternoon, taxis beat Citi Bikes 57% of the time from Union Square to Murray Hill, but on June 8, Citi Bikes won 90% of the time. An even more dramatic way to see the Obama effect is to look at daily median travel times:

A typical afternoon taxi takes 8 minutes, but on June 8, the median was over 21 minutes. The Citi Bike median travel time is almost always 9 minutes, including during President Obama’s visit.

The same graph shows a similar phenomenon on September 19, 2016, when the annual United Nations General Assembly shut down large swathes of Manhattan’s east side, including Murray Hill. Although the impact was not as severe as during President Obama’s visit, the taxi median time doubled on September 19, while the Citi Bike median time again remained unchanged.

The morning of June 15, 2016 offers another example, this time on the west side, when an overturned tractor trailer shut down the Lincoln Tunnel for nearly seven hours. Taxi trips from the Upper West Side to West Chelsea, which normally take 15 minutes, took over 35 minutes. Citi Bikes typically take 18 minutes along the same route, and June 15 was no exception. Taxis would normally expect to beat Citi Bikes 67% of the time on a weekday morning, but on June 15, Citi Bikes won over 92% of the time.

These are of course three hand-picked outliers, and it wouldn’t be entirely fair to extrapolate from them to say that Citi Bikes are always more resilient than taxis during extreme circumstances. The broader data shows, though, that taxis are more than twice as likely as Citi Bikes to have days when a route’s median time is at least 5 minutes slower than average, and more than 3.5 times as likely to be at least 10 minutes slower, so it really does seem that Citi Bikes are better at minimizing worst-case outcomes.

Why have taxis gotten slower since 2009?

The biggest slowdowns in taxi travel times happened in 2014 and 2015. The data and regression model have nothing to say about why taxis slowed down so much over that period, though it might be interesting to dig deeper into the data to see if there are specific regions where taxis have fared better or worse since 2009.

Uber usage took off in New York starting in 2014, reaching over 10,000 vehicles dispatched per week by the beginning of 2015. There are certainly people who blame Uber—and other ride-hailing apps like Lyft and Juno—for increasing traffic, but the city’s own 2016 traffic report did not blame Uber for increased congestion.

It’s undoubtedly very hard to do an accurate study measuring ride-hailing’s impact on traffic, and I’m especially wary of people on both sides who have strong interests in blaming or exonerating the ride-hailing companies. Nevertheless, if I had to guess the biggest reasons taxis got particularly slower in 2014 and 2015, I would start with the explosive growth of ride-hailing apps, since the timing looks to align, and the publicly available data shows that they account for tens of thousands of vehicles on the roads.

On the other hand, if ride-hailing were the biggest cause of increased congestion in 2014 and 2015, it doesn’t exactly make sense that taxi travel times have stabilized a bit in 2016 and 2017, because ride-hailing has continued to grow, and while taxi usage continues to shrink, the respective rates of growth and shrinkage are not very different in 2016–17 than they were in 2014–15. One explanation could be that starting in 2016 there was a reduction in other types of vehicles—traditional black cars, private vehicles, etc.—to offset ride-hailing growth, but I have not seen any data to support (or refute) that idea.

There are also those who blame bike lanes for worsening vehicle traffic. Again, different people have strong interests arguing both sides, but it seems like there are more data points arguing that bike lanes do not cause traffic (e.g. here, here, and here) than vice versa. I wasn’t able to find anything about the timing of NYC bike lane construction to see how closely it aligns with the 2014–15 taxi slowdown.

Lots of other factors could have contributed to worsening traffic: commuter-adjusted population growth, subway usage, decaying infrastructure, construction, and presidential residences are just a few that feel like they could be relevant. I don’t know the best way to account for all of them, but it does seem like if you want to get somewhere in New York quickly, it’s increasingly less likely that a car is your best option.

How representative are taxis and Citi Bikes of all cars and bikes?

I think it’s not a terrible assumption that taxis are representative of typical car traffic in New York. If anything, maybe taxis are faster than average cars since taxi drivers are likely more experienced—and often aggressive—than average drivers. On the other hand, taxi drivers seem anecdotally less likely to use a traffic-enabled GPS, which maybe hurts their travel times.

Citi Bikes are probably slower than privately-owned bikes. Citi Bikes are designed to be heavy and stable, which maybe makes them safer, but lowers their speeds. Plus, I’d guess that biking enthusiasts, who might be faster riders, are more likely to ride their own higher-performance bikes. Lastly, Citi Bike riders might have to spend extra time at the end of a trip looking for an available dock, whereas privately-owned bikes have more parking options.

Weighing up these factors, I would guess that if we somehow got the relevant data to analyze the broader question of all cars vs. all bikes, the results would tip a bit in favor of bikes compared to the results of the narrower taxi vs. Citi Bike analysis. It’s also worth noting that both taxis and Citi Bikes have additional time costs that aren’t accounted for in trip durations: you have to hail a taxi, and there might not be a Citi Bike station in the near vicinity of your origin or destination.

What are the implications of all this?

One thing to keep in mind is that even though the taxi and Citi Bike datasets are the most conveniently available for analysis, New Yorkers don’t limit their choices to cars and bikes. The subway, despite its poor reputation of late, carries millions of people every day, more than taxis, ride-hailing apps, and Citi Bikes combined, so it’s not like “car vs. bike” is always the most relevant question. There are also legitimate reasons to choose a car over a bike—or vice versa—that don’t depend strictly on expected travel time.

Bike usage in New York has increased dramatically over the past decade, probably in large part because people figured out on their own that biking is often the fastest option. Even with this growth, though, the data shows that a lot of people could still save precious time—and minimize their worse-case outcomes—if they switched from cars to bikes. To the extent the city can incentivize that, it strikes me as a good thing.

When L-mageddon comes, take a bike

For any readers who might be affected by the L train’s planned 2019 closure, if you only remember one thing from this post: Citi Bikes crush taxis when traveling from Williamsburg to just about anywhere in Manhattan during morning rush hour!

GitHub

The code for the taxi vs. Citi Bike analysis is available here as part of the nyc-taxi-data repo. Note that parts of the analysis also depend on loading the data from the nyc-citibike-data repo.

The data

Taxi trip data is available since January 2009, Citi Bike data since July 2013. I filtered each dataset to make the analysis closer to an apples-to-apples comparison—see the GitHub repo for a more complete description of the filtering—but in short:

  • Restrict both datasets to weekday trips only
  • Restrict Citi Bike dataset to subscribers only, i.e. no daily pass customers
  • Restrict taxi dataset to trips that started and ended in areas with Citi Bike stations, i.e. where taking a Citi Bike would have been a viable option

Starting in July 2016, perhaps owing to privacy concerns, the TLC stopped providing latitude and longitude coordinates for every taxi trip. Instead, the TLC now divides the city into 263 taxi zones (map), and provides the pickup and drop off zones for every trip. The analysis then makes the assumption that taxis and Citi Bikes have the same distribution of trips within a single zone, see GitHub for more.

80% of taxi trips start and end within zones that have Citi Bike stations, and the filtered dataset since July 2013 contains a total of 330 million taxi trips and 27 million Citi Bike trips. From July 1, 2016 to June 30, 2017—the most recent 12 month period of available data—the filtered dataset includes 68 million taxi trips and 9 million Citi Bike trips.

Methodology

I wrote a Monte Carlo simulation in R to calculate the probability that a Citi Bike would be faster than a taxi for a given route. Every trip is assigned to a bucket, where the buckets are picked so that trips within a single bucket are fairly comparable. The bucket definitions are flexible, and I ran many simulations with different bucket definitions, but one sensible choice might be to group trips by:

  1. Starting zone
  2. Ending zone
  3. Hour of day

For example, weekday trips from the West Village to Times Square between 9:00 AM and 10:00 AM would constitute one bucket. The simulation iterates over every bucket that contains at least 5 taxi and 5 Citi Bike trips, and for each bucket, it draws 10,000 random samples, with replacement, for each of taxi and Citi Bike trips. The bucket’s estimated probability that a taxi is faster than a Citi Bike, call it the “taxi win rate”, is the fraction of samples where the taxi duration is shorter than the Citi Bike duration. You can think of this as 10,000 individual head-to-head races, with each race pitting a single taxi trip against a single Citi Bike trip.

Different bucketing and filtering schemes allow for different types of analysis. I ran simulations that bucketed by month to see how win rates have evolved over time, simulations that used only days where it rained, and others. There are undoubtedly more schemes to be considered, and the Monte Carlo methodology should be well equipped to handle them.

$(function() { var desktop = !mobileDevice();

var selected_location_click_handler = [];

var tooltip_x_handlers = [ { events: "@taxi_zone_marks:mouseover", update: "x() - 100 < 10 ? 10 : (x() - 100 > width - 240 ? width - 240 : x() - 100)" }, { events: "@taxi_zone_marks:mouseout", update: "-999" } ]

if (desktop) { $(".map-click-instructions").show();

selected_location_click_handler = [ { events: "@taxi_zone_marks:click", update: "datum.properties.zone" }, { events: "@taxi_zone_base_marks:click", update: "datum.properties.zone" } ];

tooltip_x_handlers.push({ events: "@taxi_zone_marks:click", update: "-999" }); }

if (window.screen && window.screen.width < 450) { var map_width = window.screen.width; var map_height = 520; var map_center = [-73.854, 40.737]; var map_scale = 155000; $("#nyc-taxi-zones-map").css({"margin-left": "-12px", "width": "100vw"}); } else { var map_width = 450; var map_height = 640; var map_center = [-73.889, 40.75]; var map_scale = 192000; } var vega_spec = { $schema: "https://vega.github.io/schema/vega/v3.0.json", width: map_width, height: map_height, autosize: "none", signals: [ { name: "selected_location", bind: { input: "select", options: ['Alphabet City', 'Battery Park', 'Battery Park City', 'Bedford', 'Bloomingdale', 'Boerum Hill', 'Brooklyn Heights', 'Brooklyn Navy Yard', 'Bushwick South', 'Carroll Gardens', 'Central Park', 'Chinatown', 'Clinton East', 'Clinton Hill', 'Clinton West', 'Cobble Hill', 'Columbia Street', 'Crown Heights North', 'Downtown Brooklyn/MetroTech', 'DUMBO/Vinegar Hill', 'East Chelsea', 'East Harlem South', 'East Village', 'East Williamsburg', 'Financial District North', 'Financial District South', 'Flatiron', 'Fort Greene', 'Garment District', 'Gowanus', 'Gramercy', 'Greenpoint', 'Greenwich Village North', 'Greenwich Village South', 'Hudson Sq', 'Kips Bay', 'Lenox Hill East', 'Lenox Hill West', 'Lincoln Square East', 'Lincoln Square West', 'Little Italy/NoLiTa', 'Long Island City/Hunters Point', 'Long Island City/Queens Plaza', 'Lower East Side', 'Manhattan Valley', 'Meatpacking/West Village West', 'Midtown Center', 'Midtown East', 'Midtown North', 'Midtown South', 'Murray Hill', 'Park Slope', 'Penn Station/Madison Sq West', 'Prospect Heights', 'Prospect Park', 'Queensbridge/Ravenswood', 'Red Hook', 'Seaport', 'SoHo', 'South Williamsburg', 'Stuy Town/Peter Cooper Village', 'Stuyvesant Heights', 'Sunnyside', 'Sunset Park West', 'Sutton Place/Turtle Bay North', 'Times Sq/Theatre District', 'TriBeCa/Civic Center', 'Two Bridges/Seward Park', 'UN/Turtle Bay South', 'Union Sq', 'Upper East Side North', 'Upper East Side South', 'Upper West Side North', 'Upper West Side South', 'West Chelsea/Hudson Yards', 'West Village', 'Williamsburg (North Side)', 'Williamsburg (South Side)', 'World Trade Center', 'Yorkville East', 'Yorkville West'] }, value: "Midtown East", on: selected_location_click_handler }, { name: "time_of_day", bind: { input: "radio", options: ["8:00 AM–11:00 AM", "11:00 AM–4:00 PM", "4:00 PM–7:00 PM", "7:00 PM–10:00 PM", "10:00 PM–8:00 AM"] }, value: "8:00 AM–11:00 AM" }, { name: "hover_area", value: null, on: [ { events: "@taxi_zone_marks:mouseover", update: "datum" }, { events: "@taxi_zone_marks:mouseout", update: "null" } ] }, { name: "tooltip_title", value: null, update: "hover_area ? selected_location + ' to ' + hover_area.properties.zone : ''" }, { name: "tooltip_from", value: null, update: "hover_area ? 'From ' + selected_location : ''" }, { name: "tooltip_to", value: null, update: "hover_area ? 'to ' + hover_area.properties.zone : ''" }, { name: "tooltip_time_of_day", value: null, update: "hover_area ? 'Weekdays ' + time_of_day : ''" }, { name: "tooltip_message", value: null, update: "hover_area ? (hover_area.taxi_win_rate > 0.5 ? 'Taxis' : 'Citi Bikes') + ' beat ' + (hover_area.taxi_win_rate > 0.5 ? 'Citi Bikes' : 'taxis') + ' ' + format((hover_area.taxi_win_rate > 0.5 ? hover_area.taxi_win_rate : 1 - hover_area.taxi_win_rate), '0.0%') + ' of the time' : ''" }, { name: "tooltip_taxi_median", value: null, update: "hover_area ? 'Taxi median: ' + timeFormat(1000 * hover_area.taxi_median, '%-M:%S') : ''" }, { name: "tooltip_citi_median", value: null, update: "hover_area ? 'Citi Bike median: ' + timeFormat(1000 * hover_area.citi_median, '%-M:%S') : ''" }, { name: "tooltip_x", value: -999, on: tooltip_x_handlers }, { name: "tooltip_y", on: [ { events: "@taxi_zone_marks:mouseover", update: "y() - 145 < 15 ? y() + 50 : y() - 145" } ] } ], data: [ { name: "simulation_results", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/simulation_results.csv", format: {type: "csv", parse: "auto"}, transform: [ { type: "filter", expr: "datum.time_bucket == time_of_day && datum.start_zone == selected_location" } ] }, { name: "east_river_bridges", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/east_river_bridges.json", format: {type: "topojson", feature: "bridges"} }, { name: "taxi_zones", url: "http://cdn.toddwschneider.com/taxi_vs_citibike/taxi_zones_bmq.json", format: {type: "topojson", feature: "trimmed_taxi_zones_geojson"}, transform: [ { type: "lookup", from: "simulation_results", key: "end_taxi_zone_id", fields: ["properties.locationid"], values: ["taxi_win_rate", "taxi_median", "citi_median"], as: ["taxi_win_rate", "taxi_median", "citi_median"] } ] }, { name: "taxi_zones_with_data", source: "taxi_zones", transform: [ { type: "filter", expr: "datum.taxi_win_rate" } ] }, { name: "selected_taxi_zone_origin", source: "taxi_zones", transform: [ { type: "filter", expr: "datum.properties.zone == selected_location" } ] } ], projections: [ { name: "projection", type: "mercator", center: map_center, scale: map_scale } ], scales: [ { name: "color", type: "sequential", domain: [0.05, 0.95], range: {scheme: "viridis"} } ], legends: [ { fill: "color", orient: "bottom-right", title: "Taxi win %", format: "0%", type: "gradient", encode: { gradient: { update: { width: {value: 150} } }, title: { enter: { fontSize: {value: 16} } }, labels: { enter: { text: {value: ""} } } } } ], marks: [ { type: "shape", name: "taxi_zone_base_marks", from: {data: "taxi_zones"}, encode: { update: { fill: {value: "#f4f4f4"}, fillOpacity: {value: 0.5}, stroke: {value: "#aaa"}, strokeWidth: {value: 0.2}, zindex: {value: 0} }, }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "east_river_bridges_marks", from: {data: "east_river_bridges"}, encode: { update: { stroke: {value: "#777"}, strokeWidth: {value: 2} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "taxi_zone_marks", from: {data: "taxi_zones_with_data"}, encode: { update: { fill: { scale: "color", field: "taxi_win_rate" }, fillOpacity: {value: 1}, stroke: {value: "#777"}, strokeWidth: {value: 0.2}, zindex: {value: 10} }, hover: { fillOpacity: {value: 0.8}, stroke: {value: "#222"}, strokeWidth: {value: 2}, zindex: {value: 100} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "shape", name: "selected_taxi_zone_marks", from: {data: "selected_taxi_zone_origin"}, encode: { update: { fill: {value: "#f00"}, fillOpacity: {value: 1}, stroke: {value: "#222"}, strokeWidth: {value: 2}, zindex: {value: 1000} } }, transform: [ {type: "geoshape", projection: "projection"} ] }, { type: "rect", interactive: false, encode: { update: { width: {value: 243}, height: {value: 127}, fill: {value: "#fff"}, fillOpacity: {value: 0.9}, stroke: {value: "#777"}, strokeWidth: {value: 2}, cornerRadius: {value: 2}, x: {signal: "tooltip_x - 5"}, y: {signal: "tooltip_y - 15"} } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y"}, fontSize: {value: 14}, text: { signal: "tooltip_from" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 16"}, fontSize: {value: 14}, text: { signal: "tooltip_to" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 40"}, fontSize: {value: 14}, text: { signal: "tooltip_time_of_day" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 66"}, fontSize: {value: 14}, text: { signal: "tooltip_taxi_median" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 82"}, fontSize: {value: 14}, text: { signal: "tooltip_citi_median" } } } }, { type: "text", interactive: false, encode: { update: { x: {signal: "tooltip_x"}, y: {signal: "tooltip_y + 106"}, fontSize: {value: 14}, text: { signal: "tooltip_message" } } } }, { type: "text", description: "hack to get the legend labels to work", interactive: false, encode: { update: { x: {value: map_width - 168}, y: {value: map_height - 15}, text: {value: "0%" + vega.repeat(" ", 39) + "100%"} } } } ] }; var vega_opts = { loader: vega.loader(), logLevel: vega.Warn, renderer: 'canvas' }; var view = new vega.View(vega.parse(vega_spec), vega_opts). initialize('#nyc-taxi-zones-map'). hover(). run(); $("#nyc-taxi-vs-citi-select-container").show(); $(".map-hover-instructions").show(); $("#nyc-taxi-vs-citi-select").on("change", function() { view.signal("selected_location", $(this).val()).run(); set_title(); }); $("input[name='nyc-taxi-vs-citi-time']").on("change", function() { view.signal("time_of_day", $(this).val()).run(); set_title(); }); $(".williamsburg-link").on("click", function() { $("#nyc-taxi-vs-citi-select").val("Williamsburg (North Side)"); view.signal("selected_location", "Williamsburg (North Side)").run(); set_title(); }); var $title = $(".taxi-vs-citi-map-title"); var set_title = function() { var new_title = "From " + $("#nyc-taxi-vs-citi-select").val() + ", weekdays " + $("input[name='nyc-taxi-vs-citi-time']:checked").val(); $title.html(new_title); }; if (desktop) { $("#nyc-taxi-zones-map").on("click", function() { vega_val = $("select[name='selected_location']").val(); outer_select = $("#nyc-taxi-vs-citi-select"); outer_val = outer_select.val(); if (vega_val !== outer_val) { outer_select.val(vega_val); set_title(); } }); } });

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Todd W. Schneider. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data.Table by Example – Part 1

Tue, 09/26/2017 - 08:02

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.

For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.

dat = fread("rows.csv") dat[1:3] str(dat)

Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.

names(dat) <- gsub(" ", "_", names(dat))

Let’s say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.

dat[, .N, by=.(Arrest)]

In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.

dat[, .N, by=.(Primary_Type)][order(-N)][N>=200000] dat[, .N, by=.(Location_Description)][order(-N)][N>=200000]

Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]

If you want to plot the results as a line graph, just add another chain which executes the visualization.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][ Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>% ggplot(., aes(date2, N)) + geom_line(group=1)

I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.

PS: As of this past week, I’ve decided to re-enter the job market. So if you are looking for a statistical analyst or data scientist with strong experience in programming in R and time series econometrics, please contact me at mathewanalytics@gmail.com or reach me through LinkedIn

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Create an Interactive Infographic

Tue, 09/26/2017 - 07:53

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

An interactive infographic can be used to communicate a lot of information in an engaging way. With the right tools, they are also relatively straightforward to create. In this post, I show step-by-step how to create this interactive infographic, using Canva, Displayr and R code. The interactive example is designed so that the user can change the country and have the infographic update automatically.

Tools used to create an interactive infographic: Canva is used to create the base infographic. The calculations, charting, and automatic text-writing are performed using the R language. It is all hooked up with Displayr.

Step 1: Create or download the infographic

I start by going to Canva, and choosing the Neat Interactive Gaming Infographic (tip: use Find Templates on the left-hand panel). You could, of course, design your own infographic, either in Canva or elsewhere. I like Canva, but the key thing is to create an infographic image some way. In Canva, I edited the template by deleting the bits that I wanted to replace with interactive charts and visualizations and then I download the infographic as a PNG file (2,000 pixels high by 800 wide).

 

      

Step 2: Import the infographic into Displayr

Create an account in Displayr, and then click the button that says + New Document. Set the page size to the same aspect ratio as the PNG file (Home > Page Layout > Layout > Page Size > Custom).  For this example, the page should be 20 inches high and 8 inches wide.

Next, insert the infographic into Displayr (Insert > Image), move and resize it to fit the page (tip: you can use the Properties panel on the right to type in pixels 800 x 2000 to reset the correct aspect ratio of the image).

 

Step 3: Get the data into Displayr

The data that will make the infographic interactive needs to be hooked up in Displayr. The data used to create the infographic in this example is shown to the right. There are lots of ways to import data into Displayr (e.g., importing a raw data file and creating the tables in Displayr). For this example, the data has been pasted into Displayr from Excel using the steps below.

To paste the data into Displayr:

  • Insert > Paste Table (Data), click Add data (on the right of the screen).
  • Paste in the data. Alternatively, you could type it into this screen. I first just pasted in the Age Distribution data and press OK. 
  • Properties > GENERAL and type AgeDistribution into the Label field and check the Automatic option (above).
  • Drag the table so that it is to the left of the infographic.
  • Hide the table (select it, Appearance > Hide). It will stay visible but will be invisible when you share the infographic.

Repeat this process for AverageAge, Ratio, and Multiplayer. It is important that you give each of these tables these names, as we refer to them later in our R code.

Step 4: Add the country selector

Next, I add a control so that the user can change country:

  • Insert > Control
  • Item list: China, US, Europe
  • Move it to the top of the screen and style as desired (font size, color, border)
  • Name: Click on the control and select China
  • Properties > GENERAL > Name: Country

I then insert a text box (“GAMERS”), and placed it to the left of the control (i.e.: font: Impact, size: 48, color: #ffb600).

Step 5: Create the charts and visualizations in R

Finally, create the charts and visualizations in Displayr using the following R code.

The column chart

I created the column chart using my colleague Carmen’s nifty wrapper-function for plotly. Insert an R Output in Displayr (Insert > R Output), and copy and paste the following code, pressing Calculate and resizing moving and resizing the chart.

flipStandardCharts::Chart(AgeDistribution[, Country], type = "Column", background.fill.color = "#212121", charting.area.fill.color = "#212121", colors = "tiel", x.tick.font.color = "white", x.tick.font.size = 20, x.grid.width = 0, y.tick.font.color = "white", y.tick.font.size = 20, y.title = "%", y.title.font.color = "white", y.grid.width = 0) Average age

The average age was also created by inserting an R Output, using the code below. While I could have written the formatting in R, I instead used the various formatting tools built into Displayr (Properties >LAYOUT and Properties > APPEARANCE).

AverageAge[,Country] The hearts pictograph

This was also done using an R Output in Displayr, with the following code (using R GitHub packages built by my colleagues Kyle and Carmen).

women = Ratio["Women", Country] total = sum(Ratio[, Country]) flipPictographs::SinglePicto(women, total, layout = "Number of columns", number.cols = 5, image = "Heart", hide.base.image = FALSE, auto.size = TRUE, fill.icon.color = "red", base.icon.color = "cyan", background.color ="#212121" )

The R code used to create the textbox is below (tip: toggle on Wrap text output at the bottom of the Properties panel on the right)

women = Ratio["Women", Country] total = sum(Ratio[, Country]) paste0(women, " IN ", total, " GAMERS ARE WOMEN") The pie chart

This is the R code for the pie chart:

flipStandardCharts::Chart(Multiplayer[, Country], type = "Pie", colors = c(rgb(175/255, 224/255, 170/255), rgb(0/255, 181/255, 180/255)), data.label.font.color = "White", data.label.font.size = 18, data.label.decimals = 0, data.label.suffix = "%")

The R code used to create the textbox was:

one.player = Multiplayer["1 player", Country] paste0(one.player, "% OF PEOPLE PLAY VIDEO GAMES ON THEIR OWN") Create the interactive infographic yourself

Sign in to Displayr and edit the document used to create the interactive infographic here. In Edit mode you can click on each of the charts, pictographs, and text boxes to see the underlying code. The final document was published (i.e., turned into a Dashboard) using Export > Web Page. Or you can view the interactive infographic created using the instructions above.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Regular Expression Searching within Shiny Selectize Objects

Tue, 09/26/2017 - 06:00

(This article was first published on Jonathan Sidi's R Blog, and kindly contributed to R-bloggers)

regexSelect is a small package that uses Shiny modules to solve a problem in Shiny selectize objects – regular expression (regex) searching. You can quickly filter the values in the selectize object, while being able to add that new regex query to the selectize list.

This is great for long lists, since you can return multiple item simultaneously without needing to endlessly click items in a list!

Install install.packages('regexSelect') #devtools::install_github('yonicd/regexSelect')

Below are two examples of using regular expressions to quickly populate multiple items in a ggplot and a datatable.

regexSelect with Plots

The shiny module works with two main functions:

# server side: callModule(module=regexSelect, id='myId', reactive(<selectizeInput Choices>)) # ui side: regexSelectUI(id = "myId", label = 'myLabel', choices = <selectizeInput Choices>)

regexSelect comes with controls placed in a group checkbox below the selectize object. When calling regexSelect you can show or hide the checkbox controls (hidden by default), as to make it look like a normal selectize object, and save valuable UI real estate.

While the shiny app is running regexSelect properties can be manipulated through the checkbox controls giving greater flexibilty to:

  • Toggle regexSelect to work as a standard selectize object.
  • Retain the regex search as a new value the selectize object.
  • Change arguments that are passed to base::grep : ignore.case, perl, fixed, invert.
library(shiny) library(ggplot2) ui <- fluidPage( selectInput('var', 'Choose Variable', choices = names(diamonds)[sapply(diamonds,function(x){ inherits(x,c('character','factor')))] }, selected = 'clarity'), uiOutput('regexchoose'), plotOutput("data") ) server <- function(input, output, session) { output$regexchoose<-shiny::renderUI({ regexSelectUI(id = "a", label = input$var, choices = unique(diamonds[[input$var]]), checkbox.show = TRUE) }) observeEvent(input$var,{ curr_cols <- callModule(module = regexSelect, id = "a", shiny::reactive(unique(diamonds[[input$var]])) ) observeEvent(curr_cols(),{ cols_now <- curr_cols() output$data <- shiny::renderPlot({ ggplot(diamonds[diamonds[[input$var]]%in%cols_now,], aes_string(x='table',y='carat',colour=input$var))+ geom_point() }) }) }) } shinyApp(ui, server) regexSelect with Tables

ui <- shiny::fluidPage( regexSelectUI(id = "a", label = "Variable:", choices = names(iris) ), shiny::tableOutput("data") ) server <- function(input, output, session) { curr_cols <- callModule(module = regexSelect, id = "a", shiny::reactive(names(iris)) ) observeEvent(curr_cols(),{ cols_now <- curr_cols() if( length(cols_now)==0 ) cols_now <- names(iris) output$data <- shiny::renderTable({ iris[,cols_now , drop = FALSE] }, rownames = TRUE) }) } shiny::shinyApp(ui, server) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jonathan Sidi's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

News Roundup from Microsoft Ignite

Tue, 09/26/2017 - 01:40

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It's been a big day for the team here at Microsoft, with a flurry of announcements from the Ignite conference in Orlando. We'll provide more in-depth details in the coming days and weeks, but for now here's a brief roundup of the news related to data science:

Microsoft ML Server 9.2 is now available. This is the new name for what used to be called Microsoft R Server, and now also includes support for operationalizing the Python language as well as R. This 2-minute video explains how to deploy models with ML Server (and with this release, real-time scoring is now also supported Linux as well). Microsoft R Client 3.4.1, featuring R 3.4.1, was also released today and provides desktop capabilities for R developers with the ability to deploy computations to a production ML Server.

The next generation of Azure Machine Learning is now available. This new service includes the AML Workbench, a cross-platform client for AI-powered data wrangling and experiment management; the AML Experimentation service to help data scientists increase their rate of experimentation with big data and GPUs, and the AML Model Management service to host, version, manage and monitor machine learning models. This TechCrunch article provides an overview.

SQL Server 2017 is now generally available, bringing support for in-database R and Python to both Windows and Linux.

Azure SQL Database now supports real-time scoring for R and Python, and a preview of general in-database R services is now available as well.

Microsoft Cognitive Services offers new intelligent APIs for developing AI applications, including general availability of the text analytics API for topic extraction, language detection and sentiment analysis.

Visual Studio Code for AI, an extension for the popular open-source code editor, provides new interfaces for developing with Tensorflow, Cognitive Toolkit (CNTK), Azure ML and more.

Finally, Microsoft announced a new high-level language for quantum computing to be integrated with Visual Studio, and a simulator for quantum computers up to 32 qubits. This 2-minute video provides an overview of Microsoft's vision for Quantum Computing.   

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Custom Level Coding in vtreat

Mon, 09/25/2017 - 19:50

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.



Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS

By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.

For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.

We’ll implement our partial pooling encoders using the lmer() (multilevel linear regression) and glmer() (multilevel generalized linear regression) functions from the lme4 package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).

The Data: Radon levels in Minnesota library("vtreat") library("lme4") library("dplyr") library("tidyr") library("ggplot2") # example data srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE) # target: log of radon activity (activity) # grouping variable: county radonMN = filter(srrs, state=="MN") %>% select("county", "activity") %>% filter(activity > 0) %>% mutate(activity = log(activity), county = base::trimws(county)) %>% mutate(critical = activity>1.5) str(radonMN) ## 'data.frame': 916 obs. of 3 variables: ## $ county : chr "AITKIN" "AITKIN" "AITKIN" "AITKIN" ... ## $ activity: num 0.788 0.788 1.065 0 1.131 ... ## $ critical: logi FALSE FALSE FALSE FALSE FALSE FALSE ...

For this example we have three columns of interest:

  • county: 85 possible values
  • activity: the log of the radon reading (numerical outcome)
  • critical: TRUE when activity > 1.5 (categorical outcome)

The goal is to level code county for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is "critical").

As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.

Implementing Level Coders for Partial Pooling

Let’s implement level coders that use partial pooling to compute the level score.

Regression

For regression problems, the custom coder should be a function that takes as input:

  • v: a string with the name of the categorical variable
  • vcol: the actual categorical column (assumed character)
  • y: the numerical outcome column
  • weights: a column of row weights

The function should return a column of scores (the level codings). For our example, the function builds a lmer model to predict y as a function of vcol, then returns the predictions on the training data.

# @param v character variable name # @param vcol character, independent or input variable # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderN <- function(v, vcol, y, weights) { # regression case y ~ vcol d <- data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m <- lmer(y ~ (1 | x), data=d, weights=weights) predict(m, newdata=d) }

Categorization

For categorization problems, the function should assume that y is a logical column, where TRUE is assumed to be the target outcome. This is because vtreat converts the outcome column to a logical while creating the treatment plan.

# @param v character variable name # @param vcol character, independent or input variable # @param y logical, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column ppCoderC <- function(v, vcol, y, weights) { # classification case y ~ vcol d <- data.frame(x = vcol, y = y, stringsAsFactors = FALSE) m = glmer(y ~ (1 | x), data=d, weights=weights, family=binomial) predict(m, newdata=d, type='link') }

You can then pass the functions in as a named list into either designTreatmentsX or mkCrossFrameXExperiment to build the treatment plan. The format of the key is [n|c].levelName[.option]*.

The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat to center the codes with respect to the estimated grand mean. ThecatN and catB level codings are centered in this way.

Our example coders can be passed in as shown below.

customCoders = list('n.poolN.center' = ppCoderN, 'c.poolC.center' = ppCoderC) Using the Custom Coders

Let’s build a treatment plan for the regression problem.

# I only want to create the cleaned numeric variables, the isBAD variables, # and the level codings (not the indicator variables or catP, etc.) vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN') treatplanN = designTreatmentsN(radonMN, varlist = c('county'), outcomename = 'activity', codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) scoreFrame = treatplanN$scoreFrame scoreFrame %>% select(varName, sig, origName, code) ## varName sig origName code ## 1 county_poolN 1.343072e-16 county poolN ## 2 county_catN 2.050811e-16 county catN

Note that the treatment plan returned both the catN variable (default level encoding) and the pooled level encoding (poolN). You can restrict to just using one coding or the other using the codeRestriction argument either during treatment plan creation, or in prepare().

Let’s compare the two level encodings.

# create a frame with one row for every county, measframe = data.frame(county = unique(radonMN$county), stringsAsFactors=FALSE) outframe = prepare(treatplanN, measframe) # If we wanted only the new pooled level coding, # (plus any numeric/isBAD variables), we would # use a codeRestriction: # # outframe = prepare(treatplanN, # measframe, # codeRestriction = c('clean', 'isBAD', 'poolN')) gather(outframe, key=scoreType, value=score, county_poolN, county_catN) %>% ggplot(aes(x=score)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of scores")

Notice that the poolN scores are "tucked in" compared to the catN encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.

We can also code for the categorical problem.

# For categorical problems, coding is catB vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC') treatplanC = designTreatmentsC(radonMN, varlist = c('county'), outcomename = 'critical', outcometarget= TRUE, codeRestriction = vartypes_I_want, customCoders = customCoders, verbose=FALSE) outframe = prepare(treatplanC, measframe) gather(outframe, key=scoreType, value=linkscore, county_poolC, county_catB) %>% ggplot(aes(x=linkscore)) + geom_density(adjust=0.5) + geom_rug(sides="b") + facet_wrap(~scoreType, ncol=1, scale="free_y") + ggtitle("Distribution of link scores")

Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and un-modeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lower-variance estimators when they do not cost much in terms of bias.

Other Considerations

For this example, we used the lme4 package to create custom level codings. Once calculated, vtreat stores the coding as a lookup table in the treatment plan. This means lme4 is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.

When using mkCrossFrameXExperiment, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB and catN variables. This is because the treatment plan is built using all the data, while the cross frame is built using n-fold cross validation on the data. See the cross frame vignette for more details.

Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.

In a follow-up article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat‘s default coding.

References

Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Survival Analysis with R

Mon, 09/25/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.


Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger

The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.

ggfortify enables producing handsome, one-line survival plots with ggplot2::autoplot.

ranger might be the surprise in my very short list of survival packages. The ranger() function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building time-to-event models with the large, high-dimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machine-learning age.

Load the data

This first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a two-treatment, randomized trial for lung cancer.

library(survival) library(ranger) library(ggplot2) library(dplyr) library(ggfortify) #------------ data(veteran) head(veteran) ## trt celltype time status karno diagtime age prior ## 1 1 squamous 72 1 60 7 69 0 ## 2 1 squamous 411 1 70 5 64 10 ## 3 1 squamous 228 1 60 3 38 0 ## 4 1 squamous 126 1 60 9 63 10 ## 5 1 squamous 118 1 70 11 65 10 ## 6 1 squamous 10 1 20 5 49 0

The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes

Kaplan Meier Analysis

The first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.

# Kaplan Meier Survival Curve km <- with(veteran, Surv(time, status)) head(km,80) ## [1] 72 411 228 126 118 10 82 110 314 100+ 42 8 144 25+ ## [15] 11 30 384 4 54 13 123+ 97+ 153 59 117 16 151 22 ## [29] 56 21 18 139 20 31 52 287 18 51 122 27 54 7 ## [43] 63 392 10 8 92 35 117 132 12 162 3 95 177 162 ## [57] 216 553 278 12 260 200 156 182+ 143 105 103 250 100 999 ## [71] 112 87+ 231+ 242 991 111 1 587 389 33

To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.

km_fit <- survfit(Surv(time, status) ~ 1, data=veteran) summary(km_fit, times = c(1,30,60,90*(1:10))) ## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1 137 2 0.985 0.0102 0.96552 1.0000 ## 30 97 39 0.700 0.0392 0.62774 0.7816 ## 60 73 22 0.538 0.0427 0.46070 0.6288 ## 90 62 10 0.464 0.0428 0.38731 0.5560 ## 180 27 30 0.222 0.0369 0.16066 0.3079 ## 270 16 9 0.144 0.0319 0.09338 0.2223 ## 360 10 6 0.090 0.0265 0.05061 0.1602 ## 450 5 5 0.045 0.0194 0.01931 0.1049 ## 540 4 1 0.036 0.0175 0.01389 0.0934 ## 630 2 2 0.018 0.0126 0.00459 0.0707 ## 720 2 0 0.018 0.0126 0.00459 0.0707 ## 810 2 0 0.018 0.0126 0.00459 0.0707 ## 900 2 0 0.018 0.0126 0.00459 0.0707 #plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready autoplot(km_fit)

Next, we look at survival curves by treatment.

km_trt_fit <- survfit(Surv(time, status) ~ trt, data=veteran) autoplot(km_trt_fit)

And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.

vet <- mutate(veteran, AG = ifelse((age < 60), "LT60", "OV60"), AG = factor(AG), trt = factor(trt,labels=c("standard","test")), prior = factor(prior,labels=c("N0","Yes"))) km_AG_fit <- survfit(Surv(time, status) ~ AG, data=vet) autoplot(km_AG_fit)

Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.

Cox Proportional Hazards Model

Next, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.

# Fit Cox Model cox <- coxph(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) summary(cox) ## Call: ## coxph(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137, number of events= 128 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## trttest 2.946e-01 1.343e+00 2.075e-01 1.419 0.15577 ## celltypesmallcell 8.616e-01 2.367e+00 2.753e-01 3.130 0.00175 ** ## celltypeadeno 1.196e+00 3.307e+00 3.009e-01 3.975 7.05e-05 *** ## celltypelarge 4.013e-01 1.494e+00 2.827e-01 1.420 0.15574 ## karno -3.282e-02 9.677e-01 5.508e-03 -5.958 2.55e-09 *** ## diagtime 8.132e-05 1.000e+00 9.136e-03 0.009 0.99290 ## age -8.706e-03 9.913e-01 9.300e-03 -0.936 0.34920 ## priorYes 7.159e-02 1.074e+00 2.323e-01 0.308 0.75794 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(-coef) lower .95 upper .95 ## trttest 1.3426 0.7448 0.8939 2.0166 ## celltypesmallcell 2.3669 0.4225 1.3799 4.0597 ## celltypeadeno 3.3071 0.3024 1.8336 5.9647 ## celltypelarge 1.4938 0.6695 0.8583 2.5996 ## karno 0.9677 1.0334 0.9573 0.9782 ## diagtime 1.0001 0.9999 0.9823 1.0182 ## age 0.9913 1.0087 0.9734 1.0096 ## priorYes 1.0742 0.9309 0.6813 1.6937 ## ## Concordance= 0.736 (se = 0.03 ) ## Rsquare= 0.364 (max possible= 0.999 ) ## Likelihood ratio test= 62.1 on 8 df, p=1.799e-10 ## Wald test = 62.37 on 8 df, p=1.596e-10 ## Score (logrank) test = 66.74 on 8 df, p=2.186e-11 cox_fit <- survfit(cox) #plot(cox_fit, main = "cph model", xlab="Days") autoplot(cox_fit)

Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, time-dependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.

Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”

To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”

The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.

aa_fit <-aareg(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior , data = vet) aa_fit ## Call: ## aareg(formula = Surv(time, status) ~ trt + celltype + karno + ## diagtime + age + prior, data = vet) ## ## n= 137 ## 75 out of 97 unique event times used ## ## slope coef se(coef) z p ## Intercept 0.083400 3.81e-02 1.09e-02 3.490 4.79e-04 ## trttest 0.006730 2.49e-03 2.58e-03 0.967 3.34e-01 ## celltypesmallcell 0.015000 7.30e-03 3.38e-03 2.160 3.09e-02 ## celltypeadeno 0.018400 1.03e-02 4.20e-03 2.450 1.42e-02 ## celltypelarge -0.001090 -6.21e-04 2.71e-03 -0.229 8.19e-01 ## karno -0.001180 -4.37e-04 8.77e-05 -4.980 6.28e-07 ## diagtime -0.000243 -4.92e-05 1.64e-04 -0.300 7.65e-01 ## age -0.000246 -6.27e-05 1.28e-04 -0.491 6.23e-01 ## priorYes 0.003300 1.54e-03 2.86e-03 0.539 5.90e-01 ## ## Chisq=41.62 on 8 df, p=1.6e-06; test weights=aalen #summary(aa_fit) # provides a more complete summary of results autoplot(aa_fit)

Random Forests Model

As a final example of what some might perceive as a data-science-like way to do time-to-event modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic tree-modeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.

ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.

# ranger model r_fit <- ranger(Surv(time, status) ~ trt + celltype + karno + diagtime + age + prior, data = vet, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE) # Average the survival models death_times <- r_fit$unique.death.times surv_prob <- data.frame(r_fit$survival) avg_prob <- sapply(surv_prob,mean) # Plot the survival models for each patient plot(r_fit$unique.death.times,r_fit$survival[1,], type = "l", ylim = c(0,1), col = "red", xlab = "Days", ylab = "survival", main = "Patient Survival Curves") # cols <- colors() for (n in sample(c(2:dim(vet)[1]), 20)){ lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n]) } lines(death_times, avg_prob, lwd = 2) legend(500, 0.7, legend = c('Average = black'))

The next block of code illustrates how ranger() ranks variable importance.

vi <- data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE)) names(vi) <- "importance" head(vi) ## importance ## karno 0.0734 ## celltype 0.0338 ## diagtime 0.0003 ## trt -0.0007 ## prior -0.0011 ## age -0.0027

Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest p-values in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.

But ranger() does compute Harrell’s c-index (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.

cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error) ## Prediction Error = 1 - Harrell's c-index = 0.308269

An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:

However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.

I believe that the major use for tree-based models for survival data will be to deal with very large data sets.

Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().

# Set up for ggplot kmi <- rep("KM",length(km_fit$time)) km_df <- data.frame(km_fit$time,km_fit$surv,kmi) names(km_df) <- c("Time","Surv","Model") coxi <- rep("Cox",length(cox_fit$time)) cox_df <- data.frame(cox_fit$time,cox_fit$surv,coxi) names(cox_df) <- c("Time","Surv","Model") rfi <- rep("RF",length(r_fit$unique.death.times)) rf_df <- data.frame(r_fit$unique.death.times,avg_prob,rfi) names(rf_df) <- c("Time","Surv","Model") plot_df <- rbind(km_df,cox_df,rf_df) p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model)) p + geom_line()

For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.

This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how open-source code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s c-index packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.

Some Tutorials and Papers

For a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.

Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.

For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].

For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.

See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.

References

For convenience, I have collected the references used throughout the post here.

[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 4452-4461 [3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563. [4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
[5] Diez, David. Survival Analysis in R, OpenIntro
[6] Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
[7] Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361-387 [9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
[10] NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
[11] Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model [12] Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
[13] Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005) [14] Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
[15] Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
[16] Bou-Hamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)

Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.

_____='https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Super excited for R promises

Mon, 09/25/2017 - 02:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

We at Appsilon are excited about RStudio introducing promises in R quite soon which is going to be a huge step forward in programming in R (we have already used futures and similar libraries to run code asynchronously, however this is going to be a standard and it looks like it’s going to be very easy to use). They support chaining which is a great way of building clean code by piping computations that take a long time.

We’ve used futures/promises/tasks in programming languages like C#, Scala, Javascript and promises have always had a big impact on the way code is structured and on the overall execution speed.

We recently spoke with Joe Chang, and he mentioned promises at the EARL conference. Here are some links if you’re interested in reading more about promises on github and medium.

What do you think? Let us know in the comments.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

eXtremely Boost your machine learning Exercises (Part-1)

Sun, 09/24/2017 - 19:11

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in xgboost package and it is really worth including into anyone’s machine learning portfolio.

This is the first part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load xgboost library and download German Credit dataset. Your goal in this tutorial will be to predict Creditability (the first column in the dataset).

Exercise 2
Convert columns c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20) to factors and then encode them as dummy variables. HINT: use model.matrix()

Exercise 3
Split data into training and test set 700:300. Create xgb.DMatrix for both sets with Creditability as label.

Exercise 4
Train xgboost with logistic objective and 30 rounds of training and maximal depth 2.

Exercise 5
To check model performance calculate test set classification error.

Exercise 6
Plot predictors importance.

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

  • Create a machine learning algorithm from a beginner point of view
  • Quickly dive into more advanced methods in an accessible pace and with more explanations
  • And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

Exercise 7
Use xgb.train() instead of xgboost() to add both train and test sets as a watchlist. Train model with same parameters, but 100 rounds to see how it performs during training.

Exercise 8
Train model again adding AUC and Log Loss as evaluation metrices.

Exercise 9
Plot how AUC and Log Loss for train and test sets was changing during training process. Use plotting function/library of your choice.

Exercise 10
Check how setting parameter eta to 0.01 influences the AUC and Log Loss curves.

Related exercise sets:
  1. Model Evaluation 2
  2. How to prepare and apply machine learning to your dataset
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGSL 0.3.3

Sun, 09/24/2017 - 17:41

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance update RcppGSL 0.3.3 is now on CRAN. It switched the vignette to the our new pinp package and its two-column pdf default.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.3 (2017-09-24)
  • We also check for gsl-config at package load.

  • The vignette now uses the pinp package in two-column mode.

  • Minor other fixes to package and testing infrastructure.

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppCNPy 0.2.7

Sat, 09/23/2017 - 21:07

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of the RcppCNPy package arrived on CRAN yesterday.

RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

This version updates internals for function registration, but otherwise mostly switches the vignette over to the shiny new pinp two-page template and package.

Changes in version 0.2.7 (2017-09-22)
  • Vignette updated to Rmd and use of pinp package

  • File src/init.c added for dynamic registration

CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages