Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 6 hours 49 min ago

Algorithmic Trading: Using Quantopian’s Zipline Python Library In R And Backtest Optimizations By Grid Search And Parallel Processing

Thu, 05/31/2018 - 02:00

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

We are ready to demo our new new experimental package for Algorithmic Trading, flyingfox, which uses reticulate to to bring Quantopian’s open source algorithmic trading Python library, Zipline, to R. The flyingfox library is part of our NEW Business Science Labs innovation lab, which is dedicated to bringing experimental packages to our followers early on so they can test them out and let us know what they think before they make their way to CRAN. This article includes a long-form code tutorial on how to perform backtest optimizations of trading algorithms via grid search and parallel processing. In this article, we’ll show you how to use the combination of tibbletime (time-based extension of tibble) + furrr (a parallel-processing compliment to purrr) + flyingfox (Zipline in R) to develop a backtested trading algorithm that can be optimized via grid search and parallel processing. We are releasing this article as a compliment to the R/Finance Conference presentation “A Time Series Platform For The Tidyverse”, which Matt will present on Saturday (June 2nd, 2018). Enjoy!

New: Business Science Labs

I (Davis) am excited to introduce a new open source initiative called Business Science Labs. A lot of the experimental work we do is done behind the scenes, and much of it you don’t see early on. What you do see is a “refined” version of what we think you need based on our perception, which is not always reality. We aim to change this. Starting today, we have created Business Science Labs, which is aimed at bringing our experimental software to you earlier so you can test it out and let us know your thoughts!

Our first initiative is to bring Quantopian’s Open Source algorithmic trading Python library, Zipline, to R via an experimental package called flyingfox (built using the awesome reticulate package).

What We’re Going To Learn

Introducing Business Science Labs is exciting, but we really want to educate you on some new packages! In this tutorial, we are going to go over how to backtest algorithmic trading strategies using parallel processing and Quantopian’s Zipline infrastructure in R. You’ll gain exposure to a tibbletime, furrr, and our experimental flyingfox package. The general progression is:

  • tibbletime: What it is and why it’s essential to performing scalable time-based calculations in the tidyverse

  • furrr: Why you need to know this package for speeding up code by processing purrr in parallel

  • flyingfox: The story behind the package, and how you can use it to test algorithmic trading strategies

  • tibbletime + furrr + flyingfox: Putting it all together to perform parallelized algorithmic trading strategies and analyze time-based performance

Here’s an example of the grid search we perform to determine which are the best combinations of short and long moving averages for the stock symbol JPM (JP Morgan).

Here’s an example of the time series showing the order (buy/sell) points determined by the moving average crossovers, and the effect on the portfolio value.

Algorithmic Trading Strategies And Backtesting

Algorithmic trading is nothing new. Financial companies have been performing algorithmic trading for years as a way of attempting to “beat” the market. It can be very difficult to do, but some traders have successfully applied advanced algorithms to yield significant profits.

Using an algorithm to trade boils down to buying and selling. In the simplest case, when an algorithm detects an asset (a stock) is going to go higher, a buy order is placed. Conversely, when the algorithm detects that an asset is going to go lower, a sell order is placed. Positions are managed by buying and selling all or part of the portfolio of assets. To keep things simple, we’ll focus on just the full buy/sell orders.

One very basic method of algorithmic trading is using short and long moving averages to detect shifts in trend. The crossover is the point where a buy/sell order would take place. The figure below shows the price of Halliburton (symbol “HAL”), which a trader would have an initial position in of say 10,000 shares. In a hypothetical case, the trader could use a combination of a 20 day short moving average and a 150 day long moving average and look for buy/sell points at the crossovers. If the trader hypothetically sold his/her position in full on the sell and bought the position back in full, then the trader would stand to avoid a delta loss of approximately $5/share during the downswing, or $50,000.

Backtesting is a strategy that is used to detect how a trading strategy would have performed in the past. It’s impossible to know what the future will bring, but using trading strategies that work in the past helps to instill confidence in an algorithm.

Quantopian is a platform designed to enable anyone to develop algorithmic trading strategies. To help its community, Quantopian provides several open source tools. The one we’ll focus on is Zipline for backtesting. There’s one downside: it’s only available in Python.

With the advent of the reticulate package, which enables porting any Python library to R, we took it upon ourselves to test out the viability of porting Zipline to R. Our experiment is called flyingfox.

RStudio Cloud Experiment Sandbox

In this code-based tutorial, we’ll use an experimental package called flyingfox. It has several dependencies including Python that require setup time and effort. For those that want to test out flyingfox quickly, we’ve created a FREE RStudio Cloud Sandbox for running experiments. You can access the Cloud Sandbox here for FREE: https://rstudio.cloud/project/38291

Packages Needed For Backtest Optimization

The meat of this code-tutorial is the section Backtest Optimization Using tibbletime + furrr + flyingfox . However, before we get to it, we’ll go over the three main packages used to do high-performance backtesting optimizations:

  1. tibbletime: What it is, and why it’s essential to performing scalable time-based calculations in the tidyverse

  2. furrr: Why you need to know this package for speeding up code by processing purrr in parallel

  3. flyingfox: How to use it to test algorithmic trading strategies

Putting It All Together: tibbletime + furrr + flyingfox for backtesting optimizations performed using parallel processing and grid search!

Install & Load Libraries

Install Packages

For this post, you’ll need to install development version of flyingfox.

devtools::install_github("DavisVaughan/flyingfox")

If you are on windows, you should also install the development version of furrr.

devtools::install_github("DavisVaughan/furrr")

Other packages you’ll need include tibbletime, furrr, and tidyverse. We’ll also load tidyquant mainly for the ggplot2 themes. We’ll install ggrepel to repel overlapping plot labels. You can install these from CRAN using install.packages().

Load Packages

library(tidyverse) library(tibbletime) library(furrr) library(flyingfox) library(tidyquant) library(ggrepel)

We’ll cover how a few packages work before jumping into backtesting and optimizations.

1. tibbletime

The tibbletime package is a cornerstone for future time series software development efforts at Business Science. We have major plans for this package. Here are some key benefits:

  • Time periods down to milliseconds are supported
  • Because this is a tibble under the hood, we are able to leverage existing packages for analysis without reinventing the wheel
  • Scalable grouped analysis is at your fingertips because of collapse_by() and integration with group_by()

It’s best to learn now, and we’ll go over the basics along with a few commonly used functions: collapse_by(), rollify(), filter_time(), and as_period().

First, let’s get some data. We’ll use the FANG data set that comes with tibbletime, which includes stock prices for FB, AMZN, NFLX, and GOOG. We recommend using the tidyquant package to get this or other stock data.

data("FANG") FANG ## # A tibble: 4,032 x 8 ## symbol date open high low close volume adjusted ## ## 1 FB 2013-01-02 27.4 28.2 27.4 28 69846400 28 ## 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8 ## 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8 ## 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4 ## 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1 ## 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6 ## 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3 ## 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7 ## 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0 ## 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1 ## # ... with 4,022 more rows

Next, you’ll need to convert thistbl_df object to a tbl_time object using the tbl_time() function.

FANG_time <- FANG %>% tbl_time(date) %>% group_by(symbol) FANG_time ## # A time tibble: 4,032 x 8 ## # Index: date ## # Groups: symbol [4] ## symbol date open high low close volume adjusted ## ## 1 FB 2013-01-02 27.4 28.2 27.4 28 69846400 28 ## 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8 ## 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8 ## 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4 ## 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1 ## 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6 ## 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3 ## 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7 ## 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0 ## 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1 ## # ... with 4,022 more rows collapse_by()

Beautiful. Now we have a time-aware tibble. Let’s test out some functions. First, let’s take a look at collapse_by(), which is used for grouped operations. We’ll collapse by “year”, and calculate the average price for each of the stocks.

FANG_time %>% collapse_by(period = "year") %>% group_by(symbol, date) %>% summarise(mean_adj = mean(adjusted)) ## # A time tibble: 16 x 3 ## # Index: date ## # Groups: symbol [?] ## symbol date mean_adj ## ## 1 AMZN 2013-12-31 298. ## 2 AMZN 2014-12-31 333. ## 3 AMZN 2015-12-31 478. ## 4 AMZN 2016-12-30 700. ## 5 FB 2013-12-31 35.5 ## 6 FB 2014-12-31 68.8 ## 7 FB 2015-12-31 88.8 ## 8 FB 2016-12-30 117. ## 9 GOOG 2013-12-31 442. ## 10 GOOG 2014-12-31 561. ## 11 GOOG 2015-12-31 602. ## 12 GOOG 2016-12-30 743. ## 13 NFLX 2013-12-31 35.3 ## 14 NFLX 2014-12-31 57.5 ## 15 NFLX 2015-12-31 91.9 ## 16 NFLX 2016-12-30 102. rollify()

Next, let’s take a look at rollify(). Remember the chart of Halliburton prices at the beginning. It was created using rollify(), which turns any function into a rolling function. Here’s the code for the chart. Notice how we create two rolling functions using mean() and supplying the appropriate window argument.

hal_tbl <- tq_get("HAL", from = "2016-01-01", to = "2017-12-31") roll_20 <- rollify(mean, window = 20) roll_150 <- rollify(mean, window = 150) hal_ma_tbl <- hal_tbl %>% select(date, adjusted) %>% mutate( ma_20 = roll_20(adjusted), ma_150 = roll_150(adjusted) ) %>% gather(key = key, value = value, -date, factor_key = T) hal_ma_tbl %>% ggplot(aes(date, value, color = key, linetype = key)) + geom_line() + theme_tq() + scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) + labs(title = "Halliburton Asset Price (HAL)", subtitle = "Strategy: Short MA = 20, Long MA = 150", x = "Date", y = "Adjusted Stock Price") + annotate("point", x = as_date("2017-04-07"), y = 49.25, colour = "red", size = 5) + annotate("text", label = "Sell Signal", x = as_date("2017-04-07"), y = 49.25 + 2.5, color = "red", hjust = 0) + annotate("point", x = as_date("2017-10-09"), y = 44.25, colour = "blue", size = 5) + annotate("text", label = "Buy Signal", x = as_date("2017-10-09"), y = 44.25 + 2.5, color = "blue")

filter_time()

Let’s check out filter_time(), which enables easier subsetting of time-based indexes. Let’s redo the chart above, instead focusing in on sell and buy signals, which occur after February 2017. We can convert the previously stored hal_ma_tbl to a tbl_time object, group by the “key” column, and then filter using the time function format filter_time("2017-03-01" ~ "end"). We then reuse the plotting code above.

hal_ma_time <- hal_ma_tbl %>% as_tbl_time(date) %>% group_by(key) hal_ma_time %>% filter_time("2017-03-01" ~ "end") %>% ggplot(aes(date, value, color = key, linetype = key)) + geom_line() + theme_tq() + scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) + labs(title = "Halliburton Asset Price (HAL)", subtitle = "2016 through 2017", x = "Date", y = "Adjusted Stock Price") + annotate("point", x = as_date("2017-04-07"), y = 49.25, colour = "red", size = 5) + annotate("text", label = "Sell Signal", x = as_date("2017-04-07"), y = 49.25 + 2.5, color = "red", hjust = 0) + annotate("point", x = as_date("2017-10-09"), y = 44.25, colour = "blue", size = 5) + annotate("text", label = "Buy Signal", x = as_date("2017-10-09"), y = 44.25 + 2.5, color = "blue")

as_period()

We can use the as_period() function to change the periodicity to a less granular level (e.g. going from daily to monthly). Here we convert the HAL share prices from daily periodicity to monthly periodicity.

hal_ma_time %>% as_period(period = "monthly", side = "end") %>% ggplot(aes(date, value, color = key, linetype = key)) + geom_line() + theme_tq() + scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) + labs(title = "Halliburton Asset Price (HAL)", subtitle = "Periodicity Changed To Monthly", x = "Date", y = "Adjusted Stock Price")

Next, let’s check out a new package for parallel processing using purrr: furrr!

2. furrr

The furrr package combines the awesome powers of future for parallel processing with purrr for iteration. Let’s break these up into pieces by purrr, future, and furrr.

purrr

The purrr package is used for iteration over a number of different types of generic objects in R, including vectors, lists, and tibbles. The main function used is map(), which comes in several varieties (e.g. map(), map2(), pmap(), map_chr(), map_lgl(), map_df(), etc). Here’s a basic example to get the class() of the columns of the FANG_time variable. Using map() iterates over the columns of the data frame returning a list containing the contents of the function applied to each column.

FANG_time %>% map(class) ## $symbol ## [1] "character" ## ## $date ## [1] "Date" ## ## $open ## [1] "numeric" ## ## $high ## [1] "numeric" ## ## $low ## [1] "numeric" ## ## $close ## [1] "numeric" ## ## $volume ## [1] "numeric" ## ## $adjusted ## [1] "numeric" future

The future package enables parallel processing. Here are a few important points:

  • future is a unified interface for parallel programming in R.

  • You set a “plan” for how code should be executed, call future() with an expression to evaluate, and call value() to retrieve the value. The first future() call sends off the code to another R process and is non-blocking so you can keep running R code in this session. It only blocks once you call value().

  • Global variables and packages are automatically identified and
    exported for you!

Now, the major point: If you’re familiar with purrr, you can take advantage of future parallel processing with furrr!

furrr = future + purrr

furrr takes the best parts of purrr and combines it with the parallel processing capabilities of future to create a powerful package with an interface instantly familiar to anyone who has used purrr before. All you need to do is two things:

  1. Setup a plan() such as plan("multiprocess")

  2. Change map() (or other purrr function) to future_map() (or other compatible furrr function)

Every purrr function has a compatible furrr function. For example,
map_df() has future_map_df(). Set a plan, and run future_map_df() and that is all there
is to it!

furrr Example: Multiple Moving Averages

Say you would like to process not a single moving average but multiple moving averages for a given data set. We can create a custom function, multi_roller(), that uses map_dfc() to iteratively map a rollify()-ed mean() based on a sequence of windows. Here’s the function and how it works when a user supplies a data frame, a column in the data frame to perform moving averages, and a sequence of moving averages.

multi_roller <- function(data, col, ..f = mean, window = c(10, 20, 30, 100, 200), sep = "V_") { col_expr <- enquo(col) ret_tbl <- window %>% set_names(~ paste0(sep, window)) %>% map_dfc(~ rollify(..f, window = .x)(pull(data, !! col_expr))) %>% bind_cols(data, .) return(ret_tbl) }

We can test the function out with the FB stock prices from FANG. We’ll ungroup, filter by FB, and select the important columns, then pass the data frame to the multi_roller() function with window = seq(10, 100, by = 10). Great, it’s working!

FANG_time %>% ungroup() %>% filter(symbol == "FB") %>% select(symbol, date, adjusted) %>% multi_roller(adjusted, mean, window = seq(10, 100, by = 10), sep = "ma_") ## # A time tibble: 1,008 x 13 ## # Index: date ## symbol date adjusted ma_10 ma_20 ma_30 ma_40 ma_50 ma_60 ## ## 1 FB 2013-01-02 28 NA NA NA NA NA NA ## 2 FB 2013-01-03 27.8 NA NA NA NA NA NA ## 3 FB 2013-01-04 28.8 NA NA NA NA NA NA ## 4 FB 2013-01-07 29.4 NA NA NA NA NA NA ## 5 FB 2013-01-08 29.1 NA NA NA NA NA NA ## 6 FB 2013-01-09 30.6 NA NA NA NA NA NA ## 7 FB 2013-01-10 31.3 NA NA NA NA NA NA ## 8 FB 2013-01-11 31.7 NA NA NA NA NA NA ## 9 FB 2013-01-14 31.0 NA NA NA NA NA NA ## 10 FB 2013-01-15 30.1 29.8 NA NA NA NA NA ## # ... with 998 more rows, and 4 more variables: ma_70 , ## # ma_80 , ma_90 , ma_100

We can apply this multi_roller() function to a nested data frame. Let’s try it on our FANG_time data set. We’ll select the columns of interest (symbol, date, and adjusted), group by symbol, and use the nest() function to get the data nested.

FANG_time_nested <- FANG_time %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% nest() FANG_time_nested ## # A tibble: 4 x 2 ## symbol data ## ## 1 FB ## 2 AMZN ## 3 NFLX ## 4 GOOG

Next, we can perform a rowwise map using the combination of mutate() and map(). We apply multi_roller() as an argument to map() along with data (variable being mapped), and the additional static arguments, col and window.

FANG_time_nested %>% mutate(ma = map(data, multi_roller, col = adjusted, ..f = mean, window = seq(10, 100, by = 10), sep = "ma_")) %>% select(-data) %>% unnest() ## # A time tibble: 4,032 x 13 ## # Index: date ## symbol date adjusted ma_10 ma_20 ma_30 ma_40 ma_50 ma_60 ## ## 1 FB 2013-01-02 28 NA NA NA NA NA NA ## 2 FB 2013-01-03 27.8 NA NA NA NA NA NA ## 3 FB 2013-01-04 28.8 NA NA NA NA NA NA ## 4 FB 2013-01-07 29.4 NA NA NA NA NA NA ## 5 FB 2013-01-08 29.1 NA NA NA NA NA NA ## 6 FB 2013-01-09 30.6 NA NA NA NA NA NA ## 7 FB 2013-01-10 31.3 NA NA NA NA NA NA ## 8 FB 2013-01-11 31.7 NA NA NA NA NA NA ## 9 FB 2013-01-14 31.0 NA NA NA NA NA NA ## 10 FB 2013-01-15 30.1 29.8 NA NA NA NA NA ## # ... with 4,022 more rows, and 4 more variables: ma_70 , ## # ma_80 , ma_90 , ma_100

Great, we have our moving averages. But…

What if instead of 10 moving averages, we had 500? This would take a really long time to run on many stocks. Solution: Parallelize with furrr!

There are two ways we could do this since there are two maps:

  1. Parallelize the map() internal to the multi_roller() function
  2. Parallelize the rowwise map() applied to each symbol

We’ll choose the former (1) to show off furrr.

To make the multi_roller_parallel() function, copy the multi_roller() function and do 2 things:

  1. Add plan("multiprocess") at the beginning
  2. Change map_dfc() to future_map_dfc()

That’s it!

multi_roller_parallel <- function(data, col, ..f = mean, window = c(10, 20, 30, 100, 200), sep = "V_") { plan("multiprocess") col_expr <- enquo(col) ret_tbl <- window %>% set_names(~ paste0(sep, window)) %>% future_map_dfc(~ rollify(..f, window = .x)(pull(data, !! col_expr))) %>% bind_cols(data, .) return(ret_tbl) }

In the previous rowwise map, switch out multi_roller() for multi_roller_parallel() and change the window = 2:500. Sit back and let the function run in parallel using each of your computer cores.

FANG_time_nested %>% mutate(ma = map(data, multi_roller_parallel, col = adjusted, ..f = mean, window = 2:500, sep = "ma_")) %>% select(-data) %>% unnest() ## # A time tibble: 4,032 x 502 ## # Index: date ## symbol date adjusted ma_2 ma_3 ma_4 ma_5 ma_6 ma_7 ## ## 1 FB 2013-01-02 28 NA NA NA NA NA NA ## 2 FB 2013-01-03 27.8 27.9 NA NA NA NA NA ## 3 FB 2013-01-04 28.8 28.3 28.2 NA NA NA NA ## 4 FB 2013-01-07 29.4 29.1 28.6 28.5 NA NA NA ## 5 FB 2013-01-08 29.1 29.2 29.1 28.8 28.6 NA NA ## 6 FB 2013-01-09 30.6 29.8 29.7 29.5 29.1 28.9 NA ## 7 FB 2013-01-10 31.3 30.9 30.3 30.1 29.8 29.5 29.3 ## 8 FB 2013-01-11 31.7 31.5 31.2 30.7 30.4 30.1 29.8 ## 9 FB 2013-01-14 31.0 31.3 31.3 31.1 30.7 30.5 30.3 ## 10 FB 2013-01-15 30.1 30.5 30.9 31.0 30.9 30.6 30.4 ## # ... with 4,022 more rows, and 493 more variables: ma_8 , ## # ma_9 , ma_10 , ma_11 , ma_12 , ma_13 , ## # ma_14 , ma_15 , ma_16 , ma_17 , ma_18 , ## # ma_19 , ma_20 , ma_21 , ma_22 , ma_23 , ## # ma_24 , ma_25 , ma_26 , ma_27 , ma_28 , ## # ma_29 , ma_30 , ma_31 , ma_32 , ma_33 , ## # ma_34 , ma_35 , ma_36 , ma_37 , ma_38 , ## # ma_39 , ma_40 , ma_41 , ma_42 , ma_43 , ## # ma_44 , ma_45 , ma_46 , ma_47 , ma_48 , ## # ma_49 , ma_50 , ma_51 , ma_52 , ma_53 , ## # ma_54 , ma_55 , ma_56 , ma_57 , ma_58 , ## # ma_59 , ma_60 , ma_61 , ma_62 , ma_63 , ## # ma_64 , ma_65 , ma_66 , ma_67 , ma_68 , ## # ma_69 , ma_70 , ma_71 , ma_72 , ma_73 , ## # ma_74 , ma_75 , ma_76 , ma_77 , ma_78 , ## # ma_79 , ma_80 , ma_81 , ma_82 , ma_83 , ## # ma_84 , ma_85 , ma_86 , ma_87 , ma_88 , ## # ma_89 , ma_90 , ma_91 , ma_92 , ma_93 , ## # ma_94 , ma_95 , ma_96 , ma_97 , ma_98 , ## # ma_99 , ma_100 , ma_101 , ma_102 , ## # ma_103 , ma_104 , ma_105 , ma_106 , ## # ma_107 , ...

Bam! 500 moving averages run in parallel in fraction of the time it would take running in series.

3. flyingfox

We have one final package we need to demo prior to jumping into our Algorithmic Trading Backtest Optimization: flyingfox.

What is Quantopian?

Quantopian is a company that has setup a community-driven platform for everyone (from traders to home-gamers) enabling development of algorithmic trading strategies. The one downside is they only use Python.

What is Zipline?

Zipline is a Python module open-sourced by Quantopian to help traders back-test their trading algorithms. Here are some quick facts about Quantopian’s Zipline Python module for backtesting algorithmic trading strategies:

  • It is used to develop and backtest financial algorithms using Python.

  • It includes an event-driven backtester (really good at preventing look-ahead bias)

  • Algorithms consist of two main functions:

    1. initialize(): You write an initialize() function that is called once at the beginning of the backtest. This sets up variables for use in the backtest, schedules functions to be called daily, monthly, etc, and let’s you set slippage or commission for the backtest.

    2. handle_data(): You then write a handle_data() function that is called once per day (or minute) that implements the trading logic. You can place orders, retrieve historical pricing data, record metrics for performance evalutation and more.

  • Extra facts: You can use any Python module inside the handle_data() function, so you have a lot of flexibility here.

What is reticulate?

The reticulate package from RStudio is an interface with Python. It smartly takes care of (most) conversion between R and Python objects.

Can you combine them?

Yes, and that’s exactly what we did. We used reticulate to access the Zipline Python module from R!

What is the benefit to R users?

What if you could write your initialize() and handle_data() functions in R utilizing any financial or time series R package for your analysis and then have them called from Python and Zipline?

Introducing flyingfox: An R interface to Zipline

flyingfox integrates the Zipline backtesting module in R! Further, it takes care of the overhead with creating the main infrastructure functions initialize() and handle_data() by enabling the user to set up:

  • fly_initialize(): R version of Zipline’s initialize()
  • fly_handle_data(): R version of Zipline’s handle_data()

flyingfox takes care of passing these functions to Python and Zipline.

Why “Flying Fox”?

Zipline just doesn’t quite make for a good hex sticker. A flying fox is a synonym for zipliners, and it’s hard to argue that this majestic animal wouldn’t create a killer hex sticker.

Getting Started With flyingfox: Moving Average Crossover

Let’s do a Moving Average Crossover example using the following strategy:

  • Using JP Morgan (JPM) stock prices
  • If the 20 day short-term moving average crosses above the 150 day long-term moving average, buy 100% into JP Morgan
  • If 20 day crosses below the 150 day, sell all of the current JPM position

Setup

Setup can take a while and take up some computer space due to ingesting data (which is where Zipline saves every major asset to your computer). We recommend one of two options:

  1. No weight option (for people that just want to try it out): Use our flyingfox sandbox on RStudio Cloud. You can connect here: https://rstudio.cloud/project/38291

  2. Heavy weight option (for people that want to expand and really test it): Follow the instructions on my GitHub page to fly_ingest() data. The setup and data ingesting process are discussed here: https://github.com/DavisVaughan/flyingfox. Keep in mind that this is still a work in progress. We recommend doing the no weight option as a first start.

Initialize

First, write the R function for initialize(). It must take context as an argument. This is where you store variables used later, which are accessed via context$variable. We’ll store context$i = 0L to initialize the tracking of days, and context$asset = fly_symbol("JPM") to trade the JPM symbol. You can select any symbol that you’d like (provided Quantopian pulls it from Quandl).

fly_initialize <- function(context) { # We want to track what day we are on. context$i = 0L # We want to trade JP Morgan stock context$asset = fly_symbol("JPM") }

Handle Data

Next, write a handle_data() function:

  • This implements the crossover trading algorithm logic

  • In this example we also use fly_data_history() to retrieve historical data each day for JP Morgan

  • We use fly_order_target_percent() to move to a new percentage amount invested in JP Morgan (if we order 1, we want to move to be 100% invested in JP Morgan, no matter where we were before)

  • We use fly_record() to store arbitrary metrics for review later

fly_handle_data <- function(context, data) { # Increment day context$i <- context$i + 1L # While < 20 days of data, return if(context$i < 20L) { return() } # Calculate a short term (20 day) moving average # by pulling history for the asset (JPM) and taking an average short_hist <- fly_data_history(data, context$asset, "price", bar_count = 20L, frequency = "1d") short_mavg <- mean(short_hist) # Calculate a long term (150 day) moving average long_hist <- fly_data_history(data, context$asset, "price", bar_count = 150L, frequency = "1d") long_mavg <- mean(long_hist) # If short > long, go 100% in JPM if(short_mavg > long_mavg) { fly_order_target_percent(asset = context$asset, target = 1) } # Else if we hit the crossover, dump all of JPM else if (short_mavg < long_mavg) { fly_order_target_percent(asset = context$asset, target = 0) } # Record today's data # We record the current JPM price, along with the value of the short and long # term moving average fly_record( JPM = fly_data_current(data, context$asset, "price"), short_mavg = short_mavg, long_mavg = long_mavg ) }

Run The Algorithm

Finally, run the algorithm from 2013-2016 using fly_run_algorithm().

performance_time <- fly_run_algorithm( initialize = fly_initialize, handle_data = fly_handle_data, start = as.Date("2013-01-01"), end = as.Date("2016-01-01") )

If you got to this point, you’ve just successfully run a single backtest. Let’s review the performance output.

Reviewing The Performance

Let’s glimpse performance_time to see what the results show. It’s a tbl_time time series data frame organized by the “date” column, and there is a ton of information. We’ll focus on:

  • date: Time stamp for each point in the performance analysis
  • JPM: This is the price of the asset
  • short_mavg and long_mavg: These are our moving averages we are using for the buy/sell crossover signals
  • portfolio_value: The value of the portfolio at each time point
  • transactions: Transactions stored as a list column. The tibble contains a bunch of information that is useful in determining what happened. More information below.
glimpse(performance_time) ## Observations: 756 ## Variables: 41 ## $ date 2013-01-02 21:00:00, 2013-01-03 ... ## $ JPM NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ algo_volatility NaN, 0.00000000, 0.00000000, 0.00... ## $ algorithm_period_return 0.00000000, 0.00000000, 0.0000000... ## $ alpha NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ benchmark_period_return 0.003691629, 0.007396885, 0.01111... ## $ benchmark_volatility NaN, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... ## $ beta NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ capital_used 0.000, 0.000, 0.000, 0.000, 0.000... ## $ ending_cash 10000.0000, 10000.0000, 10000.000... ## $ ending_exposure 0.00, 0.00, 0.00, 0.00, 0.00, 0.0... ## $ ending_value 0.00, 0.00, 0.00, 0.00, 0.00, 0.0... ## $ excess_return 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ gross_leverage 0.0000000, 0.0000000, 0.0000000, ... ## $ long_exposure 0.00, 0.00, 0.00, 0.00, 0.00, 0.0... ## $ long_mavg NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ long_value 0.00, 0.00, 0.00, 0.00, 0.00, 0.0... ## $ longs_count 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ max_drawdown 0.000000000, 0.000000000, 0.00000... ## $ max_leverage 0.0000000, 0.0000000, 0.0000000, ... ## $ net_leverage 0.0000000, 0.0000000, 0.0000000, ... ## $ orders [[], [], [], [], [], [], [], [],... ## $ period_close 2013-01-02 21:00:00, 2013-01-03 ... ## $ period_label "2013-01", "2013-01", "2013-01", ... ## $ period_open 2013-01-02 14:31:00, 2013-01-03 ... ## $ pnl 0.0000, 0.0000, 0.0000, 0.0000, 0... ## $ portfolio_value 10000.000, 10000.000, 10000.000, ... ## $ positions [[], [], [], [], [], [], [], [],... ## $ returns 0.000000000, 0.000000000, 0.00000... ## $ sharpe NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ short_exposure 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ short_mavg NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ short_value 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ shorts_count 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ sortino NaN, NaN, NaN, NaN, NaN, NaN, NaN... ## $ starting_cash 10000.0000, 10000.0000, 10000.000... ## $ starting_exposure 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0... ## $ starting_value 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0... ## $ trading_days 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11... ## $ transactions [[], [], [], [], [], [], [], [],... ## $ treasury_period_return 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

First, let’s plot the asset (JPM) along with the short and long moving averages. We can see there are a few crossovers.

performance_time %>% select(date, JPM, short_mavg, long_mavg) %>% gather(key = "key", value = "value", -date, factor_key = T) %>% ggplot(aes(date, value, color = key, linetype = key)) + geom_line() + theme_tq() + scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) + scale_y_continuous(labels = scales::dollar) + labs( title = "JPM Stock Price", subtitle = "Strategy: 20 Day Short MA, 150 Day Long MA", x = "Date", y = "Share Price" )

Next, we can investigate the transactions. Stored within the performance_time output are transaction information as nested tibbles. We can get these values by flagging which time points contain tibbles and the filtering and unnesting. A transaction type can be determined if the “amount” of the transaction (number of shares bought or sold) is positive or negative.

transactions_flagged_time <- performance_time %>% select(date, portfolio_value, transactions) %>% mutate(flag_transactions = map_lgl(transactions, is.tibble)) %>% filter(flag_transactions == TRUE) %>% unnest() %>% mutate(transaction_type = ifelse(amount >= 0, "buy", "sell")) transactions_flagged_time %>% glimpse() ## Observations: 9 ## Variables: 10 ## $ date 2013-01-31 21:00:00, 2014-05-07 20:00:... ## $ portfolio_value 9994.801, 11472.859, 11467.007, 11488.6... ## $ flag_transactions TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU... ## $ commission NA, NA, NA, NA, NA, NA, NA, NA, NA ## $ amount 212, -212, 197, 2, -199, 179, -179, 166, 3 ## $ sid [, "573c67da9fb6481c8a10cc55cddf57eb", "79... ## $ price 47.07353, 54.02298, 57.44871, 57.55877,... ## $ dt 2013-01-31 21:00:00, 2014-05-07 21:00:... ## $ transaction_type "buy", "sell", "buy", "buy", "sell", "b...

Finally, we can visualize the results using ggplot2. We can see that the ending portfolio value is just under $11.5K.

performance_time %>% ggplot(aes(x = date, y = portfolio_value)) + geom_line(color = palette_light()[[1]]) + geom_point(aes(color = transaction_type), size = 5, data = transactions_flagged_time) + geom_label_repel( aes(label = amount), vjust = -1, color = palette_light()[[1]], data = transactions_flagged_time) + theme_tq() + scale_color_tq() + scale_y_continuous(labels = scales::dollar) + labs(title = "Portfolio Value And Transactions", subtitle = "JPM Strategy: Short MA = 20, Long MA = 150", x = "Date", y = "Portfolio Value" )

Last, let’s use tibbletime to see what happened to our portfolio towards the end. We’ll use the portfolio value as a proxy for the stock price, visualizing the crossover of the 20 and 150-day moving averages of the portfolio. Note that the actual algorithm is run with moving averages based on the adjusted stock price, not the portfolio value.

performance_time %>% # Data manipulation with tibbletime & multi_roller() function select(date, portfolio_value) %>% multi_roller(portfolio_value, mean, window = c(20, 150), sep = "ma_") %>% filter_time("2015-05-01" ~ "end") %>% gather(key = key, value = value, -date, factor_key = T) %>% # Visualization with ggplot2 ggplot(aes(date, value, color = key, linetype = key)) + geom_line() + theme_tq() + scale_color_manual(values = c(palette_light()[[1]], "blue", "red")) + scale_y_continuous(labels = scales::dollar) + labs(title = "Portfolio Value And Transactions", subtitle = "JPM Strategy: Short MA = 20, Long MA = 150", x = "Date", y = "Portfolio Value" )

Backtest Optimization Via Grid Search

Now for the main course: Optimizing our algorithm using the backtested performance. To do so, we’ll combine what we learned from our three packages: tibbletime + furrr + flyingfox.

Let’s say we want to use backtesting to find the optimal combination or several best combinations of short and long term moving averages for our strategy. We can do this using Cartesian Grid Search, which is simply creating a combination of all of the possible “hyperparameters” (parameters we wish to adjust). Recognizing that running multiple backtests can take some time, we’ll parallelize the operation too.

Preparation

Before we can do grid search, we need to adjust our fly_handle_data() function to enable our parameters to be adjusted. The two parameters we are concerned with are the short and long moving averages. We’ll add these as arguments of a new function fly_handle_data_parameterized().

fly_handle_data_parameterized <- function(context, data, short_ma = 20L, long_ma = 150L) { # Increment day context$i <- context$i + 1L # While < 300 days of data, return if(context$i < 300) { return() } # Calculate a short term (20 day) moving average # by pulling history for the asset (JPM) and taking an average short_hist <- fly_data_history(data, context$asset, "price", bar_count = short_ma, frequency = "1d") short_mavg <- mean(short_hist) # Calculate a long term (150 day) moving average long_hist <- fly_data_history(data, context$asset, "price", bar_count = long_ma, frequency = "1d") long_mavg <- mean(long_hist) # If short > long, go 100% in JPM if(short_mavg > long_mavg) { fly_order_target_percent(asset = context$asset, target = 1) } # Else if we hit the crossover, dump all of JPM else if (short_mavg < long_mavg) { fly_order_target_percent(asset = context$asset, target = 0) } # Record today's data # We record the current JPM price, along with the value of the short and long # term moving average fly_record( JPM = fly_data_current(data, context$asset, "price"), short_mavg = short_mavg, long_mavg = long_mavg ) } Making The Grid

Next, we can create a grid of values from a list() containing the hyperparameter values. We can turn this into a cross product as a tibble using the cross_df() function.

ma_grid_tbl <- list( short_ma = seq(20, 100, by = 20), long_ma = seq(150, 300, by = 50) ) %>% cross_df() ma_grid_tbl ## # A tibble: 20 x 2 ## short_ma long_ma ## ## 1 20 150 ## 2 40 150 ## 3 60 150 ## 4 80 150 ## 5 100 150 ## 6 20 200 ## 7 40 200 ## 8 60 200 ## 9 80 200 ## 10 100 200 ## 11 20 250 ## 12 40 250 ## 13 60 250 ## 14 80 250 ## 15 100 250 ## 16 20 300 ## 17 40 300 ## 18 60 300 ## 19 80 300 ## 20 100 300

Now that we have the hyperparameters, let’s create a new column with the function we wish to run. We’ll use the partial() function to partially fill the function with the hyper parameters.

ma_grid_tbl <- ma_grid_tbl %>% mutate(f = map2(.x = short_ma, .y = long_ma, .f = ~ partial(fly_handle_data_parameterized, short_ma = .x, long_ma = .y))) ma_grid_tbl ## # A tibble: 20 x 3 ## short_ma long_ma f ## ## 1 20 150 ## 2 40 150 ## 3 60 150 ## 4 80 150 ## 5 100 150 ## 6 20 200 ## 7 40 200 ## 8 60 200 ## 9 80 200 ## 10 100 200 ## 11 20 250 ## 12 40 250 ## 13 60 250 ## 14 80 250 ## 15 100 250 ## 16 20 300 ## 17 40 300 ## 18 60 300 ## 19 80 300 ## 20 100 300 Running Grid Search In Parallel Using furrr

Now for the Grid Search. We use the future_map() function to process in parallel. Make sure to setup a plan() first. The following function runs the fly_run_algorithm() for each fly_handle_data() function stored in the “f” column.

plan("multiprocess") results_tbl <- ma_grid_tbl %>% mutate(results = future_map( .x = f, .f = ~ fly_run_algorithm( initialize = fly_initialize, handle_data = .x, start = as.Date("2013-01-01"), end = as.Date("2016-01-01") ) )) results_tbl ## # A tibble: 20 x 4 ## short_ma long_ma f results ## ## 1 20 150 ## 2 40 150 ## 3 60 150 ## 4 80 150 ## 5 100 150 ## 6 20 200 ## 7 40 200 ## 8 60 200 ## 9 80 200 ## 10 100 200 ## 11 20 250 ## 12 40 250 ## 13 60 250 ## 14 80 250 ## 15 100 250 ## 16 20 300 ## 17 40 300 ## 18 60 300 ## 19 80 300 ## 20 100 300 Inspecting The Backtest Performance Results

The performance results are stored in the “results” column as tbl_time objects. We can examine the first result.

results_tbl$results[[1]] ## # A time tibble: 756 x 41 ## # Index: date ## date JPM algo_volatility algorithm_period_~ alpha ## ## 1 2013-01-02 21:00:00 NaN NaN 0 NaN ## 2 2013-01-03 21:00:00 NaN 0 0 NaN ## 3 2013-01-04 21:00:00 NaN 0 0 NaN ## 4 2013-01-07 21:00:00 NaN 0 0 NaN ## 5 2013-01-08 21:00:00 NaN 0 0 NaN ## 6 2013-01-09 21:00:00 NaN 0 0 NaN ## 7 2013-01-10 21:00:00 NaN 0 0 NaN ## 8 2013-01-11 21:00:00 NaN 0 0 NaN ## 9 2013-01-14 21:00:00 NaN 0 0 NaN ## 10 2013-01-15 21:00:00 NaN 0 0 NaN ## # ... with 746 more rows, and 36 more variables: ## # benchmark_period_return , benchmark_volatility , ## # beta , capital_used , ending_cash , ## # ending_exposure , ending_value , excess_return , ## # gross_leverage , long_exposure , long_mavg , ## # long_value , longs_count , max_drawdown , ## # max_leverage , net_leverage , orders , ## # period_close , period_label , period_open , ## # pnl , portfolio_value , positions , ## # returns , sharpe , short_exposure , ## # short_mavg , short_value , shorts_count , ## # sortino , starting_cash , starting_exposure , ## # starting_value , trading_days , transactions , ## # treasury_period_return

We can also get the final portfolio value using a combination of pull() and tail().

results_tbl$results[[1]] %>% pull(portfolio_value) %>% tail(1) ## [1] 9187.849

We can turn this into a function and map it to all of the columns to obtain the “final_portfolio_value” for each of the grid search combinations.

results_tbl <- results_tbl %>% mutate(final_portfolio_value = map_dbl(results, ~ pull(.x, portfolio_value) %>% tail(1))) results_tbl ## # A tibble: 20 x 5 ## short_ma long_ma f results final_portfolio_value ## ## 1 20 150 9188. ## 2 40 150 9277. ## 3 60 150 9823. ## 4 80 150 10816. ## 5 100 150 10697. ## 6 20 200 9065. ## 7 40 200 10612. ## 8 60 200 11444. ## 9 80 200 11366. ## 10 100 200 11473. ## 11 20 250 9522. ## 12 40 250 10898. ## 13 60 250 11494. ## 14 80 250 11494. ## 15 100 250 11494. ## 16 20 300 10951. ## 17 40 300 11494. ## 18 60 300 11494. ## 19 80 300 11494. ## 20 100 300 11494. Visualizing The Backtest Performance Results

Now let’s visualize the results to see which combinations of short and long moving averages maximize the portfolio value. It’s clear that short >= 60 days and long >= 200 days maximize the return. But, why?

results_tbl %>% ggplot(aes(long_ma, short_ma, color = final_portfolio_value)) + geom_point(aes(size = final_portfolio_value)) + geom_label_repel( aes(label = scales::dollar(round(final_portfolio_value, 0))), vjust = -1, color = palette_light()[[1]]) + theme_tq() + scale_color_gradient(low = palette_light()[[1]], high = palette_light()[[3]]) + scale_size(range = c(1, 12)) + labs( title = "JPM Strategy: Backtest Grid Search Optimization", subtitle = "Short and Fast Moving Averages: Optimum = 60 days (Short MA), 300 Days (Long MA)", x = "Long Moving Average (Days)", y = "Short Moving Average (Days)", color = "Final Portfolio Value", size = "Final Portfolio Value" ) + theme(legend.direction = "vertical", legend.position = "right")

Let’s get the transaction information (buy/sell) by unnesting the results and determining which transactions are buys and sells.

transactions_tbl <- results_tbl %>% unnest(results) %>% select(short_ma:date, portfolio_value, transactions) %>% mutate(flag_transactions = map_lgl(transactions, is.tibble)) %>% filter(flag_transactions == TRUE) %>% unnest() %>% mutate(transaction_type = ifelse(amount >= 0, "buy", "sell")) %>% mutate( short_ma = as.factor(short_ma) %>% fct_rev(), long_ma = as.factor(long_ma) ) transactions_tbl %>% glimpse() ## Observations: 105 ## Variables: 13 ## $ short_ma 20, 20, 20, 20, 20, 20, 20, 20, 20,... ## $ long_ma 150, 150, 150, 150, 150, 150, 150, ... ## $ final_portfolio_value 9187.849, 9187.849, 9187.849, 9187.... ## $ date 2014-03-13 20:00:00, 2014-03-14 20... ## $ portfolio_value 9994.890, 9888.191, 9404.815, 9400.... ## $ flag_transactions TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,... ## $ commission NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ amount 172, 2, -174, 161, 2, -163, 147, -1... ## $ sid [... ## $ order_id "d8a3551cb36c4ff985c42e4d78e00ef8",... ## $ price 57.44871, 56.82840, 54.02298, 57.44... ## $ dt 2014-03-13 21:00:00, 2014-03-14 21... ## $ transaction_type "buy", "buy", "sell", "buy", "buy",...

Finally, we can visualize the portfolio value over time for each combination of short and long moving averages. By plotting the buy/sell transactions, we can see the effect on a stock with a bullish trend. The portfolios with the optimal performance are those that were bought and held rather than sold using the moving average crossover. For this particular stock, the benefit of downside protection via the moving average crossover costs the portfolio during the bullish uptrend.

results_tbl %>% unnest(results) %>% mutate( short_ma = as.factor(short_ma) %>% fct_rev(), long_ma = as.factor(long_ma) ) %>% group_by(short_ma, long_ma) %>% filter_time("2014" ~ "end") %>% ggplot(aes(date, portfolio_value)) + geom_line(color = palette_light()[[1]]) + geom_vline(aes(xintercept = date), color = "blue", data = filter(transactions_tbl, transaction_type == "buy")) + geom_vline(aes(xintercept = date), color = "red", data = filter(transactions_tbl, transaction_type == "sell")) + facet_grid(short_ma ~ long_ma, labeller = label_both) + theme_tq() + scale_y_continuous(labels = scales::dollar) + scale_x_datetime(date_labels = "%Y", date_breaks = "1 year") + labs( title = "JPM Strategy: Portfolio Value Over Time With Transactions", subtitle = "Bull Market: Buy and Hold Strategy is Optimal Vs Short/Long Moving Average Strategies", x = "Date", y = "Portfolio Value" )

Conclusions

We’ve covered a lot of ground in this article. Congrats if you’ve made it through. You’ve now been exposed to three cool packages:

  1. tibbletime: For time-series in the tidyverse
  2. furrr: Our parallel processing extension to purrr
  3. flyingfox: Our experimental package brought to you as part of our Business Science Labs initiative

Further, you’ve seen how to apply all three of these packages to perform grid search backtest optimization of your trading algorithm.

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), you should consider Business Science University. Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You’ll learn:

  • Data Science Framework: Business Science Problem Framework
  • Tidy Eval
  • H2O Automated Machine Learning
  • LIME Feature Explanations
  • Sensitivity Analysis
  • Tying data science to financial improvement

All while solving a REAL WORLD CHURN PROBLEM: Employee Turnover!


Get 15% OFF in June!

DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


Get 15% OFF in June!

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.


Get 15% OFF in June!

Don’t Miss A Beat Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Quick guide for converting from JAGS or BUGS to NIMBLE

Wed, 05/30/2018 - 19:28

(This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)



.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Converting to NIMBLE from JAGS, OpenBUGS or WinBUGS

Converting to NIMBLE from JAGS, OpenBUGS or WinBUGS

NIMBLE is a hierarchical modeling package that uses nearly the same modeling language as the popular MCMC packages WinBUGS, OpenBUGS and JAGS. NIMBLE makes the modeling language extensible — you can add distributions and functions — and also allows customization of MCMC or other algorithms that use models. Here is a quick summary of steps to convert existing code from WinBUGS, OpenBUGS or JAGS to NIMBLE. For more information, see examples on r-nimble.org or the NIMBLE User Manual.

Main steps for converting existing code

These steps assume you are familiar with running WinBUGS, OpenBUGS or JAGS through an R package such as R2WinBUGS, R2jags, rjags, or jagsUI.

  1. Wrap your model code in nimbleCode({}), directly in R.
    • This replaces the step of writing or generating a separate file containing the model code.
    • Alternatively, you can read standard JAGS- and BUGS-formatted code and data files using
      readBUGSmodel.
  2. Provide information about missing or empty indices
    • Example: If x is a matrix, you must write at least x[,] to show it has two dimensions.
    • If other declarations make the size of x clear, x[,] will work in some circumstances.
    • If not, either provide index ranges (e.g. x[1:n, 1:m]) or use the dimensions argument to nimbleModel to provide the sizes in each dimension.
  3. Choose how you want to run MCMC.
    • Use nimbleMCMC() as the just-do-it way to run an MCMC. This will take all steps to
      set up and run an MCMC using NIMBLE’s default configuration.
    • To use NIMBLE’s full flexibility: build the model, configure and build the MCMC, and compile both the model and MCMC. Then run the MCMC via runMCMC or by calling the run function of the compiled MCMC. See the NIMBLE User Manual to learn more about what you can do.

See below for a list of some more nitty-gritty additional steps you may need to consider for some models.

Example: An animal abundance model

This example is adapted from Chapter 6, Section 6.4 of Applied Hierarchical Modeling in Ecology: Analysis of distribution, abundance and species richness in R and BUGS. Volume I: Prelude and Static Models by Marc Kéry and J. Andrew Royle (2015, Academic Press). The book’s web site provides code for its examples.

Original code

The original model code looks like this:

cat(file = "model2.txt"," model { # Priors for(k in 1:3){ # Loop over 3 levels of hab or time factors alpha0[k] ~ dunif(-10, 10) # Detection intercepts alpha1[k] ~ dunif(-10, 10) # Detection slopes beta0[k] ~ dunif(-10, 10) # Abundance intercepts beta1[k] ~ dunif(-10, 10) # Abundance slopes } # Likelihood # Ecological model for true abundance for (i in 1:M){ N[i] ~ dpois(lambda[i]) log(lambda[i]) <- beta0[hab[i]] + beta1[hab[i]] * vegHt[i] # Some intermediate derived quantities critical[i] <- step(2-N[i])# yields 1 whenever N is 2 or less z[i] <- step(N[i]-0.5) # Indicator for occupied site # Observation model for replicated counts for (j in 1:J){ C[i,j] ~ dbin(p[i,j], N[i]) logit(p[i,j]) <- alpha0[j] + alpha1[j] * wind[i,j] } } # Derived quantities Nocc <- sum(z[]) # Number of occupied sites among sample of M Ntotal <- sum(N[]) # Total population size at M sites combined Nhab[1] <- sum(N[1:33]) # Total abundance for sites in hab A Nhab[2] <- sum(N[34:66]) # Total abundance for sites in hab B Nhab[3] <- sum(N[67:100])# Total abundance for sites in hab C for(k in 1:100){ # Predictions of lambda and p ... for(level in 1:3){ # ... for each level of hab and time factors lam.pred[k, level] <- exp(beta0[level] + beta1[level] * XvegHt[k]) logit(p.pred[k, level]) <- alpha0[level] + alpha1[level] * Xwind[k] } } N.critical <- sum(critical[]) # Number of populations with critical size }") Brief summary of the model

This is known as an "N-mixture" model in ecology. The details aren't really important for illustrating the mechanics of converting this model to NIMBLE, but here is a brief summary anyway. The latent abundances N[i] at sites i = 1...M are assumed to follow a Poisson. The j-th count at the i-th site, C[i, j], is assumed to follow a binomial with detection probability p[i, j]. The abundance at each site depends on a habitat-specific intercept and coefficient for vegetation height, with a log link. The detection probability for each sampling occasion depends on a date-specific intercept and coefficient for wind speed. Kéry and Royle concocted this as a simulated example to illustrate the hierarchical modeling approaches for estimating abundance from count data on repeated visits to multiple sites.

NIMBLE version of the model code

Here is the model converted for use in NIMBLE. In this case, the only changes to the code are to insert some missing index ranges (see comments).

library(nimble) Section6p4_code <- nimbleCode( { # Priors for(k in 1:3) { # Loop over 3 levels of hab or time factors alpha0[k] ~ dunif(-10, 10) # Detection intercepts alpha1[k] ~ dunif(-10, 10) # Detection slopes beta0[k] ~ dunif(-10, 10) # Abundance intercepts beta1[k] ~ dunif(-10, 10) # Abundance slopes } # Likelihood # Ecological model for true abundance for (i in 1:M){ N[i] ~ dpois(lambda[i]) log(lambda[i]) <- beta0[hab[i]] + beta1[hab[i]] * vegHt[i] # Some intermediate derived quantities critical[i] <- step(2-N[i])# yields 1 whenever N is 2 or less z[i] <- step(N[i]-0.5) # Indicator for occupied site # Observation model for replicated counts for (j in 1:J){ C[i,j] ~ dbin(p[i,j], N[i]) logit(p[i,j]) <- alpha0[j] + alpha1[j] * wind[i,j] } } # Derived quantities; unnececssary when running for inference purpose # NIMBLE: We have filled in indices in the next two lines. Nocc <- sum(z[1:100]) # Number of occupied sites among sample of M Ntotal <- sum(N[1:100]) # Total population size at M sites combined Nhab[1] <- sum(N[1:33]) # Total abundance for sites in hab A Nhab[2] <- sum(N[34:66]) # Total abundance for sites in hab B Nhab[3] <- sum(N[67:100])# Total abundance for sites in hab C for(k in 1:100){ # Predictions of lambda and p ... for(level in 1:3){ # ... for each level of hab and time factors lam.pred[k, level] <- exp(beta0[level] + beta1[level] * XvegHt[k]) logit(p.pred[k, level]) <- alpha0[level] + alpha1[level] * Xwind[k] } } # NIMBLE: We have filled in indices in the next line. N.critical <- sum(critical[1:100]) # Number of populations with critical size }) Simulated data

To carry this example further, we need some simulated data. Kéry and Royle provide separate code to do this. With NIMBLE we could use the model itself to simulate data rather than writing separate simulation code. But for our goals here, we simply copy Kéry and Royle's simulation code, and we compact it somewhat:

# Code from Kery and Royle (2015) # Choose sample sizes and prepare obs. data array y set.seed(1) # So we all get same data set M <- 100 # Number of sites J <- 3 # Number of repeated abundance measurements C <- matrix(NA, nrow = M, ncol = J) # to contain the observed data # Create a covariate called vegHt vegHt <- sort(runif(M, -1, 1)) # sort for graphical convenience # Choose parameter values for abundance model and compute lambda beta0 <- 0 # Log-scale intercept beta1 <- 2 # Log-scale slope for vegHt lambda <- exp(beta0 + beta1 * vegHt) # Expected abundance # Draw local abundance N <- rpois(M, lambda) # Create a covariate called wind wind <- array(runif(M * J, -1, 1), dim = c(M, J)) # Choose parameter values for measurement error model and compute detectability alpha0 <- -2 # Logit-scale intercept alpha1 <- -3 # Logit-scale slope for wind p <- plogis(alpha0 + alpha1 * wind) # Detection probability # Take J = 3 abundance measurements at each site for(j in 1:J) { C[,j] <- rbinom(M, N, p[,j]) } # Create factors time <- matrix(rep(as.character(1:J), M), ncol = J, byrow = TRUE) hab <- c(rep("A", 33), rep("B", 33), rep("C", 34)) # assumes M = 100 # Bundle data # NIMBLE: For full flexibility, we could separate this list # into constants and data lists. For simplicity we will keep # it as one list to be provided as the "constants" argument. # See comments about how we would split it if desired. win.data <- list( ## NIMBLE: C is the actual data C = C, ## NIMBLE: Covariates can be data or constants ## If they are data, you could modify them after the model is built wind = wind, vegHt = vegHt, XvegHt = seq(-1, 1,, 100), # Used only for derived quantities Xwind = seq(-1, 1,,100), # Used only for derived quantities ## NIMBLE: The rest of these are constants, needed for model definition ## We can provide them in the same list and NIMBLE will figure it out. M = nrow(C), J = ncol(C), hab = as.numeric(factor(hab)) ) Initial values

Next we need to set up initial values and choose parameters to monitor in the MCMC output. To do so we will again directly use Kéry and Royle's code.

Nst <- apply(C, 1, max)+1 # Important to give good inits for latent N inits <- function() list(N = Nst, alpha0 = rnorm(3), alpha1 = rnorm(3), beta0 = rnorm(3), beta1 = rnorm(3)) # Parameters monitored # could also estimate N, bayesian counterpart to BUPs before: simply add "N" to the list params <- c("alpha0", "alpha1", "beta0", "beta1", "Nocc", "Ntotal", "Nhab", "N.critical", "lam.pred", "p.pred") Run MCMC with nimbleMCMC

Now we are ready to run an MCMC in nimble. We will run only one chain, using the same settings as Kéry and Royle.

samples <- nimbleMCMC( code = Section6p4_code, constants = win.data, ## provide the combined data & constants as constants inits = inits, monitors = params, niter = 22000, nburnin = 2000, thin = 10) ## defining model... ## Detected C as data within 'constants'. ## Adding C as data for building model. ## building model... ## setting data and initial values... ## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... ## checking model sizes and dimensions... ## checking model calculations... ## model building finished. ## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details. ## compilation finished. ## runMCMC's handling of nburnin changed in nimble version 0.6-11. Previously, nburnin samples were discarded *post-thinning*. Now nburnin samples are discarded *pre-thinning*. The number of samples returned will be floor((niter-nburnin)/thin). ## running chain 1... ## |-------------|-------------|-------------|-------------| ## |-------------------------------------------------------| Work with the samples

Finally we want to look at our samples. NIMBLE returns samples as a simple matrix with named columns. There are numerous packages for processing MCMC output. If you want to use the coda package, you can convert a matrix to a coda mcmc object like this:

library(coda) coda.samples <- as.mcmc(samples)

Alternatively, if you call nimbleMCMC with the argument samplesAsCodaMCMC = TRUE, the samples will be returned as a coda object.

To show that MCMC really happened, here is a plot of N.critical:

plot(jitter(samples[, "N.critical"]), xlab = "iteration", ylab = "N.critical", main = "Number of populations with critical size", type = "l")

Next steps

NIMBLE allows users to customize MCMC and other algorithms in many ways. See the NIMBLE User Manual and web site for more ideas.

Smaller steps you may need for converting existing code

If the main steps above aren't sufficient, consider these additional steps when converting from JAGS, WinBUGS or OpenBUGS to NIMBLE.

  1. Convert any use of truncation syntax
    • e.g. x ~ dnorm(0, tau) T(a, b) should be re-written as x ~ T(dnorm(0, tau), a, b).
    • If reading model code from a file using readBUGSmodel, the x ~ dnorm(0, tau) T(a, b) syntax will work.
  2. Possibly split the data into data and constants for NIMBLE.
    • NIMBLE has a more general concept of data, so NIMBLE makes a distinction between data and constants.
    • Constants are necessary to define the model, such as nsite in for(i in 1:nsite) {...} and constant vectors of factor indices (e.g. block in mu[block[i]]).
    • Data are observed values of some variables.
    • Alternatively, one can provide a list of both constants and data for the constants argument to nimbleModel, and NIMBLE will try to determine which is which. Usually this will work, but when in doubt, try separating them.
  3. Possibly update initial values (inits).
    • In some cases, NIMBLE likes to have more complete inits than the other packages.
    • In a model with stochastic indices, those indices should have inits values.
    • When using nimbleMCMC or runMCMC, inits can be a function, as in R packages for calling WinBUGS, OpenBUGS or JAGS. Alternatively, it can be a list.
    • When you build a model with nimbleModel for more control than nimbleMCMC, you can provide inits as a list. This sets defaults that can be over-ridden with the inits argument to runMCMC.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

OS Secrets Exposed: Extracting Extended File Attributes and Exploring Hidden Download URLs With The xattrs Package

Wed, 05/30/2018 - 19:19

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

$ cd ~/Downloads $ ls -l total 264856 -rw-r--r--@ 1 user staff 169062 Nov 27 2017 1109.1968.pdf -rw-r--r--@ 1 user staff 171059 Nov 27 2017 1109.1968v1.pdf -rw-r--r--@ 1 user staff 291373 Apr 27 21:25 1804.09970.pdf -rw-r--r--@ 1 user staff 1150562 Apr 27 21:26 1804.09988.pdf -rw-r--r--@ 1 user staff 482953 May 11 12:00 1805.01554.pdf -rw-r--r--@ 1 user staff 125822222 May 14 16:34 RStudio-1.2.627.dmg -rw-r--r--@ 1 user staff 2727305 Dec 21 17:50 athena-ug.pdf -rw-r--r--@ 1 user staff 90181 Jan 11 15:55 bgptools-0.2.tar.gz -rw-r--r--@ 1 user staff 4683220 May 25 14:52 osquery-3.2.4.pkg

You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn’t think so. Thus begat the xattrs package.

Exploring Past Downloads

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let’s see what is hidden with the RStudio disk image:

library(xattrs) library(reticulate) # not 100% necessary but you'll see why later library(tidyverse) # we'll need this later list_xattrs("~/Downloads/RStudio-1.2.627.dmg") ## [1] "com.apple.diskimages.fsck" "com.apple.diskimages.recentcksum" ## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"

There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:

get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms") ## [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61 ## [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64 ## [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36 ## [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74 ## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a ## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 ## [169] 00 00 00 89

Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:

get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms") ## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"

So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.

biplist <- import("biplist", as="biplist") biplist$readPlistFromString( get_xattr_raw( "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms" ) ) ## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg" ## [2] "https://dailies.rstudio.com/rstudio/oss/mac/"

That's much better.

Let's work with metadata for the whole directory:

list.files("~/Downloads", full.names = TRUE) %>% keep(has_xattrs) %>% set_names(basename(.)) %>% map_df(read_xattrs, .id="file") -> xdf xdf ## # A tibble: 24 x 4 ## file name size contents ## ## 1 1109.1968.pdf com.apple.lastuseddate#PS 16 ## 2 1109.1968.pdf com.apple.metadata:kMDItemWhereFroms 110 ## 3 1109.1968.pdf com.apple.quarantine 74 ## 4 1109.1968v1.pdf com.apple.lastuseddate#PS 16 ## 5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms 116 ## 6 1109.1968v1.pdf com.apple.quarantine 74 ## 7 1804.09970.pdf com.apple.metadata:kMDItemWhereFroms 86 ## 8 1804.09970.pdf com.apple.quarantine 82 ## 9 1804.09988.pdf com.apple.lastuseddate#PS 16 ## 10 1804.09988.pdf com.apple.metadata:kMDItemWhereFroms 104 ## # ... with 14 more rows ## count(xdf, name, sort=TRUE) ## # A tibble: 5 x 2 ## name n ## ## 1 com.apple.metadata:kMDItemWhereFroms 9 ## 2 com.apple.quarantine 9 ## 3 com.apple.lastuseddate#PS 4 ## 4 com.apple.diskimages.fsck 1 ## 5 com.apple.diskimages.recentcksum 1

Now we can focus on the task at hand: recovering the URLs:

list.files("~/Downloads", full.names = TRUE) %>% keep(has_xattrs) %>% set_names(basename(.)) %>% map_df(read_xattrs, .id="file") %>% filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% mutate(where_from = map(contents, biplist$readPlistFromString)) %>% select(file, where_from) %>% unnest() %>% filter(!where_from == "") ## # A tibble: 15 x 2 ## file where_from ## ## 1 1109.1968.pdf https://arxiv.org/pdf/1109.1968.pdf ## 2 1109.1968.pdf https://www.google.com/ ## 3 1109.1968v1.pdf https://128.84.21.199/pdf/1109.1968v1.pdf ## 4 1109.1968v1.pdf https://www.google.com/ ## 5 1804.09970.pdf https://arxiv.org/pdf/1804.09970.pdf ## 6 1804.09988.pdf https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf ## 7 1805.01554.pdf https://arxiv.org/pdf/1805.01554.pdf ## 8 athena-ug.pdf http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf ## 9 athena-ug.pdf https://www.google.com/ ## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz ## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/ ## 12 osquery-3.2.4.pkg https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p… ## 13 osquery-3.2.4.pkg https://osquery.io/downloads/official/3.2.4 ## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio… ## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/

(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Version of ggplot2

Wed, 05/30/2018 - 17:30

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here’s what choroplethr users need to know about this new version of ggplot2.

Choroplethr Update Required

The new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25.

This change is the third or fourth time I’ve had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: “More time is spent maintaining old software than writing new software.”

Simple Features Support

One of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been “I can’t do that until ggplot2 adds support for Simple Features.” Thankfully, this new version of ggplot2 introduces that support!

Currently all maps in the choroplethr ecosystem are stored as ggplot2 “fortified” dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:

  1. ggplot2 does not support interactive graphics.
  2. The main interactive mapping library in CRAN (leaflet) cannot render fortified data frames. It can only render maps stored as Shapefiles or Simple Features.

Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:

  1. Updating choroplethr to be able to render maps stored in the Simple Features format.
  2. Migrating choroplethr’s existing maps to the Simple Features format.

After that, I can start experimenting with adding interactive graphics support to choroplethr.

Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved.

Thoughts on the CRAN Ecosystem

This issue has caused me to reflect a bit about the stability of the CRAN ecosystem. 

ggplot2 is used by about 1,700 packages. It’s unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25?

And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated.

This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user’s code.

The post New Version of ggplot2 appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Does financial support in Australia favour residents born elsewhere? Responding to racism with data

Wed, 05/30/2018 - 17:09

(This article was first published on blogR, and kindly contributed to R-bloggers)

Seeing a racist outburst made me wonder whether the Australian Government unfairly supports people based on their background. Using data from the Australian Government and Bureau of Statistics, I couldn’t find compelling evidence of this being true. Don’t believe me? Read on and see what you make of the data.

 Australian racism goes viral, again

Australian racism went viral again this year when a man was filmed abusing staff at Centrelink, which delivers social security payments and services to Australians (see story here). The man yells that he didn’t vote for multiculturalism and that Centrelink is supporting everyone except “Australians”. It is distressing to watch, especially as someone whose ancestors found a home in Australia having escaped persecution. He can’t take it back, but the man did publically apologise and may be suffering from mental illness (see story here).

This topic is still polarising. Many of us want to vilify this man while others may applaud him. But hate begets hate, and fighting fire with fire makes things worse. As a data scientist, the best way I know to respond to the assumptions and stereotypes that fuel racism is with evidence. So, without prejudice, let us investigate the data and uncover to whom the Australian government provides support through Centrelink.

 Centrelink supports Australians, so who are we talking about?

With rare exceptions, Centrelink supports Australian Residents living in Australia (see here and here). So, the claim that Centrelink supports everyone but Australians in misguided. Perhaps the reference to “multiculturalism” can direct us to a more testable question. Centrelink offers support to Australian residents who can be born anywhere in the world. So in this article, I’ll use publically accessible data to investigate differences in support given to residents born in Australia or elsewhere.

 Estimated Residential Population

The Figure below shows changes in Australia’s Estimated Residential Population, which is an official value published by the Australian Bureau of Statistics and used for policy formation and decision making.

The residential population has been increasing from about 17 million in 1992 to over 24 million in 2016. In contrast, the percentage of residents who are Australian-born has decreased from 76.9% to 71.5%. This will guide our sense of whether Centrelink payments are unbiased.

As a side note, Census statistics reported that the percentage of Australian-born residents in 2016 was 66.7% (4.8% lower than the official estimate above). This discrepancy is the result of the the Australian Bureau of Statistics making adjustments that you can learn about here.

 All Centrelink Payments

Centrelink data is published quarterly and has included country-of-birth breakdowns since December 2016 (which aligns with the last available population data reported above). At this time, Centrelink made 15 million payments to Australian residents.

In December 2016, 71.5% of Australia’s Estimated Residential population was Australian-born. Comparably, 68.8% of all Centrelink payments went to Australian-born residents.

The data shows that Centrelink payments are made to residents born in Australia or elsewhere in approximately the same proportions as these groups are represented in the population. The difference of a couple of percent indicates that slightly fewer payments were going to Australian-born residents than we’d expect. As we’ll see in the following section, this difference can be almost wholly accounted for by the Age Pension. Still, the difference is small enough to negate the claim that Centrelink substantially favours residents born outside Australia.

 Breakdown by Type

It’s also possible to break down these total numbers into the specific payment types shown below (detailed list here).

It’s expected that these payment types, which support specific needs, will show biases in favour of certain groups. For example, ABSTUDY supports study costs and housing for Aboriginal or Torres Strait Islander residents. This should mostly go to Australian-born residents. To investigate, we can extend the Figure above to include the number of Australian-born recipients:

Looking at this Figure, most Centrelink payments fall along the dotted line, which is what we’d expect from a fair system (if 71.5% of the recipients were Australian-born).

The outlier is the Age Pension, which falls below the line. More recipients of the Age Pension are born outside Australia than is reflected in the total population. I cannot say from this data alone why there is some bias in the Age Pension and perhaps a knowledgeable reader can comment. Nonetheless, this discrepancy is large enough that removing the Age Pension from consideration results in 70.5% of all other Centrelink payments going to Australian-born residents – almost exactly the proportion in the population.

 Ignoring Total Numbers

The Figure below shows the percentage of Australian-born recipients for each payment type, ignoring totals.

At the upper end of the scale, we can see Australian-born recipients being over-represented for ABSTUDY and Youth Allowance payments. At the lower end, residents who are born outside Australia are over-represented for Wife and Widow pension and allowance.

These payments with large biases (in either direction) have some common features. They have very specific eligibility criteria and are among the least-awarded services (shown in earlier Figures). Furthermore, the granting of payments to new recipients has been stopped in some cases such as the Wife Pension.

These findings are consistent with the expectation that specific types of payments should be biased in specific ways. It also shows that substantial biases only arise for specific payments that are awarded to very few individuals.

 Concluding remarks

In response to a racist outburst, I sought out publically available data to investigate whether there was evidence that the Australian Government unfairly supported residents based on their country of origin. I found that the percentage of residents born outside Australia has increased over time. However, with the minor exception of the Age pension (which the outraged man was not eligible for), residents born in Australia or elsewhere were fairly represented in the total number of Centrelink payments.

I’d like to thank the Australian Government and Australian Bureau of Statistics for publicising this data and making it possible for me to respond to racism with evidence. If you’d like to reproduce this work or dig into the data yourself, everything from explaining where I got the data to create this article is freely available on GitHub. You can also keep in touch with me on LinkedIn or by following @drsimonj on Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bio7 2.8 Released

Wed, 05/30/2018 - 16:41

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

30.05.2018

A new and improved release of Bio7 is available. The new Bio7 2.8 release comes with a plethora of new R features and bugfixes.

Release Notes:

General:

  • Updated Eclipse RCP to 4.7.3
  • Windows R update to version 3.5.0
  • ImageJ plugin updated to version 1.52d7
  • Added the ImageJ Edit perspective to the default opened perspectives
  • The R perspective is now selected as the opened perspective after startup of Bio7
  • Added an option to save all editors on close (default enabled on MacOSX)
  • Improved the main ‘Scripts’ menu to allow nested custom scripts in folders and subfolders which form new submenus
  • Improved the Table and R-Shell context scripts menu
  • Added some example scripts to the new scripts menu
  • Added an script to open the default script location in the OS file explorer
  • Simplified the different script locations (reduced to four locations in the Bio7 preferences)
  • Spatial import, export scripts moved to the main ‘Scripts’ and R-Shell script context menu (Deleted import, export script locations)

R:

  • Simplified the Bio7 R-Shell interface
  • Added a new view for some R templates
  • Code completion from loaded R packages can now be imported from the R-Shell or the R editor (both completions will be updated!)
  • Improved the reload code completion by only considering R functions (cleaner seperation of functions and datasets)
  • The updated R-Shell context menu can now dynamically be extended with custom scripts (updated dynamically). Folder and subfolders form new submenus.
  • Improved, bugfixed and extended the ‚Evaluate Line‘ action which was renamed to ‚Evaluate Line or Selection‘  action which now interprets selected R code, too.
  • Improved MacOSX ImageJ interaction (editor hoover focus)
  • Improved the call of the default plot device (quartz) in a native R connection on MacOSX (now default opened when, e.g., plotting, debugging, etc.)
  • Added a new menu to install Rserve for R > 3.5.0
  • New shutdown option to execute R code at shutdown (e.g., to save the current R workspace – a reload with the already existing startup option is possible, too!)
  • Improved the general error messages of the R-Shell view and the R editor evaluation. Now much simpler and printed cleaner!
  • Added an option to print the error stream of the Bio7 console to the R-Shell view (to the right output panel to get an output if the console is hidden!)
  • Fixed a bug with the shiny execution (removed annoying dialog)
  • Fixed many other bugs (unnecessary error messages, execution errors, MacOSX bugs, etc.)
  • R-Shell code completion improved (see editor code completion improvements below)
  • Removed the use of temporary files of the R-Shell and R editor code completion
  • Added an preference option to add arguments to the source command
  • Multiple selected R-Shell variables are now available in the R workspace (‘.r_shell_vars’ – a character vector) for the use in custom R scripts (e.g., to extend the R-Shell script context menu!)

R Editor:

  • Added a detection/quickfix function to install missing packages when a package load definition is in the editor file but the package is not installed

  • Added a quickfix to add a library definition when an unknown function is called (not defined and loaded but package with function is installed)
  • Added support for the the magrittr piping operator in code completion to autocomplete a given dataframe in the workspace (column completion)

  • Added code completion support for dataframes columns and rows, named list elements and named arrays (of arbitrary dimension)

  • Improved code completion to display current workspace variables
  • Classes of R objects are now displayed in the code completion context dialog
  • Editor variables and functions are now emphazised in code completion, too
  • Added a directory dialog to easily create a directory string for the editor file
  • Created new submenus for the organization of the context menu
  • Improved the hoover dialog (improved offset)
  • Improved the completion context of R objects (current workspace objects)
  • The styler package can now be used for an improved code formatting (requires UTF-8 Bio7 workspace on Windows – see preferences!)
  • Improved code completion for S3 and S4 objects (correctly display current cursor completion)

ImageJ:

  • Updated ImageJ plugin to version 1.52d7
  • Updated ImageJ2 libraries
  • Added new ImageJ macro functions to the code completion of the macro editor
  • Added the ImageJ Edit perspective to the default openend perspectives
  • Improved the editor hoover activation on MacOSX
  • Improved the ImageJ MacOSX touchpad resizing of the ImageJ panel

Java:

  • Updated JRE to 1.8.71
  • Improved the creation of Java projects and project imports (no annoying dialog anymore)
  • Improved the ‘Add Selected Libraries’ preference to add multiple Java libraries for dynamic compilation (e.g., to compile more embedded Eclipse library functions)
  • Now ImageJ2 libs have to be added to the dynamic compilation libraries for compilation (removed as default libraries)
  • Added a key shortcut (STRG + ‚+‘) to increase the SceneBuilder panel (2x)
Download and Installation: Windows:

Just download the *.zip distribution file from https://bio7.org and unzip it in your preferred location. Bio7 comes bundled with a Java Runtime Environment, R and Rserve distribution and works out of the box.

Linux:

Download and extract the installation file from https://bio7.org.
For Linux you have to install R and Rserve (see Rserve installation below!).

MacOSX:

Download and extract the installation file from https://bio7.org.

If you start Bio7 a warning or error can occur because of the changes how Apple treats signatures! To allow Bio7 to start see this instructions for Yosemite and Sierra:

OS X Yosemite: Open an app from an unidentified developer

macOS Sierra: Open an app from an unidentified developer

If you have still problems with Sierra see this solution!

In addition for MacOSX you have to install R and Rserve (see below!).

Linux and MacOSX Rserve (compiled for cooperative mode) installation:

To install Rserve open the R shell and then execute the menu action “Options -> Install Rserve (coop. mode) for R …” for different R versions. This will download an install Rserve in your default R library location, see video below (please make sure that your default Linux R library install location has writing permissions!). In cooperative mode only one connection at a time is allowed (which we want for this Desktop appl.) and all subsequent connections share the same namespace (default on Windows)!

Installation of Useful R Packages

R packages which are useful in combination with Bio7 can easily be installed with the R-Shell context menu “Packages” action:

R-Shell context menu->Packages->Install_Default_R_Packages

Bio7 Documentation

For more information about Bio7 please consult the soon updated Bio7 User Guide.

A plethora of Bio7 videotutorials can be found on YouTube.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Demystifying life expectancy calculations by @ellis2013nz

Wed, 05/30/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Life expectancy and death rates

A chance comment on Twitter about historical increases in life expectancy (in the USA) with fairly constant death rates got me wondering about the exact calculations of these things so I did some Googling. Warning on what follows – I am strictly an amateur in demographics, and was deliberately working things out from basics as a learning exercise, so there are possibly better ways to do some of the things below.

According to Wikipedia which is always good at this sort of thing, life expectancy can be calculated two ways:

  • by observing a cohort until they are all dead, and taking the average age. Obviously this only works with historical data. This is referred to as “cohort life expectancy at birth”.
  • by taking a set of death rates by age bracket, and calculating the average age of death of a hypothetical cohort of people who move through life with those death rates. This is called “period life expectancy at birth” and is the method used for reporting by national statistics offices.

By convention, period life expectancy applies today’s death rates to an infant born today. Hence, there’s no estimation of those death rates changing; we assume that by the time this infant is 60 years old, the death rate of 60 year olds in 2078 will be the same as now. This has some rather obvious flaws, but it’s also easy to see why that approach is convenient, particularly for standardising data across countries. It means that the reported life expectancies aren’t really the best estimates of how long a baby born today will live conditional on standard expectations about civilization continuing at all, because we actually expect death rates to continue to decline (although some days I wonder). But it does mean that “life expectancy” is very clearly defined and with minimum discretion needed in the calculation. So long as we remember that period life expectancy is really a summary statistic of today’s death rates by age, not really how long people are expected to live but a sort of hypothetical life length, we are ok.

Life expectancy can go up while crude death rates are also going up (or vice versa) because of changing age composition in a population. All the age-specific death rates (and hence any age-adjusted death rate) might be going down, but if more and more people are in the higher-rate age brackets the overall crude death rate might be increasing. We can see this happening for Japan in the image below:

Here’s the R code that grabbed the data for that image and built it. ggplot2 afficianados might be interested in how I’ve used geom_blank to increase the scales a bit beyond the defaults, which were too narrow and had my text banging against the axes (a common problem). It’s a bit of a hack but it works nicely:

library(frs) # if necessary, install with devtools::install_github("ellisp/frs-r-package/pkg") library(tidyverse) library(scales) library(WDI) # Use WDIsearch to find the indicators we want # View(WDIsearch("life")) death_rate <- WDI(indicator = "SP.DYN.CDRT.IN", start = 1960, end = 2100) life_exp <- WDI(indicator = "SP.DYN.LE00.IN", start = 1960, end = 2100) selected_countries <- c("United Arab Emirates", "Suriname", "Georgia", "Barbados", "Kiribati", "Moldova", "Ghana", "Japan", "New Zealand") # connected scatter plot: death_rate %>% filter(country %in% selected_countries) %>% inner_join(life_exp, by = c("iso2c", "country", "year")) %>% # take out NAs which confuse the geom_text call later: filter(!is.na(SP.DYN.LE00.IN)) %>% # order country by life expectancy so graphic facets make some sense: mutate(country = fct_reorder(country, SP.DYN.LE00.IN)) %>% ggplot(aes(x = SP.DYN.CDRT.IN, y = SP.DYN.LE00.IN, colour = country)) + geom_path() + # add labels of the year at max and min points geom_text(aes(label = ifelse(year %in% range(year), year, "")), colour = "grey30", size = 3) + facet_wrap(~country, scales = "free") + # these next two geom_blank calls are a hack to make sure enough space is given for the labels which # otherwise crash into the axes, and can't be easily controlled with scale limits: geom_blank(aes(x = SP.DYN.CDRT.IN * 1.1, y = SP.DYN.LE00.IN * 1.05)) + geom_blank(aes(x = SP.DYN.CDRT.IN * 0.9, y = SP.DYN.LE00.IN * 0.95)) + theme(legend.position = "none") + labs(x = "Crude death rate (per 1,000 people)", y = "Life Expectancy", caption = "Source: World Bank World Development Indicators") + ggtitle("Differing relationship of crude death rate and life expectancy", "Countries with aging populations (eg Georgia, Japan, in the below) can experience increases in both simultaneously.") Calculating life expectancy

To be sure I understood how it worked, I had a go at estimating some life expectancies myself. I started with some French death rates per age group from 2015, because these were the most convenient to hand. Death rates are reported as deaths per 1,000 people per year. Here’s how they look when you convert them into the probability of surviving the year at any particular age:

Here’s how that was drawn. The french_death_rates_2015 object comes from the frs package where I store miscellaneous things to help out with this blog.

french_death_rates_2015 %>% gather(sex, value, -age) %>% ggplot(aes(x = age, y = (1000 - value) / 1000, colour = sex)) + geom_line() + labs(x = "Age", y = "Probability of surviving to the next year", colour = "", caption = "https://www.ined.fr/en/everything_about_population/data/france/deaths-causes-mortality/mortality-rates-sex-age/") + coord_cartesian(ylim = c(0.8, 1), xlim = c(0, 100)) + ggtitle("French death rates in 2015") + theme(legend.position = c(0.6, 0.6))

To actually convert these probabilities into a life expectancy, we need to estimate the proportion of our hypothetical population that will die at each age. There’s lots of different ways you might do this but the one that came to mind to me was:

  • Create interpolated death rates for each integer age (because typically death rates are given for a bracket of ages, not every single age)
  • Estimate the proportion still alive at any age, starting with 1 at birth and 0 at some arbitrary end point (I chose 150 which seems reasonable). This is the cumulative product of the yearly survival rates, which are of course 1 – death rate (where death rate has been converted to a probability rather than a factor out of 1,000).
  • Estimate the difference between those proportions for each year, which gives you the proportion of the total population that died in that year.
  • Take the average of all of our ages, weighted by the proportion of the population that died each year.

Now that I write it up this seems a bit more involved than I would have thought. It’s possible there’s a simpler way. Anyway, here’s the function that implements my approach

#' @param age vector of ages in years #' @param rate death rate in deaths per 1000 people alive at the given age. Must be the same length as age. #' @param data optionally, a data frame or matrix where the first column is used as age and the second column as rate #' @examples #' life_expectancy(data = french_death_rates_2015[ , c("age", "female")]) #' life_expectancy(age = french_death_rates_2015$age, rate = french_death_rates_2015$male) life_expectancy <- function(age = data[, 1], rate = data[ , 2], data = NULL){ if(length(age) != length(rate) | !class(age) %in% c("numeric", "integer") | !class(rate) %in% c("numeric", "integer")){ stop("age and rate should be integer or numeric vectors of the same length") } if(max(rate) != 1000) { stop("The highest value of rate should be exactly 1000, indicating the age at which everyone is guaranteed to die at or before.") } # interpolated version so we have a death rate for every age: dr <- stats::approx(age, rate, xout = 0:max(age)) prop_alive <- c(1, cumprod((1000 - dr$y) / 1000)) deaths <- -diff(prop_alive) return(sum(deaths * 0:max(age))) }

With my original data it gives plausible results: 85.3 for females and 79.1 for males. These differ minutely from the published figures for France in 2015 of 85.4 and 79.4; I imagine the difference is in how the age brackets are treated. I made some fairly cavalier simplifications in using the middle year of the age bracket as a reference point and interpolating between those points, which on reflection will cause some small problems when there are rapid changes in mortality (the most likely being from the age 0-1 bracket to the age 2-5 bracket).

Impact on life expectancy of changing a single age groups death rate

It’s interesting to play with how changing death rates at a particular part of the life cycle change the life expectancy calculation. Infant mortality is the biggest driver. The intuition behind this is that everyone who is born alive gets a chance to survive the first year, so an improvement here impacts on the biggest part of our population. If you improve the odds of surviving from 110 to 111 it has minimal impact on life expectancy because most people are already dead at that point.

So here’s what happens if we make infant mortality (ie death per thousand in the first year of life) arbitrarily small or large, using French males in 2015 as a reference point. The blue dot represents the current actual death rate in first year of life and overall life expectancy; the rest of the line shows what happens for hypothetical different death rates (which for France with its low infant mortality, historically and internationally speaking, means higher ones):

Obviously, if we say 1,000 out of 1,000 people die in the first year, our life expectancy becomes zero. More realistically, if death rates in the first year went up to 250 out 1,000 (which would be around the worst current day level, but well within historical ranges), life expectancy comes down from 79 to around 50, despite death rates at all other ages staying the same as in 2015 France.

On the other hand, what if we make a spike in deaths at age 18, perhaps due to a strange disease or social custom that makes this a uniquely hazardous age (the the hazard going down to normal levels at age 19). Even if the entire population dies at this age, the life expectancy is still 17 or so; and improvements in mortality rates for 18-year olds accordingly have less relative impact on life expectancy than was the case when we “improved” infant mortality:

For the older population, the issue is more marked again:

Finally, consider the case where a medical advance guarantees a uniform yearly survival rate for anyone who reaches 85 until they turn 150:

Even if we make that survival rate 0 (ie all 85 year olds are guaranteed to reach 150), life expectancy only gets up to about 91.

The code for those simulations is below. It’s a bit repetitive, but with the fiddles I wanted to labels and so on and with limited future re-use expected, it didn’t seem worth writing a function to do this job.

m <- french_death_rates_2015[ , c("age", "male")] the_age <- which(m$age == 0) # row in the data frame for this age of interest current_level <- life_expectancy(data = m) data.frame( dr = 0:1000, le = sapply(0:1000, function(i){ m[1, 2] <- i # m is modified only within the environment of this function life_expectancy(data = m) }) ) %>% ggplot(aes(x = dr, y = le)) + geom_line() + annotate("point", french_death_rates_2015[the_age, "male"], y = current_level, size = 3, colour = "steelblue") + ggtitle("Impact of different infant mortality rates on life expectancy", "Keeping death rates at all other ages the same, at levels for French males in 2015") + labs(x = "Deaths in first year of life per thousand live births", y = "Life expectancy at birth for overall cohort") + ylim(c(0, 100)) the_age <- which(m$age == "17.5") data.frame( dr = 0:1000, le = sapply(0:1000, function(i){ m[the_age, 2] <- i m <- rbind(m, data.frame(age = c(16.5, 18.5), male = c(0.3, 0.3))) life_expectancy(data = m) }) ) %>% ggplot(aes(x = dr, y = le)) + geom_line() + annotate("point", french_death_rates_2015[the_age, "male"], y = current_level, size = 3, colour = "steelblue") + ggtitle("Impact of different death rates in early adulthood on life expectancy", "Keeping death rates at all other ages the same, at levels for French males in 2015") + labs(x = "Deaths at age 17.5 per thousand people living to 16.5", y = "Life expectancy at birth for overall cohort") + ylim(c(0, 100)) the_age <- which(m$age == "85") data.frame( dr = 0:1000, le = sapply(0:1000, function(i){ m[the_age, 2] <- i m <- rbind(m, data.frame(age = c(84, 86), male = c(78, 78))) life_expectancy(data = m) }) ) %>% ggplot(aes(x = dr, y = le)) + geom_line() + annotate("point", french_death_rates_2015[the_age, "male"], y = current_level, size = 3, colour = "steelblue") + ggtitle("Impact of different death rates in old age on life expectancy", "Keeping death rates at all other ages the same, at levels for French males in 2015") + labs(x = "Deaths at age 85 per thousand people living to 84", y = "Life expectancy at birth for overall cohort") + ylim(c(0, 100)) the_age <- which(m$age == "85") mr <- nrow(m) - 1 data.frame( dr = 0:1000, le = sapply(0:1000, function(i){ m[the_age:mr, 2] <- i life_expectancy(data = m) }) ) %>% ggplot(aes(x = dr, y = le)) + geom_line() + annotate("point", french_death_rates_2015[the_age, "male"], y = current_level, size = 3, colour = "steelblue") + ggtitle("Impact of different death rates in old age on life expectancy", "Keeping death rates at all other ages the same, at levels for French males in 2015") + labs(x = "Deaths at age 85 and all older ages until 150 (when all die), per thousand people living to 84", y = "Life expectancy at birth for overall cohort") + ylim(c(0, 100))

Hmm, ok, interesting. I have some more thoughts about the arithmetic of demography, but they can come in a subsequent post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Summertime Fun: R Graphs with LaCroixColoR, Magick Animations + More!

Wed, 05/30/2018 - 04:48

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

Now that we are back from the Memorial day long weekend, does anyone else feel like we’ve kicked off summer?  I know that summer doesn’t begin officially until June 21, but I am already in the spirit.  It might seem strange to “be in the spirit” since I’m living in Austin, Texas and we get temperatures up to 95-99 degrees Fahrenheit (35 to 37 degrees Celsius).  But I think as a  Canadian, I’ve been programmed to believe summer is fun.  Heat is FUN darn it!   I mean what is really to complain about?  We’ve got AC, water parks, and LaCroix to keep us cool.  

Image courtesy of medium.com

Lacroix is for Summer

To get in the mood for summer, this blog is LaCroix themed!  We will be exploring the stock prices of National Beverage Corp (FIZZ) which is the LaCroix parent company.  Data was provided courtesy of nasdaq.com.  We will be exploring the data in the color palette of my favorite LaCroix flavor; Peach Pear.  This is made possible with the new LaCroixColoR package created by Johannes Björk.  We will then be adding appropriate gif stickers to the graphs, courtesy of giphy.com and the magick package created by Jeroen Ooms.  Note that when I was creating this blog, I referenced both the magick tutorial and an awesome “how to” blog on adding animations by Daniel Hadley

Set Up

The first thing we need to do is install and load all required packages for our summertime fun work with R.  These are all of the install.packages() and library() commands below.  Note that some packages can be installed directly via CRAN and some need to be installed from github via the devtools package.  I wrote a blog on navigating various R package install issues here. We then download the data set from my github profile through the fread() function. We can then create a few helper date columns which make it easier to display in the graph. 

 

######## Set Up Work ######## #Install necessary packages install.packages("ggplot2") install.packages("magrittr") install.packages("magick") install.packages("devtools") install.packages("curl") install.packages("data.table") install.packages("lubridate") install.packages("devtools") library(devtools) install_github("johannesbjork/LaCroixColoR") #Load necessary packages library(LaCroixColoR) library(ggplot2) library(magick) library(magrittr) library(curl) library(data.table) library(lubridate) #Bring in the nasdaq  data from Nasdaq.com for the LaCroix parent company: #National Beverage Corp #https://www.nasdaq.com/symbol/fizz/historical df= fread('https://raw.githubusercontent.com/lgellis/MiscTutorial/master/lacroix/FIZZ_Stock.csv') attach(df) df$mdy <-mdy(df$date) df$month <- month(df$mdy, label = TRUE) df$year <- as.factor(year(df$mdy)) df$wday <- wday(df$mdy, label = TRUE) head(df) attach(df)

 

Create the LaCroix plot with the LaCroix color palette  

Next, we want to create our graph which plots the FIZZ stock price over time.  To give the graph a more appropriate LaCroix style flair, we will employ the employ the LaCroixColoR package.  We will call on the lacroix_palette function when assigning the colors through the scale_color_manual function in ggplot 2.  When calling the lacroix_palette function, we need to specify a flavor and of course I picked my favorite flavor: Peach Pear!

 

######## Image One - The LaCroix Curve ######## # Create base plot fizz <-ggplot(data=df, aes(x=mdy, y=close, color=year)) +   geom_point() +   ggtitle("FIZZ Stock (LaCroix Parent Company)") +   scale_color_manual(values=lacroix_palette("PeachPear",type = "continuous", n=11)) +   labs(x = "Date", y="Closing Stock Price") # Save the base plot ggsave(fizz, file="fizz.png", width = 5, height = 4, dpi = 100) plot <- image_read("fizz.png") plot

 

 

Adding Animations

Now it’s time to inject even more fun into these graphs – ANIMATIONS!  In case you are wondering, yes this is the most important thing to spend your time doing.  Come on this is fun!  Who doesn’t want to have a cartoon character climbing their graph? Or have Vincent Vega pointing at your x-axis?  This type of activity is known as data nerd humor, or graph comedy.  

In this blog, we are breaking the graph into three parts of the stock journey and adding an animation for each phase: initial stock climb, climbing the mountain and the sad decline.  

To add an animation to your saved plot, you need to do 5 key parts:  

  • Bring in your gif and apply any transformations (scale, rotation, etc)
  • Set the background image
  • Flatten the animation and background into frames
  • Turn the frame into an animation
  • Print and save
Animation 1 –  The curve starts climbing

For this image, I wanted to bring in a sticker with a lot of zest!  This lucky guy just realized his stock is climbing.  He is stoked and doing a happy dance.  

 

#Bring in the gif - scale and rotate laCroixAnimation <- image_read("https://media.giphy.com/media/h7ZuxGCxXTRMQ/giphy.gif") %>%   image_scale("150") %>%   image_rotate(-30) laCroixAnimation # Combine the plot and animation # Set the Background image background <- image_background(image_scale(plot, "500"), "white", flatten = TRUE) # Combine and flatten frames frames <- image_composite(background, laCroixAnimation, offset = "+150+120") # Turn frames into animation animation <- image_animate(frames, fps = 5) print(animation) #Save gif image_write(animation, "laCroixImage1.gif")

Animation 2 – Climbing the Mountain

For this image, I wanted to bring in a sticker with an earned sense of confidence.  This stock climb is step but the little buddy is sticking with it.  He deserves to give a triumphant wave.  

 

#Use base plot previously created #Bring in the gif - scale and rotate laCroixAnimation <- image_read("https://media.giphy.com/media/l378xFKDBZO9Y5VUk/giphy.gif") %>%   image_scale("300") %>%   image_rotate(10) laCroixAnimation # Combine the plot and animation # Background image background <- image_background(image_scale(plot, "500"), "white", flatten = TRUE) # Combine and flatten frames frames <- image_composite(background, laCroixAnimation, offset = "+100+50") # Turn frames into animation animation <- image_animate(frames, fps = 5) print(animation) #Save gif image_write(animation, "laCroixImage2.gif")

 

Animation 3 – The Sad Decline

For this image, I wanted to bring in a sticker that reflects the sad realization of a stock hitting somewhat of a free fall.   I am doing my part to boost the LaCroix stock prices, but I’m only one woman.  And therefore, we need a crying cat at the end of this curve to reflect the cooling off of stock prices in 2017/2018.

 

######## Image Three - Sad Decline ######## #Use base plot previously created #Bring in the gif - scale laCroixAnimation <- image_read("https://media.giphy.com/media/xUA7bfJ4OF11xXe4Fy/giphy.gif") %>%   image_scale("80") laCroixAnimation # Background image background <- image_background(image_scale(plot, "500"), "white", flatten = TRUE) # Combine and flatten frames frames <- image_composite(background, laCroixAnimation, offset = "+360+150") # Turn frames into animation animation <- image_animate(frames, fps = 10) print(animation) #Save gif image_write(animation, "laCroixImage3.gif")

Thank You and Enjoy Your LaCroix

Thanks for reading along while we took a silly little journey through LaCroix stock prices, the LaCroixColoR package and Magick animations.  Please feel free to let me know your thoughts in the comments or on twitter

Note that the full code is available on my  github repo.  If you have trouble downloading the file from github, go to the main page of the repo and select “Clone or Download” and then “Download Zip”.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Public Data Release of Stack Overflow’s 2018 Developer Survey

Wed, 05/30/2018 - 02:00

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

Note: Cross-posted with the Stack Overflow blog.
Starting today, you can access the public data release for Stack Overflow’s 2018 Developer Survey. Over 100,000 developers from around the world shared their opinions about everything from their favorite technologies to job preferences, and this data is now available for you to analyze yourself. This year, we are partnering with Kaggle to publish and highlight this dataset. This means you can access the data both here on our site and on Kaggle Datasets, and that on Kaggle, you can explore the dataset using Kernels.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scico and the Colour Conundrum

Wed, 05/30/2018 - 02:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

I’m happy to once again announce the release of a package. This time it happens to be a rather unplanned and quick new package, which is fun for a change. The package in question is scico which provides access to the colour palettes developed by Fabio Crameri as well as scale functions for ggplot2 so they can be used there. As there is not a lot to talk about in such a simple package I’ll also spend some time discussing why choice of colour is important beyond aesthtic considerations, and discuss how the quality of a palette might be assesed.

An overview of the package

scico provides a total of 17 different continuous palettes, all of which are available with the scico() function. For anyone having used viridis() the scico() API is very familiar:

library(scico) scico(15, palette = 'oslo') ## [1] "#000000" "#09131E" "#0C2236" "#133352" "#19456F" "#24588E" "#3569AC" ## [8] "#4D7CC6" "#668CCB" "#7C99CA" "#94A8C9" "#ABB6C7" "#C4C7CC" "#E1E1E1" ## [15] "#FFFFFF"

In order to get a quick overview of all the available palettes use the scico_palette_show() function:

scico_palette_show()

As can be seen, the collection consists of both sequential and diverging palettes, both of which have their own uses depending on the data you want to show. A special mention goes to the oleron palette which is intended for topographical height data in order to produce the well known atlas look. Be sure to center this palette around 0 or else you will end up with very misleading maps.

ggplot2 support is provided with the scale_[colour|color|fill]_scico() functions, which works as expected:

library(ggplot2) volcano <- data.frame( x = rep(seq_len(ncol(volcano)), each = nrow(volcano)), y = rep(seq_len(nrow(volcano)), ncol(volcano)), Altitude = as.vector(volcano) ) ggplot(volcano, aes(x = x, y = y, fill = Altitude)) + geom_raster() + theme_void() + scale_fill_scico(palette = 'turku')

This is more or less all there is for this package… Now, let’s get to the meat of the discussion.

What’s in a colour?

If you’ve ever wondered why we are not all just using rainbow colours in our plots (after all, rainbows are pretty…) it’s because our choice of colour scale have a deep impact on what changes in the underlying data our eyes can percieve. The rainbow colour scale is still very common and notoriously bad – see e.g. Borland & Taylor (2007), The Rainbow Colour Map (repeatedly) considered harmful, and How The Rainbow Color Map Misleads – due to two huge problems that are fundamental to designing good colour scales: Perceptual Uniformity and Colour Blind Safe. Both of these issues have been taken into account when designing the scico palettes, but let’s tackle them one by one:

Colour blindness

Up to 10% of north european males have the most common type of colour blindness (deuteranomaly, also known as red-green colour blindness), while the number is lower for other population groups. In addition, other, rarer, types of colour blindness exists as well. In any case, the chance that a person with a color vision deficiency will look at your plots is pretty high.

As we have to assume that the plots we produce will be looked at by people with color vision deficiency, we must make sure that the colours we use to encode data can be clearly read by them (ornamental colours are less important as they – hopefully – don’t impact the conclusion of the graphic). Thanksfully there are ways to simulate how colours are percieved by people with various types of colour blindness. Let’s look at the rainbow colour map:

library(pals) pal.safe(rainbow, main = 'Rainbow scale')

As can be seen, there are huge areas of the scale where key tints disappears, making it impossible to correctly map colours back to their original data values. Put this in contrast to one of the scico palettes:

pal.safe(scico(100, palette = 'tokyo'), main = 'Tokyo scale')

While colour blindness certainly have an effect here, it is less detrimental as changes along the scale can still be percieved and the same tint is not occuring at multiple places.

Perceptual uniformity

While lack of colour blind safety “only” affects a subgroup of your audience, lack of perceptual uniformity affects everyone – even you. Behind the slightly highbrow name lies the criteria that equal jumps in the underlying data should result in equal jumps in percieved colour difference. Said in another way, every step along the palette should be percieved as giving the same amount of difference in colour.

One way to assess perceptual uniformity is by looking at small oscillations inside the scale. Let’s return to our favourite worst rainbow scale:

pal.sineramp(rainbow, main = 'Rainbow scale')

We can see that there are huge differences in how clearly the oscilations appear along the scale and around the green area they even disappears. In comparison the scico palettes produces much more even resuls:

pal.sineramp(scico(100, palette = 'tokyo'), main = 'Tokyo scale')

But wait – there’s more!

This is just a very short overview into the world of colour perception and how it affects information visualisation. The pals package contains more functions to assess the quality of colour palettes, some of which has been collected in an ensemble function:

pal.test(scico(100, palette = 'broc'), main = 'Broc scale')

It also has a vignette that explains in more detail how the different plots can be used to look into different aspects of the palette.

scico is also not the only package that provides well-designed, safe, colour palettes. RColorBrewer has been a beloved utility for a long time, as well as the more recent viridis. Still, choice is good and using the same palettes for prolonged time can make them seem old and boring, so the more the merrier.

A last honerable mention is the overview of palettes in R that Emil Hvitfeldt has put together. Not all of the palettes in it (the lions share actually) have been designed with the issues discussed above in mind, but sometimes thats OK – at least you now know how to assess the impact of your choice and weigh it out with the other considerations you have.

Always be weary of colours

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

User authentication in R Shiny – sneak peek of shiny.users and shiny.admin packages

Wed, 05/30/2018 - 02:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

At Appsilon we frequently build advanced R/Shiny dashboards that need user authentication. I would like to share with you how we implement user management – user accounts, the authorization process and gather usage statistics to have a better understanding of how the app is used.

For user authentication, we created the shiny.users package. For user administration and usage statistics, we created the shiny.admin package. We will show you a sneak peek of how these packages work. They are not released yet – but we opened an Early Adopter Waitlist – sign up and we will contact you as soon as these packages will be ready for production testing!

Shiny.users – convenient authentication on the application level

Here is an example of how shiny.users works when you add it to your shiny application:

Below you can view the code of this example:

library(shiny) library(shiny.users) demo_users <- list( list( username = "demo-appsilon", password = "QZt85C46Ab", roles = list("basic", "admin") ), list( username = "john", password = "C4SRx3xJcy", roles = list("basic") ) ) ui <- shinyUI(fluidPage( div(class = "container", style = "padding: 4em", login_screen_ui('login_screen'), uiOutput("authorized_content") ) )) server <- shinyServer(function(input, output) { users <- initialize_users(demo_users) callModule(login_screen, 'login_screen', users) output$authorized_content <- renderUI({ if (!is.null(users$user()) { ... # application content } }) }) shinyApp(ui, server)

This is a simple demo with users defined in a list. In a real-world scenario, you would store user credentials in a database or authenticate users with an external API. This is also covered in shiny.users.

In bigger apps with more users, you will also need user management, authorization management (different roles e.g. admin, writer, reader, etc) and usage statistics. For such advanced use cases, we created shiny.admin package.

Shiny.admin – user management and usage statistics

This package is an extension to the shiny.users package. When you add it to your Shiny app, you instantly get admin panel under the url http://application-url/#!/admin. This panel is only accessible by the users with “admin” role.

Features of shiny.admin:

  • Adding and removing users
  • Managing existing users – their passwords, roles and metadata
  • Statistics – counting unique users and sessions (similar to Google Analytics)
  • Session recording – recording defined actions e.g. provided input values and clicked elements. You can view all sessions details (similar to Hotjar)

We observed huge gains from above features. We exactly understood how our apps were used by our users. Here are some screenshots from shiny.admin:

I hope you like the idea of shiny.users and shiny.admin packages. Let us know what would you expect from user accounts and admin panel features. Do you think there is a need for developing such packages? Tell us about your use cases and needs in the comments or contact us at dev@appsilondatascience.com.

Also, don’t forget to sign up to shiny.users and shiny.admin waitlist!




Subscribe to shiny.users & shiny.admin waitlist!

Read the original post at
Appsilon Data Science Blog.

Follow Appsilon Data Science var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A recipe for recipes

Tue, 05/29/2018 - 16:15

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

If you build statistical or machine learning models, the recipes package can be useful for data preparation. A recipe object is a container that holds all the steps that should be performed to go from the raw data set to the set that is fed into model a algorithm. Once your recipe is ready it can be executed on a data set at once, to perform all the steps. Not only on the train set on which the recipe was created, but also on new data, such as test sets and data that should be scored by the model. This assures that new data gets the exact same preparation as the train set, and thus can be validly scored by the learned model. The author of recipes, Max Kuhn, has provided abundant material to familiarize yourself with the richness of the package, see here, here, and here. I will not dwell on how to use the package. Rather, I’d like to share what in my opinion is a good way to create new steps and checks to the package1. Use of the package is probably intuitive. Developing new steps and checks, however, does require a little more understanding of the package inner workings. With this procedure, or recipe if you like, I hope you will find adding (and maybe even contributing) your own steps and checks becomes easier and more organized.

S3 classes in recipes

Lets build a very simple recipe:

library(tidyverse) library(recipes) rec <- recipe(mtcars) %>% step_center(everything()) %>% step_scale(everything()) %>% check_missing(everything()) rec %>% class() ## [1] "recipe"

As mentioned a recipe is a container for the steps and checks. It is a list of class recipe on which the prep, bake, print and tidy methods do the work as described in their respective documentation files. The steps and checks added to the recipe are stored inside this list. As you can see below, each of them are of their own subclass, as well as of the generic step or check classes.

rec$steps %>% map(class) ## [[1]] ## [1] "step_center" "step" ## ## [[2]] ## [1] "step_scale" "step" ## ## [[3]] ## [1] "check_missing" "check"

Each subclass has the same four methods defined as the recipe class. As one of the methods is called on the recipe, it will call the same method on all the steps and checks that are added to the recipe. (Exception is prep.recipe, which does not only callprep on all the steps and checks, but also bake). This means that when implementing a new step or check, you should provide these four methods. Additionally, we’ll need the function that is called by the user to add it to the recipe object and a constructor (if you are not sure what this is, see 15.3.1 of Advanced R by Hadley Wickham).

The recipes for recipes

When writing a new step or check you will probably be inclined to copy-paste an existing step and start tweaking it from the top. Thus, first writing the function, then the constructor, and then the methods one by one. I think this is suboptimal and can get messy quickly. My preferred way is to start by not bothering about recipes at all, but write a general function that does the preparation work on a single vector. Once you are happy with this function you sit and think about which arguments to this function should be provided with upfront. These should be added as arguments to the function that is called by the user. Next you think about which function arguments are statistics that should be learned on the train set in the prep part of the recipe. You’ll then go on and do the constructor that is called by both the main function and the prep method. Only then you’ll write the function that is called by the user. You’ll custom step or check is completed by writing the four methods, since the functionality is already created, these are mostly bookkeeping.

For both checks and steps I made a skeleton available, in which all the “overhead” that should be in a new step or check is present. This is more convenient than copy-paste and existing example and try to figure out what is step specific and should be erased. You’ll find the skeletons on github. Note that for creating your own steps and checks you should clone the package source code, since the helper functions that are used are not exported by the package.

Putting it to practice

We are going to do two examples in which the recipe for recipes is applied.

Example 1: A signed log

First up is a signed-log (inspired by Practical Data Science with R), which is taking the log over the absolute value of a variable, multiplied by its original sign. Thus, enabling us to take logs over negative values. If a variable is between -1 and 1, we’ll set it to 0, otherwise things get messy.

1) preparing the function

This is what the function on a vector could look like, if we did not bother about recipes:

signed_log <- function(x, base = exp(1)) { ifelse(abs(x) < 1, 0, sign(x) * log(abs(x), base = base)) } 2) think about the arguments

The only argument of the function is base, that should be provided upfront when adding the step to a recipe object. There are no statistics to be learned in the prep step.

3) the constructor

Now we are going to write the first of the recipes functions. This is the constructor that produces new instances of the object of class step_signed_log. The first four arguments are present in each step or check, they are therefore part of the skeletons. The terms argument will hold the information on which columns the step should be performed. For role, train and skip, please see the documentation in one of the skeletons. base is step_signed_log specific, as used in 1). In prep.step_signed_log the tidyselect arguments will be converted to a character vector holding the actual column names. columns will store these names in the step_signed_log object. This container argument will not be necessary if the columns names are also present in another way. For instance, step_center has the argument means, that will be populated by the means of the variables of the train set by its prep method. In the bake method the names of the columns to be prepared are already provided by the names of the means and there is need to use the columns argument.

step_signed_log_new <- function(terms = NULL, role = NA, skip = FALSE, trained = FALSE, base = NULL, columns = NULL) { step( subclass = "signed_log", terms = terms, role = role, skip = skip, trained = trained, base = base, columns = columns ) } 4) the function to add it to the recipe

Next up is the function that will be called by the user to add the step to its recipe. You’ll see the internal helper function add_step is called. It will expand the recipe with the step_signed_log object that is produced by the constructor we just created.

step_signed_log <- function(recipe, ..., role = NA, skip = FALSE, trained = FALSE, base = exp(1), columns = NULL) { add_step( recipe, step_signed_log_new( terms = ellipse_check(...), role = role, skip = skip, trained = trained, base = base, columns = columns ) ) } 5) the prep method

As recognized in 2) we don’t have to do much in the prep method of this particular step, since the preparation of new sets does not depend on statistics learned on the train set. The only thing we do here is applying the internal function terms_select function to replace the tidyselect selections, by the actual names of the columns on which step_signed_log should be applied. We call the constructor again, indicating that the step is trained and we supplying the column names at the columns argument.

prep.step_signed_log <- function(x, training, info = NULL, ...) { col_names <- terms_select(x$terms, info = info) step_signed_log_new( terms = x$terms, role = x$role, skip = x$skip, trained = TRUE, base = x$base, columns = col_names ) } 6) the bake method

We are now ready to apply the baking function, designed at 1), inside the recipes framework. We loop through the variables, apply the function and return the updated data frame.

bake.step_signed_log <- function(object, newdata, ...) { col_names <- object$columns for (i in seq_along(col_names)) { col <- newdata[[ col_names[i] ]] newdata[, col_names[i]] <- ifelse(abs(col) < 1, 0, sign(col) * log(abs(col), base = object$base)) } as_tibble(newdata) } 7) the print method

This assures pretty printing of the recipe object to which step_signed_log is added. You use the internal printer function with a message specific for the step.

print.step_signed_log <- function(x, width = max(20, options()$width - 30), ...) { cat("Taking the signed log for ", sep = "") printer(x$columns, x$terms, x$trained, width = width) invisible(x) } 8) the tidy method

Finally, tidy will add a line for this step to the data frame when the tidy method is called on a recipe.

tidy.step_signed_log <- function(x, ...) { if (is_trained(x)) { res <- tibble(terms = x$columns) } else { res <- tibble(terms = sel2char(x$terms)) } res }

Lets do a quick check to see if it works as expected

recipe(data_frame(x = 1)) %>% step_signed_log(x) %>% prep() %>% bake(data_frame(x = -3:3)) ## # A tibble: 7 x 1 ## x ## ## 1 -1.0986123 ## 2 -0.6931472 ## 3 0.0000000 ## 4 0.0000000 ## 5 0.0000000 ## 6 0.6931472 ## 7 1.0986123 Example 2: A range check

Model predictions might be invalid when the range of a variable in new data is shifted from the range of the variable in the train set. Lets do a second example in which we check if the range of a numeric variable is approximately equal to the range of the variable in the train set. We do so by checking if the variable’s minimum value in the new data is not smaller than its minimum value in the train set. The variable’s maximum value in the test set should not exceed the maximum in the train set. We allow for some slack (proportion of the variable range in the train set) to account for natural variation.

1) preparing the function

As mentioned, checks are about throwing informative errors if assumptions are not met. This is a function we could apply on new variables, without bothering about recipes:

range_check_func <- function(x, lower, upper, slack_prop = 0.05, colname = "x") { min_x <- min(x) max_x <- max(x) slack <- (upper - lower) * slack_prop if (min_x < (lower - slack) & max_x > (upper + slack)) { stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack, "\n", "max x is ", max_x, ", upper bound is ", upper + slack, call. = FALSE) } else if (min_x < (lower - slack)) { stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack, call. = FALSE) } else if (max_x > (upper + slack)) { stop("max ", colname, " is ", max_x, ", upper bound is ", upper + slack, call. = FALSE) } } 2) think about the arguments

The slack_prop is a choice that the user of the check should make upfront. This is thus an argument of check_range. Then there are two statistics to be learned in the prep method: lower and upper. These should be arguments of the function and the constructor as well. However, when calling the function these are always NULL, they are container arguments that will filled when calling prep.check_range.

3) the constructor

We start again with the four arguments present in every step or check. Subsequently we add the three arguments that we recognized to be part of the check.

check_range_new <- function(terms = NULL, role = NA, skip = FALSE, trained = FALSE, lower = NULL, upper = NULL, slack_prop = NULL) { check(subclass = "range", terms = terms, role = role, trained = trained, lower = lower, upper = upper, slack_prop = slack_prop) } 4) the function to add it to the recipe

As we know by now, it is just calling the constructor and adding it to the recipe.

check_range <- function(recipe, ..., role = NA, skip = FALSE, trained = FALSE, lower = NULL, upper = NULL, slack_prop = 0.05) { add_check( recipe, check_range_new( terms = ellipse_check(...), role = role, skip = skip, trained = trained, lower = lower, upper = upper, slack_prop = slack_prop ) ) } 5) the prep method

Here the method is getting a lot more interesting, because we actually have work to do. We are calling vapply on each of the columns the check should be applied on, to derive the minimum and maximum. Again the constructor is called and the learned statistics are now populating the lower and upper arguments.

prep.check_range <- function(x, training, info = NULL, ...) { col_names <- terms_select(x$terms, info = info) lower_vals <- vapply(training[ ,col_names], min, c(min = 1), na.rm = TRUE) upper_vals <- vapply(training[ ,col_names], max, c(max = 1), na.rm = TRUE) check_range_new( x$terms, role = x$role, trained = TRUE, lower = lower_vals, upper = upper_vals, slack_prop = x$slack_prop ) } 6) the bake method

The hard work has been done already. We just get the columns on which to apply the check and check them with the function we created at 1).

bake.check_range <- function(object, newdata, ...) { col_names <- names(object$lower) for (i in seq_along(col_names)) { colname <- col_names[i] range_check_func(newdata[[ colname ]], object$lower[colname], object$upper[colname], object$slack_prop, colname) } as_tibble(newdata) } 7) the print method print.check_range <- function(x, width = max(20, options()$width - 30), ...) { cat("Checking range of ", sep = "") printer(names(x$lower), x$terms, x$trained, width = width) invisible(x) } 8) the tidy method tidy.check_range <- function(x, ...) { if (is_trained(x)) { res <- tibble(terms = x$columns) } else { res <- tibble(terms = sel2char(x$terms)) } res }

Again, we check quickly if it works

cr1 <- data_frame(x = 0:100) cr2 <- data_frame(x = -1:101) cr3 <- data_frame(x = -6:100) cr4 <- data_frame(x = 0:106) recipe_cr <- recipe(cr1) %>% check_range(x) %>% prep() cr2_baked <- recipe_cr %>% bake(cr2) cr3_baked <- recipe_cr %>% bake(cr3) ## Error: min x is -6, lower bound is -5 cr4_baked <- recipe_cr %>% bake(cr4) ## Error: max x is 106, upper bound is 105 Conclusion

If you like to add your own data preparation steps and data checks to the recipes package, I advise you to do this in a structured way so you are not distracting by the bookkeeping while implementing the functionality. I propose eight subsequent parts to develop a new step or check:

1) Create a function on a vector that could be applied in the bake method, but does not bother about recipes yet.
2) Recognize which function arguments should be provided upfront and which should be learned in the prep method.
3) Create a constructor in the form step__new or check__new.
4) Create the actual function to add the step or check to a recipe, in the form step_ or check_.
5) Write the prep method.
6) Write the bake method.
7) Write the print method.
8) Write the tidy method.

As mentioned, the source code is maintained here. Make sure to clone the latest version to access the package internal functions.

  1. Together with Max I added the checks framework to the package. Where steps are transformations of variables, checks are assertions of them. If a check passes nothing happens. If it fails, it will break the bake method of the recipe and throws an informative error. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL Boston 2018 Keynote Speaker announcement: Bob Rudis

Tue, 05/29/2018 - 16:06

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Mango Solutions are delighted to announce our Keynote Speaker for the EARL Boston Conference: Bob Rudis, [Master] Chief Data Scientist at Rapid7.

Bob has more than 20 years’experience using data to help defend global Fortune 100 companies and in his current role at Rapid7, he specializes in research on internet-scale exposure. He was formerly a Security Data Scientist and Managing Principal at Verizon, overseeing the team that produces the annual Data Breach Investigations Report.

If you’re on Twitter, you’ll know Bob as a serial tweeter (he’s even got the blue tick) – if you don’t already follow him, you can find him at @hrbrmstr. He’s also an avid blogger at rud.is, author (Data-Driven Security) and speaker.

As a prominent and avuncular participant of the twitter #rstats conversation and prolific package author, Bob is a respected and valued member of the global R community.

We’re looking forward to seeing him speak on 13 November in Boston!

Take the stage with Bob

Abstract submissions for the US Roadshow close on 30 April 2018. You could be on the agenda with Bob in Boston as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Fix Is In: Finding infix functions inside contributed R package “utilities” files

Tue, 05/29/2018 - 14:42

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Regular readers will recall the “utility belt” post from back in April of this year. This is a follow-up to a request made asking for a list of all the % infix functions in those files.

We’re going to:

  • collect up all of the sources
  • parse them
  • find all the definitions of % infix functions
  • write them to a file

We’ll start by grabbing the data from the previous post and look at it as a refresher:

library(stringi) library(tidyverse) utils <- read_rds(url("https://rud.is/dl/utility-belt.rds")) utils ## # A tibble: 1,746 x 13 ## permsissions links owner group size month day year_hr path date pkg fil file_src ## 1 -rw-r--r-- 0 hornik users 1658 Jun 05 2016 AHR/R… 2016-06-05 AHR util… "## \\int f(x)dg(x) … ## 2 -rw-r--r-- 0 ligges users 12609 Dec 13 2016 ALA4R… 2016-12-13 ALA4R util… "## some utility fun… ## 3 -rw-r--r-- 0 hornik users 0 Feb 24 2017 AWR.K… 2017-02-24 AWR.… util… "" ## 4 -rw-r--r-- 0 ligges users 4127 Aug 30 2017 Alpha… 2017-08-30 Alph… util… "#\n#' Assign API ke… ## 5 -rw-r--r-- 0 ligges users 121 Jan 19 2017 Amylo… 2017-01-19 Amyl… util… "make_decision <- fu… ## 6 -rw-r--r-- 0 herbrandt herbrandt 52 Aug 10 2017 BANES… 2017-08-10 BANE… util… "#' @importFrom dply… ## 7 -rw-r--r-- 0 ripley users 36977 Jan 06 2015 BEQI2… 2015-01-06 BEQI2 util… "#' \tRemove Redunda… ## 8 -rw-r--r-- 0 hornik users 34198 May 10 2017 BGDat… 2017-05-10 BGDa… util… "# A more memory-eff… ## 9 -rwxr-xr-x 0 ligges users 3676 Aug 14 2016 BGLR/… 2016-08-14 BGLR util… "\n readBinMat=funct… ## 10 -rw-r--r-- 0 ripley users 2547 Feb 04 2015 BLCOP… 2015-02-04 BLCOP util… "###################… ## # ... with 1,736 more rows

Note that we somewhat expected the file source to potentially come in handy at a later date and also expected the need to revisit that post, so the R data file [←direct link to RDS] included a file_src column.

Now, let’s find all the source files with at least one infix definition, collect them together and parse them so we can do more code spelunking:

filter(utils, stri_detect_fixed(file_src, "`%")) %>% # only find sources with infix definitions pull(file_src) %>% paste0(collapse="\n\n") %>% parse(text = ., keep.source=TRUE) -> infix_src str(infix_src, 1) ## length 1364 expression(dplyr::`%>%`, `%||%` <- function(a, b) if (is.null(a)) b else a, get_pkg_path <- function(ctx) { pkg_| __truncated__ ... ## - attr(*, "srcref")=List of 1364 ## - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' ## - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 15768 0 0 0 1 15768 ## ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'

We can now take all of that lovely parsed source and tokenize it to work with the discrete elements in a very tidy manner:

infix_parsed <- tbl_df(getParseData(infix_src)) # tbl_df() is mainly for pretty printing infix_parsed ## # A tibble: 118,242 x 9 ## line1 col1 line2 col2 id parent token terminal text ## 1 1 1 1 24 1 -10 COMMENT TRUE #' @impor… ## 2 2 1 2 10 4 -10 COMMENT TRUE #' @export ## 3 3 1 3 12 10 0 expr FALSE "" ## 4 3 1 3 5 7 10 SYMBOL_PACKAGE TRUE dplyr ## 5 3 6 3 7 8 10 NS_GET TRUE :: ## 6 3 8 3 12 9 10 SYMBOL TRUE `%>%` ## 7 5 1 5 49 51 0 expr FALSE "" ## 8 5 1 5 6 16 18 SYMBOL TRUE `%||%` ## 9 5 1 5 6 18 51 expr FALSE "" ## 10 5 8 5 9 17 51 LEFT_ASSIGN TRUE <- ## # ... with 118,232 more rows

We just need to find a sequence of tokens that make up a function definition, then whittle those down to ones that look like our % infix names:

pat <- c("SYMBOL", "expr", "LEFT_ASSIGN", "expr", "FUNCTION") # pattern for function definition # find all of ^^ sequences (there's a good twitter discussion on this abt a month ago) idx <- which(infix_parsed$token == pat[1]) # find location of match of start of seq # look for the rest of the sequences starting at each idx position map_lgl(idx, ~{ all(infix_parsed$token[.x:(.x+(length(pat)-1))] == pat) }) -> found f_defs <- idx[found] # starting indices of all the places where functions are defined # filter ^^ to only find infix ones infix_defs <- f_defs[stri_detect_regex(infix_parsed$text[f_defs], "^`\\%")] # there aren't too many, but remember we're just searching `util` functions length(infix_defs) ## [1] 106

Now, write it out to a file so we can peruse the infix functions:

# nuke a file and fill it with the function definition cat("", sep="", file="infix_functions.R") walk2( getParseText(infix_parsed, infix_parsed$id[infix_defs]), # extract the infix name getParseText(infix_parsed, infix_parsed$id[infix_defs + 3]), # extract the function definition body ~{ cat(.x, " <- ", .y, "\n\n", sep="", file="infix_functions.R", append=TRUE) } )

There are 106 of them so you can find the extracted ones in this gist.

Here's an overview of what you can expect to find:

# A tibble: 39 x 2 name n 1 `%||%` 47 2 `%+%` 7 3 `%AND%` 4 4 `%notin%` 4 5 `%:::%` 3 6 `%==%` 3 7 `%!=%` 2 8 `%*diag%` 2 9 `%diag*%` 2 10 `%nin%` 2 11 `%OR%` 2 12 `%::%` 1 13 `%??%` 1 14 `%.%` 1 15 `%@%` 1 16 `%&&%` 1 17 `%&%` 1 18 `%+&%` 1 19 `%++%` 1 20 `%+|%` 1 21 `%<<%` 1 22 `%>>%` 1 23 `%~~%` 1 24 `%assert_class%` 1 25 `%contains%` 1 26 `%din%` 1 27 `%fin%` 1 28 `%identical%` 1 29 `%In%` 1 30 `%inr%` 1 31 `%M%` 1 32 `%notchin%` 1 33 `%or%` 1 34 `%p%` 1 35 `%pin%` 1 36 `%R%` 1 37 `%s%` 1 38 `%sub_in%` 1 39 `%sub_nin%` 1 FIN

If any of those are useful, feel free to PR them in to https://github.com/hrbrmstr/freebase/blob/master/inst/templates/infix-helpers.R (and add yourself to the DESCRIPTION if you do).

Hopefully this provided some further inspiration to continue to use R not only as your language of choice but also as a fun data source.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Daily Streamflow Trend Analysis

Tue, 05/29/2018 - 02:00

(This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers)

Introduction

This document describes how to produce a set of graphics and perform the associated statistical tests that describe trends in daily streamflow at a single streamgage. The trends depicted cover the full range of quantiles of the streamflow distribution, from the lowest discharge of the year, through the median discharge, up through the highest discharge of the year. The method is built around the R package EGRET Exploration and Graphics for RivEr Trends. It makes it possible to consider trends over any portion of the time period for which there are daily streamflow records, and any period of analysis (the portion of the year to be considered, e.g. the entire year or just the summer months).

Getting Started

First, the EGRET package needs to be installed and loaded. Then, you’ll need need to create an eList object, which contains site information and daily discharge data for streamgage. Page 13 of the EGRET user guide here.

For this post, we will use the Choptank River in Maryland as our first example. There is an example data set included in EGRET. The data set consists of site information, daily discharge data, and water quality data, but this application does not use the water quality data. Refer to the section near the end of this document called Downloading the data for your site of interest to see how you can set up an eList object for any USGS streamgage.

There are two limitations that users should know about this application before proceeding any further.

  1. The code was designed for discharge records that are complete (no gaps). This is usually not a problem with USGS discharge records unless there is a major gap (typically years in length) when the streamgage was not operating. If records have a number of small gaps, users could use some established method for filling in missing data to create a gap-free record.

  2. The discharge on every day should be a positive value (not zero or negative). The EGRET code that is used here to read in new data has a “work around” for situations where there are a very small number of non-positive discharge values. It adds a small constant to all the discharge data so they will all be positive. This should have almost no impact on the results provided the number of non-positive days is very small, say less than 0.1% of all the days. That translates to about 11 days out of 30 years. For data sets with more zero or negative flow days different code would need to be written (we would appreciate it if an user could work on developing such a set of code).

To start, the following R commands are needed.

library(EGRET) eList <- Choptank_eList

Just to get some sense about the data we will look a portion of the metadata (gage ID number, name, and drainage area in square kilometers) and also see a summary of the discharge data (discharge in in cubic meters per second).

print(eList$INFO$site.no) ## [1] "01491000" print(eList$INFO$shortName) ## [1] "Choptank River" print(eList$INFO$drainSqKm) ## [1] 292.6687 print(summary(eList$Daily$Date)) ## Min. 1st Qu. Median Mean 3rd Qu. ## "1979-10-01" "1987-09-30" "1995-09-30" "1995-09-30" "2003-09-30" ## Max. ## "2011-09-30" print(summary(eList$Daily$Q)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00991 0.93446 2.40693 4.08658 4.61565 246.35656 Loading the necessary packages and other R code

To run the analysis and produce the graphs you will need a few R functions in addition to the EGRET package. You can copy the entire block of code shown below here and paste it into your workspace (all as a single copy and paste) or you can create an .R file from the code that you will source each time you want to use it. Let’s say you call it flowTrends.R, then each time you want to use it you need to “source” the object flowTrends.R.

In addition to EGRET the functions use the packages rkt, zyp, lubridate, and Kendall. Make sure you have installed all of them.

Show/Hide Code

library(rkt) library(zyp) library(lubridate) ######################################################################################### ########## this is the function you will use to make a single trend graph ############## ######################################################################################### plotFlowTrend <- function (eList, istat, startDate = NA, endDate = NA, paStart = 4, paLong = 12, window = 30, qMax = NA, printTitle = TRUE, tinyPlot = FALSE, customPar = FALSE, runoff = FALSE, qUnit = 2, printStaName = TRUE, printPA = TRUE, printIstat = TRUE, cex = 0.8, cex.axis = 1.1, cex.main = 1.1, lwd = 2, col = "black", ...){ localDaily <- getDaily(eList) localINFO <- getInfo(eList) localINFO$paStart <- paStart localINFO$paLong <- paLong localINFO$window <- window start <- as.Date(startDate) end <- as.Date(endDate) if(is.na(startDate)){ start <- as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end <- as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily <- subset(localDaily, Date >= start & Date <= end) eList <- as.egret(localINFO,localDaily) localAnnualSeries <- makeAnnualSeries(eList) qActual <- localAnnualSeries[2, istat, ] qSmooth <- localAnnualSeries[3, istat, ] years <- localAnnualSeries[1, istat, ] Q <- qActual time <- years LogQ <- log(Q) mktFrame <- data.frame(time,LogQ) mktFrame <- na.omit(mktFrame) mktOut <- rkt::rkt(mktFrame$time,mktFrame$LogQ) zypOut <- zyp::zyp.zhang(mktFrame$LogQ,mktFrame$time) slope <- mktOut$B slopePct <- 100 * (exp(slope)) - 100 slopePct <- format(slopePct,digits=2) pValue <- zypOut[6] pValue <- format(pValue,digits = 3) if (is.numeric(qUnit)) { qUnit <- qConst[shortCode = qUnit][[1]] } else if (is.character(qUnit)) { qUnit <- qConst[qUnit][[1]] } qFactor <- qUnit@qUnitFactor yLab <- qUnit@qUnitTiny if (runoff) { qActual <- qActual * 86.4/localINFO$drainSqKm qSmooth <- qSmooth * 86.4/localINFO$drainSqKm yLab <- "Runoff in mm/day" } else { qActual <- qActual * qFactor qSmooth <- qSmooth * qFactor } localSeries <- data.frame(years, qActual, qSmooth) yInfo <- generalAxis(x = qActual, maxVal = qMax, minVal = 0, tinyPlot = tinyPlot) xInfo <- generalAxis(x = localSeries$years, maxVal = decimal_date(end), minVal = decimal_date(start), padPercent = 0, tinyPlot = tinyPlot) line1 <- localINFO$shortName nameIstat <- c("minimum day", "7-day minimum", "30-day minimum", "median daily", "mean daily", "30-day maximum", "7-day maximum", "maximum day") line2 <- paste0("\n", setSeasonLabelByUser(paStartInput = paStart, paLongInput = paLong), " ", nameIstat[istat]) line3 <- paste0("\nSlope estimate is ",slopePct,"% per year, Mann-Kendall p-value is ",pValue) if(tinyPlot){ title <- paste(nameIstat[istat]) } else { title <- paste(line1, line2, line3) } if (!printTitle){ title <- "" } genericEGRETDotPlot(x = localSeries$years, y = localSeries$qActual, xlim = c(xInfo$bottom, xInfo$top), ylim = c(yInfo$bottom, yInfo$top), xlab = "", ylab = yLab, customPar = customPar, xTicks = xInfo$ticks, yTicks = yInfo$ticks, cex = cex, plotTitle = title, cex.axis = cex.axis, cex.main = cex.main, tinyPlot = tinyPlot, lwd = lwd, col = col, ...) lines(localSeries$years, localSeries$qSmooth, lwd = lwd, col = col) } ######################################################################################### ###### this the the function you will use to make the Quantile Kendall Plot ############# ######################################################################################### plotQuantileKendall <- function(eList, startDate = NA, endDate = NA, paStart = 4, paLong = 12, legendLocation = "topleft", legendSize = 1.0, yMax = NA, yMin = NA) { localDaily <- eList$Daily localINFO <- eList$INFO localINFO$paStart <- paStart localINFO$paLong <- paLong start <- as.Date(startDate) end <- as.Date(endDate) if(is.na(startDate)){ start <- as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end <- as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily <- subset(localDaily, Date >= start & Date <= end) eList <- as.egret(localINFO,localDaily) eList <- setPA(eList, paStart=paStart, paLong=paLong) v <- makeSortQ(eList) sortQ <- v[[1]] time <- v[[2]] results <- trendSortQ(sortQ, time) pvals <- c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999) zvals <- qnorm(pvals) name <- eList$INFO$shortName # ymax <- trunc(max(results$slopePct)*10) # ymax <- max(ymax + 2, 5) # ymin <- floor(min(results$slopePct)*10) # ymin <- min(ymin - 2, -5) # yrange <- c(ymin/10, ymax/10) # yticks <- axisTicks(yrange, log = FALSE) ymax <- max(results$slopePct + 0.5, yMax, na.rm = TRUE) ymin <- min(results$slopePct - 0.5, yMin, na.rm = TRUE) yrange <- c(ymin, ymax) yticks <- axisTicks(yrange, log = FALSE, nint =7) p <- results$pValueAdj color <- ifelse(p <= 0.1,"black","snow3") color <- ifelse(p < 0.05, "red", color) pvals <- c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999) zvals <- qnorm(pvals) name <- paste0("\n", eList$INFO$shortName,"\n", start," through ", end, "\n", setSeasonLabelByUser(paStartInput = paStart, paLongInput = paLong)) plot(results$z,results$slopePct,col = color, pch = 20, cex = 1.0, xlab = "Daily non-exceedance probability", ylab = "Trend slope in percent per year", xlim = c(-3.2, 3.2), ylim = yrange, yaxs = "i", las = 1, tck = 0.02, cex.lab = 1.2, cex.axis = 1.2, axes = FALSE, frame.plot=TRUE) mtext(name, side =3, line = 0.2, cex = 1.2) axis(1,at=zvals,labels=pvals, las = 1, tck = 0.02) axis(2,at=yticks,labels = TRUE, las = 1, tck = 0.02) axis(3,at=zvals,labels=FALSE, las = 1, tck=0.02) axis(4,at=yticks,labels = FALSE, tick = TRUE, tck = 0.02) abline(h=0,col="blue") legend(legendLocation,c("> 0.1","0.05 - 0.1","< 0.05"),col = c("snow3", "black","red"),pch = 20, title = "p-value", pt.cex=1.0, cex = legendSize * 1.5) } ######################################################################################### ############ This next function combines four individual trend graphs (for mimimum day, ########### median day, mean day, and maximum day) along with the quantile kendall graph ######################################################################################### plotFiveTrendGraphs <- function(eList, startDate = NA, endDate = NA, paStart = 4, paLong = 12, qUnit = 2, window = 30, legendLocation = "topleft", legendSize = 1.0) { localDaily <- eList$Daily localINFO <- eList$INFO localINFO$paStart <- paStart localINFO$paLong <- paLong localINFO$window <- window start <- as.Date(startDate) end <- as.Date(endDate) if(is.na(startDate)){ start <- as.Date(localDaily$Date[1]) } if(is.na(endDate)){ end <- as.Date(localDaily$Date[length(localDaily$Date)]) } localDaily <- subset(localDaily, Date >= start & Date <= end) eList <- as.egret(localINFO,localDaily) eList <- setPA(eList, paStart=paStart, paLong=paLong, window=window) # this next line of code is inserted so that when paLong = 12, we always use the # climate year when looking at the trends in the annual minimum flow paStart1 <- if(paLong == 12) 4 else paStart plotFlowTrend(eList, istat = 1, qUnit = qUnit, paStart = paStart1, paLong = paLong, window = window) plotFlowTrend(eList, istat = 4, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) plotFlowTrend(eList, istat = 8, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) plotFlowTrend(eList, istat = 5, qUnit = qUnit, paStart = paStart, paLong = paLong, window = window) # now the quantile kendall plotQuantileKendall(eList, startDate = startDate, endDate = endDate, paStart = paStart, paLong = paLong, legendLocation = legendLocation, legendSize = legendSize) } ######################################################################################### ########### makeSortQ creates a matrix called Qsort. ############It sorted from smallest to largest over dimDays ############(if working with full year dimDays=365), #############and also creates other vectors that contain information about this array. ######################################################################################### makeSortQ <- function(eList){ localINFO <- getInfo(eList) localDaily <- getDaily(eList) paStart <- localINFO$paStart paLong <- localINFO$paLong # determine the maximum number of days to put in the array numDays <- length(localDaily$DecYear) monthSeqFirst <- localDaily$MonthSeq[1] monthSeqLast <- localDaily$MonthSeq[numDays] # creating a data frame (called startEndSeq) of the MonthSeq values that go into each year Starts <- seq(paStart, monthSeqLast, 12) Ends <- Starts + paLong - 1 startEndSeq <- data.frame(Starts, Ends) # trim this list of Starts and Ends to fit the period of record startEndSeq <- subset(startEndSeq, Ends >= monthSeqFirst & Starts <= monthSeqLast) numYearsRaw <- length(startEndSeq$Ends) # set up some vectors to keep track of years good <- rep(0, numYearsRaw) numDays <- rep(0, numYearsRaw) midDecYear <- rep(0, numYearsRaw) Qraw <- matrix(nrow = 366, ncol = numYearsRaw) for(i in 1: numYearsRaw) { startSeq <- startEndSeq$Starts[i] endSeq <- startEndSeq$Ends[i] startJulian <- getFirstJulian(startSeq) # startJulian is the first julian day of the first month in the year being processed # endJulian is the first julian day of the month right after the last month in the year being processed endJulian <- getFirstJulian(endSeq + 1) fullDuration <- endJulian - startJulian yearDaily <- localDaily[localDaily$MonthSeq >= startSeq & (localDaily$MonthSeq <= endSeq), ] nDays <- length(yearDaily$Q) if(nDays == fullDuration) { good[i] <- 1 numDays[i] <- nDays midDecYear[i] <- (yearDaily$DecYear[1] + yearDaily$DecYear[nDays]) / 2 Qraw[1:nDays,i] <- yearDaily$Q } else { numDays[i] <- NA midDecYear[i] <- NA } } # now we compress the matrix down to equal number of values in each column j <- 0 numGoodYears <- sum(good) dayCounts <- ifelse(good==1, numDays, NA) lowDays <- min(dayCounts, na.rm = TRUE) highDays <- max(dayCounts, na.rm = TRUE) dimYears <- numGoodYears dimDays <- lowDays sortQ <- matrix(nrow = dimDays, ncol = dimYears) time <- rep(0,dimYears) for (i in 1:numYearsRaw){ if(good[i]==1) { j <- j + 1 numD <- numDays[i] x <- sort(Qraw[1:numD, i]) # separate odd numbers from even numbers of days if(numD == lowDays) { sortQ[1:dimDays,j] <- x } else { sortQ[1:dimDays,j] <- if(odd(numD)) leapOdd(x) else leapEven(x) } time[j] <- midDecYear[i] } } sortQList = list(sortQ,time) return(sortQList) } ######################################################################################### ########## Another function trendSortQ needed for Quantile Kendall ######################################################################################### trendSortQ <- function(Qsort, time){ # note requires packages zyp and rkt nFreq <- dim(Qsort)[1] nYears <- length(time) results <- as.data.frame(matrix(ncol=9,nrow=nFreq)) colnames(results) <- c("slopeLog","slopePct","pValue","pValueAdj","tau","rho1","rho2","freq","z") for(iRank in 1:nFreq){ mkOut <- rkt::rkt(time,log(Qsort[iRank,])) results$slopeLog[iRank] <- mkOut$B results$slopePct[iRank] <- 100 * (exp(mkOut$B) - 1) results$pValue[iRank] <- mkOut$sl outZYP <- zyp.zhang(log(Qsort[iRank,]),time) results$pValueAdj[iRank] <- outZYP[6] results$tau[iRank] <- mkOut$tau # I don't actually use this information in the current outputs, but the code is there # if one wanted to look at the serial correlation structure of the flow series serial <- acf(log(Qsort[iRank,]), lag.max = 2, plot = FALSE) results$rho1[iRank] <- serial$acf[2] results$rho2[iRank] <- serial$acf[3] frequency <- iRank / (nFreq + 1) results$freq[iRank] <- frequency results$z[iRank] <- qnorm(frequency) } return(results) } ######################################################################################### ################################## getFirstJulian finds the julian date of first day ################################## of a given month ######################################################################################### getFirstJulian <- function(monthSeq){ year <- 1850 + trunc((monthSeq - 1) / 12) month <- monthSeq - 12 * (trunc((monthSeq-1)/12)) charMonth <- ifelse(month<10, paste0("0",as.character(month)), as.character(month)) theDate <- paste0(year,"-",charMonth,"-01") Julian1 <- as.numeric(julian(as.Date(theDate),origin=as.Date("1850-01-01"))) return(Julian1) } ######################################################################################### ########### leapOdd is a function for deleting one value ############when the period that contains Februaries has a length that is an odd number ######################################################################################### leapOdd <- function(x){ n <- length(x) m <- n - 1 mid <- (n + 1) / 2 mid1 <- mid + 1 midMinus <- mid - 1 y <- rep(NA, m) y[1:midMinus] <- x[1:midMinus] y[mid:m] <- x[mid1:n] return(y)} ######################################################################################### ########### leapEven is a function for deleting one value ############when the period that contains Februaries has a length that is an even number ######################################################################################### leapEven <- function(x){ n <- length(x) m <- n - 1 mid <- n / 2 y <- rep(NA, m) mid1 <- mid + 1 mid2 <- mid + 2 midMinus <- mid - 1 y[1:midMinus] <- x[1:midMinus] y[mid] <- (x[mid] + x[mid1]) / 2 y[mid1:m] <- x[mid2 : n] return(y) } ######################################################################################### ####### determines if the length of a vector is an odd number ########################### ######################################################################################### odd <- function(x) {(!(x %% 2) == 0)} ######################################################################################### ########### calcWY calculates the water year and inserts it into a data frame ######################################################################################### calcWY <- function (df) { df$WaterYear <- as.integer(df$DecYear) df$WaterYear[df$Month >= 10] <- df$WaterYear[df$Month >= 10] + 1 return(df) } ######################################################################################### ##### calcCY calculates the climate year and inserts it into a data frame ######################################################################################### calcCY <- function (df){ df$ClimateYear <- as.integer(df$DecYear) df$ClimateYear[df$Month >= 4] <- df$ClimateYear[df$Month >= 4] + 1 return(df) } ######################################################################################### ######## smoother is a function does the trend in real discharge units and not logs. ######## It is placed here so that users wanting to run this alternative have it available ######## but it is not actually called by any function in this document ######################################################################################### smoother <- function(xy, window){ edgeAdjust <- TRUE x <- xy$x y <- xy$y n <- length(y) z <- rep(0,n) x1 <- x[1] xn <- x[n] for (i in 1:n) { xi <- x[i] distToEdge <- min((xi - x1), (xn - xi)) close <- (distToEdge < window) thisWindow <- if (edgeAdjust & close) (2 * window) - distToEdge else window w <- triCube(x - xi, thisWindow) mod <- lm(xy$y ~ x, weights = w) new <- data.frame(x = x[i]) z[i] <- predict(mod, new) } return(z) } Making a graph of the trend in a single flow statistic

Now all we need to do is to run the plotFlowTrend function that was the first part of the code we just read in. First we will run it in its simplest form, we will use the entire discharge record and our period of analysis will be the full climatic year. A climatic year is the year that starts on April 1 and ends on March 31. It is our default approach because it tends to avoid breaking a long-low flow period into two segments, one in each of two adjacent years. We will first run it for the annual minimum daily discharge.

To use it you will need to know about the index of flow statistics used in EGRET. They are called istat and the statistics represented by istat are: (1) 1-day minimum, (2) 7-day minimum, (3) 30-day minimum, (4) median (5) mean, (6) 30-day maximum, (7) 7-day maximum, and (8) 1-day maximum. So to run the annual minimum daily discharge statistic we would set istat = 1. To run the plotFlowTrend function it in its simplest form the only arguments we need are the eList (which contains the metadata and the discharge data), and istat.

plotFlowTrend(eList, istat = 1)

Explanation of the FlowTrend Plot, annual minimum day

The dots indicate the discharge on the minimum day of each climate year in the period of record. The solid curve is a smoothed representation of those data. It is specifically the smoother that is defined in the EGRET user guide (pages 16-18) with a 30-year window. For record as short as this one (only 32 years) it will typically look like a straight line or a very smooth curve. For longer records it can display some substantial changes in slope and even be non-monotonic. At the top of the graph we see two pieces of information. A trend slope expressed in percent per year and a p-value for the Mann-Kendall trend test of the data. The slope is computed using the Thiel-Sen slope estimator. It is discussed on pages 266-274 of Helsel and Hirsch, 2002, which can be found here although it is called the “Kendall-Theil Robust Line” in that text. It is calculated on the logarithms of the discharge data and then transformed to express the trend in percent per year. The p-value for the Mann-Kendall test is computed using the adjustment for serial correlation introduced in the zyp R package (David Bronaugh and Arelia Werner for the Pacific Climate Impacts Consortium (2013). zyp: Zhang + Yue-Pilon trends package. R package version 0.10-1. https://CRAN.R-project.org/package=zyp).

A few more FlowTrend Plots

We can do some other similar plots. In this case we will do the annual median day istat = 4, the annual maximum day istat = 8, and the annual mean day istat = 5.

The annual median day

The median day is computed for each year in the record.It is the middle day, that is 182 values had discharges lower than it and 182 values had discharges greater than it (for a leap year it is the average of the 183rd and 184th ranked values). Everything else about it is the same as the first plot. Here is the annual median day discharge record.

plotFlowTrend(eList, istat = 4)

The annual maximum day

The third plot is for the maximum day of the year. Otherwise it is exactly the same in its construction as the first two plots. Note that this is the maximum daily average discharge and in general it will be smaller than the annual peak discharge, which is the maximum instantaneous discharge for the year, whereas this is the highest daily average discharge. For very large rivers the annual maximum day discharge tends to be very close to the annual peak discharge and can serve as a rough surrogate for it in a trend study. For a small stream, where discharges may rise or fall by a factor of 2 or more in the course of a day, these maximimum day values can be very different from the annual peak discharge.

At this point we will add one more concept, which is the period of analysis. In the first two plots the data were organized by climate year. But, when we look at annual maximum or annual mean discharge the standard approach is to use the water year. To do this we need to designate that we want our period of analysis to start with the month of October (month 10). To do this we need to add one more argument to our call to the function. paStart = 10. We did not need to specify the paStart in the first two plots, because the default is paStart = 4. Notice that this change in the period of analysis is noted in the text on the graphic.

plotFlowTrend(eList, istat = 8, paStart = 10)

The fourth plot is the annual mean discharge. It is constructed in exactly the same manner as the previous three, but it represents the mean streamflow for all of the days in the year rather than a single order statistic such as minimum, median, or maximum. We will also use the water year for this analysis.

plotFlowTrend(eList, istat = 5, paStart = 10)

Some observations about these figures

It is interesting to note that the minimum discharge values have been trending downwards, although not a statistically significant trend. The other three discharge statistics are all trending upwards and this is most notable in terms of the annual maximum daily discharge, both in terms of slope +4.7% per year, and a p-value that is well below 0.001. This raises many questions about how the variability of discharge might be changing over time and what the drivers are of the trends that are happening in various parts of the probability distribution of discharge (which we commonly call the flow duration curve). We will turn our focus to a way of summarizing these changes after we look at a few more variations on this plotFlowTrend graphic.

Variations on the simplest example

The call to the function plotFlowTrend has many arguments (you can see them all in the code shown above) but here we will just focus on a few that may be most useful to the data analyst:

plotFlowTrend(eList, istat, startDate = NA, endDate = NA, paStart = 4, paLong = 12, window = 30, qMax = NA, qUnit = 2, runoff = FALSE)

Here is a list of the optional arguments that are available to the user.

startDate If we want to evaluate the trend for a shorter period than what is contained in the entire Daily data frame, then we can specify a different starting date. For example, if we wanted the analysis to start with Water Year 1981, then we would say: startDate = “1980-10-01”. By leaving out the startDate argment we are requesting the analysis to start where the data starts (which in this case is 1979-10-01).

endDate If we want to evalute the trend for a period that ends before the end of the data set we can specify that with endDate. So if we wanted to end with Water Year 2009 we would say: endDate = “2009-09-30”.

paStart is the starting month of the period of analysis. For example if we were interested in the trends in streamflow in a series of months starting with August, we would say paStart = 8. The default is paStart = 4, which starts in April.

paLong is the duration of the period of analysis. So, an analysis that covers all 12 months would have paLong = 12 (the default). A period that runs for the months of August – November would have paStart = 8 and paLong = 4. See pages 14 and 15 of the user guide for more detail on paStart and paLong.

window The smoothness of the curve plotted depends on the size of the window (which is measured in years). If window were much smaller (say 10) then the curve shown would be rather jagged, and as it becomes longer it causes the curve to converge to a straight line. The default value of 30 works well and is also roughly consistant with the convention used in the climate science community, when they discuss a 30-year normal period. The precise meaning of window is described in the User Guide (pages 16 – 18). It has no effect on the reported trend slope or p-value.

qMax is an argument that allows the user to set the maximum value on the vertical axis. If it is left out, the graphic scales itself. But for a set of similar graphics it may be useful to set qMax.

qUnit is an argument that allows the user to pick different discharge units. The default is qUnit = 2 which is cubic meters per second. The other most common unit to use is qUnit = 1 which is cubic feet per second.

runoff is a logical argument. The default is runoff = FALSE. In some studies it is nice to express discharge in runoff terms (discharge per unit area) amd setting runoff = TRUE can cause it to be expressed in terms of runoff.

Here is an example that shows the use of several of these arguments. Say we want to see what would happen if we only looked at annual maximum daily discharge from 1983 – 2010 (trimming off three low years at the start and one high year at the end), and used a shorter smoothing window, and expressed our results as runoff in mm/day

plotFlowTrend(eList, istat = 8, startDate = "1982-10-01", endDate = "2010-09-30", window = 15, runoff = TRUE)

It is worth noting here that compared to the previous plot of the annual maximum daily discharges, the removal of four particular years brought about a substantial change in the slope and the significance of the trend. This is a common issue in trend analysis with relatively short records.

The Quantile-Kendall Plot

Now we will look at a new kind of plot called the Quantile-Kendall plot. Each plotted point on the plot is a trend slope (computed in the same manner as in the previous plots) for a given order statistic. The point at the far left edge is the first order statistic, which is the annual minimum daily discharge. This is the result described on the first plot. The next point to its right is the trend slope for the second order statistic (second lowest daily discharge for each year), and it continues to the far right being the trend slope for the 365th order statistic (annual maximum daily discharge). Their placement with respect to the x-axis is based on the z-statistic (standard normal deviate) associated with that order statistic. It is called the daily non-exceedance probability. It is a scale used for convenience. It in no-way assumes that the data follow any particular distribution. It is simply used to provide the greatest resolution near the tails of the daily discharge distribution. The color represents the p value for the Mann-Kendall test for trend as described above. Red indicates a trend that is significant at alpha = 0.05. Black indicates an attained significance between 0.05 and 0.1. The grey dots are trends that are not significant at the alpha level of 0.1.

Here is what it looks like for the Choptank data set. What we see is that generally across the probability distribution the changes are generally not statistically significant except for the highest and second highest days of the year. The trends are slightly negative at the low end of the distribution (the lowest 10%). Near the median they are positive and modestly large (around 1.5% per year). Then near the 90th percentile they drop to near zero slope and only above about the 99th percentile do they seem to be particularly large.

plotQuantileKendall(eList)

There is one special manipulation of the data that is needed to account for leap years (this is a detail about the computation but is not crucial to understanding the plots). The 366 daily values observed in a leap year are reduced by one so that all years have 365 values. The one value eliminated is accomplished by replacing the two middle values in the ranked list of values by a single value which is the average of the two. A similar approach is used when the period of analysis is any set of months that contains the month of February. The number of leap year values are reduced by one and the reduction takes place at the median value for the year.

Variations on the Quantile Kendall

We might be interested in particularly looking at trends in a certain season of the year. In the case of the Choptank we know that tropical storm generated floods have been particularly significant in the later part of this record and these have been mostly in August, September and October. Or, if our interest were related to the flows happening in the spring and early summer, which is the period of time during which the delivery to the Chesapeake Bay (into which the Choptank River flows) then we might want to run our analysis on the period March through June. To do that we can use paStart = 3 and paLong = 4.

plotQuantileKendall(eList, paStart = 3, paLong = 4)

What we can see is that for this portion of the year, the discharge trends are rather minimal and certainly not statistically significant, except that the 5 or so highest flow days in the season do show rather large trends and the annual maximum trend is the largest of all (in percent per year) and is the only one that is significcant.

There are a few other arguments for the plotQuantileKendall function. As mentioned in the discussion of the plotFlowTrend function, it has the startDate and endDate arguments. If they aren’t specified then it is assumed that the analysis covers the whole period in the Daily data frame.

There are two arguments that relate to the appearance of the graph.

legendLocation this argument simply determines where in the graph the legend is placed. It sometimes happens that the legend obscures the data so we may need to move it to another part of the graph. The argument can take on names such as “bottomleft”, “topright” or “bottomright”. The default is “topleft”. If you don’t want a legend (say you are making many Quantile Kendall plots in a single job and can’t go in and adjust each one), then you can specify legendLocation = NA.

legendSize this argument determines the size of the legend. The default is legendSize = 1.0. If you want to decrease the dimensions of the legend you decrease this value. For example a 25% reduction would be achieved by setting legendSize = 0.75. Typically one wouldn’t want to go below about 0.4.

yMax and yMin are two more optional arguments to set up the limits on the y-axis. If you don’t specify them (just leave them out) then the data will dictate the minimum and maximum values on the y-axis (it will extend slightly beyond the highest and lowest values). But, if you are doing many graphs you may want them to have consistent axes, so a quick look at the graph will tell you if the trends are large or small, then you can set these two values. For example, say you want every graph to have a maximum value of 5 and a minimum value of -5, then the arguments would just be: yMax = 5, yMin = 5. There is a “fail safe” mechanism in the code if the trends are bigger than the specified values. Let’s say we set yMax = 5 but in one particular graph there is a trend slope of 7.2 % per year. In this case the y-axis would extend out slightly beyond a value of 7.2. Thus, no value is ever out of range. All of the slopes will be plotted.

Here is an example bringing all of the arguments into play. It is for a case where we want to consider the time period 1982-08-01 through 2010-11-30, for only the months of August through October, and we want the legend to go in the bottom right and be only half the size of what we saw in the first example.

plotQuantileKendall(eList, startDate = "1982-08-01", endDate = "2010-11-30", paStart = 8, paLong = 3, legendLocation = "bottomright", legendSize = 0.5, yMax = 4, yMin = -4)

What this graph shows us is that unlike the year as a whole, this late summer to fall period has been one of very substantially increasing discharge. From about the median of the distribution to the top, the trends are greater than 3% per year and many are significant at least at the alpha level of 0.1. There is one feature of this figure that may look odd. That is the fact that at the lower end of the distribution there are many trend slopes that are exactly zero. Using the gray dot at the far left as an example, the trend slope estimation method uses all pairwise comparisons of the lowest discharge of each year. Because of the USGS reporting conventions, many of these discharge values cluster around a few specific values, so that in the comparisons, there are many ties. The slope estimate is the median of all the pairwise slopes and since a significant number of these slopes are zero, the slope estimate is zero.

Putting it all together into a set of five graphs

This post also provides an additional function that produces a total of five graphics from a single command. It does the four trend plots and then the Quantile-Kendall. It is called plotFiveTrendGraphs. The arguments are all ones that we have seen before: eList, startDate, endDate, paStart, paLong, qUnit, window, legendLocation, and legendSize. In its simplest form it would look like this (output not plotted here).

plotFiveTrendGraphs(eList) Downloading the data for your site of interest

The steps for downloading data from USGS web services (or obtaining it from user supplied information) and also for creating and saving the eList are described on pages 4-13 of the EGRET user guide. Here is a simple example:

Say we want to look at USGS station number 01646500 (Sugar River near Brodhead, WI) and we want to consider data from Climate Years 1921 through 2016. Note that this much longer period of record causes the computations to take a good deal more time than the previous cases, but it should still take well under a minute to complete the whole job. The following commands would do what is needed.

library(EGRET) sta <- "05436500" param <- "00060" # this is the parameter code for daily discharge startDate <- "1920-04-01" endDate <- "2016-03-31" INFO <- readNWISInfo(siteNumber = sta, parameterCd = param, interactive = FALSE) Daily <- readNWISDaily(siteNumber = sta, parameterCd = param, startDate = startDate, endDate = endDate, verbose = FALSE) eList <- as.egret(INFO, Daily) plotFiveTrendGraphs(eList, legendLocation = "bottomleft")

Final thoughts

This last set of plots is particularly interesting. What we see is that for the lowest 90 percent of the distribution, discharges have been rising over most of this record, and particularly so since about 1950. But in the highest 1% of the distribution discharges have been falling thoughout the record. As a consequence the overall variablity of streamflow in this agricultural watershed has generally been declining over time. This is generally thought to be related to the increasing use of conservation practices in the watershed.

It is worth noting that we can express the percentage changes per year in other ways than as percent per year, for example percentage change per decade or percentage change over the entire period of record. Take, for example, the estimated trend slope for the median in this last example. It is 0.68% per year. If we were to express that as percentage change per decade it would be 7% per decade (the change can be computed as 1.0068^10, or 1.070, which is a 7% increase). If we expressed it for the entire 96 year record it would be about a 92% increase over that period (computed as 1.0068^96, which is 1.9167 or and increase of about 92%).

These graphical tools are one way of summarizing what we might call a signature of change for any given watershed. It shows the changes, or lack of changes, that have taken place through all parts of the probability distribution of streamflow. This has potential as a tool for evaluating hydrologic or linked climate and hydrologic models. Observing the patterns on these graphs for the actual data versus the patterns seen when simulated data are used, can provide insights on the ability of the models used to project hydrologic changes into the future.

If you have questions or comments on the concepts or the implementation please contact the author rhirsch@usgs.gov.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Native scoring in SQL Server 2017 using R

Mon, 05/28/2018 - 22:13

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only on-prem), that provides scoring and predicting in pre-build and stored machine learning models in near real-time.

                                                                                             Icons made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY

 

Depending on the definition of real-time, and what does it mean for your line of business, I will not go into the definition of real-time, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) .

Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:

  • currently supports only SQL server 2017 and Windows platform
  • trained model should not exceed 100 MiB in size
  • Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:
    • rxLinMod (linear model as linear regression)
    • rxLogit (logistic regression)
    • rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)
    • rxDtree (External memory algorithm for Classification and Regression Trees
    • rxDForest (External memory algorithm for Classification and Regression Decision Trees)

Benefits of using PREDICT function for native scoring are:

  • No configuration of R or ML environment is needed (assuming that the trained models are already stored in the database),
  • Code is cleaner, more readable, and no additional R code is needed when performing scoring,
  • No R engine is called in the run-time, so tremendous deduction of  CPU and I/O costs as well as, no external calls,
  • Client or server running Native scoring with PREDICT function does not need R engine installed, because it uses C++ libraries from Microsoft, that can read serialized model stored in a table and un-serialize it and generate predictions, all without the need of R

Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).

To put the PREDICT function to the test, I have deliberately taken the semi-larger dataset, available in RevoScaleR package in R – AirlineDemoSmall.csv. Using a simple BULK INSERT, we get the data into the database:

BULK INSERT ArrivalDelay FROM 'C:\Program Files\Microsoft SQL Server\140\R_SERVER\library\RevoScaleR\SampleData\AirlineDemoSmall.csv' WITH ( FIELDTERMINATOR =',', ROWTERMINATOR = '0x0a', FIRSTROW = 2, CODEPAGE = 'RAW');

Once data is in the database, I will split the data into training and test sub-sets.

SELECT TOP 20000 * INTO ArrDelay_Train FROM ArrDelay ORDER BY NEWID() -- (20000 rows affected) SELECT * INTO ArrDelay_Test FROM ArrDelay AS AR WHERE NOT EXISTS (SELECT * FROM ArrDelay_Train as ATR WHERE ATR.arrDelay = AR.arrDelay AND ATR.[DayOfWeek] = AR.[DayOfWeek] AND ATR.CRSDepTime = AR.CRSDepTime ) -- (473567 rows affected

And the outlook of the dataset is relatively simple:

ArrDelay CRSDepTime DayOfWeek 1 9,383332 3 4 18,983334 4 0 13,883333 4 65 21,499998 7 -3 6,416667 1 Creating models

So we will create essentially two same models using rxLinMod function with same formula, but one with additional parameter for real-time scoring set to TRUE.

-- regular model creation DECLARE @model VARBINARY(MAX); EXECUTE sp_execute_external_script @language = N'R' ,@script = N' arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = InputDataSet) model <- rxSerializeModel(arrDelay.LM)' ,@input_data_1 = N'SELECT * FROM ArrDelay_Train' ,@params = N'@model varbinary(max) OUTPUT' ,@model = @model OUTPUT INSERT [dbo].arrModels([model_name], [native_model]) VALUES('arrDelay.LM.V1', @model) ; -- Model for Native scoring DECLARE @model VARBINARY(MAX); EXECUTE sp_execute_external_script @language = N'R' ,@script = N' arrDelay.LM <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = InputDataSet) model <- rxSerializeModel(arrDelay.LM, realtimeScoringOnly = TRUE)' ,@input_data_1 = N'SELECT * FROM ArrDelay_Train' ,@params = N'@model varbinary(max) OUTPUT' ,@model = @model OUTPUT INSERT [dbo].arrModels([model_name], [native_model]) VALUES('arrDelay.LM.NativeScoring.V1', @model) ;

Both models will have same training set, and will be stored into a table for future scoring. Upon first inspection, we can see there is a difference in the model size:

Scoring Models

Both models  took relatively the same amount of time to train and to store in the table. Both can also be created on R Machine Learning server and stored in the same way (with or without argument realtimeScoringOnly). The model size gives you an idea, why and how the realtime scoring can be achieved -> is to keep your model as small as possible. Both models will give you exact same predictions scores, just that the native scoring will be much faster. Note also, if you are planning to do any text analysis with real-time scoring, keep in mind the 100 MiB limitation, as the text prediction models often exceed this limitation.

Comparing the execution of scoring models, I will compare using “traditional way” of using external procedure sp_execute_external_script and using PREDICT function.

------------------------------------ -- Using sp_execute_external_script ------------------------------------ DECLARE @model VARBINARY(MAX) = (SELECT native_model FROM arrModels WHERE model_name = 'arrDelay.LM.V1') EXEC sp_execute_external_script @language = N'R' ,@script = N' modelLM <- rxUnserializeModel(model) OutputDataSet <- rxPredict( model=modelLM, data = ArrDelay_Test, type = "link", predVarNames = "ArrDelay_Pred", extraVarsToWrite = c("ArrDelay","CRSDepTime","DayOfWeek") )' ,@input_data_1 = N'SELECT * FROM dbo.ArrDelay_Test' ,@input_data_1_name = N'ArrDelay_Test' ,@params = N'@model VARBINARY(MAX)' ,@model = @model WITH RESULT SETS (( AddDelay_Pred FLOAT ,ArrDelay INT ,CRSDepTime NUMERIC(16,5) ,[DayOfWeek] INT )) -- (473567 rows affected) -- Duration 00:00:08 --------------------------- -- Using Real Time Scoring --------------------------- DECLARE @model varbinary(max) = ( SELECT native_model FROM arrModels WHERE model_name = 'arrDelay.LM.NativeScoring.V1'); SELECT NewData.* ,p.* FROM PREDICT(MODEL = @model, DATA = dbo.ArrDelay_Test as newData) WITH(ArrDelay_Pred FLOAT) as p; GO -- (473567 rows affected) -- Duration 00:00:04

Both examples are different from each other, but PREDICT function looks much more readable and neater. Time performance is also on the PREDICT function side, as the model returns the predictions much faster.

In addition, I have mentioned that PREDICT function does not need R engine or Launchpad Service to be running in the same environment, where the code will be executed. To put this to test, I will simply stop the SQL Server Launchpad Service:

After executing the first set of predictions using sp_execute_external_script, SQL Server or Machine Learning Server will notify you that the service is not running:

whereas, the PREDICT function will work flawlessly.

Verdict

For sure, faster predictions are the something that can be very welcoming in gaming industry, in transport, utility and metal industry, financial as well as any other types, where real-time predictions against OLTP systems will be much appreciated. With the light-weight models and good algorithm support, I would for sure give it an additional thought, especially, if you see a good potential in faster and near real-time predictions.

As always, complete code and data sample are available at Github. Happy coding!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Take Screenshot of Webpage using R

Mon, 05/28/2018 - 21:12

(This article was first published on ListenData, and kindly contributed to R-bloggers)

Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.

Take Screenshot from R
What is PhantomJS?

PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

PhantomJS is an optimal solution for the following:

  • Headless website testing
  • Screen Capture
  • Page Automation
  • Network Monitoring

Webshot : R Package 

The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).

Install and Load Package

The stable version of webshot is available on CRAN hence can be installed using the below code:

install.packages(‘webshot’)
library(‘webshot’)

Also, the latest development version of webshot is hosted on github and can be installed using the below code:

#install.packages(‘devtools’)
devtools::install_github(‘wch/webshot’)

Initial Setup

As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.

webshot::install_phantomjs()

The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.

Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.

Screenshot Function

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like ‘.jpg’, ‘.png’, ‘.pdf’ based on which the output file is rendered. Below is the basic structure of how the function goes:

library(webshot)

#webshot(url, filename.extension)
webshot(“https://www.listendata.com/”, “listendata.png”)

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases – starting with downloading/converting a webpage as a PDFcopy.

Case #1: PDF Copy of WebPage

Let us assume, we would like to download Bill Gates’ notes on Best Books of 2017 as a PDF copy.

#loading the required library
 library(webshot)

#PDF copy of a web page / article
 webshot(“https://www.gatesnotes.com/About-Bill-Gates/Best-Books-2017”,
 “billgates_book.pdf”,
 delay = 2)

The above code generates a PDF whose (partial) screenshot is below:

Snapshot of PDF Copy

Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.

  1. URL from which the screenshot has to be taken. 
  2. Output Filename along with its file extensions. 
  3. Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.

Thus, a webpage can be converted/downloaded as a PDF programmatically in R.

Case #2: Webpage Screenshot (Viewport Size)

Now, I’d like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.

#Screenshot of Viewport
webshot(‘https://www.livemint.com/’,’livemint.png’, cliprect = ‘viewport’)

While the first two arguments are similar to the above function, there’s a new third argument cliprect which specifies the size of the Clipping rectangle.

If cliprect is unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value viewportwhich clips only the viewport part of the browser, as below.

Screenshot of Viewport of Browser

Case #3: Multiple Selector Based Screenshots

All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshots and saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.

#Multiple Selector Based Screenshots
webshot(“https://github.com/hadley”,
 file = c(“organizations.png”,”contributions.png”),
 selector = list(“div.border-top.py-3.clearfix”,”div.js-contribution-graph”))

In the above code, we take screenshot of two CSS Selectors from the github profile page of  Hadley Wickham and save them in two PNG files – organizations.png and contributions.png.

Contributions.png

Organizations.png

Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.

References

About Author:

Deepanshu founded ListenData with a simple objective – Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let's Get Connected: LinkedIn

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ListenData. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ShinyProxy 1.1.1 released!

Mon, 05/28/2018 - 15:00

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.

Theming

ShinyProxy 1.1.1 is in essence a maintenance release, but there is one new feature that has been on the wish list of our users for a long time: the possibility of theming the landing page of ShinyProxy which displays the overview of the Shiny apps.

The standard display when using the ShinyProxy demo image from the Getting Started guide is a plain listing:

With an HTML-template makeover, it can look like this:

or maybe you prefer to have the apps displayed in two columns?

As one can see, it is possible to use images or icons for the apps and the display and text can be entirely
customized with CSS. The templating technology used for the customization of the landing page is Thymeleaf, a mature and amply documented framework for natural HTML templates.

To help you get started customizing your own ShinyProxy instance, a fully worked theming example is availeble in our shinyproxy-configuration-examples repository on Github.

Miscellaneous fixes

In our previous release we made a major leap to run Shiny apps at datacenter scale by adding support for a Kubernetes back-end (see this blog post). We have used it for customers that roll out internet-facing Shiny apps with high numbers of concurrent users and needs for automated deployment. Following community feedback, this release has an important fix related to deletion of Kubernetes services when shutting down containers.

For the users that embed ShinyProxy in larger application frameworks, we released a fix for passing arguments to Shiny apps.
Some other small fixes include improved browser display of the apps and a more strict adherence to YAML for the ShinyProxy configuration file.

For the new features, documentation has been updated on https://shinyproxy.io and as always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Two R Packages to Check Out

Sun, 05/27/2018 - 16:27

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

I’m currently out of town and not spending as much time on my computer as I have over the last couple months. (It’s what happens when you’re the only one in your department at work and also most of your hobbies involve a computer.) But I wanted to write up something for Statistics Sunday and I recently discovered two R packages I need to check out in the near future.

The first is called echor, which allows you to search and download data directly from the US Environmental Protection Agency (EPA) Environmental Compliance and History Online (ECHO), using the ECHO-API. According to the vignette, linked above, “ECHO provides data for:

  • Stationary sources permitted under the Clean Air Act, including data from the National Emissions Inventory, Greenhouse Gas Reporting Program, Toxics Release Inventory, and Clean Air Markets Division Acid Rain Program and Clean Air Interstate Rule. 
  • Public drinking water systems permitted under the Safe Drinking Water Act, including data from the Safe Drinking Water Information System. 
  • Hazardous Waste Handlers permitted under the Resource Conservation and Recovery Act, with data drawn from the RCRAInfo data system. 
  • Facilities permitted under the Clean Water Act and the National Pollutant Discharge Elimination Systems (NPDES) program, including data from EPA’s ICIS-NPDES system and possibly waterbody information from EPA’s ATTAINS data system.”
The second package is papaja, or Preparing APA Journal Articles, which uses RStudio and R Markdown to create APA-formatted papers. Back when I wrote APA style papers regularly, I had Word styles set up to automatically format headers and subheaders, but properly formatting tables and charts was another story. This package promises to do all of that. It’s still in development, but you can find out more about it here and here.

I have some fun analysis projects in the works! Stay tuned.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Intersecting points and overlapping polygons

Sun, 05/27/2018 - 15:01

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

I’ve been doing some spatial stuff of late and the next little step will involve intersecting points with possibly many overlapping polygons. The sp package has a function called over which returns the polygons that points intersects with. The catch though, is that it only returns the last (highest numerical value) polygon a point overlaps with. So it’s not so useful if you have many overlapping polygons. A little playing, and I’ve overcome that problem…

Here’s a toy example.

Create a couple of polygons and put them into a SpatialPolygons object.

library(sp) p1 <- matrix(c(1,1, 2,1, 4,2, 3,2), ncol = 2, byrow = TRUE) p2 <- matrix(c(2.2,1, 3,1, 3,2, 3,3, 2.8,3), ncol = 2, byrow = TRUE) p1s <- Polygons(list(Polygon(p1)), 3) p2s <- Polygons(list(Polygon(p2)), 4) sps <- SpatialPolygons(list(p1s, p2s))

Define a few points and put them in a SpatialPoints object

point <- matrix(c(2.5, 1.5, 3.2, 1.75, 2,3, 1.5, 1.25, 2.75, 2.5), ncol = 2, byrow = TRUE) points <- SpatialPoints(point)

We can plot them…(not shown)

plot(sps, border = c("black", "blue")) plot(points, add = TRUE)

As here we have the issue:

over(points, sps) 1 2 3 4 5 2 1 NA 1 2

only returning a single “hit” per point (point 1 overlaps with both polygons 1 and 2).

To get around this, we can rip the individual polygons out of the SpatialPolygons object and put them in a list, converting the individual polygons into SpatialPolygons along the way:

sps2 <- lapply(sps@polygons, function(x) SpatialPolygons(list(x)))

Then lapply-ing over sps2 we can see which polygons each point intersects…

lapply(sps2, function(x) over(points, x)) [[1]] 1 2 3 4 5 1 1 NA 1 NA [[2]] 1 2 3 4 5 1 NA NA NA 1

And now we see that point one overlaps with both polygons.

Code >>>here<<<

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages