Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 19 min ago

GEDCOM Reader for the R Language: Analysing Family History

7 hours 17 min ago

(This article was first published on R Language – The Lucid Manager, and kindly contributed to R-bloggers)

Understanding who you are is strongly related to understanding your family history. Discovering ancestors is now a popular hobby, as many archives are available on the internet. The GEDCOM format provides a standardised way to store information about ancestors. This article shows how to develop a GEDCOM reader using the R language.

Download the Code

The text in this article explains the basic principles to write a GEDCOM reader. You can download the latest version from GitHub.

Download The GEDCOM format

The GEDCOM format is not an ideal way to store information, but it has become the de-facto standard for family history. This format includes metadata and two sets of data. The file contains a list of the individuals, and it lists the families to which they belong.

The basic principle is that each line has a level, indicated by the first digit. At level zero, we find metadata and the individuals and their family. At level one, we see the various types of data, such as births, deaths and marriages. The deeper levels provide the data for these events.

Heiner Eichmann maintains a website that explains the format and provides some examples of files to help you understand the syntax.

The GEDCOM format is not only old in the way it stores data, but it is also limited in the types of human relationships. These files only store genetic relationships between people and assume that these relationships are marriages between a wife and a husband. Human relationships are, however, a lot more complicated than the genetic relationships between children and their parents, grandparents and ancestors.

These issues aside, all genealogical software can export a file to GEDCOM. The next section shows how to create a basic GEDCOM reader using the stringr, tibble and dplyr packages from the Tidyverse.

GEDCOM Reader

The read.gedcom() function takes a GEDCOM file as input and delivers a data frame (tibble) with basic information:

  • ID
  • Full name
  • Gender
  • Birthdate and place
  • Father
  • Mother
  • Death date and place

This code only can be easily expanded to include further fields by adding lines in the while-loops and including the fields in the data frame.

The first lines read the file and setup the data frame. The extract() function extracts an individual’s ID from a line in the file. The for loop runs through each line of the GEDCOM file. When the start of a new individual is detected, the GEDCOM reader collects the relevant information.

Births and christenings are considered equal to simplify the data. In older data, we often only know one or the other. The function looks for the start of a family. It extracts the husband and wife and assigns these as parents to each of the children. The last section cleans the data and returns the result.

## Read GEDCOM file ## The Devil is in the Data ## lucidmanager.org/data-science ## Dr Peter Prevos read.gedcom <- function(gedcom.loc) { require(stringr) require(tibble) require(dplyr) gedcom <- str_squish(readLines(gedcom.loc)) idv <- sum(grepl("^0.*INDI$", gedcom)) fam <- sum(grepl("^0.*FAM$", gedcom)) cat(paste("Individuals: ", idv, "\n")) cat(paste("Families: ", fam, "\n")) family <- tibble(id = NA, Full_Name = NA, Gender = NA, Birth_Date = NA, Birth_Place = NA, Father_id = NA, Mother_id = NA, Death_Date = NA, Death_Place = NA) ## Extract data extract <- function(line, type) { str_trim(str_sub(line, str_locate(line, type)[2] + 1)) } id <- 0 for (l in 1:length(gedcom)) { if (str_detect(gedcom[l], "^0") & str_detect(gedcom[l], "INDI$")) { id <- id + 1 family[id, "id"] <- unlist(str_split(gedcom[l], "@"))[2] l <- l + 1 while(!str_detect(gedcom[l], "^0")) { if (grepl("NAME", gedcom[l])) family[id, "Full_Name"] <- extract(gedcom[l], "NAME") if (grepl("SEX", gedcom[l])) family[id, "Gender"] <- extract(gedcom[l], "SEX") l <- l + 1 if (grepl("BIRT|CHR", gedcom[l])) { l <- l + 1 while (!str_detect(gedcom[l], "^1")) { if (grepl("DATE", gedcom[l])) family[id, "Birth_Date"] <- extract(gedcom[l], "DATE") if (grepl("PLAC", gedcom[l])) family[id, "Birth_Place"] <- extract(gedcom[l], "PLAC") l <- l + 1 } } if (grepl("DEAT|BURI", gedcom[l])) { l <- l + 1 while (!str_detect(gedcom[l], "^1")) { if (grepl("DATE", gedcom[l])) family[id, "Death_Date"] <- extract(gedcom[l], "DATE") if (grepl("PLAC", gedcom[l])) family[id, "Death_Place"] <- extract(gedcom[l], "PLAC") l <- l + 1 } } } } if (str_detect(gedcom[l], "^0") & str_detect(gedcom[l], "FAM")) { l <- l + 1 while(!str_detect(gedcom[l], "^0")) { if (grepl("HUSB", gedcom[l])) husband <- unlist(str_split(gedcom[l], "@"))[2] if (grepl("WIFE", gedcom[l])) wife <- unlist(str_split(gedcom[l], "@"))[2] if (grepl("CHIL", gedcom[l])) { child <- which(family$id == unlist(str_split(gedcom[l], "@"))[2]) family[child, "Father_id"] <- husband family[child, "Mother_id"] <- wife } l <- l + 1 } } } family %>% mutate(Full_Name = gsub("/", "", str_trim(Full_Name)), Birth_Date = as.Date(family$Birth_Date, format = "%d %b %Y"), Death_Date = as.Date(family$Death_Date, format = "%d %b %Y")) %>% return() } Analysing the data

There are many websites with GEDCOM files of family histories of famous and not so famous people. The Famous GEDCOMs website has a few useful examples to test the GEDCOM reader.

Once the data is in a data frame, you can analyse it any way you please. The code below downloads a file with the presidents of the US, with their ancestors and descendants. The alive() function filters people who are alive at a certain date. For people without birth date, it sets a maximum age of 100 years.

The histogram shows the distribution of ages at time of death of all the people in the presidents file.

These are just some random examples of how to analyse family history data with this GEDCOM reader. The next article will explain how to plot a population pyramid using this data. A future article will discuss how to visualise the structure of family history.

Birth years of people of in the Presidents file ## Basic family history statistics library(tidyverse) library(lubridate) source("read.gedcom.R") presidents <- read.gedcom("https://webtreeprint.com/tp_downloader.php?path=famous_gedcoms/pres.ged") filter(presidents, grepl("Jefferson", Full_Name)) mutate(presidents, Year = year(Birth_Date)) %>% ggplot(aes(Year)) + geom_histogram(binwidth = 10, fill = "#6A6A9D", col = "white") + labs(title = "Birth years in the presidents file") ggsave("../../Genealogy/years.png") alive <- function(population, census_date){ max_date <- census_date + 100 * 365.25 filter(people, (is.na(Birth_Date) & (Death_Date <= max_date & Death_Date >= census_date)) | (Birth_Date <= census_date & Death_Date >= census_date)) %>% arrange(Birth_Date) %>% mutate(Age = as.numeric(census_date - Birth_Date) / 365.25) %>% return() } alive(presidents, as.Date("1840-03-07"))

The post GEDCOM Reader for the R Language: Analysing Family History appeared first on The Lucid Manager.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Language – The Lucid Manager. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Adding Syntax Highlight

18 hours 32 min ago

(This article was first published on R on notast, and kindly contributed to R-bloggers)

Syntax highlighting

Previously, I posted entries without any syntax highlighting as I was satisfied using basic blogdown and Hugo functions until a Disqus member commented in the previous post to use syntax highlighting. Thus, I tasked myself to learn more about syntax highlighting and to implement them in future posts. Now I’d like to share what I’ve learned.

There are various ways to highlight syntax in Hugo but the preferred approach for blogdown is to use Highlight.js. Interestingly, Yi Hui himself is not too particular about the specifics nor the excessiveness of syntax highlighting.

Themes with embedded syntax highlight feature

Some hugo themes have in built syntax highlight functions. For instances, blogdown’s default theme, hugo-lithium, has Highlight.js option readily avilable.

For the hugo-lithium theme, you can access details of the Highlight.js in the config.toml file.

highlightjsVersion = "9.12.0" highlightjsCDN = "//cdnjs.cloudflare.com/ajax/libs" highlightjsLang = ["r", "yaml"] highlightjsTheme = "github"

highlightjsVersion refers to the version of highlightjs

highlightjsCDN refers to the CDN provider of highlight.js. For each CDN provider and version number, you can determine if the coding language you intend to highlight and the style of the highlight is available.

highlightjsLang refers to the coding languages to be highlighted. By default, r and yaml are highlighted. You can add other languages to be highlighted as long as the CDN provider and version number supports the highlighting of those languages.

highlightjsTheme refers to the colour theme for the highlighted syntax. You can preview the various themes for different languages on [highlightjs]’s website before deciding which theme to adopt for your own site. (https://highlightjs.org/static/demo/)

Themes without embedded syntax highlight feature

There are many themes which do not have in built highlighting functions including the Mainroad theme which I’m using. There are two approaches to add syntax highlighting for these Hugo themes.

blogdown textbook approach
  1. In the book, Xie, Thomas and Hill recommends adding

to head_custom.html file. For the Mainroad theme, I added the script to the bottom of head.html file. You can change the colour theme by replacing the github theme to your desired theme.

  1. Next, they require you to add
hljs.configure({languages: []}); hljs.initHighlightingOnLoad();

to foot_custom.html. For the Mainroad theme, I added the script to the bottom of footer.html file.

Amber Thomas’s approach
  1. Download Highlight.js for R.

  2. From the file you downloaded, copy highlight.pack.js file and paste into the js folder for your Hugo theme. For Mainroad theme, I accessed it via themes-> Mainroad-> static-> js.

  3. From the file you downloaded, go to the style subfolder and copy the css file of your desired syntax colour theme and paste into the css folder for your Hugo theme. For Mainroad theme, I found it via themes-> Mainroad -> static -> css.

  4. Add this
hljs.initHighlightingOnLoad();

to the header.html file. For Mainroad theme, I discovered the file via themes-> Mainroad-> layouts -> partials. You can change the github-gist theme to your selected theme. As the Hugo Mainroad theme displays code chunks in a faint shade of grey, I chose routeros highlighter as it has a similar light grey as its background which complements the Hugo Mainroad theme.

Verify syntax highlighting

Apart from visually see the changes on your site, you can go geek mode and verify that the highlighted syntax is based on your selected Highlight.js theme.
If you are using, Microsoft Edge select a paragraph of plain text and click on the right hand mouse and select “Inspect element”. You will see two boxes. Examine the box which has style as one of the sub-tab. From that sub-tab, you will noticed that the css style is css.style. Now do the same for a chuck of R script, you will realize that the css style is the name of the Highlight.js theme. For this blog, it is routeros.css.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on notast. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Germination data and time-to-event methods: comparing germination curves

18 hours 32 min ago

(This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers)

Very often, seed scientists need to compare the germination behaviour of different seed populations, e.g., different plant species, or one single plant species submitted to different temperatures, light conditions, priming treatments and so on. How should such a comparison be performed?

Let’s take a practical approach and start from an appropriate example: a few years ago, some collegues studied the germination behaviour for seeds of a plant species (Verbascum arcturus, BTW…), in different conditions. In detail, they considered the factorial combination of two storage periods (LONG and SHORT storage) and two temperature regimes (FIX: constant daily temperature of 20°C; ALT: alternating daily temperature regime, with 25°C during daytime and 15°C during night time, with a 12:12h photoperiod). If you are a seed scientist and are interested in this experiment, you’ll find detail in Catara et al. (2016).

If you are not a seed scientist you may wonder why my colleagues made such an assay; well, there is evidence that, for some plant species, the germination ability improves over time after seed maturation. Therefore, if we take seeds and store them for a different period of time, there might be an effect on their germination traits. Likewise, there is also evidence that some seeds may not germinate if they are not submitted to daily temperature fluctuations. These mechanisms are very interesting, as they permit to the seed to recognise that the environmental conditions are favourable for seedling survival.My colleagues wanted to discover whether this was the case for Verbascum.

Let’s go back to our assay: the experimental design consisted of four combinations (LONG-FIX, LONG-ALT, SHORT-FIX and SHORT-ALT) and four replicates for each combination. One replicate consisted of a Petri dish, that is a small plastic box containing humid blotting paper, with 25 seeds of Verbascum. In all, there were 16 Petri dishes, put in climatic chambers with the appropriate conditions. During the assay, my collegues made daily inspections: germinated seeds were counted and removed from the dishes. Inspections were made for 15 days, until no more germinations could be observed.

The dataset is available from a gitHub repository: let’s load it and have a look.

dataset <- read.csv("https://raw.githubusercontent.com/OnofriAndreaPG/agroBioData/master/TempStorage.csv", header = T, check.names = F) head(dataset) ## Dish Storage Temp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 1 1 Low Fix 0 0 0 0 0 0 0 0 3 4 6 0 1 0 3 ## 2 2 Low Fix 0 0 0 0 1 0 0 0 2 7 2 3 0 5 1 ## 3 3 Low Fix 0 0 0 0 1 0 0 1 3 5 2 4 0 1 3 ## 4 4 Low Fix 0 0 0 0 1 0 3 0 0 3 1 1 0 4 4 ## 5 5 High Fix 0 0 0 0 0 0 0 0 1 2 5 4 2 3 0 ## 6 6 High Fix 0 0 0 0 0 0 0 0 2 2 7 8 1 2 1

We have one row per Petri dish; the first three columns show, respectively, dish number, storage and temperature conditions. The next 15 columns represent the inspection times (from 1 to 15) and contain the counts of germinated seeds. The research question is:

Is germination behaviour affected by storage and temperature conditions?

Response feature analyses

One possible line of attack is to take a summary measure for each dish, e.g. the total number of germinated seeds. Taking a single value for each dish brings us back to more common methods of data analysis: for example, we can fit some sort of GLM to test the significance of effects (storage, temperature and their interaction), within a fully factorial design.

Although the above method is not wrong, undoubtedly, it may be sub-optimal. Indeed, dishes may contain the same total number of germinated seeds, but, nonetheless, they may differ for some other germination traits, such as velocity or uniformity. Indeed, we do not want to express a judgment about one specific characteristic of the seed lot, we would like to express a judgment about the whole seed lot. In other words, we are not specifically asking: “do the seed lots differ for their germination capability?”. We are, more generally, asking “are the seed lots different?”.

In order to get a general assessment, a different method of analysis should be sought, which considers the entire time series (from 1 to 15 days) and not only one single summary measure. This method exists and it is available within the time-to-event platform, which has shown very useful and appropriate for seed germination studies (Onofri et al., 2011; Ritz et al., 2013; Onofri et al., 2019).

The germination time-course

It is necessary to re-organise the dataset in a more useful way. A good format can be obtained by using the ‘makeDrm()’ function in the ‘drcSeedGerm’ package, which can be installed from gitHub (see the code at: this link). The function needs to receive a dataframe storing the counts (dataset[,4:18]), a dataframe storing the factor variables (dataset[,2:3]), a vector with the number of seeds in each Petri dish (rep(25, 16)) and a vector of monitoring times.

library(drcSeedGerm) datasetR <- makeDrm(dataset[,4:18], dataset[,2:3], rep(25, 16), 1:15) head(datasetR, 16) ## Storage Temp Dish timeBef timeAf count nCum propCum ## 1 Low Fix 1 0 1 0 0 0.00 ## 1.1 Low Fix 1 1 2 0 0 0.00 ## 1.2 Low Fix 1 2 3 0 0 0.00 ## 1.3 Low Fix 1 3 4 0 0 0.00 ## 1.4 Low Fix 1 4 5 0 0 0.00 ## 1.5 Low Fix 1 5 6 0 0 0.00 ## 1.6 Low Fix 1 6 7 0 0 0.00 ## 1.7 Low Fix 1 7 8 0 0 0.00 ## 1.8 Low Fix 1 8 9 3 3 0.12 ## 1.9 Low Fix 1 9 10 4 7 0.28 ## 1.10 Low Fix 1 10 11 6 13 0.52 ## 1.11 Low Fix 1 11 12 0 13 0.52 ## 1.12 Low Fix 1 12 13 1 14 0.56 ## 1.13 Low Fix 1 13 14 0 14 0.56 ## 1.14 Low Fix 1 14 15 3 17 0.68 ## 1.15 Low Fix 1 15 Inf 8 NA NA

The snippet above shows the first dish. Roughly speaking, we have gone from a WIDE format to a LONG format. The column ‘timeAf’ contains the time when the inspection was made and the column ‘count’ contains the number of germinated seeds (e.g. 9 seeds were counted at day 9). These seeds did not germinate exactly at day 9; they germinated within the interval between two inspections, that is between day 8 and day 9. The beginning of the interval is given in the variable ‘timeBef’. Apart from these columns, we have additional columns, which we are not going to use for our analyses. The cumulative counts of germinated seeds are in the column ‘nCum’; these cumulative counts have been converted into cumulative proportions by dividing by 25 (i.e., the total number of seeds in a dish; see the column ‘propCum’).

We can use a time-to-event model to parameterise the germination time-course for this dish. This is easily done by using the ‘drm()’ function in the ‘drc’ package (Ritz et al., 2013):

modPre <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", subset = c(Dish == 1)) plot(modPre, log = "", xlab = "Time", ylab = "Proportion of germinated seeds", xlim = c(0, 15))

Please, note the following:

  1. we are using the counts (‘count’) as the dependent variable
  2. as the independent variable: we are using the extremes of the inspection interval, within which germinations were observed (count ~ timeBef + time Af)
  3. we have assumed a log-logistic distribution of germination times (fct = LL.3()). A three parameter model is necessary, because there is a final fraction of ungerminated seeds (truncated distribution)
  4. we have set the argument ‘type = “event”’. Indeed, we are fitting a time-to-event model, not a nonlinear regression model, which would be incorrect, in this setting (see this link here ).

As we have determined the germination time-course for dish 1, we can do the same for all dishes. However, we have to instruct ‘drm()’ to define a different curve for each combination of storage and temperature. It is necessary to make an appropriate use of the ‘curveid’ argument. Please, see below.

mod1 <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", curveid = Temp:Storage) plot(mod1, log = "", legendPos = c(2, 1))

It appears that there are visible differences between the curves (the legend considers the curves in alphabetical order, i.e. 1: Fix-Low, 2: Fix-High, 3: Alt-Low and 4: Alt-High). We can test that the curves are similar by coding a reduced model, where we have only one pooled curve for all treatment levels. It is enough to remove the ‘curveid’ argument.

modNull <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event") anova(mod1, modNull, test = "Chisq") ## ## 1st model ## fct: LL.3() ## pmodels: Temp:Storage (for all parameters) ## 2nd model ## fct: LL.3() ## pmodels: 1 (for all parameters) ## ANOVA-like table ## ## ModelDf Loglik Df LR value p value ## 1st model 244 -753.54 ## 2nd model 253 -854.93 9 202.77 0

Now, we can compare the full model (four curves) with the reduced model (one common curve) by using a Likelihood Ratio Test, which is approximately distributed as a Chi-square. The test is highly significant. Of course, we can also test some other hypotheses. For example, we can code a model with different curves for storage times, assuming that the effect of temperature is irrelevant:

mod2 <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", curveid = Storage) anova(mod1, mod2, test = "Chisq") ## ## 1st model ## fct: LL.3() ## pmodels: Temp:Storage (for all parameters) ## 2nd model ## fct: LL.3() ## pmodels: Storage (for all parameters) ## ANOVA-like table ## ## ModelDf Loglik Df LR value p value ## 1st model 244 -753.54 ## 2nd model 250 -797.26 6 87.436 0

We see that such an assumption (temperature effect is irrelevant) is not supported by the data: the temperature effect cannot be removed without causing a significant decrease in the likelihood of the model. Similarly, we can test the effect of storage:

mod3 <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", curveid = Temp) anova(mod1, mod3, test = "Chisq") ## ## 1st model ## fct: LL.3() ## pmodels: Temp:Storage (for all parameters) ## 2nd model ## fct: LL.3() ## pmodels: Temp (for all parameters) ## ANOVA-like table ## ## ModelDf Loglik Df LR value p value ## 1st model 244 -753.54 ## 2nd model 250 -849.48 6 191.87 0

Again, we get significant results. So, we need to conclude that temperature and storage time caused a significant influence on the germination behavior for the species under study.

Before concluding, it is necessary to mention that, in general, the above LR tests should be taken with care: the results are only approximate and the observed data are not totally independent, as multiple observations are taken in each experimental unit (Petri dish). In order to restore independence, we would need to add to the model a random effect for the Petri dish, which is not an easy task in a time-to-event framework (Onofri et al., 2019). However, we got very low p-levels, which leave us rather confident about the significance of effects. It may be a good suggestion, in general, to avoid formal hypothesis testing and compare the models by using the Akaike Information Criterion (AIC: the lowest is the best), which confirms that the complete model with four curves is, indeed, the best one.

AIC(mod1, mod2, mod3, modNull) ## df AIC ## mod1 244 1995.088 ## mod2 250 2094.524 ## mod3 250 2198.961 ## modNull 253 2215.862

For those who are familiar with linear model parameterisation, it is possible to reach even a higher degree of flexibility by using the ‘pmodels’ argument, within the ‘drm()’ function. However, this will require another post. Thanks for reading!

References
  1. Catara, S., Cristaudo, A., Gualtieri, A., Galesi, R., Impelluso, C., Onofri, A., 2016. Threshold temperatures for seed germination in nine species of Verbascum (Scrophulariaceae). Seed Science Research 26, 30–46.
  2. Onofri, A., Mesgaran, M.B., Tei, F., Cousens, R.D., 2011. The cure model: an improved way to describe seed germination? Weed Research 51, 516–524.
  3. Onofri, A., Piepho, H.-P., Kozak, M., 2019. Analysing censored data in agricultural research: A review with examples and software tips. Annals of Applied Biology 174, 3–13. https://doi.org/10.1111/aab.12477
  4. Ritz, C., Pipper, C.B., Streibig, J.C., 2013. Analysis of germination data from agricultural experiments. European Journal of Agronomy 45, 1–6.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analysis of a Flash Flood

Fri, 07/19/2019 - 22:32

(This article was first published on RClimate, and kindly contributed to R-bloggers)

Flash floods seem to be increasing in many areas. This post will show how to download local USGS flow and precipitation data and generate a 3-panel chart of flow, gage height and precipitation.

There was a tragic flash flood incident in Berks County Pennsylvania on July 11-12, 2019 that caused the death of 3 people, including a pregnant woman and her young son. This newspaper article (link) provides some details on this tragic event.

The timing of this flash flooding was interesting to me because the woman and her son were trapped in their car by a small tributary to the Schuylkill River at 4:30 PM on Thursday, July 11 and flooding in Philadelphia started about 1:30 Am on Friday, July 12th. The tragic drowning occurred on the Manatawny Creek , about 35 miles Northwest of Philadelphia.

I plan to write a series of posts to document the rainfall patterns and timing of the rainfall and subsequent flooding.

This initial post will present show the flow hydrograph, cumulative precipitation and gage height at for the USGS’s Schuylkill river gaging station #01474500 at Fairmount Dam.

Follow-up posts will review the upstream USGS data as well as national weather service rainfall data.

Here is the 3 panel chart showing the hydrograph, precipitation data and gage height data for the period July 9 – 16, 2019.

Here’s the R script for analyzing and plotting csv file downloaded from USGS site.

################################################################################################################# ## get source data file link link <- "C:\\Users\\Kelly O'Day\\Desktop\\Schuylkill_flood\\CSV_files\\Schuylkil_Fairmont_Dam_7_days_july_2019.csv" #(link <- choose.files()) df <- read.csv(link, as.is=T) df$dt <- mdy_hm(df$datetime) dt_min <- df$dt[1] dt_max <- df$dt[nrow(df)] peak_flow <- max(df$cfs, na.rm=T) peak_flow_comma <- format(peak_flow, big.mark=',') peak_dt <- df$dt[which(df$cfs == peak_flow)] from_to <- paste(month(dt_min),"/",day(dt_min), " to ", month(dt_max),"/",day(dt_max),"/",year(dt_max),sep="") # Create 3 panel charts for flow, precipitation and gage height par(mfrow = c(3,1),ps=10, pty="m",xaxs="r",yaxs="r") par(las=1,oma=c(2.5,2,3,1), mar=c(2,5,3,0)) par(mgp=c(4,1,0)) # Number of margin lines for title, axis labels and axis line ## Flow chart plot_title <- paste("Flow - cfs") plot(df$dt, df$cfs, type="l", xlab="", ylab = "Flow -cfs",xlim=c(dt_min,dt_max), yaxt="n", main =plot_title) axis(side=2, at=axTicks(2), labels=formatC(axTicks(2), format="d", big.mark=',')) points(peak_dt, peak_flow, pch=16, col="red") spacer <- 4*60*60 note <- paste("Peak @ ", peak_flow_comma, " cfs \n", peak_dt) abline(h = seq(10000,50000,10000), col = "grey") abline(v= seq(dt_min,dt_max,by="day"), col="grey") text(peak_dt+ spacer,peak_flow * 0.975, note, adj=0, cex=1) ################################################ # Sheet title annotation mtext("Flood Conditions @ Fairmount Dam (July 9-16, 2019)", side = 3, line = 3, adj = 0.5, cex = 1.2) ##### precipitation data analyis n& Chart df$cum_precip <- cumsum(df$inches) tot_precip <- df$cum_precip[nrow(df)] precip_st_row <- length(subset(df$cum_precip, df$cum_precip == 0)) precip_st_dt <- df$dt[precip_st_row] precip_note <- paste0("Total Precip - ",tot_precip, " inches") precip_subset <- subset(df, df$cum_precip == tot_precip) precip_end_dt <- mdy_hm(precip_subset[1,1]) #precip_end_dt_t <- mdy_hm(precip_end_dt) plot(df$dt, df$cum_precip, type="l", xlab="", ylab = "Precipitation -inches",xlim=c(dt_min,dt_max), main = "Cummulative Precipitation - Inches") points(precip_st_dt,0, pch=12, col = "blue") points(precip_end_dt,tot_precip, pch=12, col="blue") abline(v= seq(dt_min,dt_max,by="day"), col="grey") text(dt_min,tot_precip, precip_note, adj=0) precip_st_note <- paste0(" Precipitation Starts @ ",precip_st_dt) dur <- precip_end_dt - precip_st_dt precip_end_note <- paste0(" Precipitation ends @ ",precip_end_dt,"\n ",dur, " hours") text(precip_end_dt,tot_precip * 0.9, precip_end_note, adj=0, cex = 1) text(precip_st_dt, 0, precip_st_note, adj=0, cex = 1) #### gage height chart gage_act_df <- subset(df, df$gage >= 10) gage_act_dt <- gage_act_df$dt[1] gage_act_note <- paste0("Gage Action Level @ \n",gage_act_dt) plot(df$dt, df$gage, type="l", xlab="", ylab = "Gage Height - Ft",xlim=c(dt_min,dt_max), main ="Gage height - ft") abline(h = 10, col="brown") abline(h=11, col = "red", lwd =1) abline(v= seq(dt_min,dt_max,by="day"), col="grey") points(gage_act_dt,gage_act_df[1,3],pch=11, col = "black") text(gage_act_dt - 1*4*3600, 10,gage_act_note, adj=1) min_gage <- min(df$gage, na.rm=T) max_gage <- max(df$gage, na.rm=T) delta_gage <- max_gage - min_gage gage_note <- paste("Max Gage @ ", max_gage, " ft\nGage Increase ", delta_gage, " ft") text(dt_min, max_gage*0.97, gage_note, adj=0, cex=1) flood_act_note <- "Flood Action stage @ 10-ft" flood_stage_note <-"Flood stage @ 11-ft" text(dt_max-60*60,10.25, flood_act_note, adj = 1, cex=0.95) text(dt_max - 1*1*60*60,11.25, flood_stage_note, adj = 1, cex=0.95) ########################################## # Sheet annotation - Footer Notes mtext("K O'Day - 7/19/19", side = 1, line = 3, adj = 0, cex = 0.9) mtext("Data Source: USGS: https://waterdata.usgs.gov/nwis/inventory/?site_no=01474500", side=1, line = 2, adj=0, cex = 0.9) # png(file="C:\\Users\\Kelly O'Day\\Desktop\\Schuylkill_flood\\myplot.png", bg="white") dev.off() var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RClimate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Watch keynote presentations from the useR!2019 conference

Fri, 07/19/2019 - 20:46

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The keynote presentations from last week's useR!2019 conference in Toulouse are now available for everyone to view on YouTube. (The regular talks were also recorded and video should follow soon, and slides for most talks are available for download now at the conference website.) Here are links to the videos, indexed to the start of each presentation:

All of the presentations are excellent, but if I had to choose one to watch first, it would be Julia Stewart Lowndes' presentation, which is an inspiring example of how R has enabled marine researchers to collaborate and learn from data (like a transponder-equipped squid!). 

The videos have been made available thanks to sponsorship by the R Consortium. If you're not familiar with the R Consortium, you can learn more in the short presentation below from Joe Rickert:

The R Consortium is funded by its member organizations, so if you'd like to see more of the above, consider asking your company to become a member.

YouTube (R Consortium): 2019 Keynotes

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time series forecast cross-validation by @ellis2013nz

Fri, 07/19/2019 - 16:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Time series cross-validation is important part of the toolkit for good evaluation of forecasting models. forecast::tsCV makes it straightforward to implement, even with different combinations of explanatory regressors in the different candidate models for evaluation.

Suprious correlation between time series is a well documented and mocked problem, with Tyler Vigen’s educational website on the topic (“per capita cheese consumption correlated with number of people dying by becoming entangled in their bedsheets”) even spawning a whole book of humourous examples.

Identifying genuinely-correlated series can be immensely helpful for time series forecasting. Forecasting is hard, and experience generally shows that complex causal models don’t do as well as much simpler methods. However, a well chosen small set of “x regressors” can improve forecasting performance in many situations. I have been investigating one of those situations for a future blog post on forecasting unemployment rates. One method I’ll be using is time series cross-validation, as well described in this Rob Hyndman post. The implementation was sufficiently non-trivial in my real-data case that I’m writing today a separate post with simulated data to be sure I’m doing it right.

Simulated data

So here’s my simulated data. I have made three time series which are causally related to eachother:

  • x is an ARIMA(1, 1, 1) process, which means that if you took its first difference (the change from day to day rather than the original observation) it would be an ARMA(1, 1) process with an autoregression coefficient of 0.5 and a moving average coefficient of 0.5
  • y has a random component which is generated in a similar way to x, and a structural component which is a simple multiplier of x
  • z has a similar random component again and a structural component which is a simple multiplier of the lagged values of y.

In other words, x causes y and y causes z via a delay. Our job is to forecast y. We would expect the best model to do this to be one that uses x as an explanatory variable and which ignores z. We would expect z to be a tempting but ultimately unhelpful explanatory variable in a model of y.

Here’s what the data looks like:

Generated with this:

library(forecast) library(ggplot2) # Simulate data: set.seed(125) n <- 100 x <- ts(cumsum(arima.sim(list(ar = 0.5, ma = 0.5), n = n))) y = ts(0.6 * x + cumsum(arima.sim(list(ar = 0.5, ma = 0.5), n = n))) z = ts(0.6 * y + cumsum(arima.sim(list(ar = 0.5, ma = 0.5), n = n))) z <- c(0, z[1:(n - 1)]) d <- cbind(x, y, z) # Exploratory plot: autoplot(d) + ggtitle("Three simulated, related, time-series", "x causes y and y causes z; our job is to forecast y.") + labs(colour = "", y ="")

If this were a real analysis, I would always start with the partial auto correlations and cross correlations. Auto-correlation means the correlation of a time series with lagged versions of itself. The “partial” means we look at these correlations after controlling for the higher order lags (for example, we look at the autocorrelation at lag of 2 time periods, after controlling for the autocorrelation at lag 1). Cross correlations are the same statistics, but with the variously lagged versions of another variable instead of with itself. Here are the partial auto correlations of our three simulated series:

To the experienced eye it is immediately obvious from this PACF plot, if not from the original simple plot, that these time series are non-stationary and are heavily correlated with themselves – that is, knowing its value at time t gives you a strong indication of its location time t + 1. The giveaway in the PACF plots is the tall line indicating high autocorrelation (close to 1.0) at lag 1 for each of the three series. This non-stationarity hides the other relationships in the data and gets in the way of effective modelling, and the standard response is to try “differencing” the series. This technique alone would eliminate the big majority of the mis-inferences in the “spurious correlation” time series collection.

Here are the PACF plots for the first-differenced versions of these series:

The series are still heavily correlated with themselves – each with an autocorrelation of about 0.6 at lag 1 – but no longer so much so that they are obviously non-stationary. We can now see some of the secondary relationships, including correlations between the various variables at several lags that suggest there could be something going on between them (as in fact we know is the case).

Those PACF plots use the handy ggPacf function that comes with Hyndman’s forecast R package.

ggPacf(d) + theme(panel.grid = element_blank()) + ggtitle("Partial autocorrelations and cross correlations", "Original series") ggPacf(diff(d)) + theme(panel.grid = element_blank()) + ggtitle("Partial autocorrelations and cross correlations", "First differenced series") Modelling

When I first learnt time series analysis in the 1990s we used to use PACF plots to try to infer the order of the auto-regressive and moving-average components of a worthwhile model to fit. These days we just use highly effective and tested automated algorithms such as that behind the forecast::auto.arima function. I’m going to fit four models for my hypothetical forecast job:

  1. univariate model just using y
  2. x as an explanatory variable
  3. z as an explanatory variable
  4. both x and z as explanatory variables

For each model I am going to manually specify the level of differencing as one lag only, so my resulting models can all be compared on the same basis.

Because I made up the data, I happen to know that model 2 is the “correct” specification. I was pleased to see that fitting the four models and comparing their AIC agreed with this foreknown correct answer. The AIC of mod2 is lowest, correctly implying that any reduction in deviance from including z in the model is not justified by the use of an additional degree of freedom, whether z is included by itself or in addition to x:

  df AIC mod1 3 313.1799 mod2 5 290.6071 mod3 4 315.0472 mod4 6 292.3268

The models were fit with the code below – the perfect example of fitting enormously complex models with a single line of code each:

mod1 <- auto.arima(y, d = 1) mod2 <- auto.arima(y, xreg = x, d = 1) mod3 <- auto.arima(y, xreg = z, d = 1) mod4 <- auto.arima(y, xreg = cbind(x, z), d = 1) knitr::kable(AIC(mod1, mod2, mod3, mod4)) Time series cross-validation

OK, so the simple expedient of comparing AIC values worked in this case, but my actual motivation for today was to check that time series cross-validation would similarly pick the known-best model in a situation comparing time series forecasting models with different numbers (or no) explanatory variables. Cross-validation for time series is more complex than for cross-sectional data because we can’t simply divide the data into training and test sets without taking into account the intrinsic connections of data in each time period with its neighbours on either side. It is well explained in the Hyndman blog post I linked to earlier. The intuition is well presented in this image from that post:

Basically you use the data from the blue dots to forecast the red dots.

The forecasts::tsCV function implements this approach. It takes as its main argument a function that returns an object of class forecast. The complication (such as it is) in my case comes from the need to write a function that combines fitting an auto.arima model and then converting it to a forecast object. Building on an example provided by (again) Hyndman, here’s my version of this which I think I will find myself re-using and hence will put in the frs package for future use:

#---------Cross validation--------------- #' auto.arima forecast function for time series cross validation #' #' adapted from https://gist.github.com/robjhyndman/d9eb5568a78dbc79f7acc49e22553e96 aafc <- function(y, h, xreg = NULL, ...){ if(!is.null(xreg)){ ncol <- NCOL(xreg) X <- matrix(xreg[1:length(y), ], ncol = ncol) if(NROW(xreg) < length(y) + h) stop("Not enough xreg data for forecasting") newX <- matrix(xreg[length(y) + (1:h), ], ncol = ncol) fit <- auto.arima(y, xreg=X, ...) return(forecast(fit, xreg = newX, h = h)) } else { fit <- auto.arima(y, ...) return(forecast(fit, h = h)) } } # this CV takes about 50 seconds system.time({ aa1_cv <- tsCV(y, aafc, d = 1) aa2_cv <- tsCV(y, aafc, xreg = x, d = 1) aa3_cv <- tsCV(y, aafc, xreg = z, d = 1) aa4_cv <- tsCV(y, aafc, xreg = cbind(x, z), d = 1) })

The tsCV function returns a numerical time series object containing the forecast errors as a vector (if forecasting is for only one period forward, as is the case in my example). Interpreting and presenting the results takes a bit of care. All my models return some NAs in the forecast errors after cross-validation, including the final observation in each case. The tsCV helpfile includes the helpful comment that in the output of tsCV:

The time index corresponds to the last period of the training data

So if we want to line up the forecast errors to the actual values they apply to, we need to be careful! In my case, I need to knock out the first value of the original time series, and the last value of the forecast errors, if I want to line them up (eg for use in the accuracy function). Here’s how I did that:

rbind(accuracy(ts(y[-1] - aa1_cv[-n]), ts(y[-1])), accuracy(ts(y[-1] - aa2_cv[-n]), ts(y[-1])), accuracy(ts(y[-1] - aa3_cv[-n]), ts(y[-1])), accuracy(ts(y[-1] - aa4_cv[-n]), ts(y[-1]))) %>% as.data.frame() %>% mutate(model = c("Univariate", "With x regressor", "With z regressor", "Both x and z as regressors")) %>% dplyr::select(model, everything()) %>% knitr::kable()

I’m not sure this is the best way to evaluate those forecast errors (ie by using forecast::accuracy) but it’s a method that fits in with how I think about it and returns lots of performance metrics at once. Here are the results:

model ME RMSE MAE MPE MAPE ACF1 Theil’s U Univariate -0.0238600 1.249621 0.9706499 28.28211 47.15084 0.0725987 0.4626243 With x regressor -0.0560274 1.177935 0.9507106 20.91153 45.24956 0.1673370 0.3950919 With z regressor -0.0156043 1.245264 0.9824904 29.08816 47.88510 0.0453596 0.5016008 Both x and z as regressors -0.0909384 1.212325 0.9823389 22.85555 46.54281 0.1491128 0.4872987

Model 2, with just the x regressor, has the lowest root mean square error, mean absolute error, mean percentage error, mean absolute percentage error and Theil’s U. The autocorrelation of its errors is relatively high, but I don’t think that matters. It has more biased forecasts (mean error) than the models that don’t include x but depending on context we might accept that in return for being closer on average. Overall, this method comes up with good support for the correct underlying model.

Just to check I haven’t mangled anything, gotten the signs the wrong way around, etc I calculated a few of these stats by hand directly from the tsCV output:

> # mean error of univariate model: > mean(aa1_cv, na.rm = TRUE) [1] -0.02386004 > > # root mean square error of x regressor model > sqrt(mean(aa2_cv ^ 2, na.rm = TRUE)) [1] 1.177935 > > # mean absolute error of z regressor model > mean(abs(aa3_cv), na.rm = TRUE) [1] 0.9824904

Final word on this is that I tried running this program with various other settings (eg just changing the random seed from 125 to another number, or making the relationships between the variables a little weaker) and sometimes the AIC was slightly better at picking the correct model than was the cross-validation. Neither method is fool-proof – remember, forecasting is hard, there is always a lot of noise around the signal!

OK, c’est tout. Today’s key thought again: Time series cross-validation is important part of the toolkit, and forecast::tsCV makes it straightforward to implement.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to make 3D Plots in R (from 2D Plots of ggplot2)

Fri, 07/19/2019 - 15:46

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Are you interested in guest posting? Publish at DataScience+ via your editor (i.e., RStudio). Category Tags

3D Plots built in the right way for the right purpose are always stunning. In this article, we’ll see how to make stunning 3D plots with R using ggplot2 and rayshader . While ggplot2 might be familiar to anyone in Data science, rayshader may not. So, let’s start with a small introduction to rayshader .

Package Description:

rayshader is an open source package for producing 2D and 3D data visualizations in R. rayshader uses elevation data in a base R matrix and a combination of raytracing, spherical texture mapping, overlays, and ambient occlusion to generate beautiful topographic 2D and 3D maps. In addition to maps, rayshader also allows the user to translate ggplot2 objects into beautiful 3D data visualizations.

Required Libraries/Packages:

The latest version of rayshader is available on github which could be installed using either devtools or remotes .

devtools::install_github("tylermorganwall/rayshader")

And, make sure you’ve got the most recent version of ggplot2 . If you are a tidyverse-person, then get the latest of it.

install.packages('tidyverse') Data

To keep the requirements of this article minimal, We’ll use faithfuld one of the inbuilt datasets in ggplot2 library.

glimpse(faithfuld) Observations: 5,625 Variables: 3 $ eruptions 1.600000, 1.647297, 1.694595, 1.741… $ waiting 43, 43, 43, 43, 43, 43, 43, 43, 43,… $ density 0.003216159, 0.003835375, 0.0044355…

As you can see, faithfuld has got 3 continuous variables which we’ll use for plotting.

2D Plot

Our journey of a 3D plot just begins with a normal 2D ggplot2 plot. We’ll build a density plot using geom_raster between waiting, eruptions to see how how the data is.

faithful_dd = ggplot(faithfuld, aes(waiting, eruptions)) + geom_raster(aes(fill = density)) + ggtitle("3D Plotting in R from 2D_ggplot_graphics") + labs(caption = "Package: rayshader") + theme(axis.text = element_text(size = 12), title = element_text(size = 12,face="bold"), panel.border= element_rect(size=2,color="black",fill=NA)); faithful_dd

Gives this plot:

Journey from 2D Plot to 3D Plot — One Line!

The journey from a 2D plot to a 3D Plot, is just one extra line of code that comes from the package rayshader . The function plot_gg() which takes a bunch of arguments to define how the 3D plot should look like.

plot_gg(faithful_dd, multicore = TRUE, width = 8, height = 8, scale = 300, zoom = 0.6, phi = 60, background = "#afceff",shadowcolor = "#3a4f70")

faithful_dd is the ggplot2 object that we generated in the previous step. As most of the arguments are self-explanatory like — multicore to activate all the cores of the computer while rendering. Arguments like zoom and phi are to set where the 3D camera view should be.

Imagine, you get to explain Gradient descent or some optimization algorithm like this which is more intuitive and self-explanatory to get a mental model.

This doesn’t end with just a 3D Plot but the developer (Tyler Morgan-Wall) was kind enough to give us another function render_movie() that places a Camera and revolves it around the 3D plot that we just built and gives us a stunning video of the 3D Plot. render_movie() internally uses av package to make a video.

render_movie("faithful_3d.mp4", frames = 480)
https://datascienceplus.com/wp-content/uploads/2019/07/faithful_3d.mp4 Summary

Thanks to Tyler, now we can make stunning 3D Plots from 2D ggplots — just using one extra function plot_gg() ultimately even making a 360-degree video of the 3D Plot. Learn more about Data Visualization in R here and rayshader documentation. The code used and the sessionInfo is available here.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What NOT to do when building a shiny app (lessons learned the hard way)

Fri, 07/19/2019 - 14:00

(This article was first published on R on Adi Sarid's personal blog, and kindly contributed to R-bloggers)

I’ve been building R shiny apps for a while now, and ever since I started working with shiny, it has significantly increased the set of services I offer my clients.

Here’s a documentations of some of the many lessons I learned in previous projects I did. Hopefully, others can avoid them in the future.

Background

Shiny is a really great tool that allows data scientists to communicate their analysis in an appealing and an effective way. However, as data scientists we are used to thinking about things a certain way. Our way of thinking, and our practices are different than these of a software developer or a DevOps.

Here are some of things I learned along the path of my Shiny app developing experiences – some things that you should and should not do.

Don’t skip the planning phase

Do a mockup, research your implementation options.

Mockup

Do a mockup, even if it’s just a piece of paper which took you 5 minutes to draw. It will be worth it.

Shiny is very tempting in the sense that once you understand the concept of reactive programming, you can go from idea to a full app in a few days work. Why invest time in preparing a mockup or planning, when you can just go ahead and do the actual thing?

My experience tells me that the app is much more successful in capturing the customer’s needs, when he’s a part of the technical planning phase (when you share your dillemas with the client). It sets expectations, frames what you can and can’t (or won’t) do for the customer, and enables you to find solutions together.

Also, when you’re looking at a mockup (even if it’s just a simple drawing or a non-interactive slide), it helps in the next stages of building the app’s UI.

Here is an example of how a mockup would look like when I’m drawing it on a piece of paper. Note how I’ve already written down the purpose of some of the elements and their expeted element ids. It helps building the UI when you’re actually looking at one of these.

Example for a mockup drawing

Research

When you encounter a requirement you did not encounter before, and wondering about how to accomplish it, research.

  • Is there more than a single way to accomplish what you’re trying?
  • What are the pros and cons of each method?

For example, when I needed to show a table and incorporate data intake into the table, I was researching two options, one with the DataTable package (via the editable=TRUE argument) and the other is the rhandsontable package.

Both provide data editing, eventually I chose randsontable which had some limitations (e.g., slower rendering than DataTable, no search box), but provided more features out-of-the-box (e.g., data validation displaying factors as a list, checkboxes, etc.).

Be sure you can live up to your promises

This is more of a broad issue (you can say its true for anything).

In my case, in the past I promised some clients I’ll provide “realtime” dashboards. However, as it turned out, I was reading from a csv data dump which provided the data with delays going up to 15-30 minutes.

In most projects I do, 15 minutes and realtime are pretty much equivalent from a practical standpoint, but in a specific project I did recently, I had a client which wanted to check the data as it was changing minute-by-minute.

This gap in expectations caused some confusion and dissappointment. We eventually learned from this, and in the future, when realtime is a requirement, we will use a better data source (i.e., data base instead of the delayed data dump).

Don’t forget to plan your budget

Make sure you consider all the elements you need for the project. Plan the budget accordingly, and understand the ramifications of scaling the app.

For example, if you’re using shinyapps.io, get familiar with the pricing packages, figure out what will you need to provide a good SLA (relative to the number of users of the app).

Same goes for other cloud services, e.g., using a data base – how many users? how many connections? size of data base?

In most cloud providers you can also set up billing alerts which lets you know when something is exceeding a predetermined threshold.

All of these are very important when you’re building your quote, and obviously when going into production with your App.

Don’t skip testing and staging on your way to production

In software development there are various levels of environments, starting from your desktop (“local”), through development server, integration, testing, staging/acceptance, and production. See Wikipedia’s Deployment environment entry.

When building an app, make sure you go through these steps. Specifically relating to testing, staging, and production). What I found to be particularly useful is to upload the app twice (in two seperate locations/urls):

  1. Deploy as a beta app (client acceptance/demo) in which I demonstrate additional features and discuss them with the client, before incorporating them into production.
  2. Deploy as a production/live app.

As you iterate and improve the app, fix bugs, and add new features, you are also at the risk of breaking things. Thus, you should first update the beta app, share the new additions, and let the client experiment with the app. This way you can double check you didn’t break anything else.

Only when the client authorizes the corrections, redeploy the new app to the production.

Conclusions

As data scientists using Shiny, we’ve also become software developers. We’re developing not just for ourselves or for other useRs in our community.

With Shiny we’re building for end-users. We’re building customer facing apps, and we need to keep that in mind. We should make sure that we adopt and use best practices of software development.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Adi Sarid's personal blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistical matching, or when one single data source is not enough

Fri, 07/19/2019 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


I was recently asked how to go about matching several datasets where different samples of
individuals were interviewed. This sounds like a big problem; say that you have dataset A and B,
and that A contain one sample of individuals, and B another sample of individuals, then how could
you possibly match the datasets? Matching datasets requires a common identifier, for instance,
suppose that A contains socio-demographic information on a sample of individuals I, while B,
contains information on wages and hours worked on the same sample of individuals I, then yes,
it will be possible to match/merge/join both datasets.

But that was not what I was asked about; I was asked about a situation where the same population
gets sampled twice, and each sample answers to a different survey. For example the first survey
is about labour market information and survey B is about family structure. Would it be possible to
combine the information from both datasets?

To me, this sounded a bit like missing data imputation problem, but where all the information
about the variables of interest was missing! I started digging a bit, and found that not only there
was already quite some literature on it, there is even a package for this, called {StatMatch} with
a very detailed vignette.
The vignette is so detailed, that I will not write any code, I just wanted to share this package!

Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates and
buy me an espresso or paypal.me.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An R Users Guide to JSM 2019

Fri, 07/19/2019 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

If you are like me, and rather last minute about making a plan to get the most out of a large conference, you are just starting to think about JSM 2019 which will begin in just a few days. My plans always begin with an attempt to sleuth out the R-related sessions. While in the past it took quite a bit of work to identify talks that were likely backed by R-based calculations, this is clearly no longer the case. In fact, because Stanford Professor Trevor Hastie will be delivering the prestigious Wald Lectures this year, R-backed work will be front and center.

Professor Hastie has made numerous, important contributions to statistical learning, machine learning, data science and statistical computing. Among the latter, is the glmnet package he co-authored with Jerome Friedman, Rob Tibshirani, Noah Simon, Balasubramanian Narasimhan and Junyang Qian which has become a fundamental resource.

The Wald Lectures will be delivered over three days in room CC Four Seasons 1 according to the following schedule:
* Lecture 1: Mon, 7/29/2019, 10:30 AM – 12:20 PM
* Lecture 2: Tue, 7/30/2019, 2:00 PM – 3:50 PM
* Lecture 3: Wed, 7/31/2019, 10:30 AM – 12:20 PM

If you want to do some preparation for the lectures, you might have a look at the book Statistical Learnig with Sparsity; The Lasso and Generalizations by Hastie, Tibshirani and Wainwright.

The rest of this post lists some R-related talks that can help you fill your days at JSM! I am sure my list is not complete. Please feel free to add anything I may have missed to the comments section following this post.

Sunday, July 28, 2019

Findings from Analysis and Visualization of the New York City Housing and Vacancy Survey Data – CC 501 – 3:20 PM – Nels Grevstad, Metropolitan State University of Denver; Rachel Rosebrook, Metropolitan State University of Denver; Lance Barto, Metropolitan State University of Denver; Gil Leibovich, Metropolitan State University of Denver; Elizabeth Foster, Metropolitan State University of Denver; ThienNgo Le, Metropolitan State University of Denver; Kelsey Smith, Metropolitan State University of Denver; Nathanael Whitney, Metropolitan State University of Denver; Zoe Girkin, Metropolitan State University of Denver; Ahern Nelson, Metropolitan State University of Denver; Karan Bhargava, Metropolitan State University of Denver; Alex Whalen-Wagner, Metropolitan State University of Denver; Gemma Hoeppner, Metropolitan State University of Denver; Larry Breeden, Metropolitan State University of Denver; Ayako Zrust, Metropolitan State University of Denver; Travis Rebhan, Metropolitan State University of Denver; Anayeli Ochoa, Metropolitan State University of Denver

Bayesian Uncertainty Estimation Under Complex Sampling – Speed: CC 502 – 3:00 PM –
Matthew Williams, National Science Foundation; Terrance Savitsky, Bureau of Labor Statistics

Measuring Gentrification Over Time with the NYCHVS – Poster: CC Hall C – 4:00 PM – 4:45 PM
Robert Montgomery, NORC; Quentin Brummet, NORC; Nola du Toit, NORC at the University of Chicago; Peter Herman, NORC at the University of Chicago; Edward Mulrow, NORC at the University of Chicago

A SHINY Markov Machine for Decision-Making in Major League Baseball – Part 1: CC105 – 2:45 PM and Part 2: CC Hall C – 4:00 PM to 4:45 PM
Jason Osborne, North Carolina State University

Measuring Gentrification Over Time with the NYCHVS – CC 501 – 2:55 PM –
Robert Montgomery, NORC; Quentin Brummet, NORC; Nola du Toit, NORC at the University of Chicago; Peter Herman, NORC at the University of Chicago; Edward Mulrow, NORC at the University of Chicago

A New Tidy Data Structure to Support Exploration and Modeling of Temporal Data – CC 301 – 3:25 PM –
Earo Wang, Monash University; Dianne Cook, Monash University; Rob J Hyndman, Monash Univeristy

TensorFlow Versus H20, Predicting the SandP500 – CC 504 – 4:50 PM – Kenneth Davis

Model-Based Clustering Using Adjacent-Categories Logit Models via Finite Mixture Model – CC 504 – 5:05 PM –
Lingyu Li, Victoria University of Wellington; Ivy Liu, Victoria University of Wellington; Richard Arnold, Victoria University of Wellington

The Estimable Luke Tierney – and Estimability in R – CC 501 – 5:20 PM –
Russell V. Lenth, University of Iowa

Monday, July 29, 2019

Training Students Concurrently in Data Science and Team Science: Results and Lessons Learned from Multi-Institutional Interdisciplinary Student-Led Research Teams 2012-2018 -Poster: CC Hall C- 2:00 PM to 3:50 PM –
Brent Ladd, Purdue University; Mark Ward, Purdue University

A Natural Language Processing Algorithm for Medication Extraction from Electronic Health Records Using the R Programming Language: MedExtractR – Pister: CC Hall C – 2:00 PM to 3:50 PM – Hannah L Weeks, Vanderbilt University; Cole Beck, Vanderbilt University Medical Center; Elizabeth McNeer, Vanderbilt University; Joshua C Denny, Vanderbilt University; Cosmin A Bejan, Vanderbilt University; Leena Choi, Vanderbilt University Medical Center

Conditional Probability and SQL for Data Science – Poster: CC Hall C – 10:30 AM to 12:20 PM – Eric Suess, CSU East Bay

R Markdown: a Software Ecosystem for Reproducible Publications – CC 107- 11:55 PM – Yihui Xie, RStudio, Inc.

Infusing Bayesian Strategies for Pharmaceutical Manufacturing and Development – CC 109- 12:05 PM –
Bill Pikounis, Johnson & Johnson; Dwaine Banton, Janssen R&D; John Oleynick, Johnson & Johnson; Jyh-Ming Shoung, Janssen R&D

Tuesday, July 30, 2019

Controlling the False Discovery Proportion: a Simulation Study – Poster: CC Hall C – 10:30 AM to 12:20 PM
HARLAN MCCAFFERY, University of Michigan; Chi Chang, Michigan State University

Give Your Statistician Colleague Iris Bulbs for Their House Warming! – CC 605 – 11:05 AM – Dianne Cook, Monash University

From Prediction Models to Shiny App: Creating a Tool for Contaminated Food Source Prediction in Salmonella and STEC Outbreaks – CC Hall C – 11:35 AM to 12:20 PM – Caroline Ledbetter, University of Colorado; Alice White, Colorado School of Public Health; Elaine Scallan Walter, Colorado School of Public Health; David Weitzenkamp, Colorado School of Public Health

Stats for Data Science – H-Centennial Ballroom G-H – Round Table: 12:30 PM to 1:50 PM – Daniel Kaplan, Macalester College

Experiences with Incorporating R into a Second-Level Biostatistics Course for MPH Students – CC Hall C – 2:00 PM to 2:45 PM – Christine Mauro, Columbia University; Nicholas Williams, Columbia University; Anjile An, Columbia University

From Prediction Models to Shiny App: Creating a Tool for Contaminated Food Source Prediction in Salmonella and STEC Outbreaks – CC 501 – 8:40 AM
Caroline Ledbetter, University of Colorado; Alice White, Colorado School of Public Health; Elaine Scallan Walter, Colorado School of Public Health; David Weitzenkamp, Colorado School of Public Health

Tools for Evaluating Quality of State and Local Administrative Data – CC708 – 9:15AM –
Zachary H Seeskin, NORC at the University of Chicago; Gabriel Ugarte, NORC at the University of Chicago; Rupa Datta, NORC at the University of Chicago

Wednesday, July 31, 2019

Ggvoronoi: Voronoi Tessellations in R – CC 105 – 11:20 AM -Thomas J Fisher, Miami University; Robert C Garrett, Miami University; Karsten Maurer, Miami University

Using R to Conduct Retrospective Analyzes of EHR and Imaging Data: a Case Study in MS – Poster: CC Hall C – 10:30 AM – 12:20 PM – Melissa Martin, University of Pennsylvania; Russell Shinohara, University of Pennsylvania

Generalized Causal Mediation and Path Analysis and Its R Package gmediation Talk: – CC 501 – 8:45 AM – and Poster: CC Hall C – 11:35 AM – 12:20 PM –
Jang Ik Cho, Eli Lilly and Company; Jeffrey M Albert, Case Western Reserve University

Tidi_MIBI: a Tidy Pipeline for Microbiome Analysis and Visualization in R – Speed Talk: CC 501 – 10:15 AM and Poster: CC Hall C – 11:35 AM – 12:20 PM –
Charlie Carpenter, University of Colorado-Biostatistics

Incorporating Spatial Statistics into Routine Analysis of Agricultural Field Trials – CC Hall C – 11:35 AM – 12:20 PM –
Julia Piaskowski, University of Idaho; Chad Jackson, University of Idaho; Juliet Marshall, University of Idaho; William J Price, University of Idaho

Incorporating Spatial Statistics into Routine Analysis of Agricultural Field Trials – CC 501 – 10:05 AM –
Julia Piaskowski, University of Idaho; Chad Jackson, University of Idaho; Juliet Marshall, University of Idaho; William J Price, University of Idaho

DemoR: Tools for Teaching and Presenting R Code – CC 302 – 10:35 AM –
Kelly Bodwin, California Polytechnic State University; Hunter Glanz, California Polytechnic State University

Ghclass: An R Package for Managing Classes with GitHub – CC 302 – 10:50 AM –
Colin Rundel, Duke University

Using and Building Shiny Apps for Teaching Introductory Biostatistics CC 504 – 11:05 AM –
Adam Ciarleglio, The George Washington University

Using GitHub and RStudio to Facilitate Authentic Learning Experiences in a Regression Analysis Course – CC 302 – 11:05 AM –
Maria Tackett, Duke University

A Generalized Additive Cox Model with L1-Penalty for Heart Failure Time-To-Event Outcomes and Comparison to Other Machine Learning Approaches – CC 712 – 3:20 PM – Matthias Kormaksson

Thursday, August 1, 2019

A Journey Teaching Applied Statistics for Health Sciences in an Asynchronous Team Based Learning Format Using Data Science Ideas – CC 110 – 8:50 AM – Ben Barnard

Noncentral Algorithm Assessments – CC 104 – 9:20 AM
Jerry Lewis, Biogen Idec

Supplementary Code

In case you are wondering how I produced the plot above, here is the code which uses the cranly and dlstats packages to investigate CRAN.

library(tidyverse) library(cranly) library(dlstats) # Get clean copy of CRAN p_db <- tools::CRAN_package_db() package_db <- clean_CRAN_db(p_db) # Build package network package_network <- build_network(package_db) # Find Hastie packages pkgs <- package_by(package_network, "Trevor Hastie") # Find most downloaded Hastie packages dstats <- cran_stats(pkgs) topdown <- group_by(dstats,package) %>% summarize(n=sum(downloads)) %>% arrange(desc(n)) %>% filter(n > 100000) # Plot the monthly downloads for Hastie's top 5 packages shortlist <- select(topdown,package) %>% slice(1:5) toppkgs <- cran_stats(as.vector(shortlist$package)) ggplot(toppkgs, aes(end, downloads, group=package, color=package)) + geom_line() + geom_point(aes(shape=package)) + xlab("Monthly Downloads") + ggtitle("Trevor Hastie Packages")

_____='https://rviews.rstudio.com/2019/07/19/an-r-users-guide-to-jsm-2019/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dotplot – the single most useful yet largely neglected dataviz type

Fri, 07/19/2019 - 02:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

I have to confess that the core message of this post is not really a fresh saying. But if I was given a chance to deliver one dataviz advise to every (ha-ha-ha) listening mind, I’d choose this: forget multi-category bar plots and use dotplots instead.

I was converted several years ago after reading this brilliant post. Around the same time, as I figured out later, demographer Griffith Feeney wrote a similar post. From fresher examples, there are chapters discussing dotplot visualizations in Claus Wilke’s book and Kieran Healy’s book. So, what are dotplots and what’s so special about them?

Basically, dotplots are the same regular barplots rotated 90 degrees and optimized on “ink usage”, i.e. dots represent values along the continuous horizontal axis. This leaves the vertical axis for the categorical variable allowing to write out long labels for the categories as normal horizontally aligned text (nobody likes to read that awful 45 degrees labels). The use of dots instead of bars allows to represent several comparable values for each category.

Here is a sample of some ugly plots that I found in the recent papers in my personal library of papers.

It was not difficult to compose such a selection as these plots are, unfortunately, super popular. I think, the guy to blame here is (surprise-surprise) Excel. I’m failing to understand how such a basic plot type can be missed out.

Just to be clear, I’m guilty too. Before the happy rstats-ggplot switch, I’ve been producing the same Excel plots. Here is an example from my 2014 working paper on internal youth migration in the central regions of Russia.

These two figures became one in the final paper recently published in GeoJournal. I think it’s a nice example how information dense, pretty looking (very subjective here), and easy to read a dotplot can be.

The figure shows discrepancy in migration estimates based on statistical record and census indirect estimates.

The code to replicate this figure is in this gist. A couple of coding solutions might be interesting here. First, the trick to offset vertically the dots for the two cohorts. Yet I realise that the way I implemented it is a bit clumsy; if you know a more elegant solution please let me know. Also, I believe in some cases composing the legend manually pays off, especially when we are dealing with two-dimensional legend. If the plot has some empty area it’s always a good idea to move the legend in the plotting area thus cutting away the margins.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Wordcloud of conference abstracts – FOSS4G Edinburgh

Thu, 07/18/2019 - 16:24

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I’m helping run a conference this September – FOSS4GUK. To help promote the event I’ve created a wordcloud of conference abstracts, in R!

The conference is taking place in Edinburgh, Scotland at Dynamic Earth. It’s focused on free and open source software for geospatial (FOSS4G), as such is full stack. Everything from backend databases, ETL, analysis to web publication. Tools featured include QGIS, R, Python, PostGIS, leaflet and many others.

Tickets: https://uk.osgeo.org/foss4guk2019/

Follow on twitter: https://twitter.com/foss4guk

Back to the wordcloud. Working with plain text files, I followed Julia and David’s excellent instructions and then added decoration in GIMP. R script below.

 library(tidyverse) library(tidytext) library(wordcloud) # ---------------------------- data("stop_words") f = list.files("~/dir/dir") abstracts = lapply(f, function(i){ read_table(paste0("~/dir/dir/", i), col_names = F) %>% gather(key, word) %>% select(-key) %>% add_column(author = str_remove(i, ".txt")) %>% unnest_tokens(word, word) %>% anti_join(stop_words) }) abstracts = do.call("rbind.data.frame", abstracts) png("~/dir/dir/abstract_wordcloud.png", width=1200, height=800, res=150) par(mar = rep(0, 4)) abstracts %>% drop_na() %>% filter(!str_detect(word, "[0-9]") & word != "e.g") %>% count(word) %>% with(wordcloud(word, n, random.order = FALSE, max.words = 150, colors = c("#497fbf", "#f49835"), rot.per = 0, fixed.asp = F)) dev.off() var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Trainer Directory Launches

Thu, 07/18/2019 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Several dozen people have taken part in RStudio’s instructor training and certification program since it was announced earlier this year. Since our last update, many of them have completed certification, so we are pleased to announce a preview of our trainers’ directory. Each of the people listed there has completed an exam on modern evidence-based teaching practices, as well as technical exams on the Tidyverse or Shiny, and they would welcome inquiries from anyone who is looking for high-quality training on these topics.

We plan to fold this directory into the main RStudio website in the near future, and would be grateful for comments about how to make it more useful. We are also now taking applications for instructor training in September and October 2019; if you are interested, you can find details here or register here. If you have questions or suggestions, please email Greg Wilson.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Plotting Bayes Factors for multiple comparisons using ggsignif

Thu, 07/18/2019 - 02:00

(This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers)

This week my post is relatively short and very focused. What makes it interesting
(at least to me) is whether it will be seen as a useful “bridge” between
frequentist methods and bayesian methods or as an abomination to both! There’s
some reasonably decent code and explanation in this post but before I spend much
more time on the functionality I definitely want to hear some feedback.

For the small number of you who have been following along I have been trying to
elevate my own use of bayesian methods as well as share what I hope are
practical and time saving pieces of code that are relatively simple to use and
reuse across different datasets. I’ve been focusing on some of the fastest male
runners on record and while not strictly necessary you may benefit from reading
some of the earlier postings
.

Today the exercise is to follow-up on what I did conceptually last
post

but add plotting, using ggplot2 and ggsignif, the bayes factors associated with
the multiple comparisons that arise from what is traditionally referred to as a
oneway ANOVA. Last post I was able to produce output that was analogous to that
generated by pairwise.t.test with bayes factors as opposed to “p values”. My
thinking is that one of the reasons that bayesian methods are not more broadly
adopted is that the lack of familiarity is both conceptual compared to NHST
but also because the actual output is so strikingly different. My hope is that
being able to print and plot objects that at least feel familiar in form(at)
will help more people engage with Bayesian methods.

Read on! I’ll say it again below but comments and critique are always welcomed
via disqus or email you’ll have no trouble finding the icon links in a couple
of places. Please comment, have I done something useful here or is this a
Frankenstein’s monster, a terrible idea of a hybrid creature?

I’m tired of running!

I’ve been using the same dataset for the last few posts. All about Usain Bolt
and 100m sprinters. It’s nice data but I’m tired of it. I recently ran across a
post
that
made use of the data from the 2016 US Census American Community Survey (ACS)
Public Use Microdata Sample (PUMS). While that post was about ensemble models
there was one table that caught my eye because it summarized results that
surprised me. The chart was about income in 2016 by sector of employment, and the
means were not in an order I would have imagined.

Let’s grab the data set from the WinVector Github
site
and begin our quest for a plotting solution.
We’ll load the packages we need (suppressing the load messages), set my favorite
ggplot theme, and grab the data.

It’s a funny thing to say but I love R’s factor methods. More than a little
geeky and also perhaps a bit antiquated since in many places across the web
you’ll find stern admonitions to never ever use factors just stick to character
type. I embrace factors and as a consequence really appreciate the forcats
package that is part of the tidyverse. After we throw away the gp column
about “test” versus “training” that we don’t need let’s use forcats package to
do a little cleanup and reordering. I’ve put comments in the code but I want to
especially highlight forcats::fct_reorder which allows us to trivially relevel
our employment factor by mean income.

library(tidyverse) library(BayesFactor) library(ggsignif) theme_set(theme_bw()) # set theme location <- "https://github.com/WinVector/PDSwR2/raw/master/PUMS/incomedata.rds" incomedata <- readRDS(url(location)) # remove the test vs training column as unnecessary. incomedata$gp <- NULL # just cleaning uo the factor so it will make shorter better labels on the plot incomedata$employment <- fct_recode(incomedata$employment, "Self Not Inc" = "Self employed not incorporated", "Self Incorporated" = "Self employed incorporated", "Private for Profit" = "Employee of a private for profit", "Private Non Profit" = "Private not-for-profit employee", "Federal Government" = "Federal government employee", "State Government" = "State government employee", "Local Government" = "Local government employee" ) # In case I want to reduce down to 3 more fundamental # categories Private, Government, or self employed incomedata$empcategory <- fct_collapse(incomedata$employment, "Self" = c("Self Not Inc", "Self Incorporated"), "Private" = c("Private for Profit", "Private Non Profit"), "Government" = c("Federal Government", "State Government", "Local Government") ) incomedata$employment <- forcats::fct_reorder( .f = incomedata$employment, .x = incomedata$income, .fun = mean ) str(incomedata) ## 'data.frame': 22241 obs. of 6 variables: ## $ income : num 22000 21000 25800 25000 20200 36000 20000 30000 23000 5000 ... ## $ age : num 24 31 26 27 27 47 24 41 43 21 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 2 2 2 2 1 1 1 2 2 ... ## $ employment : Factor w/ 7 levels "Self Not Inc",..: 2 4 2 2 4 2 2 2 2 2 ... ## $ education : Factor w/ 9 levels "no high school diploma",..: 1 4 4 6 4 2 1 1 6 2 ... ## $ empcategory: Factor w/ 3 levels "Private","Government",..: 1 1 1 1 1 1 1 1 1 1 ...

Okay we have the data we need. What I want to do is something like this example taken straight from the helpfile for geom_signif only with the bayes factors displayed instead of p values.

ggplot(mpg, aes(class, hwy)) + geom_boxplot() + geom_signif(comparisons = list(c("compact", "pickup"), c("subcompact", "suv") ) )

We pass a list of paired comparisons we want to make and by default it applies
the wilcox.test() function to the dataframe called mpg that we passed it.
Add a boxplot geom and voila we have a nice display. ggsignif is a very nice
package designed to spare us the tedium of doing the math of figuring out where
to put our comparisons on our plot. We could write our own custom geom or even
just try using existing geoms after we calculate everything out, but I’m too
lazy for that. I’d much rather fool ggsignif into doing what we want.

Why even bother?

Before we plunge into writing code let’s review why we might want to do this at
all. As I mentioned in previous
posts

“NHST” has been around a long time and is unlikely to go away in many
disciplines any time soon. If I had to pick one and only one reason why I think
replacing p values with Bayes Factors on this plot is important and the “right
thing to do”
(because make no mistake Bayes Factors are not miracles and not
without their own challenges) it would be expressed in paragraph 2.5 in this
excerpt from Wagenmakers, Lee, Lodewyckx, and Iverson,
2008
. I would like our
plot to provide an assessment of how our data influences our thinking about the
probability of both evidence for our hypothesis (BF10) and evidence
against H0 that hypothesis BF01 .

or to be even briefer…

“… frequentist inference does not allow probability statements …”

which is a shame because very often that’s exactly what we want…

“What are the odds of that happening?”

On with the show!

What I want to accomplish:

  1. Be able to show all the possible pairwise comparisons in ggplot

  2. Apply the ttestBF test from the BayesFactor package to the data piped
    into ggplot

  3. Display the BF10 and/or BF01 information in a variety of formats, e.g. text
    or numeric and with various levels of precision.

  4. Not modify or replicate a single line of code from ggsignif, i.e. use it as
    is.

The first parameter I need to pass to geom_signif is a list of comparisons
where each element of the list is itself a character vector with two elements.
As the example above shows we retain the ability to pass just certain
comparisons that we choose e.g.,comparisons = list(c("compact", "pickup"), c("subcompact", "suv"))
but we also want a function that can take the 7 levels of class or
employment and just make all 21 pairs for us.

Here’s a function called comparisons_list that will take in our dataframe of
interest with x being the grouping or x axis variable and return what we
need.

comparisons_list <- function(data, x) { # creating a dataframe with just the columns of interest # make sure the grouping variable x is indeed a factor # has no unused levels data <- dplyr::select( .data = data, x = !!rlang::enquo(x) ) %>% dplyr::mutate(.data = ., x = droplevels(as.factor(x))) grouplevels <- levels(data$x) g1_list <- combn(grouplevels, 2)[1, ] g2_list <- combn(grouplevels, 2)[2, ] comparisons_list <- lapply( 1:length(g1_list), function(i) c( combn(grouplevels, 2)[2, i], combn(grouplevels, 2)[1, i] ) ) return(comparisons_list) } head(comparisons_list(data = incomedata, x = employment) ) ## [[1]] ## [1] "Private for Profit" "Self Not Inc" ## ## [[2]] ## [1] "State Government" "Self Not Inc" ## ## [[3]] ## [1] "Private Non Profit" "Self Not Inc" ## ## [[4]] ## [1] "Local Government" "Self Not Inc" ## ## [[5]] ## [1] "Federal Government" "Self Not Inc" ## ## [[6]] ## [1] "Self Incorporated" "Self Not Inc" length(comparisons_list(data = incomedata, x = employment) ) ## [1] 21 comp.list <- comparisons_list(incomedata, employment)

Looks like it’s working let’s try it. For now we’ll just run the default
wilcox.test. We’ll use step_increase so we don’t have to think about
spacing.

ggplot(data = incomedata, aes( x = employment, y = income, fill = employment, group = employment )) + geom_boxplot(show.legend = FALSE) + geom_signif( comparisons = comp.list, step_increase = .1 )

Okay that part works. Now we need to work on writing a “test” that ggsignif
understands. That took me awhile to sort through but in the end requires very
little code. Two things to remember both hidden in plain site in the
geom_signif documentation.

  1. “test = the name of the statistical test that is applied to the values
    of the 2 columns (e.g. ‘t.test’, ‘wilcox.test’ etc.).” The key here is values
    which is why I bolded it. What gets passed to the test is not a reference to
    the columns in the data frame or a formula to be processed. The actual values
    for in our case income for two certain levels of the employment factor are
    passed as two distinct vectors.

  2. “If you implement a custom test make sure that it returns a list that has
    an entry called ‘p.value’.”
    Whatever we do internal make sure that p.value
    can be found on the output.

Our test, called pairwise_bf, accepts two numeric vectors as input applies
ttestBF to them uses extractBF to create a dataframe (dataframes are after
all just very special cases of lists) with the results. The only other thing we
have to do is copy the bayes factor in the column bf into a column called
p.value to “fool” geom_signif. If that loses you please go back to earlier
posts where I describe extracting the BF in more detail.

The code below shows a test of “Private for Profit” and “Self Not Inc” which
generates a BF10 of 3,694,058,115 which shows up on the plot in the right place
in scientific notation as 3.7e+09. It’s the very first or bottom paired
comparison.

pairwise_bf <- function(x = NULL, y = NULL) { results <- ttestBF(x = x, y = y) %>% extractBF() %>% mutate(p.value = bf) return(results) } # make two vectors to test function x1 <- incomedata %>% filter(employment == "Private for Profit") %>% droplevels() %>% pull(income) y1 <- incomedata %>% filter(employment == "Self Not Inc") %>% droplevels() %>% pull(income) # try our function pairwise_bf(x1, y1) ## bf error time code p.value ## 1 3694058115 2.391095e-17 Thu Jul 18 14:27:48 2019 36fb194375e3 3694058115 ggplot(data = incomedata, aes( x = employment, y = income, fill = employment, group = employment )) + geom_boxplot(show.legend = FALSE) + geom_signif( comparisons = comp.list, test = "pairwise_bf", step_increase = .1 )

Objectives 1 & 2 above are complete without violating #4. It’s time
for…

Customization

Plotting is working but definitely can use some improvement. I’m not
talking about color scheme or spacing (although they are important) how about
just being able to easily read and interpret those numbers! So I’d like to make
at least the following available:

  1. For the purists who simply want the BF10 values we’ll at least enable
    rounding to a select number of digits via a k parameter.

  2. One of the benefits of a bayesian approach is that we can quantify support
    for both our research hypothesis (usually labelled BF10) AND quantify
    support for the converse or null hypothesis (usually labelled BF01). Our
    custom pairwise_bf currently returns BF10. BF10 can take on any value
    between greater than zero and less than infinity. Some of the pairs are very large
    2.1e+32 and some quite small .047. For me personally it becomes confusing. If
    we remember that BF01 = 1 / BF10 we can write a case statement so we display
    BF10 or BF01 depending on which is larger. That way we know immediately which
    way the support trends.

  3. As noted earlier BF values can be very large we’ll give the user the ability
    to display the natural log value log() so that for example log(2.1e+32) =
    74.4246603 rounded to whatever precision they like.

  4. At some point talking about odds over 100:1 (in my humble opinion) loses the
    need for precision. After all is there really much difference between odds of
    1,000:1 versus 1,001:1? We’ll give the user an option to display BF10/BF01
    values that are between 1 and 100 as is and then create ranges between 100 to
    1,000, 1,000 to one million and greater than one million.

  5. Finally, one of the strengths of using bayes factors is that the user can
    draw their own conclusion about the strength of the evidence. There are,
    however, some suggested rubrics for defining the evidence that are gaining in
    popularity. While not a replacement there have been some attempts to quantify
    the standards of evidence that would be considered meaningful in a scientific
    context. One that is widely used is from Wagenmakers, Wetzels, Borsboom, and
    Van Der
    Maas

    (2011). The article summarizes it in Table 1 but you may prefer the large
    graphic at
    https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs12888-018-1761-4/MediaObjects/12888_2018_1761_Fig1_HTML.png**).

So we’ll add to the function by including a parameter for the display_type as
well as the rounding k. Remember that the object returned from pairwise_bf
is a dataframe with just one row and a column that contains our BF10 that
is named bf. We need to provide geom_signif with a column named p.value so
we can add more columns with variations of how to display bf and then before
we return choose which column to copy into p.value.

The first mutate statement implements #5 above and we’ll call it "support".
The second mutate implements #3 a conversion to a logged and rounded value,
we’ll call it "logged". The final mutate statement implements #4 above, and
in a twist of irony we’ll call it "human" because I find it most human
readable.

At this point our dataframe results looks something like this:

'data.frame': 1 obs. of 7 variables: $ bf : num 5.16 $ error : num 8.1e-09 $ time : Factor w/ 1 level "Thu Jul 18 09:12:05 2019": 1 $ code : Factor w/ 1 level "15f524885ef9": 1 $ support: chr "moderate BF10" $ logged : chr "log(BF10) = 1.64" $ human : chr "BF01 = 5.16 : 1"

The final set of if logic simply determines which one of the display formats
is used based upon the user’s choice from display_type.

pairwise_bf <- function(x = NULL, y = NULL, display_type = "bf", k = 2) { results <- ttestBF(x = x, y = y) %>% extractBF() %>% mutate(support = case_when( bf < .01 ~ "extreme BF01", bf < .03 & bf >= .01 ~ "very strong BF01", bf < .1 & bf >= .03 ~ "strong BF01", bf < 1 / 3 & bf >= .1 ~ "moderate BF01", bf < 1 & bf >= 1 / 3 ~ "anecdotal BF01", bf >= 1 & bf < 3 ~ "anecdotal BF10", bf >= 3 & bf < 10 ~ "moderate BF10", bf >= 10 & bf < 30 ~ "strong BF10", bf >= 30 & bf < 100 ~ "very strong BF10", bf >= 100 ~ "extreme BF10" )) %>% mutate(logged = case_when( bf < 1 ~ paste("log(BF01) = ", round(log(1 / bf), k)), bf >= 1 ~ paste("log(BF10) = ", round(log(bf), k)) )) %>% mutate(human = case_when( bf < .000001 ~ "BF01 >= 1,000,000 : 1", bf < .001 & bf >= .000001 ~ "BF01 >= 1,000 : 1", bf < .01 & bf >= .001 ~ "BF01 >= 100 : 1", bf < 1 & bf >= .01 ~ paste("BF01 = ", round(1 / bf, k), ": 1"), bf >= 1 & bf < 100 ~ paste("BF01 = ", round(bf, k), ": 1"), bf >= 100 & bf < 1000 ~ "BF10 >= 100 : 1", bf >= 1000 & bf < 1000000 ~ "BF10 >= 1,000 : 1", bf >= 1000000 ~ "BF10 >= 1,000,000 : 1" )) if (display_type == "support") { results <- mutate(results, p.value = support) } else if (display_type == "log") { results <- mutate(results, p.value = logged) } else if (display_type == "human") { results <- mutate(results, p.value = human) } else { results <- mutate(results, p.value = bf) } return(results) } # pairwise_bf(incomedata$employment, incomedata$income)

To make use of our new “features” we use the test.args parameter when we
invoke geom_signif. Something like test.args = list(c(display_type = "human"), c(k = 1)) will work. For clarity sake let’s limit ourselves to just a
few comparisons (instead of all) by manually inputting our pairs to
comparisons. Since we’re getting close to our solution I’ll also take the time
to do what I should have done all along and add title information and load the
scales package so the y axis can be properly formatted.

geom_boxplot automatically displays the median and outliers, which makes it
clear that as usual with income data we have a long upper tail, so I decided to
add the mean as a plotted point in a different shape and color.

Enough with the formatting though let’s see the results.

library(scales) ggplot(data = incomedata, aes( x = employment, y = income, fill = employment, group = employment )) + geom_boxplot(show.legend = FALSE) + stat_summary(fun.y = "mean", geom = "point", color = "dark red", shape = 15, size = 3, show.legend = FALSE) + geom_signif( comparisons = list( c("Self Not Inc", "Private for Profit"), c("Self Not Inc", "State Government"), c("Self Not Inc", "Private Non Profit"), c("Private for Profit", "State Government"), c("Private for Profit", "Private Non Profit"), c("Private for Profit", "Local Government") ), test = "pairwise_bf", test.args = list( c(display_type = "human"), c(k = 1) ), step_increase = .1 ) + ggtitle(label = "ACS 2016 Income by Employer Type", subtitle = "With selected multiple comparisons non directional hypothesis using bayes factors") + scale_y_continuous(label = dollar)

Looks good! For me it’s also very illuminating. Until I saw the data I had no
idea there was such a large disparity between those who were self-employed and
those self-employed who had taken the time and effort to incorporate. The odds
as expressed by the BF10 for our data are greater than a million to one
that the self employed unincorporated make the same amount as any of the other
categories.

I also deliberately selected these 6 pairings so I could reinforce (hopefully
not belabor
) the utility of BF01. I will confess before I saw the
results I labored under the impression that people employed in the private
sector would be better paid than government workers. Let’s consider the
BF01 values for the “Private for Profit” pairings. Notice that unlike
traditional NHST rejection of the null we get actual information value from the
Bayes Factors. A BF01 = 13.8 : 1 doesn’t just say we have insufficient
evidence to reject the null h0 it actually provides odds of 13.8:1 that the
averages are the same. Whereas BF01 = 1.5 : 1 expresses a great deal more
uncertainty about whether they are the same or different. Our data don’t provide
strong evidence in either direction.

Perhaps using our “support” option across all the pairings is a good thing to
try at this juncture. Just to vary it a little we’ll also shift from box plots
to violin plots. Since we have a lot of pairwise comparisons we’ll also
change some parameters in our call to geom_signif with
step_increase = .07, textsize = 3.0, tip_length = 0.01 to allow us to make
better use of plot real estate.

ggplot(incomedata, aes( x = employment, y = income, fill = employment, group = employment )) + geom_violin(show.legend = FALSE) + stat_summary(fun.y="mean", geom="point", show.legend = FALSE) + stat_summary(fun.y="median", geom="point", color = "red", shape = 13, show.legend = FALSE) + geom_signif( comparisons = comp.list, test = "pairwise_bf", test.args = list( c(display_type = "support"), c(k = 0) ), step_increase = .07, textsize = 3.0, tip_length = 0.01 ) + ggtitle(label = "ACS 2016 Income by Employer Type", subtitle = "With selected multiple comparisons non directional hypothesis using bayes factors") + scale_y_continuous(label = dollar, breaks = seq(from = 0, to = 250000, by = 50000) )

Finally, just for completeness, I thought I’d show the results of using
display_type = "log" with 3 digits. While I’m at it I’ll highlight one of
the joys of the open source ecosystem around modular frameworks like ggplot2.
Rather than indulge in additional tweaking of the theme (I know only a little
about them). Besides being able to pick and choose from a wide assortment of
*“geom_ as I did with geom_signif you can also avail yourself of a wide
variety of”ready made” themes. In this case hrbrthemes::theme_ipsum() from
Bob Rudis.

I think the end result looks very nice.

theme_set(hrbrthemes::theme_ipsum()) ggplot(incomedata, aes( x = employment, y = income, fill = employment, group = employment )) + geom_violin(show.legend = FALSE) + stat_summary(fun.y="mean", geom="point", show.legend = FALSE) + stat_summary(fun.y="median", geom="point", color = "red", shape = 13, show.legend = FALSE) + geom_signif( comparisons = comp.list, test = "pairwise_bf", test.args = list( c(display_type = "log"), c(k = 3) ), step_increase = .09, textsize = 3.0, tip_length = 0.01 ) + ggtitle(label = "ACS 2016 Income by Employer Type", subtitle = "With selected multiple comparisons non directional hypothesis using bayes factors") + scale_y_continuous(label = dollar, breaks = seq(from = 0, to = 250000, by = 50000) )

Acknowledgements

There’s a quote attributed to Sir Isaac Newton that applies here:

If I have seen further than others, it is by standing upon the shoulders of giants.

While I hope the post and the code in it are useful, I’d be remiss if I didn’t acknowledge how much they rely on the work of others.

Done

I’m really very interested in feedback on whether people find this functionality
useful. I’m tempted to, and would be happy to, expand it further (for example
automatic choosing of which comparisons are displayed based upon a criteria the
user sets). But as I acknowledged at the outset I’m not sure whether this is a
Frankenstein’s monster that will be abhorrent to frequentists and bayesians
alike or even whether someone has already implemented this with better code.

I do personally think it is a nice bridge that might be especially useful for
those like me who were taught frequentist methods early on and are still
learning their way around new methods. On the one hand the display is somehow
familiar and comforting while at the same time more informative and less mistake
prone.

Chuck

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Processing satellite image collections in R with the gdalcubes package

Thu, 07/18/2019 - 02:00

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

[view raw
Rmd
]

The problem

Scientists working with collections and time series of satellite imagery
quickly run into some of the following problems:

  • Images from different areas of the world have different spatial
    reference systems (e.g., UTM zones).
  • The pixel size of a single image sometimes differs among its
    spectral bands / variables.
  • Spatially adjacent image tiles often overlap.
  • Time series of images are often irregular when the area of interest
    covers spatial areas larger than the extent of a single image.
  • Images from different data products or different satellites are
    distributed in diverse data formats and structures.

GDAL and the rgdal R
package
can solve most of
these difficulties by reading all relevant data formats and implementing
image warping to reproject, rescale, resample, and crop images.
However, GDAL does not know about image time series and hence there is
a lot of manual work needed before data scientists can actually work
with these data. Instead of collections of images, data users in many
cases desire a regular data structure such as a four-dimensional
raster data cube with dimensions
x, y, time, and band.

In R, there is currently no implementation to build regular data cubes
from image collections. The stars
package
provides a generic
implementation for processing raster and vector data cubes with an
arbitrary number of dimensions, but assumes that the data are already
organized as an array.

Introduction and overview of gdalcubes

This blog post introduces the gdalcubes R package, aiming at making
the work with collections and time series of satellite imagery easier
and more interactive.

The core features of the package are:

  • build regular dense data cubes from large satellite image
    collections based on a user-defined data cube view (spatiotemporal
    extent, resolution, and map projection of the cube)
  • apply and chaining operations on data cubes
  • allow for the execution of user-defined functions on data cubes
  • export data cubes as netCDF files, making it easy to process
    further, e.g., with
    stars or
    raster.

Technically, the R package is a relatively lightweight wrapper around
the gdalcubes C++ library. The
library strongly builds on GDAL,
netCDF, and
SQLite (the full list of C++
libraries used by gdalcubes is found
here).

This blog post focuses on how to use the R package gdalcubes, more
details about the underlying ideas can be found in our recent open
access paper in the datacube special issue of the DATA
journal
.

Installation

The package can be installed directly from
CRAN:

install.packages("gdalcubes")

Some features and functions used in this blog post are still in the
development version (which will be submitted to CRAN as version 0.2.0),
which currently needs a source install:

# install.packages("remotes") remotes::install_git("https://github.com/appelmar/gdalcubes_R", ref="dev", args="--recursive")

If this fails with error messages like “no rule to make target …”,
please read here.

Demo dataset

We use a collection of 180 Landsat
8
surface reflectance images,
covering a small part of the Brazilian Amazon forest. If you would like
to work with the dataset on your own (and maybe reproduce some parts of
this blog post), you have two options:

Option A: Download the dataset with full resolution (30 meter pixel
size) here
(67 gigabytes compressed; 190 gigabytes unzipped)

Option B: Download a downsampled version of the dataset that
contains all images at a coarse spatial resolution (300 meter pixel
size) here
(740 megabytes compressed; 2 gigabytes unzipped)

Except for the spatial resolution of the images, the datasets are
identical, meaning that all code samples work for both options but
result images may look blurred when using option B. Here is the code to
download and unzip the dataset to your current working directory (which
might take some time):

# Option A #download.file("https://uni-muenster.sciebo.de/s/SmiqWcCCeFSwfuY/download", destfile = "L8_Amazon.zip") # Option B if (!file.exists("L8_Amazon.zip")) { download.file("https://uni-muenster.sciebo.de/s/e5yUZmYGX0bo4u9/download", destfile = "L8_Amazon.zip") unzip("L8_Amazon.zip", exdir = "L8_Amazon") } Creating image collections

After extraction of the zip archive, we get one directory per image,
where each image contains 10 GeoTIFF files representing the spectral
bands and additional per-pixel quality information. As a first step, we
must scan all available images once, extract some metadata (e.g.
acquisition date/time and spatial extents of images and how the files
relate to bands), and store this information in a simple image
collection
index file. This file does not store any pixel values but
only metadata and references to where images can be found.

First, we simply need to find available GeoTIFF files in all
subdirectories of our demo dataset:

files = list.files("L8_Amazon", recursive = TRUE, full.names = TRUE, pattern = ".tif") length(files) ## [1] 1800 head(files) ## [1] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_pixel_qa.tif" ## [2] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_radsat_qa.tif" ## [3] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_aerosol.tif" ## [4] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band1.tif" ## [5] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band2.tif" ## [6] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band3.tif"

To understand the structure of particular data products, the package
comes with a set of predefined rules (called collection formats) that
define how required metadata can be derived from the data. These include
formats for some Sentinel-2, MODIS, and Landsat products. We can list
all available formats with:

library(gdalcubes) ## Loading required package: Rcpp ## Loading required package: RcppProgress ## Loading required package: jsonlite ## Loading required package: ncdf4 ## Using gdalcubes library version 0.1.9999 collection_formats() ## CHIRPS_v2_0_daily_p05_tif | Image collection format for CHIRPS v 2.0 ## | daily global precipitation dataset (0.05 ## | degrees resolution) from GeoTIFFs, expects ## | list of .tif or .tif.gz files as input. ## | [TAGS: CHIRPS, precipitation] ## CHIRPS_v2_0_monthly_p05_tif | Image collection format for CHIRPS v 2.0 ## | monthly global precipitation dataset (0.05 ## | degrees resolution) from GeoTIFFs, expects ## | list of .tif or .tif.gz files as input. ## | [TAGS: CHIRPS, precipitation] ## L8_L1TP | Collection format for Landsat 8 Level 1 TP ## | product [TAGS: Landsat, USGS, Level 1, NASA] ## L8_SR | Collection format for Landsat 8 surface ## | reflectance product [TAGS: Landsat, USGS, ## | Level 2, NASA, surface reflectance] ## MxD10A2 | Collection format for selected bands from ## | the MODIS MxD10A2 (Aqua and Terra) v006 Snow ## | Cover product [TAGS: MODIS, Snow Cover] ## MxD11A1 | Collection format for selected bands from ## | the MODIS MxD11A2 (Aqua and Terra) v006 Land ## | Surface Temperature product [TAGS: MODIS, ## | LST] ## MxD11A2 | Collection format for selected bands from ## | the MODIS MxD11A2 (Aqua and Terra) v006 Land ## | Surface Temperature product [TAGS: MODIS, ## | LST] ## MxD13A3 | Collection format for selected bands from ## | the MODIS MxD13A3 (Aqua and Terra) product ## | [TAGS: MODIS, VI, NDVI, EVI] ## Sentinel2_L1C | Image collection format for Sentinel 2 Level ## | 1C data as downloaded from the Copernicus ## | Open Access Hub, expects a list of file ## | paths as input. The format works on original ## | ZIP compressed as well as uncompressed ## | imagery. [TAGS: Sentinel, Copernicus, ESA, ## | TOA] ## Sentinel2_L1C_AWS | Image collection format for Sentinel 2 Level ## | 1C data in AWS [TAGS: Sentinel, Copernicus, ## | ESA, TOA] ## Sentinel2_L2A | Image collection format for Sentinel 2 Level ## | 2A data as downloaded from the Copernicus ## | Open Access Hub, expects a list of file ## | paths as input. The format should work on ## | original ZIP compressed as well as ## | uncompressed imagery. [TAGS: Sentinel, ## | Copernicus, ESA, BOA, Surface Reflectance]

The “L8_SR” format is what we need for our demo dataset. Next, we must
tell gdalcubes to scan the files and build an image collection. Below,
we create an image collection from the set of GeoTIFF files, using the
“L8_SR” collection format and store the resulting image collection
under “L8.db”.

L8.col = create_image_collection(files, "L8_SR", "L8.db") L8.col ## A GDAL image collection object, referencing 180 images with 17 bands ## Images: ## name left top bottom ## 1 LC08_L1TP_226063_20140719_20170421_01_T1 -54.15776 -3.289862 -5.392073 ## 2 LC08_L1TP_226063_20140820_20170420_01_T1 -54.16858 -3.289828 -5.392054 ## 3 LC08_L1GT_226063_20160114_20170405_01_T2 -54.16317 -3.289845 -5.392064 ## 4 LC08_L1TP_226063_20160724_20170322_01_T1 -54.16317 -3.289845 -5.392064 ## 5 LC08_L1TP_226063_20170609_20170616_01_T1 -54.17399 -3.289810 -5.392044 ## 6 LC08_L1TP_226063_20170711_20170726_01_T1 -54.15506 -3.289870 -5.392083 ## right datetime ## 1 -52.10338 2014-07-19T00:00:00 ## 2 -52.11418 2014-08-20T00:00:00 ## 3 -52.10878 2016-01-14T00:00:00 ## 4 -52.10878 2016-07-24T00:00:00 ## 5 -52.11958 2017-06-09T00:00:00 ## 6 -52.09798 2017-07-11T00:00:00 ## srs ## 1 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs ## 2 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs ## 3 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs ## 4 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs ## 5 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs ## 6 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs ## [ omitted 174 images ] ## ## Bands: ## name offset scale unit nodata image_count ## 1 AEROSOL 0 1 180 ## 2 B01 0 1 -9999.000000 180 ## 3 B02 0 1 -9999.000000 180 ## 4 B03 0 1 -9999.000000 180 ## 5 B04 0 1 -9999.000000 180 ## 6 B05 0 1 -9999.000000 180 ## 7 B06 0 1 -9999.000000 180 ## 8 B07 0 1 -9999.000000 180 ## 9 EVI 0 1 -9999.000000 0 ## 10 MSAVI 0 1 -9999.000000 0 ## 11 NBR 0 1 -9999.000000 0 ## 12 NBR2 0 1 -9999.000000 0 ## 13 NDMI 0 1 -9999.000000 0 ## 14 NDVI 0 1 -9999.000000 0 ## 15 PIXEL_QA 0 1 180 ## 16 RADSAT_QA 0 1 180 ## 17 SAVI 0 1 -9999.000000 0

Internally, the output file is a simple SQLite database. Please notice
that our collection does not contain data for all possible bands (see
image_count column). Depending on particular data download requests,
Landsat 8 surface reflectance data may come e.g. with some
post-processed bands (like vegetation indexes) that can be used if
available.

Creating and processing data cubes

To create a raster data cube, we need (i) an image collection and (ii) a
data cube view, defining how we look at the data, i.e., at which
spatiotemporal resolution, window, and spatial reference system. For a
quick look at the data, we define a cube view with 1km x 1km pixel size,
yearly temporal resolution, covering the full spatiotemporal extent of
the image collection, and using the web mercator spatial reference
system.

v.overview = cube_view(extent=L8.col, dt="P1Y", dx=1000, dy=1000, srs="EPSG:3857", aggregation = "median", resampling = "bilinear") raster_cube(L8.col, v.overview) ## A GDAL data cube proxy object ## Dimensions: ## name low high size chunk_size ## 1 t 2013.0 2019.0 7 16 ## 2 y -764014.4 -205014.4 559 256 ## 3 x -6582280.1 -5799280.1 783 256 ## ## Bands: ## name type offset scale nodata unit ## 1 AEROSOL float64 0 1 NaN ## 2 B01 float64 0 1 NaN ## 3 B02 float64 0 1 NaN ## 4 B03 float64 0 1 NaN ## 5 B04 float64 0 1 NaN ## 6 B05 float64 0 1 NaN ## 7 B06 float64 0 1 NaN ## 8 B07 float64 0 1 NaN ## 9 EVI float64 0 1 NaN ## 10 MSAVI float64 0 1 NaN ## 11 NBR float64 0 1 NaN ## 12 NBR2 float64 0 1 NaN ## 13 NDMI float64 0 1 NaN ## 14 NDVI float64 0 1 NaN ## 15 PIXEL_QA float64 0 1 NaN ## 16 RADSAT_QA float64 0 1 NaN ## 17 SAVI float64 0 1 NaN

As specified in our data cube view, the time dimension of the resulting
data cube only has 7 values, representing years from 2013 to 2019. The
aggregation parameter in the data cube view defines how values from
multiple images in the same year shall be combined. In contrast, the
selected resampling algorithm is applied when reprojecting and rescaling
individual images.

If we are interested in a smaller area at higher temporal resolution, we
simply need to define a data cube view with different parameters,
including a specific spatiotemporal extent by passing a list as extent
argument to cube_view. Below, we define a data cube view for a 100km x
100km area with 50m pixel size at monthly temporal resolution.

v.subarea = cube_view(extent=list(left=-6320000, right=-6220000, bottom=-600000, top=-500000, t0="2014-01-01", t1="2018-12-31"), dt="P1M", dx=50, dy=50, srs="EPSG:3857", aggregation = "median", resampling = "bilinear") raster_cube(L8.col, v.subarea) ## A GDAL data cube proxy object ## Dimensions: ## name low high size chunk_size ## 1 t 201401 201812 60 16 ## 2 y -600000 -500000 2000 256 ## 3 x -6320000 -6220000 2000 256 ## ## Bands: ## name type offset scale nodata unit ## 1 AEROSOL float64 0 1 NaN ## 2 B01 float64 0 1 NaN ## 3 B02 float64 0 1 NaN ## 4 B03 float64 0 1 NaN ## 5 B04 float64 0 1 NaN ## 6 B05 float64 0 1 NaN ## 7 B06 float64 0 1 NaN ## 8 B07 float64 0 1 NaN ## 9 EVI float64 0 1 NaN ## 10 MSAVI float64 0 1 NaN ## 11 NBR float64 0 1 NaN ## 12 NBR2 float64 0 1 NaN ## 13 NDMI float64 0 1 NaN ## 14 NDVI float64 0 1 NaN ## 15 PIXEL_QA float64 0 1 NaN ## 16 RADSAT_QA float64 0 1 NaN ## 17 SAVI float64 0 1 NaN

The raster_cube function always returns a proxy object, meaning that
neither any expensive computations nor any data reads from disk are
started. Instead, gdalcubes delays the execution until the data is
really needed when users call plot(), or write_ncdf(). However, the
result of our call to raster_cube can be passed to data cube
operators. For example, the code below drops all bands except the
visible RGB bands and, again, returns a proxy object.

L8.cube.rgb = select_bands(raster_cube(L8.col, v.overview), c("B02","B03","B04")) L8.cube.rgb ## A GDAL data cube proxy object ## Dimensions: ## name low high size chunk_size ## 1 t 2013.0 2019.0 7 16 ## 2 y -764014.4 -205014.4 559 256 ## 3 x -6582280.1 -5799280.1 783 256 ## ## Bands: ## name type offset scale nodata unit ## 1 B02 float64 0 1 NaN ## 2 B03 float64 0 1 NaN ## 3 B04 float64 0 1 NaN

Calling plot() will eventually start computationn, and hence might
take some time:

system.time(plot(L8.cube.rgb, rgb=3:1, zlim=c(0,1200)))

## user system elapsed ## 16.367 0.605 17.131 Chaining data cube operations

For the remaining examples, we use multiple threads to process data
cubes by setting:

gdalcubes_options(threads=4)

We can chain many of the provided data cube operators (e.g., using the
pipe %>%). The following code will derive the median values of the RGB
bands over time, producing a single RGB overview image for our selected
subarea.

suppressPackageStartupMessages(library(magrittr)) # use the pipe raster_cube(L8.col, v.subarea) %>% select_bands(c("B02","B03","B04")) %>% reduce_time("median(B02)", "median(B03)", "median(B04)") %>% plot(rgb=3:1, zlim=c(0,1200))

Implemented data cube operators include:

  • apply_pixel apply one or more arithmetic expressions on individual
    data cube pixels, e.g., to derive vegetation indexes.
  • reduce_time apply on or more reducers over pixel time series.
  • reduce_time apply on or more reducers over spatial slices.
  • select_bands subset available bands.
  • window_time apply an aggregation function or convolution kernel
    over moving time series windows.
  • join_bands combines the bands of two indentically shaped data
    cubes.
  • filter_pixel filter pixels by a logical expression on band values,
    e.g., select all pixels with NDVI larger than 0.
  • write_ncdf export a data cube as a netCDF file.

In a second example, we compute the normalied difference vegetation
index (NDVI) with apply_pixel and derive its maximum values over time:

suppressPackageStartupMessages(library(viridis)) # use colors from viridis package raster_cube(L8.col, v.subarea) %>% select_bands(c("B04","B05")) %>% apply_pixel(c("(B05-B04)/(B05+B04)"), names="NDVI") %>% reduce_time("max(NDVI)") %>% plot(zlim=c(-0.2,1), col=viridis, key.pos=1)

User-defined functions

Previous examples used character expressions to define reducer and
arithmetic functions. Operations like apply_pixel and filter_pixel
take character arguments to define the expressions. The reason for this
is that expressions are translated to C++ functions and all computations
then are purely C++. However, to give users more flexibility and allow
the definition of user-defined functions, reduce_time and
apply_pixel also allow to pass arbitrary R functions as an argument.
In the example below, we derive the 0.25, and 0.75 quantiles over NDVI
time series. There is of course no limitation what the provided reducer
function does and it is thus possible to use functions from other
packages.

v.16d = cube_view(view=v.overview, dt="P16D") raster_cube(L8.col, v.16d) %>% select_bands(c("B04", "B05")) %>% apply_pixel(c("(B05-B04)/(B05+B04)"), names="NDVI") %>% reduce_time(names = c("q1","q3"), FUN = function(x) { quantile(x["NDVI",], probs = c(0.25, 0.75), na.rm = TRUE) }) %>% plot(col=viridis, zlim=c(-0.2,1), key.pos=1)

However, there are some things, users need to keep in mind when working
with user-defined functions:

  1. Users should provide names of the output bands and make sure that
    the function always return the same number of elements.
  2. When executed, the function runs in a new R session, meaning that it
    cannot access variables in the current worskspace and packages must
    be loaded within the function if needed.
  3. Ideally, users should carefully check for errors. A frequent cause
    for errors is the presence of NA values, which are abundant in
    raster data cubes from irregular image collections.
  4. In the current version, only apply_pixel and reduce_time allow
    for passing user-defined functions.
Interfacing with stars

The stars package is much more generic and supports higher dimensional
arrays and hence supports e.g. data from climate model output. It also
does not assume data to be orthorectified, i.e. it works also with
curvilinear grids and hence supports data as from Sentinel-5P. In
contrast, gdalcubes concentrates on multispectral image time series (4d)
only.

gdalcubes currently comes with a simple as_stars() function, writing a
data cube as a (temporary) netCDF file, which is then opened by
read_stars. The stars object holds bands as attributes. If needed
(e.g. for ggplot below), st_redimension converts attributes to a new
dimension.

suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(stars)) x = raster_cube(L8.col, v.overview) %>% select_bands(c("B02","B03","B04")) %>% as_stars() x ## stars object with 3 dimensions and 3 attributes ## attribute(s), summary of first 1e+05 cells: ## B02 B03 B04 ## Min. : 123.8 Min. : 174.1 Min. : 112.1 ## 1st Qu.: 209.9 1st Qu.: 397.3 1st Qu.: 247.2 ## Median : 295.0 Median : 497.9 Median : 380.3 ## Mean : 457.6 Mean : 653.0 Mean : 563.5 ## 3rd Qu.: 476.7 3rd Qu.: 686.5 3rd Qu.: 662.6 ## Max. :9419.0 Max. :9780.7 Max. :10066.2 ## NA's :79497 NA's :79497 NA's :79497 ## dimension(s): ## from to offset delta refsys point ## x 1 783 -6582280 1000 +proj=merc +a=6378137 +b=... FALSE ## y 1 559 -205014 -1000 +proj=merc +a=6378137 +b=... FALSE ## time 1 7 NA NA POSIXct FALSE ## values ## x NULL [x] ## y NULL [y] ## time 2013-01-01,...,2019-01-01 ggplot() + geom_stars(data = slice(st_redimension(x), "time", 5)) + facet_wrap(~new_dim) + theme_void() + coord_fixed() + scale_fill_viridis(limits=c(0, 2000))

Future work

There are many things, which we didn’t cover in this blog post like
applying masks during the construction of data cubes. More importantly,
a question we are very much interested in at the moment is how far we
can go with gdalcubes and stars in cloud computing envrionments, where
huge image collections such as the full Sentinel-2 archive are already
stored. This will be the topic of another blog post.

We also have a lot of ideas for future developments but no real
schedule. If you would like to contribute, get involved, or if you have
further ideas, please get in touch! Development and discussion takes
place on GitHub (R package on
GitHub
, C++ library on
GitHub
).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-spatial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rOpenSci Hiring for New Position in Statistical Software Testing and Peer Review

Thu, 07/18/2019 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Are you passionate about statistical methods and software? If so we would love for you to join our team to dig deep into the world of statistical software packages. You’ll develop standards for evaluating and reviewing statistical tools, publish, and work closely with an international team of experts to set up a new software review system.

We are seeking a creative, dedicated, and collaborative software research scientist to support a two-year project in launching a new software peer-review initiative. The software research scientist will work on the Sloan Foundation supported rOpenSci project, with rOpenSci staff and a statistical methods editorial board. They will research and develop standards and review guidelines for statistical software, publish findings, and develop R software to test packages against those standards. The software research scientist will work with staff and the board to collaborate broadly with the statistical and software communities to gather input, refine and promote the standards, and recruit editors and peer reviewers. The candidate must be self-motivated, proactive, collaborative and comfortable working openly and reproducibly with a broad online community.

For more details and how to apply see https://ropensci.org/careers/.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Combining momentum and value into a simple strategy to achieve higher returns

Wed, 07/17/2019 - 11:53

(This article was first published on Data based investing, and kindly contributed to R-bloggers)

In this post I’ll introduce a simple investing strategy that is well diversified and has been shown to work across different markets. In short, buying cheap and uptrending stocks has historically led to notably higher returns. The strategy is a combination of these two different investment styles, value and momentum. In a previous post I explained how the range of possible outcomes in investing into a single market is excessively high. Therefore, global diversification is the key to assure that you achieve your investment objective. This strategy is diversified across strategies, markets and different stocks. The benefits of this strategy are the low implementation costs, a high diversification level, higher expected returns and lower drawdowns. We’ll use data from Barclays for the CAPEs which represent valuations, and Yahoo Finance using quantmod for the returns that do not include dividends, which we’ll use as absolute momentum. Let’s take a look at the paths of valuation and momentum for the U.S. stock market for the last seven years: The two corrections are easy to spot, because momentum was low, and valuations decreased. The U.S. stock market currently has a strong momentum as measured by six-month absolute return, but the valuation level is really high. Therefore the U.S. is not the optimal country to invest in. So, which market is the optimal place to be? Let’s look at just the current values of different markets: There is only one market that is just in the right spot: Russia. It has the highest momentum and second lowest valuation of all the countries in this sample. In emerging markets things happen faster and more intensively, which leads to more opportunities and makes investing in them more interesting. Different markets also tend to be in different cycles, which makes this combination strategy even more attractive. Let’s discuss more about these strategies and why they work well together. Research on the topic
Value and momentum factors are negatively correlated, which means that when the other one has low returns, the other one’s returns tend to be higher. Both have been found to lead to excess returns and are two of the most researched so-called anomalies. Both strategies have been tried to be explained using risk-based and behavioral factors, but no single explanation has been agreed on for either of the strategies. The fact that there are multiple explanations for the superior performance can rather be viewed as a good thing for the strategies. In their book “Your Complete Guide to Factor-Based Investing”, Berkin and Swedroe found out that the yearly returns of the two anomalies using a long-short strategy was 4.8 percent for value and 9.6 percent for the momentum anomaly. This corresponds to the return of the factor itself and can directly be compared to the market beta factor, which has had a historical annual return of 8.3 percent during the same period. This means that investing just in the momentum factor and therefore hedging against the market would have led to a higher return than just investing in the market. It is important to notice that investing normally just using a momentum strategy without shorting gives exposure to both of the market beta and momentum factors, which leads to a higher return than investing just into either of these factors. Andreu et al. examined momentum on the country level and found out that the return of the momentum factor has been about 6 percent per annum for a holding period of six months. For a holding period of twelve months, the return was cut in half (source). It seems that a short holding period seems to work best for this momentum strategy. They researched investing in a single country and three countries at a time and shorting the same amount of countries at a time. The smaller amount of countries led to higher returns, but no risk measures were presented in the study. As a short-term strategy I’d suggest equal weighting some of the countries with high momentum and low valuation. I’ve also tested the combination of value and momentum in the U.S. stock market, and it seems that momentum does not affect the returns at all on longer periods of time. Value on the other hand tends to correlate strongly with future returns only on much longer periods, and on shorter periods the correlation is close to zero as I demonstrated in a previous post. However, the short-term CAGR of the value strategy on the country level in the U.S. has still been rather impressive at 14.5 percent for a CAPE ratio of 5 to 10, as shown by Faber (source, figure 3A). I chose to show this specific valuation level, since currently countries such as Turkey and Russia are trading at these valuation levels (source). The 10-year cyclically adjusted price to earnings ratio that was discussed in the previous chapter, also known as CAPE, has been shown to be among the best variables for explaining the future returns of the stock market. It has a logarithmic relationship with future 10-15 year returns, and an r-squared as high as 0.49 across 17 country-level indices (source, page 11). A lower CAPE has also lead to smaller maximum and average drawdowns (source).

Faber has shown that investing in countries with a low CAPE has returned 14 percent annually since 1993, and the risk-adjusted return has also been really good (source). The strategy, and value investing as a whole, has however underperformed for the last ten years or so (source). This is good news if you believe in mean reversion in the stock market. The two strategies work well together on the stock level, as shown by Keimling (source). According to the study, the quintile with highest momentum has led to a yearly excess return of 2.7 percent, and the one with the lowest valuation has led to a yearly excess return of 3 percent globally. Choosing stocks with highest momentum and lowest valuations has over doubled the excess return to 7.6 percent. O’Shaughnessy has shown that the absolute return for a quintile with the highest momentum was 11.6 percent, and 11.8 percent for value. Combining the two lead to a return of 18.5 percent (source). Lastly, let’s take a closer look at some selected countries and their paths: As expected, the returns of the emerging markets vary a lot compared to U.S. market. U.S. has performed extremely well, but the historical earnings haven’t kept up with the prices. Israel on the other hand has gotten cheaper while the momentum has been good. Even though the momentum of U.S. is higher than any other point in time in this sample, Russia’s momentum currently is, and Turkey’s momentum has been way higher. Both Russia’s and Turkey’s valuations are less than a third of U.S. valuations, which makes these markets very interesting. In conclusion, combining value and momentum investing into a medium-term strategy is likely to lead to excess returns as shown by previous research. The strategy can be easily implemented using country-specific exchange traded funds, and the data is easily available. Currently only Russia is in the sweet spot for this strategy, and Turkey might be once it gains some momentum. Investing to just one country is however risky, and I suggest diversifying between the markets with high momentum and low valuations. Be sure to follow me on Twitter for updates about new blog posts! The R code used in the analysis can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data based investing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An Ad-hoc Method for Calibrating Uncalibrated Models

Wed, 07/17/2019 - 03:54

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

Example: Predicting income

We’ll continue the example from the previous posts in the series: predicting income from demographic variables (sex, age, employment, education). The data is from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):

library(zeallot) library(wrapr) incomedata <- readRDS("incomedata.rds") c(test, train) %<-% split(incomedata, incomedata$gp) # get the rollups (mean) by grouping variable show_conditional_means <- function(d, outcome = "income") { cols <- qc(sex, employment, education) lapply( cols := cols, function(colname) { aggregate(d[, outcome, drop = FALSE], d[, colname, drop = FALSE], FUN = mean) }) } display_tables <- function(tlist) { for(vi in tlist) { print(knitr::kable(vi)) } } display_tables( show_conditional_means(train)) sex income Male 55755.51 Female 47718.52 employment income Employee of a private for profit 51620.39 Federal government employee 64250.09 Local government employee 54740.93 Private not-for-profit employee 53106.41 Self employed incorporated 66100.07 Self employed not incorporated 41346.47 State government employee 53977.20 education income no high school diploma 31883.18 Regular high school diploma 38052.13 GED or alternative credential 37273.30 some college credit, no degree 42991.09 Associate’s degree 47759.61 Bachelor’s degree 65668.51 Master’s degree 79225.87 Professional degree 97772.60 Doctorate degree 91214.55 A random forest model

For this post, we’ll train a random forest model to predict income.

library(randomForest) model_rf_1stage <- randomForest(income ~ age+sex+employment+education, data=train) train$pred_rf_raw <- predict(model_rf_1stage, newdata=train, type="response") # doesn't roll up display_tables( show_conditional_means(train, qc(income, pred_rf_raw))) sex income pred_rf_raw Male 55755.51 55292.47 Female 47718.52 48373.40 employment income pred_rf_raw Employee of a private for profit 51620.39 51291.36 Federal government employee 64250.09 61167.31 Local government employee 54740.93 55425.30 Private not-for-profit employee 53106.41 54174.31 Self employed incorporated 66100.07 63714.20 Self employed not incorporated 41346.47 46415.34 State government employee 53977.20 55599.89 education income pred_rf_raw no high school diploma 31883.18 41673.91 Regular high school diploma 38052.13 42491.11 GED or alternative credential 37273.30 43037.49 some college credit, no degree 42991.09 44547.89 Associate’s degree 47759.61 46815.79 Bachelor’s degree 65668.51 63474.48 Master’s degree 79225.87 69953.53 Professional degree 97772.60 76861.44 Doctorate degree 91214.55 75940.24

As we observed before, the random forest model predictions do not match the true rollups, even on the training data.

Polishing the model

Suppose that we wish to make individual predictions of subjects’ incomes, and estimate mean income as a function of employment type. An ad-hoc way to do this is to adjust the predictions from the random forest, depending on subjects’ employment type, so that the resulting polished model is calibrated with respect to employment. Since linear models are calibrated, we might try fitting a linear model to the random forest model’s predictions, along with employment.

(Of course, we could use a Poisson model as well, but for this example we’ll just use a simple linear model for the polishing step).

One caution: we shouldn’t use the same data to fit both the random forest model and the polishing model. This leads to nested-model bias, a potential source of overfit. Either we must split the training data into two sets: one to train the random forest model and another to train the polishing model; or we have to use cross-validation to simulate having two sets. This second procedure is the same procedure used when stacking multiple models; you can think of this polishing procedure as being a form of stacking, where some of the sub-learners are simply single variables.

Let’s use 5-fold cross-validation to "stack" the random forest model and the employment variable. We’ll use vtreat to create the cross-validation plan.

set.seed(2426355) # build a schedule for 5-way crossval crossplan <- vtreat::kWayCrossValidation(nrow(train), 5)

The crossplan is a list of five splits of the data (described by row indices); each split is itself a list of two disjoint index vectors: split$train and split$app. For each fold, we want to train a model using train[split$train, , drop=FALSE] and then apply the model to train[split$app, , drop=FALSE].

train$pred_uncal <- 0 # use cross validation to get uncalibrated predictions for(split in crossplan) { model_rf_2stage <- randomForest(income ~ age+sex+employment+education, data=train[split$train, , drop=FALSE]) predi <- predict(model_rf_2stage, newdata=train[split$app, , drop=FALSE], type="response") train$pred_uncal[split$app] <- predi }

The vector train$pred_uncal is now a vector of random forest predictions on the training data; every prediction is made using a model that was not trained on the datum in question.

Now we can use these random forest predictions to train the linear polishing model.

# learn a polish/calibration for employment rf_polish <- lm(income - pred_uncal ~ employment, data=train) # get rid of pred_uncal, as it's no longer needed train$pred_uncal <- NULL

Now, take the predictions from the original random forest model (the one trained on all the data, earlier), and polish them with the polishing model.

# get the predictions from the original random forest model train$pred_rf_raw <- predict(model_rf_1stage, newdata=train, type="response") # polish the predictions so that employment rolls up correctly train$pred_cal <- train$pred_rf_raw + predict(rf_polish, newdata=train, type="response") # see how close the rollups get to ground truth rollups <- show_conditional_means(train, qc(income, pred_cal, pred_rf_raw)) display_tables(rollups) sex income pred_cal pred_rf_raw Male 55755.51 55343.35 55292.47 Female 47718.52 48296.93 48373.40 employment income pred_cal pred_rf_raw Employee of a private for profit 51620.39 51640.44 51291.36 Federal government employee 64250.09 64036.19 61167.31 Local government employee 54740.93 54739.80 55425.30 Private not-for-profit employee 53106.41 53075.76 54174.31 Self employed incorporated 66100.07 66078.76 63714.20 Self employed not incorporated 41346.47 41341.37 46415.34 State government employee 53977.20 53946.07 55599.89 education income pred_cal pred_rf_raw no high school diploma 31883.18 41526.88 41673.91 Regular high school diploma 38052.13 42572.57 42491.11 GED or alternative credential 37273.30 43104.09 43037.49 some college credit, no degree 42991.09 44624.38 44547.89 Associate’s degree 47759.61 46848.84 46815.79 Bachelor’s degree 65668.51 63468.93 63474.48 Master’s degree 79225.87 69757.13 69953.53 Professional degree 97772.60 76636.17 76861.44 Doctorate degree 91214.55 75697.59 75940.24

Note that the rolled up predictions from the polished model almost match the true rollups for employment, but not for the other grouping variables (sex and education). To see this better, let’s look at the total absolute error of the estimated rollups.

err_mag <- function(x, y) { sum(abs(y-x)) } preds = qc(pred_rf_raw, pred_cal) errframes <- lapply(rollups, function(df) { lapply(df[, preds], function(p) err_mag(p, df$income)) %.>% as.data.frame(.) }) errframes <- lapply(rollups, function(df) { gp = names(df)[[1]] errs <- lapply(df[, preds], function(p) err_mag(p, df$income)) as.data.frame(c(grouping=gp, errs)) }) display_tables(errframes) grouping pred_rf_raw pred_cal sex 1117.927 990.5685 grouping pred_rf_raw pred_cal employment 14241.51 323.2577 grouping pred_rf_raw pred_cal education 70146.37 70860.7

We can reduce the rollup errors substantially for the variables that the polishing model was exposed to. For variables that the polishing model is not exposed to, there is no improvement; it’s likely that those estimated rollups will in many cases be worse.

Model performance on holdout data

Let’s see the performance of the polished model on test data.

# get the predictions from the original random forest model test$pred_rf_raw <- predict(model_rf_1stage, newdata=test, type="response") # polish the predictions so that employment rolls up correctly test$pred_cal <- test$pred_rf_raw + predict(rf_polish, newdata=test, type="response") # compare the rollups on employment preds <- qc(pred_rf_raw, pred_cal) employment_rollup <- show_conditional_means(test, c("income", preds))$employment knitr::kable(employment_rollup) employment income pred_rf_raw pred_cal Employee of a private for profit 50717.96 51064.25 51413.32 Federal government employee 66268.05 61401.94 64270.82 Local government employee 52565.89 54878.96 54193.47 Private not-for-profit employee 52887.52 54011.64 52913.09 Self employed incorporated 67744.61 63664.51 66029.07 Self employed not incorporated 41417.25 46215.42 41141.44 State government employee 51314.92 55395.96 53742.14 # see how close the rollups get to ground truth for employment lapply(employment_rollup[, preds], function(p) err_mag(p, employment_rollup$income)) %.>% as.data.frame(.) %.>% knitr::kable(.) pred_rf_raw pred_cal 21608.9 8764.302

The polished model estimates rollups with respect to employment better than the uncalibrated random forest model. Its performance on individual predictions (as measured by root mean squared error) is about the same.

# predictions on individuals rmse <- function(x, y) { sqrt(mean((y-x)^2)) } lapply(test[, preds], function(p) rmse(p, test$income)) %.>% as.data.frame(.) %.>% knitr::kable(.) pred_rf_raw pred_cal 31780.39 31745.12 Conclusion

We’ve demonstrated a procedure that mitigates bias issues with ensemble models, or any other uncalibrated model. This potentially allows the data scientist to balance the requirement for highly accurate predictions on individuals with the need to correctly estimate specific aggregate quantities.

This method is ad-hoc, and may be somewhat brittle. In addition, it requires that the data scientist knows ahead of time which rollups will be desired in the future. However, if you find yourself in a situation where you must balance accurate individual prediction with accurate aggregate estimates, this may be a useful trick to have in your data science toolbox.

Loglinear models

Jelmer Ypma has pointed out to us that for the special case of loglinear models (that is, a linear model forlog(y)), there are other techniques for mitigating bias in predictions on y. More information on these methods can be found in chapter 6.4 of Introductory Econometrics: A Modern Approach by Jeffrey Woolrich (2014).

These methods explicitly assume that y is lognormally distributed (an assumption that is often valid for monetary amounts), and try to estimate the true standard deviation of log(y) in order to adjust the estimates of y. They do not completely eliminate the bias, because this true standard deviation is unknown, but they do reduce it, while making predictions on individuals with RMSE performance competitive with the performance of linear or (quasi)Poisson models fit directly to y. However, they do not give the improvements on relative error that the naive adjustment we showed in the first article of this series will give.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Three Strategies for Working with Big Data in R

Wed, 07/17/2019 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

For many R users, it’s obvious why you’d want to use R with big data, but not so obvious how. In fact, many people (wrongly) believe that R just doesn’t work very well for big data.

In this article, I’ll share three strategies for thinking about how to use big data in R, as well as some examples of how to execute each of them.

By default R runs only on data that can fit into your computer’s memory. Hardware advances have made this less of a problem for many users since these days, most laptops come with at least 4-8Gb of memory, and you can get instances on any major cloud provider with terabytes of RAM. But this is still a real problem for almost any data set that could really be called big data.

The fact that R runs on in-memory data is the biggest issue that you face when trying to use Big Data in R. The data has to fit into the RAM on your machine, and it’s not even 1:1. Because you’re actually doing something with the data, a good rule of thumb is that your machine needs 2-3x the RAM of the size of your data.

An other big issue for doing Big Data work in R is that data transfer speeds are extremely slow relative to the time it takes to actually do data processing once the data has transferred. For example, the time it takes to make a call over the internet from San Francisco to New York City takes over 4 times longer than reading from a standard hard drive and over 200 times longer than reading from a solid state hard drive.1 This is an especially big problem early in developing a model or analytical project, when data might have to be pulled repeatedly.

Nevertheless, there are effective methods for working with big data in R. In this post, I’ll share three strategies. And, it important to note that these strategies aren’t mutually exclusive – they can be combined as you see fit!

Strategy 1: Sample and Model

To sample and model, you downsample your data to a size that can be easily downloaded in its entirety and create a model on the sample. Downsampling to thousands – or even hundreds of thousands – of data points can make model runtimes feasible while also maintaining statistical validity.2

If maintaining class balance is necessary (or one class needs to be over/under-sampled), it’s reasonably simple stratify the data set during sampling.

Illustration of Sample and Model

Advantages
  • Speed Relative to working on your entire data set, working on just a sample can drastically decrease run times and increase iteration speed.
  • Prototyping Even if you’ll eventually have to run your model on the entire data set, this can be a good way to refine hyperparameters and do feature engineering for your model.
  • Packages Since you’re working on a normal in-memory data set, you can use all your favorite R packages.
Disadvantages
  • Sampling Downsampling isn’t terribly difficult, but does need to be done with care to ensure that the sample is valid and that you’ve pulled enough points from the original data set.
  • Scaling If you’re using sample and model to prototype something that will later be run on the full data set, you’ll need to have a strategy (such as pushing compute to the data) for scaling your prototype version back to the full data set.
  • Totals Business Intelligence (BI) tasks frequently answer questions about totals, like the count of all sales in a month. One of the other strategies is usually a better fit in this case.
Strategy 2: Chunk and Pull

In this strategy, the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining. This strategy is conceptually similar to the MapReduce algorithm. Depending on the task at hand, the chunks might be time periods, geographic units, or logical like separate businesses, departments, products, or customer segments.

Chunk and Pull Illustration

Advantages
  • Full data set The entire data set gets used.
  • Parallelization If the chunks are run separately, the problem is easy to treat as embarassingly parallel and make use of parallelization to speed runtimes.
Disadvantages
  • Need Chunks Your data needs to have separable chunks for chunk and pull to be appropriate.
  • Pull All Data Eventually have to pull in all data, which may still be very time and memory intensive.
  • Stale Data The data may require periodic refreshes from the database to stay up-to-date since you’re saving a version on your local machine.
Strategy 3: Push Compute to Data

In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.

Sometimes, more complex operations are also possible, including computing histogram and raster maps with dbplot, building a model with modeldb, and generating predictions from machine learning models with tidypredict.

Push Compute to Data Illustration

Advantages
  • Use the Database Takes advantage of what databases are often best at: quickly summarizing and filtering data based on a query.
  • More Info, Less Transfer By compressing before pulling data back to R, the entire data set gets used, but transfer times are far less than moving the entire data set.
Disadvantages
  • Database Operations Depending on what database you’re using, some operations might not be supported.
  • Database Speed In some contexts, the limiting factor for data analysis is the speed of the database itself, and so pushing more work onto the database is the last thing analysts want to do.
An Example

I’ve preloaded the flights data set from the nycflights13 package into a PostgreSQL database, which I’ll use for these examples.

Let’s start by connecting to the database. I’m using a config file here to connect to the database, one of RStudio’s recommended database connection methods:

library(DBI) library(dplyr) library(ggplot2) db <- DBI::dbConnect( odbc::odbc(), Driver = config$driver, Server = config$server, Port = config$port, Database = config$database, UID = config$uid, PWD = config$pwd, BoolsAsChar = "" )

The dplyr package is a great tool for interacting with databases, since I can write normal R code that is translated into SQL on the backend. I could also use the DBI package to send queries directly, or a SQL chunk in the R Markdown document.

df <- dplyr::tbl(db, "flights") tally(df) ## # A tibble: 1 x 1 ## n ## ## 1 336776

With only a few hundred thousand rows, this example isn’t close to the kind of big data that really requires a Big Data strategy, but it’s rich enough to demonstrate on.

Sample and Model

Let’s say I want to model whether flights will be delayed or not. This is a great problem to sample and model.

Let’s start with some minor cleaning of the data

# Create is_delayed column in database df <- df %>% mutate( # Create is_delayed column is_delayed = arr_delay > 0, # Get just hour (currently formatted so 6 pm = 1800) hour = sched_dep_time / 100 ) %>% # Remove small carriers that make modeling difficult filter(!is.na(is_delayed) & !carrier %in% c("OO", "HA")) df %>% count(is_delayed) ## # A tibble: 2 x 2 ## is_delayed n ## ## 1 FALSE 194078 ## 2 TRUE 132897

These classes are reasonably well balanced, but since I’m going to be using logistic regression, I’m going to load a perfectly balanced sample of 40,000 data points.

For most databases, random sampling methods don’t work super smoothly with R, so I can’t use dplyr::sample_n or dplyr::sample_frac. I’ll have to be a little more manual.

set.seed(1028) # Create a modeling dataset df_mod <- df %>% # Within each class group_by(is_delayed) %>% # Assign random rank (using random and row_number from postgres) mutate(x = random() %>% row_number()) %>% ungroup() # Take first 20K for each class for training set df_train <- df_mod %>% filter(x <= 20000) %>% collect() # Take next 5K for test set df_test <- df_mod %>% filter(x > 20000 & x <= 25000) %>% collect() # Double check I sampled right count(df_train, is_delayed) count(df_test, is_delayed) ## # A tibble: 2 x 2 ## is_delayed n ## ## 1 FALSE 20000 ## 2 TRUE 20000 ## # A tibble: 2 x 2 ## is_delayed n ## ## 1 FALSE 5000 ## 2 TRUE 5000

Now let’s build a model – let’s see if we can predict whether there will be a delay or not by the combination of the carrier, the month of the flight, and the time of day of the flight.

mod <- glm(is_delayed ~ carrier + as.character(month) + poly(sched_dep_time, 3), family = "binomial", data = df_train) # Out-of-Sample AUROC df_test$pred <- predict(mod, newdata = df_test) auc <- suppressMessages(pROC::auc(df_test$is_delayed, df_test$pred)) auc ## Area under the curve: 0.6425

As you can see, this is not a great model and any modelers reading this will have many ideas of how to improve what I’ve done. But that wasn’t the point!

I built a model on a small subset of a big data set. Including sampling time, this took my laptop less than 10 seconds to run, making it easy to iterate quickly as I want to improve the model. After I’m happy with this model, I could pull down a larger sample or even the entire data set if it’s feasible, or do something with the model from the sample.

Chunk and Pull

In this case, I want to build another model of on-time arrival, but I want to do it per-carrier. This is exactly the kind of use case that’s ideal for chunk and pull. I’m going to separately pull the data in by carrier and run the model on each carrier’s data.

I’m going to start by just getting the complete list of the carriers.

# Get all unique carriers carriers <- df %>% select(carrier) %>% distinct() %>% pull(carrier)

Now, I’ll write a function that

  • takes the name of a carrier as input
  • pulls the data for that carrier into R
  • splits the data into training and test
  • trains the model
  • outputs the out-of-sample AUROC (a common measure of model quality)
carrier_model <- function(carrier_name) { # Pull a chunk of data df_mod <- df %>% dplyr::filter(carrier == carrier_name) %>% collect() # Split into training and test split <- df_mod %>% rsample::initial_split(prop = 0.9, strata = "is_delayed") %>% suppressMessages() # Get training data df_train <- split %>% rsample::training() # Train model mod <- glm(is_delayed ~ as.character(month) + poly(sched_dep_time, 3), family = "binomial", data = df_train) # Get out-of-sample AUROC df_test <- split %>% rsample::testing() df_test$pred <- predict(mod, newdata = df_test) suppressMessages(auc <- pROC::auc(df_test$is_delayed ~ df_test$pred)) auc }

Now, I’m going to actually run the carrier model function across each of the carriers. This code runs pretty quickly, and so I don’t think the overhead of parallelization would be worth it. But if I wanted to, I would replace the lapply call below with a parallel backend.3

set.seed(98765) mods <- lapply(carriers, carrier_model) %>% suppressMessages() names(mods) <- carriers

Let’s look at the results.

mods ## $UA ## Area under the curve: 0.6408 ## ## $AA ## Area under the curve: 0.6041 ## ## $B6 ## Area under the curve: 0.6475 ## ## $DL ## Area under the curve: 0.6162 ## ## $EV ## Area under the curve: 0.6419 ## ## $MQ ## Area under the curve: 0.5973 ## ## $US ## Area under the curve: 0.6096 ## ## $WN ## Area under the curve: 0.6968 ## ## $VX ## Area under the curve: 0.6969 ## ## $FL ## Area under the curve: 0.6347 ## ## $AS ## Area under the curve: 0.6906 ## ## $`9E` ## Area under the curve: 0.6071 ## ## $F9 ## Area under the curve: 0.625 ## ## $YV ## Area under the curve: 0.7029

So these models (again) are a little better than random chance. The point was that we utilized the chunk and pull strategy to pull the data separately by logical units and building a model on each chunk.

Push Compute to the Data

In this case, I’m doing a pretty simple BI task – plotting the proportion of flights that are late by the hour of departure and the airline.

Just by way of comparison, let’s run this first the naive way – pulling all the data to my system and then doing my data manipulation to plot.

system.time( df_plot <- df %>% collect() %>% # Change is_delayed to numeric mutate(is_delayed = ifelse(is_delayed, 1, 0)) %>% group_by(carrier, sched_dep_time) %>% # Get proportion per carrier-time summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %>% ungroup() %>% # Change string times into actual times mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, "left", "0") %>% strptime("%H%M") %>% as.POSIXct())) -> timing1

Now that wasn’t too bad, just 2.366 seconds on my laptop.

But let’s see how much of a speedup we can get from chunk and pull. The conceptual change here is significant – I’m doing as much work as possible on the Postgres server now instead of locally. But using dplyr means that the code change is minimal. The only difference in the code is that the collect call got moved down by a few lines (to below ungroup()).

system.time( df_plot <- df %>% # Change is_delayed to numeric mutate(is_delayed = ifelse(is_delayed, 1, 0)) %>% group_by(carrier, sched_dep_time) %>% # Get proportion per carrier-time summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %>% ungroup() %>% collect() %>% # Change string times into actual times mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, "left", "0") %>% strptime("%H%M") %>% as.POSIXct())) -> timing2

It might have taken you the same time to read this code as the last chunk, but this took only 0.269 seconds to run, almost an order of magnitude faster!4 That’s pretty good for just moving one line of code.

Now that we’ve done a speed comparison, we can create the nice plot we all came for.

df_plot %>% mutate(carrier = paste0("Carrier: ", carrier)) %>% ggplot(aes(x = sched_dep_time, y = delay_pct)) + geom_line() + facet_wrap("carrier") + ylab("Proportion of Flights Delayed") + xlab("Time of Day") + scale_y_continuous(labels = scales::percent) + scale_x_datetime(date_breaks = "4 hours", date_labels = "%H")

It looks to me like flights later in the day might be a little more likely to experience delays, but that’s a question for another blog post.

  1. https://blog.codinghorror.com/the-infinite-space-between-words/

  2. This isn’t just a general heuristic. You’ll probably remember that the error in many statistical processes is determined by a factor of \(\frac{1}{n^2}\) for sample size \(n\), so a lot of the statistical power in your model is driven by adding the first few thousand observations compared to the final millions.

  3. One of the biggest problems when parallelizing is dealing with random number generation, which you use here to make sure that your test/training splits are reproducible. It’s not an insurmountable problem, but requires some careful thought.

  4. And lest you think the real difference here is offloading computation to a more powerful database, this Postgres instance is running on a container on my laptop, so it’s got exactly the same horsepower behind it.

_____='https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

101 Machine Learning Algorithms for Data Science with Cheat Sheets

Tue, 07/16/2019 - 20:58

(This article was first published on R Programming - Data Science Blog | AI, ML, big data analytics , and kindly contributed to R-bloggers)

Think of this as the one-stop-shop/dictionary/directory for your machine learning algorithms. The algorithms have been sorted into 9 groups: Anomaly Detection, Association Rule Learning, Classification, Clustering, Dimensional Reduction, Ensemble, Neural Networks, Regression, Regularization. In this post, you’ll find 101 machine learning algorithms, including useful infographics to help you know when to use each one (if available).

101 Machine Learning Algorithms

Each of the accordian drop downs are embeddable if you want to take them with you. All you have to do is click the little ’embed’ button in the lower left hand corner and copy/paste the iframe. All we ask is you link back to this post.

By the way, if you have trouble with Medium/TDS, just throw your browser into incognito mode.

Classification Algorithms

Any of these classification algorithms can be used to build a model that predicts the outcome class for a given dataset. The datasets can come from a variety of domains. Depending upon the dimensionality of the dataset, the attribute types, sparsity, and missing values, etc., one algorithm might give better predictive accuracy than most others. Let’s briefly discuss these algorithms. (18)

Regression Analysis

Regression Analysis is a statistical method for examining the relationship between two or more variables. There are many different types of Regression analysis, of which a few algorithms can be found below. (20)

Neural Networks

A neural network is an artificial model based on the human brain. These systems learn tasks by example without being told any specific rules. (11)

Anomaly Detection

Also known as outlier detection, anomaly detection is used to find rare occurrences or suspicious events in your data. The outliers typically point to a problem or rare event. (5)

Dimensionality Reduction

With some problems, especially classification, there can be so many variables, or features, that it is difficult to visualize your data. Correlation amongst your features creates redundancies, and that’s where dimensionality reduction comes in. Dimensionality Reduction reduces the number of random variables you’re working with. (17)

Ensemble

Ensemble learning methods are meta-algorithms that combine several machine learning methods into a single predictive model to increase the overall performance. (11)

Clustering

In supervised learning, we know the labels of the data points and their distribution. However, the labels may not always be known. Clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. Clustering can either be semi parametric or probabilistic. (14)

Association Rule Analysis

Association rule analysis is a technique to uncover how items are associated with each other. (2)

Regularization

Regularization is used to prevent overfitting. Overfitting means the a machine learning algorithm has fit the data set too strongly such that it has a high accuracy in it but does not perform well on unseen data. (3)

Scikit-Learn Algorithm Cheat Sheet

First and foremost is the Scikit-Learn cheat sheet. If you click the image, you’ll be taken to the same graphic except it will be interactive. We suggest saving this site as it makes remembering the algorithms, and when best to use them, incredibly simple and easy.

SAS: The Machine Learning Algorithm Cheat Sheet

You can also find many of the same algorithms on SAS’s machine learning cheet sheet as the one above. The SAS website (click the pic) also gives great  descriptions about how, when, and why to use each algorithm.

Microsoft Azure Machine Learning: Algorithm Cheat Sheet

Microsoft Azure’s cheet sheet is the simplest cheet sheet by far. Even though it is simple, Microsoft was still able to pack a ton of information into it. Microsoft also made their algorithm sheet available to download.

There you have it, 101 machine learning algorithms with cheat sheets, descriptions, and tutorials! We hope you are able to make good use of this list. If there are any algorithms that you think should be added, go ahead and leave a comment with the algorithm and a link to a tutorial. Thanks!

Sources var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming - Data Science Blog | AI, ML, big data analytics . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages