R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 38 min ago

### Image Convolution in R using Magick

Thu, 11/02/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Release 1.4 of the magick package introduces
a new feature called image convolution that
was requested by Thomas L. Pedersen. In this post we explain what this is all about.

Kernel Matrix

The new image_convolve() function applies a kernel over the image. Kernel convolution means that each pixel value is recalculated using the weighted neighborhood sum defined in the kernel matrix. For example lets look at this simple kernel:

library(magick) kern <- matrix(0, ncol = 3, nrow = 3) kern[1, 2] <- 0.25 kern[2, c(1, 3)] <- 0.25 kern[3, 2] <- 0.25 kern ## [,1] [,2] [,3] ## [1,] 0.00 0.25 0.00 ## [2,] 0.25 0.00 0.25 ## [3,] 0.00 0.25 0.00

This kernel changes each pixel to the mean of its horizontal and vertical neighboring pixels, which results in a slight blurring effect in the right-hand image below:

img <- image_read('logo:') img_blurred <- image_convolve(img, kern) image_append(c(img, img_blurred))

Standard Kernels

Many operations in magick such as blurring, sharpening, and edge detection are
actually special cases of image convolution. The benefit of explicitly using
image_convolve() is more control. For example, we can blur an image and then blend
it together with the original image in one step by mixing a blurring kernel with the
unit kernel:

img %>% image_convolve('Gaussian:0x5', scaling = '60,40%')

The above requires a bit of explanation. ImageMagick defines several common
standard kernels such as the
gaussian kernel. Most of the standard kernels take one or more parameters,
e.g. the example above used a gaussian kernel with 0 radius and 5 sigma.

In addition, scaling argument defines the magnitude of the kernel, and possibly
how much of the original picture should be mixed in. Here we mix 60% of the
blurring with 40% of the original picture in order to get a diffused lightning effect.

Edge Detection

Another area where kernels are of use is in edge detection. A simple example of
a direction-aware edge detection kernel is the Sobel kernel.
As can be seen below, vertical edges are detected while horizontals are not.

img %>% image_convolve('Sobel') %>% image_negate()

Something less apparent is that the result of the edge detection is truncated.
Edge detection kernels can result in negative color values which get truncated to zero.
To combat this it is possible to add a bias to the result. Often you’ll end up with
scaling the kernel to 50% and adding 50% bias to move the midpoint of the result to 50%
grey:

img %>% image_convolve('Sobel', scaling = '50%', bias = '50%')

Sharpening

ImageMagick has many more edge detection kernels, some of which are insensitive to
the direction of the edge. To emulate a classic high-pass filter from photoshop use
difference of gaussians kernel:

img %>% image_convolve('DoG:0,0,2') %>% image_negate()

As with the blurring, the original image can be blended in with the transformed one, effectively sharpening the image along edges.

img %>% image_convolve('DoG:0,0,2', scaling = '100, 100%')

The ImageMagick documentation has more examples of convolve with various avaiable kernels.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Le Monde [last] puzzle [#1026]

Thu, 11/02/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The last and final Le Monde puzzle is a bit of a disappointment, to wit:

A 4×4 table is filled with positive and different integers. A 3×3 table is then deduced by adding four adjacent [i.e. sharing a common corner] entries of the original table. Similarly with a 2×2 table, summing up to a unique integer. What is the minimal value of this integer? And by how much does it increase if all 29 integers in the tables are different?

For the first question, the resulting integer writes down as the sum of the corner values, plus 3 times the sum of the side values, plus 9 times the sum of the 4 inner values [of the 4×4 table]. Hence, minimising the overall sum means taking the inner values as 1,2,3,4, the side values as 5,…,12, and the corner values as 13,…,16. Resulting in a total sum of 352. As checked in this computer code in APL by Jean-Louis:

This configuration does not produce 29 distinct values, but moving one value higher in one corner does: I experimented with different upper bounds on the numbers and 17 always provided with the smallest overall sum, 365.

firz=matrix(0,3,3)#second level thirz=matrix(0,2,2)#third level for (t in 1:1e8){ flor=matrix(sample(1:17,16),4,4) for (i in 1:3) for (j in 1:3) firz[i,j]=sum(flor[i:(i+1),j:(j+1)]) for (i in 1:2) for (j in 1:2) thirz[i,j]=sum(firz[i:(i+1),j:(j+1)]) #last if (length(unique(c(flor,firz,thirz)))==29) solz=min(solz,sum(thirz))}

and a further simulated annealing attempt did not get me anywhere close to this solution.

Filed under: Books, Kids, R Tagged: competition, Kapla, Le Monde, mathematical puzzle, R, simulated annealing

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R: the least disliked programming language

Wed, 11/01/2017 - 23:59

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

According to a recent analysis of Stack Overflow "Developer Stories", where programmer candidates list the technologies the would and would not like to work with, R is the least disliked programming language:

This is probably related to the fact that there's high demand in the job market for fast-growing technologies, which is a disincentive for listing them on the "would not work with" section of an online resume. As author David Robinson notes:

If you’ve read some of our other posts about the growing and shrinking programming languages, you might notice that the least disliked tags tend to be fast-growing ones. R, Python, Typescript, Go, and Rust are all fast-growing in terms of Stack Overflow activity (we’ve specifically explored Python and R before) and all are among the least polarizing languages.

Read the complete analysis linked below for more insights, including the "most liked" languages and rivalries between languages (where liking one and disliking the other go hand-in-hand).

Stack Overflow blog: What are the Most Disliked Programming Languages?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A ggplot-based Marimekko/Mosaic plot

Wed, 11/01/2017 - 18:00

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

One of my first baby steps into the open source world, was when I answered this SO question over four years ago. Recently I revisited the post and saw that Z.Lin did a very nice and more modern implementation, using dplyr and facetting in ggplot2. I decided to merge here ideas with mine to create a general function that makes MM plots. I also added two features: counts, proportions, or percentages to the cells as text and highlighting cells by a condition.

For those of you unfamiliar with this type of plot, it graphs the joint distribution of two categorical variables. x is plotted in bins, with the bin widths reflecting its marginal distribution. The fill of the bins is based on y. Each bin is filled by the co-occurence of its x and y values. When x and y are independent, all the bins are filled (approximately) in the same way. The nice feature of the MM plot, is that is shows both the joint distribution and the marginal distributions of x and y.

ggmm

To demonstrate the function, I’ll take a selection of the emergency data set from the padr package. Such that it has three types of incidents in four parts of town. We also do some relabelling for prettier plot labels.

em_sel <- padr::emergency %>% dplyr::filter( title %in% c("Traffic: VEHICLE ACCIDENT -", "Traffic: DISABLED VEHICLE -", "Fire: FIRE ALARM"), twp %in% c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION")) %>% mutate(twp = factor(twp, levels = c("LOWER MERION", "ABINGTON", "NORRISTOWN", "UPPER MERION"), labels = c("Low Mer.", "Abing.", "Norris.", "Upper Mer.")))

The function takes a data frame and the bare (unquoted) column names of the x and y variables. It will then create a ggplot object. The variables don’t have to be factors or characters, the function coerces them to character.

ggmm(em_sel, twp, title)

Now I promised you two additional features. First, adding text to the cells. The add_text argument takes either “n”, to show the absolute counts

ggmm(em_sel, twp, title, add_text = "n")

“prop” to show the proportions of each cell with respect to the joint distribution

ggmm(em_sel, twp, title, add_text = "prop")

or “perc”, which reflects the percentages of the joint.

ggmm(em_sel, twp, title, add_text = "perc")

An argument is provided to control the rounding of the text.

Secondly, the alpha_condition argument takes an unevaluated expression in terms of the column names of x and y. The cells for which the expression yields TRUE will be highlighted (or rather the others will be downlighted). This is useful when you want to stress an aspect of the distribution, like a value of y that varies greatly over x.

ggmm(em_sel, twp, title, alpha_condition = title == "Traffic: DISABLED VEHICLE -")

I hope you find this function useful. The source code is shared below. Also it is in the package accompanying this blog. Which you can install by running devtools::install_github(EdwinTh/thatssorandom).

library(tidyverse) ggmm <- function(df, x, y, alpha_condition = 1 == 1, add_text = c(NA, "n", "prop", "perc"), round_text = 2) { stopifnot(is.data.frame(df)) add_text <- match.arg(add_text) x_q <- enquo(x) y_q <- enquo(y) a_q <- enquo(alpha_condition) plot_set <- df %>% add_alpha_ind(a_q) %>% x_cat_y_cat(x_q, y_q) %>% add_freqs_col() plot_return <- mm_plot(plot_set, x_q, y_q) plot_return <- set_alpha(df, plot_return, a_q) if (!is.na(add_text)) { plot_set$text <- make_text_vec(plot_set, add_text, round_text) plot_set$freq <- calculate_coordinates(plot_return) text_part <- geom_text(data = plot_set, aes(label = text)) } else { text_part <- NULL } plot_return + text_part } add_alpha_ind <- function(df, a_q) { df %>% mutate(alpha_ind = !!a_q) } x_cat_y_cat <- function(df, x_q, y_q) { df %>% mutate(x_cat = as.character(!!x_q), y_cat = as.character(!!y_q)) } add_freqs_col <- function(df) { stopifnot(all(c('x_cat', 'y_cat', 'alpha_ind') %in% colnames(df))) df %>% group_by(x_cat, y_cat) %>% summarise(comb_cnt = n(), alpha_ind = as.numeric(sum(alpha_ind) > 0)) %>% mutate(freq = comb_cnt /sum(comb_cnt), y_cnt = sum(comb_cnt)) %>% ungroup() } mm_plot <- function(plot_set, x_q, y_q) { plot_set %>% ggplot(aes(x_cat, freq, width = y_cnt, fill = y_cat, alpha = alpha_ind)) + geom_bar(stat = "identity", position = "fill", color = "black") + facet_grid(~x_cat, scales = "free_x", space = "free_x", switch = "x") + theme( axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.spacing = unit(0.1, "lines"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), strip.background = element_blank() ) + guides(alpha = FALSE) + labs(fill = quo_name(y_q)) + xlab(quo_name(x_q)) } set_alpha <- function(df, plot_return, a_q) { if (mutate(df, !!a_q) %>% pull() %>% unique() %>% length() %>% ==(1)) { plot_return + scale_alpha_continuous(range = c(1)) } else { plot_return + scale_alpha_continuous(range = c(.4, 1)) } } make_text_vec <- function(plot_set, add_text, round_text) { if (add_text == "n") return(get_counts(plot_set)) text_col <- get_props(plot_set) if (add_text == "perc") { text_col <- round(text_col * 100, round_text) return(paste0(text_col, "%")) } round(text_col, round_text) } get_counts <- function(plot_set) { plot_set %>% pull(comb_cnt) } get_props <- function(plot_set){ plot_set %>% mutate(text_col = comb_cnt / sum(plot_set$comb_cnt)) %>% pull() } calculate_coordinates <- function(plot_return) { ggplot_build(plot_return)$data[[1]] %>% split(.$PANEL) %>% map(y_in_the_middle) %>% unlist() } y_in_the_middle <- function(x) { y_pos <- c(0, x$y) rev(y_pos[-length(y_pos)] + (y_pos %>% diff()) / 2) } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RTutor: Wall Street and the Housing Bubble

Wed, 11/01/2017 - 10:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

It is widely recognized that aggressive securitization of housing loans played a decisive role in the creating of a house price bubble in the US and the subsequent financial crisis. An open question is in how far financial experts on securitization were aware of that fragile house price bubble they helped creating.

Ing-Haw Cheng, Sahil Raina, and Wei Xiong shed some light on this question in their very interesting article Wall Street and the Housing Bubble (American Economic Review, 2014). They have collected a unique data set on private real estate transactions for a sample of 400 securitization agents, and two control groups consisting of laywers and financial analysts not working in the real estate sector. They use this data set to study whether or not in their private investments the securitization agents seem to have been aware of the bubble and where more successful in the timing of their real estate investments.

As part of his Bachelor Thesis at Ulm University, Marius Wentz has generated a RTutor problem set that allows you to replicate the main insights of the article in an interactive fashion. You learn about R, econometrics and the housing bubble at the same time.

Here is a screenshoot:

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/mwentz93/RTutorWallStreet

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 25 hours total usage time per month. So it may be greyed out when you click at it.)

https://mwentz93.shinyapps.io/RTutorWallStreet/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

You can also install RTutor as a docker container:
https://hub.docker.com/r/skranz/rtutor/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Promises and Closures in R

Wed, 11/01/2017 - 09:50

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

At the moment I try to improve my knowledge about functional programming in R. Luckily there are some explanations on the topic in the web (adv-r and Cartesian Faith). Beginning to (re)discover the usefulness of closures, I remember some (at first sight) very strange behaviour. Actually it is consistent within the scoping rules of R, but until I felt to be on the same level of consistency it took a while.

What is a Promise?

Every argument you pass to a function is a promise until the moment R evaluates it. Consider a function g with arguments x and y. Let’s leave out one argument in the function call:

g <- function(x, y) x g(1) ## [1] 1

R will be forgiving (lazy) until the argument y is actually needed. Until then y exists in the environment of the function call as a ‘name without a value’. Only when R needs to evaluate y a value is searched for. This means that we can pass some non-existent objects as arguments to the function g and R won’t care until the argument is needed in the functions body.

g(1, nonExistentObject) ## [1] 1

Have a look at the figure ‘Environment Path 1’. Your workspace is also called the Global Environment and you can access it explicitly using the internal variable .GlobalEnv. There is one variable in my workspace, the function g(x, y). When g is called a new environment is created in which it’s body will be evaluated. This is denoted by the solid line. In this new environment of g there exist two variables, x and y. As long as those variables are not needed, no values are bound to those names only a promise that a value can be found at the time of evaluation. Since x is evaluated, the value 1 is bound to x in the environment of the function g. y, however, is not evaluated, so the promised value of y is never searched for and we can promise anything.

Environment Path 1.

The dashed line indicates the direction in which R will try to find objects. Meaning that if the function g does not find a variable in its own ‘evaluation environment’, it will continue its search in the global environment. The question where this dashed line is pointing to is really important if you try to understand closures. Just to give you a heads up: The parent environment (environment where the dashed line is pointing to) of a ‘functions evaluation environment’ is always the environment in which the function was created – and not the environment from which the function is called. In the case of g that is the global environment. In the case of a function living in a package it is the packages namespace.

What is a closure?

A closure is a function which has an enclosing environment. As far as my understanding of these things goes, by that definition every function can be considered a closure. This suspicion is supported by R’s constant complaint, that I try to subset closures. Anyway, typically the term closure is used for functions which will have a function as return value:

fClosure <- function(p) function(x) x^p f1 <- fClosure(1) f2 <- fClosure(2) cbind(f1(1:10), f2(1:10)) ## [,1] [,2] ## [1,] 1 1 ## [2,] 2 4 ## [3,] 3 9 ## [4,] 4 16 ## [5,] 5 25 ## [6,] 6 36 ## [7,] 7 49 ## [8,] 8 64 ## [9,] 9 81 ## [10,] 10 100

Here I created fClosure as a function of p which will return a function of x. Then I assign values to f1 and f2 which are the functions f(x)=x1 and f(x)=x2. The reason this works can be answered by looking at the figure ‘Environment Path 2’ with all environments and connections between them.

Environment Path 2.

The solid line indicates that f1 is called from the .GlobalEnv; the dashed line the direction in which R will search for values (an exception is the promise, x, which will reference to the .GlobalEnv). The enclosing environment of f1 is the environment in which it was created, which was the environment of the call to fClosure. So f1 has an own environment which can be seen when you print the function to the console.

f1 ## function(x) x^p ## <environment: 0x24488e8>

This environment can even be accessed, to check what is going on inside.

ls(environment(f1)) ## [1] "p" get("p", envir = environment(f1)) ## [1] 1

So in the enclosing environment of f1 lives a variable p with value 1. Whenever R searches for a variable which is not part of the argument list, it will first check the environment created when called, then the enclosing environment and then the .GlobalEnv followed by the search path.

Why are those two related?

When I read about the scoping rules in R I never really understood the implications of the word lazy. It needed a couple of hours of utter confusion and experiments with closures that I got it. Consider the case where I want to construct an arbitrary number of functions like in the above example. Copy-pasting fClosure will quickly reach limits and is more frustrating than coding.

# Creating f1-f5 and store them in a list # This will actually work using lapply in the most recent R version (3.4) # I enforce it by using a for-loop instead of lapply... # funList <- lapply(1:5, fClosure) funList <- list() for (i in 1:5) funList[[i]] <- fClosure(i) # Call f1-f5 with the argument x = 1:10 resultList <- lapply(funList, do.call, args = list(x = 1:10)) # Cbind the results do.call(cbind, resultList) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 1 1 ## [2,] 32 32 32 32 32 ## [3,] 243 243 243 243 243 ## [4,] 1024 1024 1024 1024 1024 ## [5,] 3125 3125 3125 3125 3125 ## [6,] 7776 7776 7776 7776 7776 ## [7,] 16807 16807 16807 16807 16807 ## [8,] 32768 32768 32768 32768 32768 ## [9,] 59049 59049 59049 59049 59049 ## [10,] 100000 100000 100000 100000 100000

Ups, what happened? The resulting matrix looks like every column was created using the same function! Just to be clear, the above code works just fine. It does exactly as intended. In this case I was tricked by the promises in the enclosing environments, and that in those enclosing environments there live variables p with values 1 to 5. This is not so. Remember, the arguments of a function are evaluated when they are first needed. Until then they are promises. The concept of a promise was surprising because it’s one of the very few objects which have reference semantics in base-R. So a promise is just a pointer to a variable name in an environment (the environment from which the function is called) – they are not pointing to values! If the value of the variable pointed to changes before the promise is evaluated inside the function, the behaviour of the function will change too. This leads to the question: what is the value of p inside this list of functions?

sapply(funList, function(fun) get("p", envir = environment(fun))) ## [1] 5 5 5 5 5

Okay, fine, so in the loop where I created the functions f1 to f5, I did pass the numbers 1 to 5 to the closure, however, they do not get evaluated but point to the iterator which is 5 at the moment the promises are evaluated. How do we fix this? Evaluate p in the enclosing environment at the moment of assignment. Actually we could just write p in the functions body (not the function which is returned, it needs to be evaluated in the enclosing environment), but that may be considered bad style because in two weeks time you will see it as a redundant and useless line of code. Actually there is a function for this. force forces the evaluation of arguments in the enclosing environment. This means that the variable p will be bound to a value at the moment the closure is called.

# Fix fClosure <- function(p) { force(p) function(x) x^p } # And again, with a new definition of fClosure: for(i in 1:5) funList[[i]] <- fClosure(i) resultList <- lapply(funList, do.call, args = list(x = 1:10)) do.call(cbind, resultList) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 1 1 ## [2,] 2 4 8 16 32 ## [3,] 3 9 27 81 243 ## [4,] 4 16 64 256 1024 ## [5,] 5 25 125 625 3125 ## [6,] 6 36 216 1296 7776 ## [7,] 7 49 343 2401 16807 ## [8,] 8 64 512 4096 32768 ## [9,] 9 81 729 6561 59049 ## [10,] 10 100 1000 10000 100000

And that made all the difference.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Thinking about different ways to analyze sub-groups in an RCT

Wed, 11/01/2017 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Here’s the scenario: we have an intervention that we think will improve outcomes for a particular population. Furthermore, there are two sub-groups (let’s say defined by which of two medical conditions each person in the population has) and we are interested in knowing if the intervention effect is different for each sub-group.

And here’s the question: what is the ideal way to set up a study so that we can assess (1) the intervention effects on the group as a whole, but also (2) the sub-group specific intervention effects?

This is a pretty straightforward, text-book scenario. Sub-group analysis is common in many areas of research, including health services research where I do most of my work. It is definitely an advantage to know ahead of time if you want to do a sub-group analysis, as you would in designing a stratified randomized controlled trial. Much of the criticism of these sub-group analyses arises when they are not pre-specified and conducted in an ad hoc manner after the study data have been collected. The danger there, of course, is that the assumptions underlying the validity of a hypothesis test are violated. (It may not be easy to convince folks to avoid hypothesis testing.) In planning ahead for these analyses, researchers are less likely to be accused of snooping through data in search of findings.

So, given that you know you want to do these analyses, the primary issue is how they should be structured. In particular, how should the statistical tests be set up so that we can draw reasonable conclusions? In my mind there are a few ways to answer the question.

Three possible models

Here are three models that can help us assess the effect of an intervention on an outcome in a population with at least two sub-groups:

$\text{Model 1: } Y_i = \beta_0 + \beta_1 D_i$

$\text{Model 2: } Y_i = \beta_0^{\prime} + \beta_1^{\prime} D_i + \beta^{\prime}_2 T_i$

$\text{Model 3: } Y_i = \beta_0^{\prime\prime} + \beta_1^{\prime\prime} D_i +\beta^{\prime\prime}_2 T_i +\beta^{\prime\prime}_3 T_i D_i$

where $$Y_i$$ is the outcome for subject $$i$$, $$T_i$$ is an indicator of treatment and equals 1 if the subject received the treatment, and $$D_i$$ is an indicator of having the condition that defines the second sub-group. Model 1 assumes the medical condition can only affect the outcome. Model 2 assumes that if the intervention does have an effect, it is the same regardless of sub-group. And Model 3 allows for the possibility that intervention effects might vary between sub-groups.

1. Main effects

In the first approach, we would estimate both Model 2 and Model 3, and conduct a hypothesis test using the null hypothesis $$\text{H}_{01}$$: $$\beta_2^{\prime} = 0$$. In this case we would reject $$\text{H}_{01}$$ if the p-value for the estimated value of $$\beta_2^{\prime}$$ was less than 0.05. If in fact we do reject $$\text{H}_{01}$$ (and conclude that there is an overall main effect), we could then (and only then) proceed to a second hypothesis test of the interaction term in Model 3, testing $$\text{H}_{02}$$: $$\beta_3^{\prime\prime} = 0$$. In this second test we can also evaluate the test using a cutoff of 0.05, because we only do this test if we reject the first one.

This is not a path typically taken, for reasons we will see at the end when we explore the relative power of each test under different effect size scenarios.

2. Interaction effects

In the second approach, we would also estimate just Models 2 and 3, but would reverse the order of the tests. We would first test for interaction in Model 3: $$\text{H}_{01}$$: $$\beta_3^{\prime\prime} = 0$$. If we reject $$\text{H}_{01}$$ (and conclude that the intervention effects are different across the two sub-groups), we stop there, because we have evidence that the intervention has some sort of effect, and that it is different across the sub-groups. (Of course, we can report the point estimates.) However, if we fail to reject $$\text{H}_{01}$$, we would proceed to test the main effect from Model 2. In this case we would test $$\text{H}_{02}$$: $$\beta_2^{\prime} = 0$$.

In this approach, we are forced to adjust the size of our tests (and use, for example, 0.025 as a cutoff for both). Here is a little intuition for why. If we use a cutoff of 0.05 for the first test and in fact there is no effect, 5% of the time we will draw the wrong conclusion (by wrongly rejecting $$\text{H}_{01}$$). However, 95% of the time we will correctly fail to reject the (true) null hypothesis in step one, and thus proceed to step two. Of all the times we proceed to the second step (which will be 95% of the time), we will err 5% of the time (again assuming the null is true). So, 95% of the time we will have an additional 5% error due to the second step, for an error rate of 4.75% due to the second test (95% $$\times$$ 5%). In total – adding up the errors from steps 1 and 2 – we will draw the wrong conclusion almost 10% of the time. However, if we use a cutoff of 0.025, then we will be wrong 2.5% of the time in step 1, and about 2.4% (97.5% $$\times$$ 2.5%) of the time in the second step, for a total error rate of just under 5%.

In the first approach (looking at the main effect first), we need to make no adjustment, because we only do the second test when we’ve rejected (incorrectly) the null hypothesis. By definition, errors we make in the second step will only occur in cases where we have made an error in the first step. In the first approach where we evaluate main effects first, the errors are nested. In the second, they are not nested but additive.

3. Global test

In the third and last approach, we start by comparing Model 3 with Model 1 using a global F-test. In this case, we are asking the question of whether or not a model that includes treatment as a predictor does “better” than a model that only adjust for sub-group membership. The null hypothesis can crudely be stated as $$\text{H}_{01}$$: $$\text{Model }3 = \text{Model }1$$. If we reject this hypothesis (and conclude that the intervention does have some sort of effect, either generally or differentially for each sub-group), then we are free to evaluate Models 2 and 3 to see if the there is a varying affect or not.

Here we can use cutoffs of 0.05 in our hypothesis tests. Again, we only make errors in the second step if we’ve made a mistake in the first step. The errors are nested and not additive.

Simulating error rates

This first simulation shows that the error rates of the three approaches are all approximately 5% under the assumption of no intervention effect. That is, given that there is no effect of the intervention on either sub-group (on average), we will draw the wrong conclusion about 5% of the time. In these simulations, the outcome depends only on disease status and not the treatment. Or, in other words, the null hypothesis is in fact true:

library(simstudy) # define the data def <- defData(varname = "disease", formula = .5, dist = "binary") # outcome depends only on sub-group status, not intervention def2 <- defCondition(condition = "disease == 0", formula = 0.0, variance = 1, dist = "normal") def2 <- defCondition(def2, condition = "disease == 1", formula = 0.5, variance = 1, dist = "normal") set.seed(1987) # the year I graduated from college, in case # you are wondering ... pvals <- data.table() # store simulation results # run 2500 simulations for (i in 1: 2500) { # generate data set dx <- genData(400, def) dx <- trtAssign(dx, nTrt = 2, balanced = TRUE, strata = "disease", grpName = "trt") dx <- addCondition(def2, dx, "y") # fit 3 models lm1 <- lm(y ~ disease, data = dx) lm2 <- lm(y ~ disease + trt, data = dx) lm3 <- lm(y ~ disease + trt + trt*disease, data = dx) # extract relevant p-values cM <- coef(summary(lm2))["trt", 4] cI <- coef(summary(lm3))["disease:trt", 4] fI <- anova(lm1, lm3)$Pr(>F)[2] # store the p-values from each iteration pvals <- rbind(pvals, data.table(cM, cI, fI)) } pvals ## cM cI fI ## 1: 0.72272413 0.727465073 0.883669625 ## 2: 0.20230262 0.243850267 0.224974909 ## 3: 0.83602639 0.897635326 0.970757254 ## 4: 0.70949192 0.150259496 0.331072131 ## 5: 0.85990787 0.449130976 0.739087609 ## --- ## 2496: 0.76142389 0.000834619 0.003572901 ## 2497: 0.03942419 0.590363493 0.103971344 ## 2498: 0.16305568 0.757882365 0.360893205 ## 2499: 0.81873930 0.004805028 0.018188997 ## 2500: 0.69122281 0.644801480 0.830958227 # Approach 1 pvals[, mEffect := (cM <= 0.05)] # cases where we would reject null pvals[, iEffect := (cI <= 0.05)] # total error rate pvals[, mean(mEffect & iEffect)] + pvals[, mean(mEffect & !iEffect)] ## [1] 0.0496 # Approach 2 pvals[, iEffect := (cI <= 0.025)] pvals[, mEffect := (cM <= 0.025)] # total error rate pvals[, mean(iEffect)] + pvals[, mean((!iEffect) & mEffect)] ## [1] 0.054 # Approach 3 pvals[, fEffect := (fI <= 0.05)] pvals[, iEffect := (cI <= 0.05)] pvals[, mEffect := (cM <= 0.05)] # total error rate pvals[, mean(fEffect & iEffect)] + pvals[, mean(fEffect & !(iEffect) & mEffect)] ## [1] 0.05 If we use a cutoff of 0.05 for the second approach, we can see that the overall error rate is indeed inflated to close to 10%: # Approach 2 - with invalid cutoff pvals[, iEffect := (cI <= 0.05)] pvals[, mEffect := (cM <= 0.05)] # total error rate pvals[, mean(iEffect)] + pvals[, mean((!iEffect) & mEffect)] ## [1] 0.1028 Exploring power Now that we have established at least three valid testing schemes, we can compare them by assessing the power of the tests. For the uninitiated, power is simply the probability of concluding that there is an effect when in fact there truly is an effect. Power depends on a number of factors, such as sample size, effect size, variation, and importantly for this post, the testing scheme. The plot below shows the results of estimating power using a range of assumptions about an intervention’s effect in the two subgroups and the different approaches to testing. (The sample size and variation were fixed across all simulations.) The effect sizes ranged from -0.5 to +0.5. (I have not included the code here, because it is quite similar to what I did to assess the error rates. If anyone wants it, please let me know, and I can post it on github or send it to you.) The estimated power reflects the probability that the tests correctly rejected at least one null hypothesis. So, if there was no interaction (say both group effects were +0.5) but there was a main effect, we would be correct if we rejected the hypothesis associated with the main effect. Take a look a the plot: What can we glean from this power simulation? Well, it looks like the global test that compares the interaction model with the null model (Approach 3) is the way to go, but just barely when compared to the approach that focuses solely on the interaction model first. And, we see clearly that the first approach suffers from a fatal flaw. When the sub-group effects are offsetting, as they are when the effect is -0.5 in subgroup 1 and +0.5 in subgroup 2, we will fail to reject the null that says there is no main effect. As a result, we will never test for interaction and see that in fact the intervention does have an effect on both sub-groups (one positive and one negative). We don’t get to test for interaction, because the rule was designed to keep the error rate at 5% when in fact there is no effect, main or otherwise. Of course, things are not totally clear cut. If we are quite certain that the effects are going to be positive for both groups, the second approach is not such a disaster. In fact, if we suspect that one of the sub-group effects will be large, it may be preferable to go with this approach. (Look at the right-hand side of the bottom plot to see this.) But, it is still hard to argue (though please do if you feel so inclined), at least based on the assumptions I used in the simulation, that we should take any approach other than the global test. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Recent R Data Packages Wed, 11/01/2017 - 01:00 (This article was first published on R Views, and kindly contributed to R-bloggers) It has never been easier to access data from R. Not only does there seem to be a constant stream of new packages that access the APIs of data providers, but it is also becoming popular for package authors to wrap up fairly large datasets into R packages. Below are 44 R packages concerned with data in one way or another that have made it to CRAN over the past two months. alphavantager v0.1.0: Implements an interface to the Alpha Vantage API to fetch historical data on stocks, physical currencies, and digital/crypto currencies. See the README to get the key. AmesHousing v0.0.2: Contains raw and processed versions of the Ames Iowa Housing Data. See De Cock (2011). billboard v0.1.0: Contains data sets regarding songs on the Billboard Hot 100 list from 1960 to 2016, including ranks for the given year, musical features, and lyrics. BIS v0.1.0: Provides an interface to data provided by the Bank for International Settlements. There is a vignette. bomrang v0.0.8: Provides functions to interface with Australian Government Bureau of Meteorology data. There is an Introduction and an Example. brazilmaps v0.1.0: Enables users to obtain Brazilian map spatial objects of varying region types, e.g., cities, states, microregions, and mesoregions. census v0.2.0: Provides functions to scrape US Census data from the American Community Survey (ACS) data and metadata. There is a brief vignette. cptec v0.1.0: Retrieves data from the CPTEC/INPE, Centro de Previsão de Tempo e Estudos Climáticos Instituto Nacional de Pesquisas Espaciais, weather and climate forecasting center in Latin America. coinmarketcapr v0.1 Provides functions to extract and monitor price and market cap of ‘Crypto currencies’ from Coin Market Cap. copulaData v0.0-1: Contains data sets used for copula modeling in addition to those in the package copula. CytobankAPIstats v1.0: Provides tools to access and process cytometry data from Cytobank. data360r v1.0.1: Provides an interface to the API for the World Bank’s TCdata360 and Govdata360 platforms. DescriptiveStats.OBeu v1.2.1: Provides functions to estimate and return the needed parameters for visualizations designed for OpenBudgets.eu datasets. There is a Getting Started Guide and two other vignettes. dobson v0.1: Contains example data sets from the book An Introduction to Generalised Linear Models by Dobson and Barnett. ensemblepp v0.1-0: Contains temperature and precipitation ensemble weather forecasts and observations taken at Innsbruck, Austria, which are contained in the forthcoming book (Elsevier) “Statistical Post processing of Ensemble Forecasts” by Vannitsem, Wilks, and Messner. fedreporter v0.2.1: Implements an API to the Federal RePORTER, allowing users to search job projects from different government agencies. fingertipsR v0.1.3: Provides an interface to the API for Fingertips, which contains data for many indicators of public health in England. There are vignettes for plotting Life Expentancy and for Interactively Selecting Indicators. GetITRData v0.6: Provides functions to read quarterly and annual financial reports – including assets, liabilities, income, and cash flow statements – from Bovespa’s ITR (informacoes trimestrais) system. There is a vignette. GetLattesData v0.6: Implements an API for downloading and reading XML data directly from Lattes. There is a vignette. IIS v1.0: Contains the datasets and functions form the book Intuitive Introductory Statistics. jaod v0.1.0: Provides a client for the Directory of Open Access Journals (DOAJ). See the API documentation and README file. LAGOSNE v1.00: Provides a client for programmatic access to the Lake Multi-scaled Geospatial and Temporal database, with functions for accessing lake water quality and ecological context data for the US. There is a vignette on the Structure of LAGOSNE and another for Working with LAGOSNE. The following map shows chlorophyll concentrations in Pennsylvania. lexiconPT v0.1.0: Provides access to Portuguese lexicons for sentiment analysis. mapsapi v0.1.0: Provides an interface to the Google Maps API. See the vignette to get started. matchbook v1.0.7: Provides a wrapper for the Matchbook API. microdemic v0.1.0: Provides programmatic access to scholarly articles in the Microsoft Academic Graph. mozzie v0.1.0: Contains weekly dengue cases in 25 districts of Sri Lanka from 2008/ week-52 to 2014/ week-21. nyctaxi v0.0.1: Provides an interface to New York City’s Taxi and Limousine Commission Trip Data. The vignette describes how to get started. opendotaR v0.1.4: Provides access to the OpenDota API. PakPC2017 v0.3.0: Provides data sets and functions for exploration of the Pakistan Population Census 2017. PakPMICS2014Ch v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Children Punjab, Pakistan. PakPMICS2014HH v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Households Punjab, Pakistan. PakPMICS2014HL v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Household Listings Punjab, Pakistan. PakPMICS2014Wm v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Women (age 15-49 years) Punjab, Pakistan. pder v1.0-0: Provides data sets for the Panel Date Econometrics with R book. realestateDK v0.1.0: Provides quarterly information on Denmark Housing Market Statistics, including average square meter prices and the number of free trades for parcel and terraced houses, condominiums, and holiday homes in Denmark since 1992. rcongresso v0.1.3: Wraps the API for the Brazilian Chamber of Deputies. There is an Introduction and a vignette for using the package with purrr. repurrrsive v0.1.0: Contains recursive lists in the form of R objects, ‘JSON’, and ‘XML’, for use in teaching with examples, including color palettes, Game of Thrones characters, GitHub repositories, entities from the Star Wars universe, and more. spatstat.data v1.1-1: Contains datasets for the spatstat package. statsDK v0.1.1: Provides a wrapper for the API to Statistics Denmark. Have a look at the API Console and the vignette to get started with the package. SPYvsSPY: Contains historical data from the legendary Mad magazine comic strip. UdderQuarterInfectionData v1.0.0: Provides the udder quarter infection data set, which contains infection times of individual cow udder quarters with Corynebacterium bovis. See Laevens et al. 1997. USCF v0.1.3: Provides a function to retrieve information from the U.S. Chess Federation website. XKCDdata v0.1.0: Provides a function to download data from individual XKCD comics, by Randall Munroe. _____='https://rviews.rstudio.com/2017/11/01/r-data-packages/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### An example of how to use the new R promises package Wed, 11/01/2017 - 01:00 (This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers) The long awaited promises will be released soon! Being as impatient as I am when it comes to new technology, I decided to play with currently available implementation of promises that Joe Cheng shared and presented recently in London at EARL conference. From this article you’ll get to know the upcoming promises package, how to use it and how it is different from the already existing future package. Promises/Futures are a concept used in almost every major programming language. We’ve used Tasks in C#, Futures in Scala, Promises in Javascript and they all adhere to a common understanding of what a promise is. If you are not familiar with the concept of Promises, asynchronous tasks or Futures, I advise you to take a longer moment and dive into the topic. If you’d like to dive deeper and achieve a higher level of understanding, read about Continuation Monads in Haskell. We’ll be comparing the new promises package with the future package, which has been around for a while so I suggest you take a look at it https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html first if you haven’t used it before. Citing Joe Cheng, our aim is to: 1. Execute long-running code asynchronously on separate thread. 2. Be able to do something with the result (if success) or error (if failure), when the task completes, back on the main R thread. A promise object represents the eventual result of an async task. A promise is an R6 object that knows: 1. Whether the task is running, succeeded, or failed 2. The result (if succeeded) or error (if failed) Without further ado, let’s get our hands on the code! You should be able to just copy-paste code into RStudio and run it. R is single threaded. This means that user cannot interact with your shiny app if there is a long running task being executed on the server. Let’s take a look at an example: longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- longRunningFunction(1) b <- longRunningFunction(2) print("User interaction") # Time: 10s c <- longRunningFunction(10) print(a) print(b) sumAC <- a + c sumBC <- b + c print("User interaction") # Time: 15s print(sumAC + sumBC) print("User interaction") # Time: 15s We’ll use a simplified version of user interaction while there are some additional computations happening on the server. Let’s assume that we can’t just put all the computations in a separate block of code and just run it separately using the future package. There are many cases when it is very difficult or even almost impossible to just gather all computations and run them elsewhere as one big long block of code. User cannot interact with the app for 10 seconds until the computations are finished and then the user has to wait another 5 seconds for next interaction. This is not a place where we would like to be in. User interactions should be as fast as possible and the user shouldn’t have to wait if it is not required. Let’s fix that using R future package that we know. install.packages("future") library(future) plan(multiprocess) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(1)) b <- future(longRunningFunction(2)) print("User interaction") # Time: 0s c <- future(longRunningFunction(10)) print(value(a)) print(value(b)) sumAC <- value(a) + value(c) sumBC <- value(b) + value(c) print("User interaction") # Time: 5s print(sumAC + sumBC) print("User interaction") # Time: 5s Nice, now the first user interaction can happen in parallel! But the second interaction is still blocked – we have to wait for the values, to print their sum. In order to fix that we’d like to chain the computation into the summing function instead of waiting synchronously for the result. We can’t do that using pure futures though (assuming we can’t just put all these computations in one single block of code and run it in parallel). Ideally we’d like to be able to write code similar to the one below: library(future) plan(multiprocess) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(1)) b <- future(longRunningFunction(2)) print("User interaction") # Time: 0s c <- future(longRunningFunction(10)) future(print(value(a))) future(print(value(b))) sumAC <- future(value(a) + value(c)) sumBC <- future(value(b) + value(c)) print("User interaction") # Time: 0s future(print(value(sumAC) + value(sumBC))) print("User interaction") # Time: 0s Unfortunately future package won’t allow us to do that. What we can do, is use the promises package from RStudio! devtools::install_github("rstudio/promises") Let’s play with the promises! I simplified our example to let us focus on using promises first: library(future) plan(multiprocess) library(tibble) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) print(value(a)) print("User interaction") # Time: 5s We’d like to chain the result of longRunningFunction to a print function so that once the longRunningFunction is finished, its results are printed. We can achieve that by using %…>% operator. It works like the very popular %>% operator from magrittr. Think of %...>% as “sometime in the future, once I have the result of the operation, pass the result to the next function”. The three dots symbolise the fact that we have to wait and that the result will be passed in future, it’s not happening now. library(future) plan(multiprocess) library(promises) library(tibble) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% print() # Time: 5s print("User interaction") # Time: 0s Pure magic. But what if I want to filter the result first and then print the processed data? Just keep on chaining: library(future) plan(multiprocess) library(promises) library(tibble) library(dplyr) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% filter(number %% 2 == 1) %...>% sum() %...>% print() print("User interaction") Neat. But, how can I print the result of filtering and pass it to the sum function? There is a tee operator, the same as the one magrittr provides (but one that operates on a promise). It will pass the result of the function to the next function. If you chain it further, it will not pass the result of print() function but previous results. Think of it as splitting a railway, printing the value on a side track and ending the run, then getting back to the main track: library(future) plan(multiprocess) library(promises) library(tibble) library(dplyr) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% filter(number %% 2 == 1) %...T>% print() %...>% sum() %...>% print() print("User interaction") What about errors? They are being thrown somewhere else than in the main thread, how can I catch them? You guessed it – there is an operator for that as well. Use %...!% to handle errors: library(future) plan(multiprocess) library(promises) library(tibble) library(dplyr) longRunningFunction <- function(value) { stop("ERROR") return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% filter(number %% 2 == 1) %...T>% print() %...>% sum() %...>% print() %...!% (function(error) { print(paste("Unexpected error: ", error$message)) }) print("User interaction")

But in our example we’re not just chaining one computation. There is a longRunningFunction call that eventually returns 1 and another one that eventually returns 2. We need to somehow join the two. Once both of them are ready, we’d like to use them and return the sum. We can use promise_all function to achieve that. It takes a list of promises as an argument and returns a promise that eventually resolves to a list of results of each of the promises.

Perfect. We know the tools that we can use to chain asynchronous functions. Let’s use them in our example then:

library(future) plan(multiprocess) library(promises) library(purrr) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(1)) b <- future(longRunningFunction(2)) print("User interaction") # Time: 0s c <- future(longRunningFunction(10)) a %...>% print() b %...>% print() sumAC <- promise_all(a, c) %...>% reduce(+) sumBC <- promise_all(b, c) %...>% reduce(+) print("User interaction") # Time: 0s promise_all(sumAC, sumBC) %...>% reduce(+) %...>% print() print("User interaction") # Time: 0s

A task for you – in line sumAC <- promise_all(a, c) %...>% reduce(+), print the list of values from promises a and c before they are summed up.

Handful of useful information:

[1] There is support for promises implemented in shiny but neither CRAN nor GitHub master branch versions of Shiny support promises. Until support is merged, you’ll have to install from async branch:

devtools::install_github("rstudio/shiny@async")

[2] Beta-quality code at https://github.com/rstudio/promises

[3] Early drafts of docs temporarily hosted at: https://medium.com/@joe.cheng

[4] Joe Cheng talk on EARL 2017 in London – https://www.dropbox.com/s/2gf6tfk1t345lyf/async-talk-slides.pdf?dl=0

[5] The plan is to release everything on CRAN by end of this year.

I hope you have as much fun playing with the promises as I did! I’m planning to play with shiny support for promises next.

Till next time!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### 2017 Beijing Workshop on Forecasting

Wed, 11/01/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Later this month I’m speaking at the 2017 Beijing Workshop on Forecasting, to be held on Saturday 18 November at the Central University of Finance and Economics.
I’m giving four talks as part of the workshop. Other speakers are Junni Zhang, Lei Song, Hui Bu, Feng Li and Yanfei Kang.
Full program details are available online.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Survey of Kagglers finds Python, R to be preferred tools

Tue, 10/31/2017 - 20:14

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Competitive predictive modeling site Kaggle conducted a survey of participants in prediction competitions, and the 16,000 responses provide some insights about that user community. (Whether those trends generalize to the wider community of all data scientists is unclear, however.) One question of interest asked what tools Kagglers use at work. Python is the most commonly-used tool within this community, and R is second. (Respondents could select more than one tool.)

Interestingly, the rankings varied according to the job title of the respondent. R and Python received top-ranking for every job-title subgroup except one (database administrators, who preferred SQL), according to the following division:

• R: Business Analyst, Data Analyst, Data Miner, Operations Researcher, Predictive Modeler, Statistician
• Python: Computer Scientist, Data Scientist, Engineer, Machine Learning Engineer, Other, Programmer, Researcher, Scientist, Software Developer

You can find summaries of the other questions in the survey at the link below. An anonymized dataset of survey responses is also available, as is the "Kaggle Kernel" (a kind of notebook) of the R code behind the survey analysis.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Free Software Foundation “Social Benefit” Award Nominations

Tue, 10/31/2017 - 16:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Ezra Haber Glenn, the author of the acs package in R, recently posted about the Free Software Foundation’s “Social Benefit” Award on the acs mailing list:

acs.R Community:

The Free Software Foundation is now accepting nominations for the 2017
“Project of Social Benefit Award,” presented to the project or team
responsible for applying free software, or the ideas of the free
software movement, in a project that intentionally and significantly
benefits society in other aspects of life.

If anyone is willing to nominate the acs package, the recognition
would be much appreciated — the package has been generously supported
by MIT and the Puget Sound Regional Council, as well as a great deal
of user-feedback and creative development on the part of the
ACS/Census/R community.

The nomination form is quick and easy — see
https://my.fsf.org/projects-of-social-benefit-award-nomination.

Thanks!

I’m reposting this here for a few reasons.

The first is that I only learned about this award from Ezra’s post, and I think that it’s worth raising awareness of the award itself.

The second is that, in my opinion, the acs package does “intentionally and significantly benefit society.” I have used the acs package over several years to learn more about US demographics. Choroplethr, my R package for creating statistical maps, also uses the acs package to retrieve data from the Census Bureau. Several thousand people have taken my free course on Choroplethr, and each of those people has benefitted from the acs package as well.

Finally, I’m mentioning this award to point out that R package developers receive compensation in different ways. None of us receive monetary compensation when people use our packages. However, Ezra has indicated that getting nominated for this award would be useful to him.

For all these reasons, I was happy to nominate the acs package for the Free software Foundation’s “Social Benefit” Award. It took me less than 5 minutes to fill out the form. If you are a user of choroplethr, and you enjoy its integration with US Census Data, then I encourage you to nominate the acs package as well. You can do so here.

The post Free Software Foundation “Social Benefit” Award Nominations appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### linl 0.0.2: Couple improvements

Tue, 10/31/2017 - 13:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Following up on the initial 0.0.1 release of linl, Aaron and I are happy to announce release 0.0.2 which reached the CRAN network on Sunday in a smooth ‘CRAN-pretest-publish’ auto-admittance. linl provides a simple-yet-powerful Markdown—and RMarkdown—wrapper around the venerable LaTeX letter class; see below for an expanded example also included as the package vignette.

This versions sets a few sensible default values for font, font size, margins, signature (non-)indentation and more; it also expands the documentation.

The NEWS entry follows:

Changes in tint version 0.0.2 (2017-10-29)
• Set a few defaults for a decent-looking skeleton and template: font, fontsize, margins, left-justify closing (#3)

• Blockquote display is now a default as well (#4).

• Updated skeleton.Rmd and vignette source accordingly

• Documented new default options (#5 and #6).

• Links are now by default printed as footnotes (#9).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### pinp 0.0.3: More docs, more features

Tue, 10/31/2017 - 02:31

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Our pinp package for snazzier one or two column vignette received it second update. Now at version 0.0.3, it arrived on CRAN on Saturday with minimal fuzz as an ‘CRAN-pretest-publish’ transition.

We added more frontmatter options, documented more, and streamlined some internals of the LaTeX class borrowed from PNAS. A screenshot of the (updated) vignette can be seen below. Additional screenshots of are at the pinp page.

The NEWS entry for this release follows.

Changes in tint version 0.0.3 (2017-10-28)
• Section ‘Acknowledgements’ now conditional on a frontmatter setting, section ‘Matmethods’ has been removed, pnasbreak no longer used which stabilizes LaTeX float formatting. References are now shown in the column just like other content (Dirk in #36).

• Vignette now uses new numbered sections frontmatter switch which improves the pdf outline.

• New front-matter options for title/section header colors, and link colors (Dirk in #39).

• YAML frontmater options are now documented in the help page for pinp as well (Dirk in #41).

• Some typos were fixed (Michael in #42 and #43).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### gg_tweet’ing Power Outages

Tue, 10/31/2017 - 02:16

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr) library(magick) library(rtweet) library(stringi) library(hrbrthemes) library(tidyverse) words_to_num <- function(x) { map_dbl(x, ~{ val <- stri_match_first_regex(.x, "^([[:print:]]+) #PowerOutages")[,2] mul <- case_when( stri_detect_regex(val, "[Kk]") ~ 1000, stri_detect_regex(val, "[Mm]") ~ 1000000, TRUE ~ 1 ) val <- stri_replace_all_regex(val, "[^[:digit:]\\.]", "") as.numeric(val) * mul }) }

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage <- get_timeline("PowerOutage_us", n=300)

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "\\#(EastCoast|NorthEast)")) %>% mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% mutate(number_out = words_to_num(text)) %>% ggplot(aes(created_at, number_out)) + geom_segment(aes(xend=created_at, yend=0), size=5) + scale_x_datetime(date_labels = "%Y-%m-%d\n%H:%M", date_breaks="2 hours") + scale_y_comma(limits=c(0,2000000)) + labs( x=NULL, y="# Customers Without Power", title="Northeast Power Outages", subtitle="Yay! Twitter as a non-blather data source", caption="Data via: @PowerOutage_us " ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc <- function(grid = "XY", style = c("stream", "card"), retina=TRUE) { style <- match.arg(tolower(style), c("stream", "card")) switch( style, stream = c(24, 18, 16, 14, 12), card = c(22, 16, 14, 12, 10) ) -> font_sizes theme_ipsum_rc( grid = grid, plot_title_size = font_sizes[1], subtitle_size = font_sizes[2], axis_title_size = font_sizes[3], axis_text_size = font_sizes[4], caption_size = font_sizes[5] ) }

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet( gg + theme_tweet_rc(grid="Y"), status = "Progress! #rtweet #gg_tweet", send=TRUE )

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Building Communities Together at ozunconf, 2017

Tue, 10/31/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Just last week we organised the 2nd rOpenSci ozunconference, the sibling rOpenSci unconference, held in Australia. Last year it was held in Brisbane, this time around, the ozunconf was hosted in Melbourne, from October 27-27, 2017.
At the ozunconf, we brought together 45 R-software users and developers, scientists, and open data enthusiasts from academia, industry, government, and non-profits. Participants travelled from far and wide, with people coming from 6 cities around Australia, 2 cities in New Zealand, and one city in the USA.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### computational methods for numerical analysis with R [book review]

Tue, 10/31/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

This is a book by James P. Howard, II, I received from CRC Press for review in CHANCE. (As usual, the customary warning applies: most of this blog post will appear later in my book review column in CHANCE.) It consists in a traditional introduction to numerical analysis with backup from R codes and packages. The early chapters are setting the scenery, from basics on R to notions of numerical errors, before moving to linear algebra, interpolation, optimisation, integration, differentiation, and ODEs. The book comes with a package cmna that reproduces algorithms and testing. While I do not find much originality in the book, given its adherence to simple resolutions of the above topics, I could nonetheless use it for an elementary course in our first year classes. With maybe the exception of the linear algebra chapter that I did not find very helpful.

“…you can have a solution fast, cheap, or correct, provided you only pick two.” (p.27)

The (minor) issue I have with the book and that a potential mathematically keen student could face as well is that there is little in the way of justifying a particular approach to a given numerical problem (as opposed to others) and in characterising the limitations and failures of the presented methods (although this happens from time to time as e.g. for gradient descent, p.191). [Seeping in my Gallic “mal-être”, I am prone to over-criticise methods during classing, to the (increased) despair of my students!, but I also feel that avoiding over-rosy presentations is a good way to avoid later disappointments or even disasters.] In the case of this book, finding [more] ways of detecting would-be disasters would have been nice.

An uninteresting and highly idiosyncratic side comment is that the author preferred the French style for long division to the American one, reminding me of my first exposure to the latter, a few months ago! Another comment from a statistician is that mentioning time series inter- or extra-polation without a statistical model sounds close to anathema! And makes extrapolation a weapon without a cause.

“…we know, a priori, exactly how long the [simulated annealing] process will take since it is a function of the temperature and the cooling rate.” (p.199)

Unsurprisingly, the section on Monte Carlo integration is disappointing for a statistician/probabilistic numericist like me,  as it fails to give a complete enough picture of the methodology. All simulations seem to proceed there from a large enough hypercube. And recommending the “fantastic” (p.171) R function integrate as a default is scary, given the ability of the selected integration bounds to misled its users. Similarly, I feel that the simulated annealing section is not providing enough of a cautionary tale about the highly sensitive impact of cooling rates and absolute temperatures. It is only through the raw output of the algorithm applied to the travelling salesman problem that the novice reader can perceive the impact of some of these factors. (The acceptance bound on the jump (6.9) is incidentally wrongly called a probability on p.199, since it can take values larger than one.)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Recent updates to the Team Data Science Process

Mon, 10/30/2017 - 20:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It's been over a year since we first introduced introduced the Team Data Science Process (TDSP). The data, technology and practices behind Data Science continue to evolve, and the TDSP has evolved in parallel. Over the past year, several new facets have been added, including:

For an example of applying the TDSP to effective data science projects, check out Buck Woody's 10-part series walking through every stage of a typical data science project

As the practice of data science changes, the TDSP continues to evolve. The TDSP is an open project hosted on Github, and your contributions are welcome.

Cortana Intelligence and Machine Learning Blog: The Microsoft Team Data Science Process (TDSP) – Recent Updates

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan

Mon, 10/30/2017 - 17:31

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

Data Visualization and Dashboard with R is our fourth course of the autumn term. It takes place in November 7-8 in a location close to Milano Lima.
This course will teach you how to build beautiful, effective and flexible plots using the most modern R tools for data visualization. Then you will discover how to embed visualizations and tables in a powerful Shinyapp, to make your data easily navigable and let their insights emerge.
You should take this course if you have some knowledge of R and would like to deepen your knowledge in data visualization with R, both static data visualization and dashboards.

Data Visualization and Dashboard with R: Outlines

– ggplot2 grammar
– Creating plots with ggplot (Scatter Plot, Line Plot, Bar Plot, Histogram, Box Plot, Surface Plot)
– Customizing Plots (aesthetics, legend, axes, faceting and themes)
– Specialised visualisation tools: ggmap and ggally
– Basic shiny concepts
– The structure of a shiny app
– Shiny: the server side and the user side
– Understanding reactivity in shiny
– An overview of html widgets

Data Visualization and Dashboard with R is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

• November 21-22R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
• November 29-30Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan appeared first on Quantide – R training & consulting.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### heatmaply: an R package for creating interactive cluster heatmaps for online publishing

Mon, 10/30/2017 - 15:07

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

This post on the heatmaply package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full. My co-authors for this paper are Jonathan Sidi, Alan O’Callaghan, and Carson Sievert.

Summary: heatmaply is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file. Interactivity includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels.  Thanks to the synergistic relationship between heatmaply and other R packages, the user is empowered by a refined control over the statistical and visual aspects of the heatmap layout.

Availability: The heatmaply package is available under the GPL-2 Open Source license. It comes with a detailed vignette, and is freely available from: http://cran.r-project.org/package=heatmaply

Contact: Tal.Galili@math.tau.ac.il

Introduction

A cluster heatmap is a popular graphical method for visualizing high dimensional data. In it, a table of numbers is scaled and encoded as a tiled matrix of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms and extra columns of categorical annotation. The ongoing development of this iconic visualization, spanning over more than a century, has provided the foundation for one of the most widely used of all bioinformatics displays (Wilkinson and Friendly, 2009). When using the R language for statistical computing (R Core Team, 2016), there are many available packages for producing static heatmaps, such as: stats, gplots, heatmap3, fheatmap, pheatmap, and others. Recently released packages also allow for more complex layouts; these include gapmap, superheat, and ComplexHeatmap (Gu et al., 2016). The next evolutionary step has been to create interactive cluster heatmaps, and several solutions are already available. However, these solutions, such as the idendro R package (Sieger et al., 2017), are often focused on providing an interactive output that can be explored only on the researcher’s personal computer. Some solutions do exist for creating shareable interactive heatmaps. However, these are either dependent on a specific online provider, such as XCMS Online, or require JavaScript knowledge to operate, such as InCHlib. In practice, when publishing in academic journals, the reader is left with a static figure only (often in a png or pdf format).

To fill this gap, we have developed the heatmaply R package for easily creating a shareable HTML file that contains an interactive cluster heatmap. The interactivity is based on a client-side JavaScript code that is generated based on the user’s data, after running the following command:

install.packages("heatmaply") library(heatmaply) heatmaply(data, file = "my_heatmap.html")

The HTML file contains a publication-ready, interactive figure that allows the user to zoom in as well as see values when hovering over the cells. This self-contained HTML file can be made available to interested readers by uploading it to the researcher’s homepage or as a supplementary material in the journal’s server. Concurrently, this interactive figure can be displayed in RStudio’s viewer pane, included in a Shiny application, or embedded in a knitr/RMarkdown HTML documents.

The rest of this paper offers guidelines for creating effective cluster heatmap visualization. Figure 1 demonstrates the suggestions from this section on data from project Tycho (van Panhuis et al., 2013), while the online supplementary information includes the interactive version, as well as several examples of using the package on real-world biological data.

Fig. 1. The (square root) number of people infected by Measles in 50 states, from 1928 to 2003. Vaccines were introduced in 1963

click the image for the online interactive version of the plot

An interactive version of the measles heatmap (embedded in the post using iframe)

I uploaded the measles_heatmaply.html to github and then used the following code to embed it in the post:

Here is the result:

heatmaply – a simple example

The generation of cluster heatmaps is a subtle process (Gehlenborg and Wong, 2012; Weinstein, 2008), requiring the user to make many decisions along the way. The major decisions to be made deal with the data matrix and the dendrogram. The raw data often need to be transformed in order to have a meaningful and comparable scale, while an appropriate color palette should be picked. The clustering of the data requires us to decide on a distance measure between the observation, a linkage function, as well as a rotation and coloring of branches that manage to highlight interpretable clusters. Each such decision can have consequences on the patterns and interpretations that emerge. In this section, we go through some of the arguments in the function heatmaply, aiming to make it easy for the user to tune these important statistical and visual parameters. Our toy example visualizes the effect of vaccines on measles infection. The output is given in the static Fig. 1, while an interactive version is available online in the supplementary file “measles.html”. Both were created using:

heatmaply(x = sqrt(measles), color = viridis, # the default Colv = NULL, hclust_method = "average", k_row = NA, # ... file = c("measles.html", "measles.png") )

The first argument of the function (x) accepts a matrix of the data. In the measles data, each row corresponds with a state, each column with a year (from 1928 to 2003), and each cell with the number of people infected with measles per 100,000 people. In this example, the data were scaled twice – first by not giving the raw number of cases with measles, but scaling them relatively to 100,000 people, thus making it possible to more easily compare between states. And second by taking the square root of the values. This was done since all the values in the data represent the same unit of measure, but come from a right-tailed distribution of count data with some extreme observations. Taking the square root helps with bringing extreme observations closer to one another, helping to avoid an extreme observation from masking the general pattern. Other transformations that may be considered come from Box-Cox or Yeo-Johnson family of power transformations. If each column of the data were to represent a different unit of measure, then leaving the values unchanged will often result in the entire figure being un-usable due to the column with the largest range of values taking over most of the colors in the figure. Possible per-column transformations include the scale function, suitable for data that are relatively normal. normalize, and percentize functions bring data to the comparable 0 to 1 scale for each column. The normalize function preserves the shape of each column’s distribution by subtracting the minimum and dividing by the maximum of all observations for each column. The percentize function is similar to ranking but with the simpler interpretation of each value being replaced by the percent of observations that have that value or below. It uses the empirical cumulative distribution function of each variable on its own values. The sparseness of the dataset can be explored using is.na10.

Once the data are adequately scaled, it is important to choose a good color palette for the data. Other than being pretty, an ideal color palette should have three (somewhat conflicting) properties: (1) Colorful, spanning as wide a palette as possible so as to make differences easy to see; (2) Perceptually uniform, so that values close to each other have similar-appearing colors compared with values that are far away, consistently across the range of values; and (3) Robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as printing well in grey scale. The default passed to the color argument in heatmaply is viridis, which offers a sequential color palette, offering a good balance of these properties. Divergent color scale should be preferred when visualizing a correlation matrix, as it is important to make the low and high ends of the range visually distinct. A helpful divergent palette available in the package is cool_warm (other alternatives in the package include RdBu, BrBG, or RdYlBu, based on the RColorBrewer package). It is also advisable to set the limits argument to range from -1 to 1.

Passing NULL to the Colv argument, in our example, removed the column dendrogram (since we wish to keep the order of the columns, relating to the years). The row dendrogram is automatically calculated using hclust with a Euclidean distance measure and the average linkage function. The user can choose to use an alternative clustering function (hclustfun), distance measure (dist_method), or linkage function (hclust_method), or to have a dendrogram only in the rows/columns or none at all (through the dendrogram argument). Also, the users can supply their own dendrogram objects into the Rowv (or Colv) arguments. The preparation of the dendrograms can be made easier using the dendextend R package (Galili, 2015) for comparing and adjusting dendrograms. These choices are all left for the user to decide. Setting the k_col/k_row argument to NA makes the function search for the number of clusters (between from 2 to 10) by which to color the branches of the dendrogram. The number picked is the one that yields the highest average silhouette coefficient (based on the find_k function from dendextend). Lastly, the heatmaply function uses the seriation package to find an “optimal” ordering of rows and columns (Hahsler et al., 2008). This is controlled using the seriation argument where the default is “OLO” (optimal-leaf-order) – which rotates the branches so that the sum of distances between each adjacent leaf (label) will be minimized (i.e.: optimize the Hamiltonian path length that is restricted by the dendrogram structure). The other arguments in the example were omitted since they are self-explanatory – the exact code is available in the supplementary material.

In order to make some of the above easier, we created the shinyHeatmaply package (available on CRAN) which offers a GUI to help guide the researcher with the heatmap construction, with the functionality to export the heatmap as an html file and summaries parameter specifications to reproduce the heatmap with heatmaply. For more detailed step-by-step demonstration of using heatmaply on biological datasets, you should explore the heatmaplyExamples package (at github.com/talgalili/heatmaplyExamples).

The following biological examples are available and fully reproducible from within the package. You may also view them online in the following links (the html files also include the R code for producing the figures):

• Introduction to heatmaply
• General biological examples
• Reproducing heatmaps from papers published in Nature
• Using heatmaply with gene expression data
• General examples
Acknowledgements

The heatmaply package was made possible by leveraging many wonderful R packages, including ggplot2 (Wickham, 2009), plotly (Sievert et al., 2016), dendextend (Galili, 2015) and many others. We would also like to thank Yoav Benjamini, Madeline Bauer, and Marilyn Friedes for their helpful comments, as well as Joe Cheng for initiating the collaboration with Tal Galili on d3heatmap, which helped lay the foundation for heatmaply.

Funding: This work was supported in part by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).

Conflict of Interest: none declared.

References
• Galili,T. (2015) dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31, 3718–3720.
• Gehlenborg,N. and Wong,B. (2012) Points of view: Heat maps. Nat. Methods, 9, 213–213.
• Gu,Z. et al. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32, 2847–2849.
• Hahsler,M. et al. (2008) Getting Things in Order : An Introduction to the R Package seriation. J. Stat. Softw., 25, 1–27.
• van Panhuis,W.G. et al. (2013) Contagious Diseases in the United States from 1888 to the Present. N. Engl. J. Med., 369, 2152–2158.
• R Core Team,(R Foundation for Statistical Computing) (2016) R: A Language and Environment for Statistical Computing.
• Sieger,T. et al. (2017) Interactive Dendrograms: The R Packages idendro and idendr0. J. Stat. Softw., 76.
• Sievert,C. et al. (2016) plotly: Create Interactive Web Graphics via ‘plotly.js’.
• Weinstein,J.N. (2008) BIOCHEMISTRY: A Postgenomic Visual Icon. Science (80-. )., 319, 1772–1773.
• Wickham,H. (2009) ggplot2 Elegant Graphics for Data Analysis.
• Wilkinson,L. and Friendly,M. (2009) The History of the Cluster Heat Map. Am. Stat., 63, 179–184.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...