Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 6 min ago

Improved Python-style Logging in R

Thu, 03/16/2017 - 00:53
This entry is part 21 of 21 in the series Using R

Last August, in Python-style Logging in R, we described using an R script as a wrapper around the futile.logger package to generate log files for an operational R data processing script. Today, we highlight an improved, documented version that can be sourced by your R scripts or dropped into your package’s R/ directory to provide easy file and console logging.

The improved pylogging.R script enables the following use case:

  1. set up log files for different log levels
  2. set console log level
  3. six available log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL

All of these capabilities depend upon the excellent futile.logger package (CRAN or github). This script just wraps this package to get python-style file logging. Please see futile.logger’s documentation for details on output formatting, etc.

The pylogging.R script is fully documented with roxygen2 comments and can be incorporated into packages as long as their DESCRIPTION file adds a dependency on futile.logger.  For those developing operational processing pipelines using R, python style logging can be very useful.

To demonstrate log files and console output you can download pylogging.R and the following sleepy.R script:

# sleepy.R"Getting sleepy ...") Sys.sleep(1) logger.warn("Getting really tired ...") Sys.sleep(2) logger.error("Having trouble staying awake ...") Sys.sleep(3) logger.fatal("Not gonna marzlmurrrzzzzz ...") stop("Snap out of it!")

The following R session demonstrates the general functionality:

> list.files() [1] "pylogger.R" "sleepy.R" > # Nothing up my sleeve > > source("pylogger.R") > source("sleepy.R") Error: You must initialize with 'logger.setup()' before issuing logger statements. > # Setup is required > > logger.setup() > source("sleepy.R") FATAL [2017-03-15 16:34:15] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # The console log level is set to FATAL by default > > list.files() [1] "pylogger.R" "sleepy.R" > # No log files created > > # Now modify console log level > logger.setLevel(ERROR) > source("sleepy.R") ERROR [2017-03-15 16:35:29] Having trouble staying awake ... FATAL [2017-03-15 16:35:32] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Got ERROR and higher > > logger.setLevel(DEBUG) > source("sleepy.R") INFO [2017-03-15 16:35:42] Getting sleepy ... WARN [2017-03-15 16:35:43] Getting really tired ... ERROR [2017-03-15 16:35:45] Having trouble staying awake ... FATAL [2017-03-15 16:35:48] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Got DEBUG and higher > > list.files() [1] "pylogger.R" "sleepy.R" > # Still no log files > > # Set up log files for two levels > logger.setup(debugLog="debug.log",errorLog="error.log") > logger.setLevel(FATAL) > source("sleepy.R") FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ... Error in eval(expr, envir, enclos) : Snap out of it!! > # Expected console output > > list.files() [1] "debug.log" "error.log" "pylogger.R" "sleepy.R" > readr::read_lines("debug.log") [1] "INFO [2017-03-15 16:36:37] Getting sleepy ..." [2] "WARN [2017-03-15 16:36:38] Getting really tired ..." [3] "ERROR [2017-03-15 16:36:40] Having trouble staying awake ..." [4] "FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ..." > readr::read_lines("error.log") [1] "ERROR [2017-03-15 16:36:40] Having trouble staying awake ..." [2] "FATAL [2017-03-15 16:36:43] Not gonna marzlmurrrzzzzz ..." > > # Got two log files containing DEBUG-and-higher and ERROR-and-higher

Best Wishes for Better Logging!



Puts as Protection

Wed, 03/15/2017 - 21:10

Many asset management firms are happily enjoying record revenue and profits driven not by inorganic growth or skillful portfolio management but by a seemingly endless increase in US equity prices. These firms are effectively commodity producers entirely dependent on the price of an index over which the firm has no control. The options market presents an easy, cheap, and liquid form of protection

Ensemble Methods are Doomed to Fail in High Dimensions

Wed, 03/15/2017 - 20:28

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Ensemble methods

By ensemble methods, I (Bob, not Andrew) mean approaches that scatter points in parameter space and then make moves by inteprolating or extrapolating among subsets of them. Two prominent examples are:

There are extensions and computer implementations of these algorithms. For example, the Python package emcee implements Goodman and Weare’s walker algorithm and is popular in astrophysics.

Typical sets in high dimensions

If you want to get the background on typical sets, I’d highly recommend Michael Betancourt’s video lectures on MCMC in general and HMC in particular; they both focus on typical sets and their relation to the integrals we use MCMC to calculate:

It was Michael who made a doughnut in the air, pointed at the middle of it and said, “It’s obvious ensemble methods won’t work.” This is just fleshing out the details with some code for the rest of us without such sharp geometric intuitions.

MacKay’s information theory book is another excellent source on typical sets. Don’t bother with the Wikipedia on this one.

Why ensemble methods fail: Executive summary

  1. We want to draw a sample from the typical set
  2. The typical set is a thin shell a fixed radius from the mode in a multivariate normal
  3. Interpolating or extrapolating two points in this shell is unlikely to fall in this shell
  4. The only steps that get accepted will be near one of the starting points
  5. The samplers devolve to a random walk with poorly biased choice of direction

Several years ago, Peter Li built the Goodman and Weare walker methods for Stan (all they need is log density) on a branch for evaluation. They failed in practice exactly the way the theory says they will fail. Which is too bad, because the ensemble methods are very easy to implement and embarassingly parallel.

Why ensemble methods fail: R simulation

OK, so let’s see why they fail in practice. I’m going to write some simple R code to do the job for us. Here’s an R function to generate a 100-dimensional standard isotropic normal variate (each element is generated normal(0, 1) independently):

normal_rng <- function(K) rnorm(K);

This function computes the log density of a draw:

normal_lpdf <- function(y) sum(dnorm(y, log=TRUE));

Next, generate two draws from a 100-dimesnional version:

K <- 100; y1 <- normal_rng(K); y2 <- normal_rng(K);

and then interpolate by choosing a point between them:

lambda = 0.5; y12 <- lambda * y1 + (1 - lambda) * y2;

Now let's see what we get:

print(normal_lpdf(y1), digits=1); print(normal_lpdf(y2), digits=1); print(normal_lpdf(y12), digits=1);

[1] -153 [1] -142 [1] -123

Hmm. Why is the log density of the interpolated vector so much higher? Given that it's a multivariate normal, the answer is that it's closer to the mode. That should be a good thing, right? No, it's not. The typical set is defined as an area within "typical" density bounds. When I take a random draw from a 100-dimensional standard normal, I expect log densities that hover between -140 and -160 or so. That interpolated vector y12 with a log density of -123 isn't in the typical set!!! It's a bad draw, even though it's closer to the mode. Still confused? Watch Michael's videos above. Ironically, there's a description in the Goodman and Weare paper in a discussion of why they can use ensemble averages that also explains why their own sampler doesn't scale---the variance of averages is lower than the variance of individual draws; and we want to cover the actual posterior, not get closer to the mode.

So let's put this in a little sharper perspective by simulating thousands of draws from a multivariate normal and thousands of draws interpolating between pairs of draws and plot them in two histograms. First, draw them and print a summary:

lp1 <- vector(); for (n in 1:1000) lp1[n] <- normal_lpdf(normal_rng(K)); print(summary(lp1)); lp2 <- vector() for (n in 1:1000) lp2[n] <- normal_lpdf((normal_rng(K) + normal_rng(K))/2); print(summary(lp2));

from which we get:

Min. 1st Qu. Median Mean 3rd Qu. Max. -177 -146 -141 -142 -136 -121 Min. 1st Qu. Median Mean 3rd Qu. Max. -129 -119 -117 -117 -114 -108

That's looking bad. It's even clearer with a faceted histogram:

library(ggplot2); df <- data.frame(list(log_pdf = c(lp1, lp2), case=c(rep("Y1", 1000), rep("(Y1 + Y2)/2", 1000)))); plot <- ggplot(df, aes(log_pdf)) + geom_histogram(color="grey") + facet_grid(case ~ .);

Here's the plot:

The bottom plot shows the distribution of log densities in independent draws from the standard normal (these are pure Monte Carlo draws). The top plot shows the distribution of the log density of the vector resulting from interpolating two independent draws from the same distribution. Obviously, the log densities of the averaged draws are much higher. In other words, they are atypical of draws from the target standard normal density.


Check out what happens as (1) the number of dimensions K varies, and (2) as lambda varies within or outside of [0, 1].

Hint: What you should see is that as lambda approaches 0 or 1, the draws get more and more typical, and more and more like random walk Metropolis with a small step size. As dimensionality increases, the typical set becomes more attenuated and the problem becomes worse (and vice-versa as it decreases).

Does Hamiltonian Monte Carlo (HMC) have these problems?

Not so much. It scales much better with dimension. It'll slow down, but it won't break and devolve to a random walk like ensemble methods do.

The post Ensemble Methods are Doomed to Fail in High Dimensions appeared first on Statistical Modeling, Causal Inference, and Social Science.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Jobs for R users – 7 R jobs from around the world (2017-03-15)

Wed, 03/15/2017 - 19:00
To post your R job on the next post

Just visit this link and post a new R job to the R community. You can post a job for free (and there are also two “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New Jobs
  1. Full-Time
    R Shiny Developer
    Summit Consulting LLC – Posted by jdieterle
    District of Columbia, United States
    14 Mar2017
  2. Full-Time
    Data Scientist for MDClone
    MDClone – Posted by MDClone
    Be’er Sheva
    South District, Israel
    14 Mar2017
  3. Freelance
    Authoring a training course : Machine learning with R – for Packt
    Koushik Sen – Posted by Koushik Sen
    7 Mar2017
  4. Full-Time
    Quantitative Research Associate
    The Millburn Corporation – Posted by themillburncorporation
    New York
    New York, United States
    6 Mar2017
  5. Full-Time
    Data Scientist – Analytics @ – Posted by work_at_booking
    4 Mar2017
  6. Full-Time
    Data Manager and Data Analysis Expert @ Leipzig, Sachsen, Germany
    Max Planck Institute for Cognitive and Brain Sciencer – Posted by Mandy Vogel
    Sachsen, Germany
    3 Mar2017
  7. Full-Time
    Postdoctoral fellow @ Belfast, Northern Ireland, United Kingdom
    Queen’s University Belfast – Posted by Reinhold
    Northern Ireland, United Kingdom
    2 Mar2017

In you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

New screencast: using R and RStudio to install and experiment with Apache Spark

Wed, 03/15/2017 - 17:40

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.

More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RevoScaleR package dependencies with graph visualization

Wed, 03/15/2017 - 17:10

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

MRAN currently holds 7520 R Packages. We can see this with usage of following command (stipulating that you are using MRAN R version. ):

library(tools) df_ap <- data.frame(available.packages()) head(df_ap)

With importing package tools, we get many useful functions to find additional information on packages.

Function package.dependencies() parses and check dependencies of a package in current environment. Function package_dependencies()  (with underscore and not dot) will find all dependent and reverse dependent packages.

With following code I can extract the packages and their dependencies (this will perform a data normalization):

net <- data.frame(df_ap[,c(1,4)]) library(dplyr) netN <- net %>%         mutate(Depends = strsplit(as.character(Depends), ",")) %>%         unnest(Depends) netN

And the result is:

Source: local data frame [14,820 x 2] Package Depends (fctr) (chr) 1 A3 R (>= 2.15.0) 2 A3 xtable 3 A3 pbapply 4 abbyyR R (>= 3.2.0) 5 abc R (>= 2.10) 6 abc 7 abc nnet 8 abc quantreg 9 abc MASS 10 abc locfit .. ... ...

Presented way needs to be further cleaned and prepared.

Once you have data normalized, we can use any of the network packages for visualizing the data. With use of igraph package, I created visual presentation of the RevoScaleR package; dependencies and imported packages.

With the code I filter out the RevoScaleR package and create visual:

library(igraph) netN_g <-[edges$src %in% c('RevoScaleR', deptree), ]) plot(netN_g)


Happy Ring!



To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Structures Exercises

Wed, 03/15/2017 - 17:06

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

There are 5 important basic data structures in R: vector, matrix, array, list and dataframe. They can be 1-dimensional (vector and list), 2-dimensional (matrix and data frame) or multidimensional (array). They also differ according to homogeneity of elements they can contain: while all elements contained in vector, matrix and array must be of the same type, list and data frame can contain multiple types.

In this set of exercises we shall practice casting between different types of these data structures, together with some basic operations on them. You can find more about data structures on Advanced R – Data structures page.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Create a vector named v which contains 10 random integer values between -100 and +100.

Exercise 2

Create a two-dimensional 5×5 array named a comprised of sequence of even integers greater than 25.

Create a list named s containing sequence of 20 capital letters, starting with ‘C’.

Exercise 3

Create a list named l and put all previously created objects in it. Name them a, b and c respectively. How many elements are there in the list? Show the structure of the list. Count all elements recursively.

Exercise 4

Without running commands in R, answer the following questions:

  1. what is the result of l[[3]]?
  2. How would you access random-th letter in the list element c?
  3. If you convert list l to a vector, what will be the type of it’s elements?
  4. Can this list be converted to an array? What will be the data type of elements in array?

Check the results with R.

Exercise 5

Remove letters from the list l. Convert the list l to a vector and check its class. Compare it with the result from exercise 4, question #3.

Exercise 6

Find the difference between elements in l[["a"]] and l[["b"]]. Find the intersection between them. Is there number 33 in their union?

Exercise 7

Create 5×5 matrix named m and fill it with random numeric values rounded to two decimal places, ranging from 1.00 to 100.00.

Exercise 8

Answer the following question without running R command, then check the result.

What will be the class of data structure if you convert matrix m to:

  • vector
  • list
  • data frame
  • array?

Exercise 9

Transpose array l$b and then convert it to matrix.

Exercise 10

Get union of matrix m and all elements in list l and sort it ascending.

Related exercise sets:
  1. Matrix exercises
  2. Array exercises
  3. Mode exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why I love R Notebooks

Wed, 03/15/2017 - 17:00

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Nathan Stephens

Note: R Notebooks requires RStudio Version 1.0 or later

I’m a big fan of the R console. During my early years with R, that’s all I had, so I got very comfortable with pasting my code into the console. Since then I’ve used many code editors for R, but they all followed the same paradigm – script in one window and get output in another window. Notebooks on the other hand combine code, output, and narrative into a single document. Notebooks allow you to interactively build narratives around small chunks of code and then publish the complete notebook as a report.

R Notebooks is a new feature of RStudio which combines the benefits of other popular notebooks (such as Jupyter, Zeppelin, and Beaker) with the benefits of R Markdown documents. As a long time R user I was skeptical that I would like this new paradigm, but after a few months I became a big fan. Here are my top three reasons why I love R Notebooks.

Number 3: Notebooks are for doing science

If scripting is for writing software then notebooks are for doing data science. In high school science class I used a laboratory notebook that contained all my experiments. When I conducted an experiment, I drew sketches and wrote down my results. I also wrote down my ideas and thoughts. The process had a nice flow which helped me improve my thinking. Doing science with physical notebooks is an idea that is centuries old.

Electronic notebooks follow the same pattern as physical notebooks but applies the pattern to code. With notebooks you break your script into manageable code chunks. You add narrative and output around the code chunk which puts it into context and makes it reproducible. When you are done, you have an elegant report that can be shared with others. Here is the thought process for doing data science with notebooks:

  • I have a chunk of code that I want to tell you about.
  • I am going to execute this chunk of code and show you the output.
  • I am going to share all chunks of code with you in a single, reproducible document.

If you do data science with R scripts, on the other hand, you develop your code as a single script. You add comments to the code, but the comments tend to be terse or nonexistent. Your output may or may not be captured at all. Sharing your results in a report requires a separate, time consuming process. Here is the thought process for doing data science with scripts:

  • I have a thousand lines of code and you get to read my amazing comments!
  • Hold onto your hats while I batch execute this entire script!
  • You can find my code and about 50 plots under the project directory (I hope you have permissions).
Number 2: R Notebooks have great features

R Notebooks are based on R Markdown documents. That means they are written in plain text and work well with version control. They can be used to create elegantly formatted output in multiple formats (e.g. HTML, PDF, and Word).

R Notebooks have some features that are not found in traditional notebooks. These are not necessarily inherent differences, but differences of emphasis. For example, R Markdown documents gives you many options when selecting graphics, templates, and formats for your output.

Feature R Notebooks Traditional Notebooks Plain text representation ✓ Same editor/tools used for R scripts ✓ Works well with version control ✓ Create elegantly formatted output ✓ Output inline with code ✓ ✓ Output cached across sessions ✓ ✓ Share code and output in a single file ✓ ✓ Emphasized execution model Interactive & Batch Interactive

When you execute a code chunk in an R Notebook, the output is cached and rendered inside the IDE. When you save the notebook, the same cache is rendered inside a document. The HTML output of R Notebooks is a dual file format that contains both the HTML and the R Markdown source code. The dual format gives you a single file that can be viewed in a browser or opened in the RStudio IDE.

Number 1: R Notebooks make it easy to create and share reports

My favorite part of R Notebooks is having the ability to easily share my work. With R notebooks a high quality report is an automatic byproduct of a completed analysis. If I write down my thoughts while I analyze my code chunks, then all I have to do is push a button to render a report. I can share this report by publishing it to the web, emailing it to my colleagues, or presenting it with slides. This video shows how easy it is to create a report from an R Notebook.

R Notebooks for data science

The following table summarizes the differences between notebooks and scripts.

Activity R Notebook R Script Building a narrative Rich text is added throughout the analytic process describing the motivation and the conclusions for each chunk of the code. Comments are added to the script, and a report that describes the entire analysis is drafted after the script is completed. Organizing plots, widgets, and tables All output is embedded in a single document and collocated with the narrative and code chunk to which it belongs. Each individual output is sent to file and is collected later into a report. Creating reports Rendering the final report is instant. The same document can be published to multiple formats (e.g. HTML, PDF, Word). Since the document is based on code, future changes are easy to implement and the document is reproducible by others. Creating a report is a separate, time consuming step. Any changes to the report can be time consuming and prone to error. Since the report is not tied to code, it is not reproducible.

R Notebooks are not designed for all the work you do in R. If you are writing software for a new package or building a Shiny app you will want to use an R script. However, if you are doing data science you might try R Notebooks. They are great for tasks like exploratory data analysis, model building, and communicating insights. Notebooks are useful for data science because they organize narrative, code, and text around manageable code chunks; and creating quality, reproducible reports is easy.

References: For an introduction to R Notebooks see this video or blog post. For more detailed information, see this workflow presentation or the reference site.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Adding figure labels (A, B, C, …) in the top left corner of the plotting region

Wed, 03/15/2017 - 16:39

I decided to submit a manuscript using only R with knitr, pandoc and make. Actually, it went quite well. Certainly, revisions of manuscript with complex figures did not require much of manual work once the R code for the figures has been created. The manuscript ended up as a Word file (for the sake of co-authors), looking no different than any other manuscript. However, you can look up precisely how all the figures have been generated and, with a single command, re-create the manuscript (with all figures and supplementary data) after you changed a parameter.

One of the small problems I faced was adding labels to pictures. You know — like A, B, C… in the top right corner of each panel of a composite figure. Here is the output I was striving at:

Doing it proved to be more tedious than I thought at first. By default, you can only plot things in the plotting region, everything else gets clipped — you cannot put arbitrary text anywhere outside the rectangle containing the actual plot:

plot(rnorm(100)) text(-20, 0, "one two three four", cex=2)

This is because the plotting are is the red rectangle on the figure below, and everything outside will not be shown by default:

One can use the function mtext to put text on the margins. However, there is no easy way to say “put the text in the top left corner of the figure”, and the results I was able to get were never perfect. Anyway, to push the label really to the very left of the figure region using mtext, you first need to have the user coordinate of that region (to be able to use option ‘at’). However, if you know these coordinates, it is much easier to achieve the desired effect using text.

However, we need to figure out a few things. First, to avoid clipping of the region, one needs to change the parameter xpd:


Then, we need to know where to draw the label. We can get the coordinates of the device (in inches), and then we can translate these to user coordinates with appropriate functions:

plot(rnorm(100)) di <- dev.size("in") x <- grconvertX(c(0, di[1]), from="in", to="user") y <- grconvertY(c(0, di[2]), from="in", to="user")

x[1] and y[2] are the coordinates of the top left corner of the device… but not of the figure. Since we might have used, for example, par(mar=...) or layout to put multiple plots on the device, and we would like to always label the current plot only (i.e. put the label in the corner of the current figure, not of the whole device), we have to take this into account as well:

fig <- par("fig") x <- x[1] + (x[2] - x[1]) * fig[1:2] y <- y[1] + (y[2] - y[1]) * fig[3:4]

However, before plotting, we have to adjust this position by half of the text string width and height, respectively:

txt <- "A" x <- x[1] + strwidth(txt, cex=3) / 2 y <- y[2] - strheight(txt, cex=3) / 2 text(x, y, txt, cex=3)

Looks good! That is exactly what I wanted:

Below you will find an R function that draws a label in one of the three regions — figure (default), plot or device. You specify the position of the label using the labels also used by legend: “topleft”, “bottomright” etc.

fig_label <- function(text, region="figure", pos="topleft", cex=NULL, ...) { region <- match.arg(region, c("figure", "plot", "device")) pos <- match.arg(pos, c("topleft", "top", "topright", "left", "center", "right", "bottomleft", "bottom", "bottomright")) if(region %in% c("figure", "device")) { ds <- dev.size("in") # xy coordinates of device corners in user coordinates x <- grconvertX(c(0, ds[1]), from="in", to="user") y <- grconvertY(c(0, ds[2]), from="in", to="user") # fragment of the device we use to plot if(region == "figure") { # account for the fragment of the device that # the figure is using fig <- par("fig") dx <- (x[2] - x[1]) dy <- (y[2] - y[1]) x <- x[1] + dx * fig[1:2] y <- y[1] + dy * fig[3:4] } } # much simpler if in plotting region if(region == "plot") { u <- par("usr") x <- u[1:2] y <- u[3:4] } sw <- strwidth(text, cex=cex) * 60/100 sh <- strheight(text, cex=cex) * 60/100 x1 <- switch(pos, topleft =x[1] + sw, left =x[1] + sw, bottomleft =x[1] + sw, top =(x[1] + x[2])/2, center =(x[1] + x[2])/2, bottom =(x[1] + x[2])/2, topright =x[2] - sw, right =x[2] - sw, bottomright =x[2] - sw) y1 <- switch(pos, topleft =y[2] - sh, top =y[2] - sh, topright =y[2] - sh, left =(y[1] + y[2])/2, center =(y[1] + y[2])/2, right =(y[1] + y[2])/2, bottomleft =y[1] + sh, bottom =y[1] + sh, bottomright =y[1] + sh) old.par <- par(xpd=NA) on.exit(par(old.par)) text(x1, y1, text, cex=cex, ...) return(invisible(c(x,y))) }

New Course: Unsupervised Learning in R

Wed, 03/15/2017 - 15:13

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hi there – today we’re launching a new machine learning course on Unsupervised Learning in R by Hank Roark!

Many times in machine learning, the goal is to find patterns in data without trying to make predictions. This is called unsupervised learning. One common use case of unsupervised learning is grouping consumers based on demographics and purchasing history to deploy targeted marketing campaigns. Another example is wanting to describe the unmeasured factors that most influence crime differences between cities. This course provides a basic introduction to clustering and dimensionality reduction in R from a machine learning perspective so that you can get from data to insights as quickly as possible.

Start for freeUnsupervised Learning in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master at machine learning in R!

What you’ll learn:

The k-means algorithm is one common approach to clustering. Learn how the algorithm works under the hood, implement k-means clustering in R, visualize and interpret the results, and select the number of clusters when it’s not known ahead of time. By the end of the first chapter, you’ll have applied k-means clustering to a fun “real-world” dataset!

In chapter 2, you’ll learn about hierarchical clustering which is another popular method for clustering. The goal of this chapter is to go over how it works, how to use it, and how it compares to k-means clustering.

Chapter 3 covers principal component analysis, or PCA, which is a common approach to dimensionality reduction. Learn exactly what PCA does, visualize the results of PCA with biplots and scree plots, and deal with practical issues such as centering and scaling the data before performing PCA.

The goal of the final chapter is to guide you through a complete analysis using the unsupervised learning techniques covered in the first three chapters. You’ll extend what you’ve learned by combining PCA as a preprocessing step to clustering using data that consist of measurements of cell nuclei of human breast masses.

About Hank Roark: Hank is a Senior Data Scientist at Boeing and a long time user of the R language. Prior to his current role, he led the Customer Data Science team at, a leading provider of machine learning and predictive analytics services.

Start course for free

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New version of imager package for image processing

Wed, 03/15/2017 - 13:42

(This article was first published on R – dahtah, and kindly contributed to R-bloggers)

A new version of imager is now available on CRAN. This release brings a lot of new features, including a whole new set of functions dealing with pixel sets, better support for videos, new and faster reduction functions.

The most significant change is the introduction of a “pixset” class, which deals with sets of pixels (for instance, the set of all pixels with a certain brightness, the foreground, the left-hand side of an image, an ROI, etc.). The pixset class includes many tools for working with pixsets, described in a new vignette.

The following piece of code illustrates some of the new features. We take a picture of coins on a table (from scikit-image), segment the coins from the background and highlight the largest coin:

library(imager) library(dplyr) im <- load.example("coins") d <- ##Subsamble, fit a linear model to remove trend in illumination, threshold px <- sample_n(d,1e4) %>% lm(value ~ x*y,data=.) %>% predict(d) %>% { im - . } %>% threshold ##Clean up px <- clean(px,3) %>% imager::fill(7) plot(im) highlight(px) ## Split into connected components (individual coins) pxs <- split_connected(px) ## Compute their respective area area <- sapply(pxs,sum) ## Highlight largest coin in green highlight(pxs[[which.max(area)]],col="green",lwd=2)

The website’s also been overhauled, and new tutorials are available. These include a tutorial on quad-trees (using recursive data structures to represent an image), imager as image editor (what’s the equivalent of a circular selection in imager? How about the bucket tool?) , image unshredding (how to shuffle an image and put it back together).

See here for a complete list.

Below, the full changelog for imager v0.40:

  • added pixset class to represent sets of pixels in an image (implemented as binary images). A test on an image (e.g., im > 0) results in a pixset object. Pixsets come with many convenience functions for plotting, manipulation, morphology, etc. They are described in the "pixsets" vignette.
  • improved reductions (parmax, parmin, etc.). Most are now implemented in C++ and some run in parallel using OpenMP. A median combine operation (parmedian) has been added.
  • added,, functions. Loading and saving videos used to be a bit fragile and platform-dependent, it should now be easier and more robust (also slower, sorry). It’s now possible to load individual frames from a video. You still need ffmpeg.
  • to load images from URLs imager now uses the downloader package, which is more robust than the previous solution
  • it’s now possible to import images from the raster package and from the magick package.
  • unified RGB representations so that every function expects RGB values to be in the [0-1] range. There used to be a conflict in expectations here, with R expecting [0-1] and CImg [0-255]. This might break existing code (albeit in minor ways).
  • new function implot, lets you draw on an image using R’s base graphics.
  • improved interactive functions for grabbing image regions (grabRect, grabLine, etc.)
  • improved grayscale conversion
  • improved plotting. The default is now to have a constant aspect ratio that matches the aspect ratio of the image.
  • save.image now accepts a quality argument when saving JPEGs.
  • native support for TIFF, now supports non-integer values in TIFF files
  • rm.alpha removes alpha channel, flatten.alpha flattens it
  • imfill now accepts colour names, e.g. imfill(10,10,val='red')
  • improved documentation and examples
  • added functions for conversion to/from CIELAB

To leave a comment for the author, please follow the link and comment on their blog: R – dahtah. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An update to the nhmrcData R package

Wed, 03/15/2017 - 12:03

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Just pushed an updated version of my nhmrcData R package to Github. A quick summary of the changes:

  • In response to feedback, added the packages required for vignette building as dependencies (Imports) – commit
  • Added 8 new datasets with funding outcomes by gender for 2003 – 2013, created from a spreadsheet that I missed first time around – commit and see the README

Vignette is not yet updated with new examples.

So now you can generate even more depressing charts of funding rates for even more years, such as the one featured on the right (click for full-size).

Enjoy and as ever, let me know if there are any issues.

Filed under: R, statistics Tagged: data, nhmrc, package, rstats

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sasMap: static code analysis for SAS

Wed, 03/15/2017 - 11:54

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Ava Yang

You may drop your weapons, this is not going to be about SAS vs R. If you work with a large amount of SAS legacy code, sasMap, an R package with a Shiny app, is for you. It evolved from our experience in migrating SAS to R, see Mark Sellor’s post about production R at ONS for an example.

Disclaimer: there’s no such thing as SAS-to-R auto translator, yet. sasMap is a map that can keep you on track by making it easy to look for a path through a wild land.


Often multiple macros are nested to construct main SAS analyses. User macros are held in sub-folders and are called in top level scripts. sasMap calculates summary statistics of SAS scripts and helps to understand macro and script dependency. The key functionalities of the package are:

  • Extract summary statistics such as procs and data steps
  • Draw a barplot of proc calls
  • Visualize static and interactive network of script dependency

And to accomplish this the package provides the following functions:

  • parseSASscript Parse a SAS script
  • parseSASfolder Parse a SAS folder
  • listProcs List frequency of various proc calls
  • drawProcs Draw frequency of various proc calls in a bar plot
  • plotSASmap Draw script dependency in static plot
  • plotSASmapJS Draw script dependency in interactive way

The package includes some dummy SAS code in the \examples\SAScode\Macros folder. The folder contains one high level script MainAnalysis.SAS and a subfolder called Macros where the user’s macros live. The main assumption is that each macro corresponds to a script of the same name. Some macros are called but don’t have a named script. For example, %summary in Util2.SAS, is not displayed in the static network representation, whereas it belongs to internal macros group in the interactive network graph.

The summary statistics include measures such as number of lines (nLines), Procs, number of data step (Data_step), macro calls (Macro_call) and macro defined (Macro_define).

# Install sasMap from github devtools::install_github("MangoTheCat/sasMap") # Load library library(sasMap) # Navigate to target directory sasDir # Parse SAS folder kable(parseSASfolder(sasDir))

# Draw frequency of proc calls drawProcs(sasDir)

# Draw network of SAS scripts. A pdf file can be created by specifying the file name. net <- renderNetwork(sasDir) # plotSASmap(net, width=10, height=10, pdffile='sasMap.pdf') plotSASmap(net, width=10, height=10)

## Alternatively, draw it interactively (not run here) plotSASmapJS(sasDir) Put them together

The sasMap package is accompanied by a shiny app which you can run by executing the following line of code:

library(shiny) runApp(system.file('shiny', package='sasMap'))

Once the “I want to specify a local directory (Warning: It only works when running the shiny app from a local machine).” box is ticked, exposed is a “Choose directory” button which makes it straightforward to direct to your SAS folder (thanks to the shinyFiles package). You can also view a demo version of the app here. For demo’s purpose, the deployed version has the dummy SAS code hard-coded.


At Mango we have benefited greatly from this way of working with SAS code (see this blogpost for more information). If you want to know more about the sasMap package or about SAS to R migration feel free to contact us by phone (+44 (0)1249 705 450) or mail ( The code for this post is available on github as is the code for the package.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Plotly for R workshop at Plotcon 2017

Wed, 03/15/2017 - 07:25

Carson Sievert, the lead developer of the Plotly package for R will be hosting a workshop at Here’s an outline of the material he will be covering during the workshop.

More details here. The workshop will be based on Carson’s Plotly for R book.

.tg {border-collapse:collapse;border-spacing:0;border-color:#aaa;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#333;background-color:#fff;border-top-width:1px;border-bottom-width:1px;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#fff;background-color:#7a7a7a;border-top-width:1px;border-bottom-width:1px;} .tg .tg-j2zy{background-color:#edece3;vertical-align:top} .tg .tg-yw4l{vertical-align:top} Broad Topic Details A tale of 2 interfaces Converting ggplot2 via ggplotly()Directly interfacing to plotly.js with plot_ly()
Augmenting ggplotly() via layout() and friends
Accessing and leveraging ggplot2 internals
Accessing any plot spec via plotly_json() The plot_ly() cookbook Scatter traces
Bars & histograms
2D frequencies
3D plots Arranging multiple views Arranging htmlwidget objects
Merging plotly objects into a subplot
Navigating many views Multiple linked views Linking views with shiny
Linking views without shiny
Linking views “without shiny, inside shiny” Animating views Key frame animations
Linking animated views Advanced topics Adding custom behavior with the plotly.js API and htmlwidgets::onRender()
Translating ggplot2 geoms to plotly

Neural Networks: How they work, and how to train them in R

Tue, 03/14/2017 - 23:01

With the current focus on deep learning, neural networks are all the rage again. (Neural networks have been described for more than 60 years, but it wasn't until the the power of modern computing systems became available that they have been successfully applied to tasks like image recognition.) Neural networks are the fundamental predictive engine in deep learning systems, but it can be difficult to understand exactly what they do. To help with that, Brandon Rohrer has created this from-the-basics guide to how neural networks work:

In R, you can train a simple neural network with just a single hidden layer with the nnet package, which comes pre-installed with every R distribution. It's a great place to start if you're new to neural networks, but the deep learning applications call for more complex neural networks. R has several packages to check out here, including MXNetdarchdeepnet, and h2o: see this post for a comparison. The tensorflow package can also be used to implement various kinds of neural networks. 

Data Science and Robots Blog: How neural networks work

Data Visualization – Part 1

Tue, 03/14/2017 - 22:00

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Introduction to Data Visualization – Theory, R & ggplot2

The topic of data visualization is very popular in the data science community. The market size for visualization products is valued at $4 Billion and is projected to reach $7 Billion by the end of 2022 according to Mordor Intelligence. While we have seen amazing advances in the technology to display information, the understanding of how, why, and when to use visualization techniques has not kept up. Unfortunately, people are often taught how to make a chart before even thinking about whether or not it’s appropriate.

In short, are you adding value to your work or are you simply adding this to make it seem less boring? Let’s take a look at some examples before going through the Stoltzmaniac Data Visualization Philosophy.

I have to give credit to Junk Charts – it inspired a lot of this post.

One author at Vox wanted to show the cause of death in all of Shakespeare

Is this not insane!?!?!

Using a legend instead of data callouts is the only thing that could have made this worse. The author could easily have used a number of other tools to get the point across. While wordles are not ideal for any work requiring exact proportions, it does make for a great visual in this article. Junk Charts Article.

To be clear, I’m not close to being perfect when it comes to visualizations in my blog. The sizes, shapes, font colors, etc. tend to get out of control and I don’t take the time in R to tinker with all of the details. However, when it comes to displaying things professionally, it has to be spot on! So, I’ll walk through my theory and not worry too much about aesthetics (save that for a time when you’re getting paid).

The Good, The Bad, The Ugly

“The Good” visualizations:

  • Clearly illustrate a point
  • Are tailored to the appropriate audience
    • Analysts may want detail
    • Executives may want a high-level view
  • Are tailored to the presentation medium
    • A piece in an academic journal can be analyzed slowly and carefully
    • A slide in front of 5,000 people in a conference will be glanced at quickly
  • Are memorable to those who care about the material
  • Make an impact which increases the understanding of the subject matter

“The Bad” visualizations:

  • Are difficult to interpret
  • Are unintentionally misleading
  • Contain redundant and boring information

“The Ugly” visualizations:

  • Are almost impossible to interpret
  • Are filled with completely worthless information
  • Are intentionally created to mislead the audience
  • Are inaccurate
Coming soon:
  • Introduction to the ggplot2 in R and how it works
  • Determining whether or not you need a visualization
  • Choosing the type of plot to use depending on the use case
  • Visualization beyond the standard charts and graphs

As always, the code used in this post is on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Unit Testing in R

Tue, 03/14/2017 - 17:47

(This article was first published on R Blog, and kindly contributed to R-bloggers)

Software testing describes several means to investigate program code regarding its quality. The underlying approaches provides means to handle errors once they occur. Furthermore, software testing also show techniques to reduce the probability of that.

R is becoming a increasingly promiment programming language. This not only includes pure statistical settings but also machine learning, dashboards via Shiny and beyond. This development is simulateneously fueled by the business schools teaching R to their students. While software testing is usually covered from a theoretical viewpoint, our slides teach the basics on software testing in an easy-to-understand fashion with the help of R.

Our slide deck aims at bridging R programming and software testing. The slides outline the need for software testing and describe general approaches, such as the V model. In addition, we present the build-in features for error handling in R and also show how to do unit testing with the help of the “testthat” package.

We hope that the slide deck supports practitioners to unleash the power of unit testing in R. Moreover, it should equip scholars in business schools with knowledge on software testing.

Download the slides here

The content was republished on with permission.

To leave a comment for the author, please follow the link and comment on their blog: R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

Tue, 03/14/2017 - 16:20

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?

If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)

But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.

In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:

  • either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;
  • and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.

Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.

As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.

For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).

Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.

Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.

Happy pi day!

Tue, 03/14/2017 - 15:39

(This article was first published on Rbloggers – The Analytics Lab, and kindly contributed to R-bloggers)

Just something funny because it’s   day. Enjoy!

# clear your environment rm(list = ls()) # load the necessary libraries library(png) library(plotrix) # lab kleuren oranje <- rgb(228/255, 86/255, 65/255) donkergrijs <- rgb(75/255, 75/255, 74/255) lichtblauw <- rgb(123/255, 176/255, 231/255) # read the image of pi img = readPNG("C:/Users/j.schoonemann/Desktop/pi.png") # read the logo of The Analytics Lab logo = readPNG("C:/Users/j.schoonemann/Desktop/Lab.png") # define the x-position of the pie charts x_position <- c(2, 4, 8, 14, 22) # define the y-position of the pie charts y_position <- c(4, 6, 8, 10, 12) # define the size of the pie charts pie_size <- c(0.5,1.0,1.5,2.0,2.5) # create PacMan pie-charts pacman <- list(c(20,80), c(20,80), c(20,80), c(20,80), c(20,80)) # calculate the chart limits for the x-axis x_axis <- c(min(x_position - pie_size), max(x_position + pie_size)) # calculate the chart limits for the y-axis y_axis <- c(min(y_position - pie_size),max(y_position + pie_size)) # define the colors of the PacMan pie-charts sector_col<- c("black", "yellow") # define the startposition of the first slice of the pie in the charts start_position <- c(-0.1, -0.2, -0.3, -0.4, -0.5) # create the canvas for the plot plot(0, xlim = x_axis, ylim = y_axis, type = "n", axes = F, xlab = "", ylab = "") # add a title and subtitle to the plot, adjust size and color title(main = "Eating Pi makes PacMan grow!\nHappy pi(e) day!", col.main = lichtblauw, cex.main = 2, sub = "Powered by: The Analytics Lab", col.sub = oranje, cex.sub = 1) # plot all the PacMan pie-charts for(bubble in 1:length(x_position)){ floating.pie(xpos = x_position[bubble], ypos = y_position[bubble], x = pacman[[bubble]], radius = pie_size[bubble], col = sector_col, startpos = start_position[bubble]) } # add the logo of The Analytics Lab to the plot rasterImage(image = logo, xleft = 0, ybottom = 12, xright = 5, ytop = 16) # add pi multiple times to the plot # pi between 1st and 2nd rasterImage(image = img, xleft = 2.5, ybottom = 4.5, xright = 3.5, ytop = 5) # pi between 2nd and 3d rasterImage(image = img, xleft = 5, ybottom = 6.5, xright = 6, ytop = 7) rasterImage(image = img, xleft = 5.8, ybottom = 7, xright = 6.8, ytop = 7.5) # pi between 3d and 4th rasterImage(image = img, xleft = 10, ybottom = 8.5, xright = 11, ytop = 9) rasterImage(image = img, xleft = 11, ybottom = 9, xright = 12, ytop = 9.5) # pi between 4th and 5th rasterImage(image = img, xleft = 16.2, ybottom = 10, xright = 17.2, ytop = 10.5) rasterImage(image = img, xleft = 17, ybottom = 10.5, xright = 18, ytop = 11) rasterImage(image = img, xleft = 18, ybottom = 11, xright = 19, ytop = 11.5)


To leave a comment for the author, please follow the link and comment on their blog: Rbloggers – The Analytics Lab. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

4 tricks for working with R, Leaflet and Shiny

Tue, 03/14/2017 - 13:39

I recently worked on a dataviz project involving Shiny and the Leaflet library. In this post I give 4 handy tricks we used to improve the app: 1/ how to use leaflet native widgets 2/ how to trigger an action when user clicks on map 3/ how to add a research bar on your map 4/ how to propose a “geolocalize me” button. For each trick, a reproducible code snippet is provided, so you just have to copy and paste it to reproduce the image.

Trick 1: use leaflet native control widget

Shiny is used to add interactivity to your dataviz. Working on maps, it’s great to add a widget to allow users to switch between datasets, using one layer or another… Of course, this can be achieved using a regular RadioButton or any other shiny widget, building a new map each time.

But Leaflet enables you to build the control widget directly while doing the map. Every layer you create must be added to the same map, all attributed to a “group“. Then, the control widget will allow you to switch from one group to another. The code below should help you understand the concept, and everything is explained in more detail on the amazing Rstudio tutorial.

Trick 2: make a graph depending on click position

It is possible to code the map so that clicking on a certain point opens particular information. In this example, the user chooses a marker on the map, which makes either a barplot OR a scatterplot. By adding an ID to every marker of the map, using the layerId argument, the click information is saved in a reactive value than can be used to run specific functions!

Trick 3: add a search bar

The ggmap library allows you to get the coordinates of any place in the world based on a google search. You can use this feature to create a search bar in addition to the leaflet map! The textInput is passed to the geocode function of ggmap. It returns coordinates that are given to leaflet to zoom on the map.

Trick 4: use geolocalization

When the user clicks on the “localize me” button, the leaflet map automatically zooms to the user’s position. This trick uses a Javascript function (but no Javascript knowledge needed). This example is inspired from here.


This post has been first published on the R graph gallery, a website that displays hundreds of R charts, always with the reproducible code! You can follow the gallery on Twitter: @R_Graph_Gallery

Code | Trick1 | Widget

# Load libraries library(shiny) library(leaflet) # Make data with several positions data_red=data.frame(LONG=42+rnorm(10), LAT=23+rnorm(10), PLACE=paste("Red_place_",seq(1,10))) data_blue=data.frame(LONG=42+rnorm(10), LAT=23+rnorm(10), PLACE=paste("Blue_place_",seq(1,10)))      # Initialize the leaflet map: leaflet() %>% setView(lng=42, lat=23, zoom=8 ) %>% # Add two tiles addProviderTiles("Esri.WorldImagery", group="background 1") %>% addTiles(options = providerTileOptions(noWrap = TRUE), group="background 2") %>% # Add 2 marker groups addCircleMarkers(data=data_red, lng=~LONG , lat=~LAT, radius=8 , color="black",  fillColor="red", stroke = TRUE, fillOpacity = 0.8, group="Red") %>% addCircleMarkers(data=data_blue, lng=~LONG , lat=~LAT, radius=8 , color="black",  fillColor="blue", stroke = TRUE, fillOpacity = 0.8, group="Blue") %>% # Add the control widget addLayersControl(overlayGroups = c("Red","Blue") , baseGroups = c("background 1","background 2"), options = layersControlOptions(collapsed = FALSE))


Code | Trick2 | Graph depends on click

library(shiny) library(leaflet) server <- function(input, output) {          # build data with 2 places     data=data.frame(x=c(130, 128), y=c(-22,-26), id=c("place1", "place2"))          # create a reactive value that will store the click position     data_of_click <- reactiveValues(clickedMarker=NULL)     # Leaflet map with 2 markers     output$map <- renderLeaflet({         leaflet() %>%           setView(lng=131 , lat =-25, zoom=4) %>%           addTiles(options = providerTileOptions(noWrap = TRUE)) %>%           addCircleMarkers(data=data, ~x , ~y, layerId=~id, popup=~id, radius=8 , color="black",  fillColor="red", stroke = TRUE, fillOpacity = 0.8)     })     # store the click     observeEvent(input$map_marker_click,{         data_of_click$clickedMarker <- input$map_marker_click     })          # Make a barplot or scatterplot depending of the selected point     output$plot=renderPlot({         my_place=data_of_click$clickedMarker$id         if(is.null(my_place)){my_place="place1"}         if(my_place=="place1"){             plot(rnorm(1000), col=rgb(0.9,0.4,0.1,0.3), cex=3, pch=20)         }else{             barplot(rnorm(10), col=rgb(0.1,0.4,0.9,0.3))         }         }) } ui <- fluidPage(     br(),     column(8,leafletOutput("map", height="600px")),     column(4,br(),br(),br(),br(),plotOutput("plot", height="300px")),     br() ) shinyApp(ui = ui, server = server)

Code | Trick3 | Search Bar

library(shiny) library(leaflet) library(ggmap) server <- function(input, output) {     output$map <- renderLeaflet({              # Get latitude and longitude         if(input$target_zone=="Ex: Bamako"){             ZOOM=2             LAT=0             LONG=0         }else{             target_pos=geocode(input$target_zone)             LAT=target_pos$lat             LONG=target_pos$lon             ZOOM=12         }         # Plot it!         leaflet() %>%           setView(lng=LONG, lat=LAT, zoom=ZOOM ) %>%         addProviderTiles("Esri.WorldImagery")     }) } ui <- fluidPage(     br(),     leafletOutput("map", height="600px"),     absolutePanel(top=20, left=70, textInput("target_zone", "" , "Ex: Bamako")),     br() ) shinyApp(ui = ui, server = server)

Code | Trick4 | Geolocalization

# ==== libraries library(shiny) library(leaflet) library(shinyjs) # ==== fonction allowing geolocalisation jsCode <- ' shinyjs.geoloc = function() {     navigator.geolocation.getCurrentPosition(onSuccess, onError);     function onError (err) {         Shiny.onInputChange("geolocation", false);     }     function onSuccess (position) {         setTimeout(function () {             var coords = position.coords;             console.log(coords.latitude + ", " + coords.longitude);             Shiny.onInputChange("geolocation", true);             Shiny.onInputChange("lat", coords.latitude);             Shiny.onInputChange("long", coords.longitude);         }, 5)     } }; ' # ==== server server <- function(input, output) {     # Basic map     output$map <- renderLeaflet({         leaflet() %>%           setView(lng=0, lat=0, zoom=2 ) %>%         addProviderTiles("Esri.WorldImagery")     })          # Find geolocalisation coordinates when user clicks     observeEvent(input$geoloc, {         js$geoloc()     })     # zoom on the corresponding area     observe({         if(!is.null(input$lat)){             map <- leafletProxy("map")             dist <- 0.2             lat <- input$lat             lng <- input$long             map %>% fitBounds(lng - dist, lat - dist, lng + dist, lat + dist)         }      }) } # ==== UI ui <- fluidPage(          # Tell shiny we will use some Javascript useShinyjs(),     extendShinyjs(text = jsCode),     # One button and one map br(),     actionButton("geoloc", "Localize me", class="btn btn-primary", onClick="shinyjs.geoloc()"),     leafletOutput("map", height="600px") ) shinyApp(ui = ui, server = server)