Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 30 min ago

Lesser known purrr tricks

Fri, 03/24/2017 - 12:00

purrr is package that extends R’s functional programming capabilities. It brings a lot of new stuff to the table and in this post I show you some of the most useful (at least to me) functions included in purrr.

Getting rid of loops with map() library(purrr) numbers <- list(11, 12, 13, 14) map_dbl(numbers, sqrt) ## [1] 3.316625 3.464102 3.605551 3.741657

You might wonder why this might be preferred to a for loop? It’s a lot less verbose, and you do not need to initialise any kind of structure to hold the result. If you google “create empty list in R” you will see that this is very common. However, with the map() family of functions, there is no need for an initial structure. map_dbl() returns an atomic list of real numbers, but if you use map() you will get a list back. Try them all out!

Map conditionally map_if() # Create a helper function that returns TRUE if a number is even is_even <- function(x){ !as.logical(x %% 2) } map_if(numbers, is_even, sqrt) ## [[1]] ## [1] 11 ## ## [[2]] ## [1] 3.464102 ## ## [[3]] ## [1] 13 ## ## [[4]] ## [1] 3.741657 map_at() map_at(numbers, c(1,3), sqrt) ## [[1]] ## [1] 3.316625 ## ## [[2]] ## [1] 12 ## ## [[3]] ## [1] 3.605551 ## ## [[4]] ## [1] 14

map_if() and map_at() have a further argument than map(); in the case of map_if(), a predicate function ( a function that returns TRUE or FALSE) and a vector of positions for map_at(). This allows you to map your function only when certain conditions are met, which is also something that a lot of people google for.

Map a function with multiple arguments numbers2 <- list(1, 2, 3, 4) map2(numbers, numbers2, `+`) ## [[1]] ## [1] 12 ## ## [[2]] ## [1] 14 ## ## [[3]] ## [1] 16 ## ## [[4]] ## [1] 18

You can map two lists to a function which takes two arguments using map_2(). You can even map an arbitrary number of lists to any function using pmap().

By the way, try this in: `+`(1,3) and see what happens.

Don’t stop execution of your function if something goes wrong possible_sqrt <- possibly(sqrt, otherwise = NA_real_) numbers_with_error <- list(1, 2, 3, "spam", 4) map(numbers_with_error, possible_sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] NA ## ## [[5]] ## [1] 2

Another very common issue is to keep running your loop even when something goes wrong. In most cases the loop simply stops at the error, but you would like it to continue and see where it failed. Try to google “skip error in a loop” or some variation of it and you’ll see that a lot of people really just want that. This is possible by combining map() and possibly(). Most solutions involve the use of tryCatch() which I personally do not find very easy to use.

Don’t stop execution of your function if something goes wrong and capture the error safe_sqrt <- safely(sqrt, otherwise = NA_real_) map(numbers_with_error, safe_sqrt) ## [[1]] ## [[1]]$result ## [1] 1 ## ## [[1]]$error ## NULL ## ## ## [[2]] ## [[2]]$result ## [1] 1.414214 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 1.732051 ## ## [[3]]$error ## NULL ## ## ## [[4]] ## [[4]]$result ## [1] NA ## ## [[4]]$error ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## ## [[5]] ## [[5]]$result ## [1] 2 ## ## [[5]]$error ## NULL

safely() is very similar to possibly() but it returns a list of lists. An element is thus a list of the result and the accompagnying error message. If there is no error, the error component is NULL if there is an error, it returns the error message.

Transpose a list safe_result_list <- map(numbers_with_error, safe_sqrt) transpose(safe_result_list) ## $result ## $result[[1]] ## [1] 1 ## ## $result[[2]] ## [1] 1.414214 ## ## $result[[3]] ## [1] 1.732051 ## ## $result[[4]] ## [1] NA ## ## $result[[5]] ## [1] 2 ## ## ## $error ## $error[[1]] ## NULL ## ## $error[[2]] ## NULL ## ## $error[[3]] ## NULL ## ## $error[[4]] ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## $error[[5]] ## NULL

Here we transposed the above list. This means that we still have a list of lists, but where the first list holds all the results (which you can then access with safe_result_list$result) and the second list holds all the errors (which you can access with safe_result_list$error). This can be quite useful!

Apply a function to a lower depth of a list transposed_list <- transpose(safe_result_list) transposed_list %>% at_depth(2, is_null) ## $result ## $result[[1]] ## [1] FALSE ## ## $result[[2]] ## [1] FALSE ## ## $result[[3]] ## [1] FALSE ## ## $result[[4]] ## [1] FALSE ## ## $result[[5]] ## [1] FALSE ## ## ## $error ## $error[[1]] ## [1] TRUE ## ## $error[[2]] ## [1] TRUE ## ## $error[[3]] ## [1] TRUE ## ## $error[[4]] ## [1] FALSE ## ## $error[[5]] ## [1] TRUE

Sometimes working with lists of lists can be tricky, especially when we want to apply a function to the sub-lists. This is easily done with at_depth()!

Set names of list elements name_element <- c("sqrt()", "ok?") set_names(transposed_list, name_element) ## $`sqrt()` ## $`sqrt()`[[1]] ## [1] 1 ## ## $`sqrt()`[[2]] ## [1] 1.414214 ## ## $`sqrt()`[[3]] ## [1] 1.732051 ## ## $`sqrt()`[[4]] ## [1] NA ## ## $`sqrt()`[[5]] ## [1] 2 ## ## ## $`ok?` ## $`ok?`[[1]] ## NULL ## ## $`ok?`[[2]] ## NULL ## ## $`ok?`[[3]] ## NULL ## ## $`ok?`[[4]] ## <simpleError in .f(...): non-numeric argument to mathematical function> ## ## $`ok?`[[5]] ## NULL Reduce a list to a single value reduce(numbers, `*`) ## [1] 24024

reduce() applies the function * iteratively to the list of numbers. There’s also accumulate():

accumulate(numbers, `*`) ## [1] 11 132 1716 24024

which keeps the intermediary results.

This function is very general, and you can reduce anything:


mat1 <- matrix(rnorm(10), nrow = 2) mat2 <- matrix(rnorm(10), nrow = 2) mat3 <- matrix(rnorm(10), nrow = 2) list_mat <- list(mat1, mat2, mat3) reduce(list_mat, `+`) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.5228188 0.4813357 0.3808749 -1.1678164 0.3080001 ## [2,] -3.8330509 -0.1061853 -3.8315768 0.3052248 0.3486929

even data frames:

df1 <- df2 <- df3 <- list_df <- list(df1, df2, df3) reduce(list_df, dplyr::full_join) ## Joining, by = c("V1", "V2", "V3", "V4", "V5") ## Joining, by = c("V1", "V2", "V3", "V4", "V5") ## V1 V2 V3 V4 V5 ## 1 0.01587062 0.8570925 1.04330594 -0.5354500 0.7557203 ## 2 -0.46872345 0.3742191 -1.88322431 1.4983888 -1.2691007 ## 3 -0.60675851 -0.7402364 -0.49269182 -0.4884616 -1.0127531 ## 4 -1.49619518 1.0714251 0.06748534 0.6650679 1.1709317 ## 5 0.06806907 0.3644795 -0.16973919 -0.1439047 0.5650329 ## 6 -1.86813223 -1.5518295 -2.01583786 -1.8582319 0.4468619

Hope you enjoyed this list of useful functions! If you enjoy the content of my blog, you can follow me on twitter.

RApiDatetime 0.0.1

Fri, 03/24/2017 - 02:30

Very happy to announce a new package of mine is now up on the CRAN repository network: RApiDatetime.

It provides six entry points for C-level functions of the R API for Date and Datetime calculations: asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R. Hence the need to put them together in the this package to complete the accessible API somewhat.

These should be helpful for fellow package authors as many of us have either our own partial copies of some of this code, or rather farm back out into R to get this done.

As a simple (yet real!) illustration, here is an actual Rcpp function which we could now cover at the C level rather than having to go back up to R (via Rcpp::Function()):

inline Datetime::Datetime(const std::string &s, const std::string &fmt) { Rcpp::Function strptime("strptime"); // we cheat and call strptime() from R Rcpp::Function asPOSIXct("as.POSIXct"); // and we need to convert to POSIXct m_dt = Rcpp::as<double>(asPOSIXct(strptime(s, fmt))); update_tm(); }

I had taken a first brief stab at this about two years ago, but never finished. With the recent emphasis on C-level function registration, coupled with a possible use case from anytime I more or less put this together last weekend.

It currently builds and tests fine on POSIX-alike operating systems. If someone with some skill and patience in working on Windows would like to help complete the Windows side of things then I would certainly welcome help and pull requests.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

QR Decomposition with the Gram-Schmidt Algorithm

Thu, 03/23/2017 - 21:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

QR decomposition is another technique for decomposing a matrix into a form that is easier to work with in further applications. The QR decomposition technique decomposes a square or rectangular matrix, which we will denote as A, into two components, Q, and R.
A = QR

Where Q is an orthogonal matrix, and R is an upper triangular matrix. Recall an orthogonal matrix is a square matrix with orthonormal row and column vectors such that Q^T Q = I, where I is the identity matrix. The term orthonormal implies the vectors are of unit length and are perpendicular (orthogonal) to each other.

QR decomposition is often used in linear least squares estimation and is, in fact, the method used by R in its lm() function. Signal processing and MIMO systems also employ QR decomposition. There are several methods for performing QR decomposition, including the Gram-Schmidt process, Householder reflections, and Givens rotations. This post is concerned with the Gram-Schmidt process.

The Gram-Schmidt Process

The Gram-Schmidt process is used to find an orthogonal basis from a non-orthogonal basis. An orthogonal basis has many properties that are desirable for further computations and expansions. As noted previously, an orthogonal matrix has row and column vectors of unit length:

||a_n|| = \sqrt{a_n \cdot a_n} = \sqrt{a_n^T a_n} = 1

Where a_n is a linearly independent column vector of a matrix. The vectors are also perpendicular in an orthogonal basis. The Gram-Schmidt process works by finding an orthogonal projection q_n for each column vector a_n and then subtracting its projections onto the previous projections (q_j). The resulting vector is then divided by the length of that vector to produce a unit vector.

Consider a matrix A with n column vectors such that:

A = \left[ a_1 | a_2 | \cdots | a_n \right]

The Gram-Schmidt process proceeds by finding the orthogonal projection of the first column vector a_1.

v_1 = a_1, \qquad e_1 = \frac{v_1}{||v_1||}

Because a_1 is the first column vector, there is no preceeding projections to subtract. The second column a_2 is subtracted by the previous projection on the column vector:

v_2 = a_2 – proj_{v_1} (a_2) = a_2 – (a_2 \cdot e_1) e_1, \qquad e_2 = \frac{v_2}{||v_2||}

This process continues up to the n column vectors, where each incremental step k + 1 is computed as:

v_{k+1} = a_{k+1} – (a_{k+1} \cdot e_{1}) e_1 – \cdots – (a_{k+1} \cdot e_k) e_k, \qquad e_{k+1} = \frac{u_{k+1}}{||u_{k+1}||}

The || \cdot || is the L_2 norm which is defined as:

\sqrt{\sum^m_{j=1} v_k^2} The projection can also be defined by:

Thus the matrix A can be factorized into the QR matrix as the following:

A = \left[a_1 | a_2 | \cdots | a_n \right] = \left[e_1 | e_2 | \cdots | e_n \right] \begin{bmatrix}a_1 \cdot e_1 & a_2 \cdot e_1 & \cdots & a_n \cdot e_1 \\\ 0 & a_2 \cdot e_2 & \cdots & a_n \cdot e_2 \\\ \vdots & \vdots & & \vdots \\\ 0 & 0 & \cdots & a_n \cdot e_n\end{bmatrix} = QR

Gram-Schmidt Process Example

Consider the matrix A:

\begin{bmatrix} 2 & – 2 & 18 \\\ 2 & 1 & 0 \\\ 1 & 2 & 0 \end{bmatrix}

We would like to orthogonalize this matrix using the Gram-Schmidt process. The resulting orthogonalized vector is also equivalent to Q in the QR decomposition.

The Gram-Schmidt process on the matrix A proceeds as follows:

v_1 = a_1 = \begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix} \qquad e_1 = \frac{v_1}{||v_1||} = \frac{\begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}2 \\\ 2 \\\ 1\end{bmatrix}^2}}} e_1 = \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}
v_2 = a_2 – (a_2 \cdot e_1) e_1 = \begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix} – \left(\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}, \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}\right)\begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}
v_2 = \begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix} \qquad e_2 = \frac{v_2}{||v_2||} = \frac{\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}-2 \\\ 1 \\\ 2\end{bmatrix}^2}}}
e_2 = \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix}
v_3 = a_3 – (a_3 \cdot e_1) e_1 – (a_3 \cdot e_2) e_2
v_3 = \begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix} – \left(\begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix}, \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix}\right)\begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} – \left(\begin{bmatrix}18 \\\ 0 \\\ 0\end{bmatrix}, \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} \right)\begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix}
v_3 = \begin{bmatrix}2 \\\ – 4 \\\ 4 \end{bmatrix} \qquad e_3 = \frac{v_3}{||v_3||} = \frac{\begin{bmatrix}2 \\\ -4 \\\ 4\end{bmatrix}}{\sqrt{\sum{\begin{bmatrix}2 \\\ -4 \\\ 4\end{bmatrix}^2}}}
e_3 = \begin{bmatrix} \frac{1}{3} \\\ -\frac{2}{3} \\\ \frac{2}{3} \end{bmatrix}

Thus, the orthogonalized matrix resulting from the Gram-Schmidt process is:

\begin{bmatrix} \frac{2}{3} & -\frac{2}{3} & \frac{1}{3} \\\ \frac{2}{3} & \frac{1}{3} & -\frac{2}{3} \\\ \frac{1}{3} & \frac{1}{3} & \frac{2}{3} \end{bmatrix}

The component R of the QR decomposition can also be found from the calculations made in the Gram-Schmidt process as defined above.

R = \begin{bmatrix}a_1 \cdot e_1 & a_2 \cdot e_1 & \cdots & a_n \cdot e_1 \\\ 0 & a_2 \cdot e_2 & \cdots & a_n \cdot e_2 \\\ \vdots & \vdots & & \vdots \\\ 0 & 0 & \cdots & a_n \cdot e_n \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} 2 \\\ 2 \\\ 1 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} & \begin{bmatrix} -2 \\\ 1 \\\ 2 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} \frac{2}{3} \\\ \frac{2}{3} \\\ \frac{1}{3} \end{bmatrix} \\\ 0 & \begin{bmatrix} -2 \\\ 1 \\\ 2 \end{bmatrix} \cdot \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} -\frac{2}{3} \\\ \frac{1}{3} \\\ \frac{2}{3} \end{bmatrix} \\\ 0 & 0 & \begin{bmatrix} 18 \\\ 0 \\\ 0 \end{bmatrix} \cdot \begin{bmatrix} \frac{1}{3} \\\ -\frac{2}{3} \\\ \frac{2}{3} \end{bmatrix}\end{bmatrix}
R = \begin{bmatrix} 3 & 0 & 12 \\\ 0 & 3 & -12 \\\ 0 & 0 & 6 \end{bmatrix}

The Gram-Schmidt Algorithm in R

We use the same matrix A to verify our results above.

A <- rbind(c(2,-2,18),c(2,1,0),c(1,2,0)) A ## [,1] [,2] [,3] ## [1,] 2 -2 18 ## [2,] 2 1 0 ## [3,] 1 2 0

The following function is an implementation of the Gram-Schmidt algorithm using the modified version of the algorithm. A good comparison of the classical and modified versions of the algorithm can be found here. The Modified Gram-Schmidt algorithm was used above due to its improved numerical stability, which results in more orthogonal columns over the Classical algorithm.

gramschmidt <- function(x) { x <- as.matrix(x) # Get the number of rows and columns of the matrix n <- ncol(x) m <- nrow(x) # Initialize the Q and R matrices q <- matrix(0, m, n) r <- matrix(0, n, n) for (j in 1:n) { v = x[,j] # Step 1 of the Gram-Schmidt process v1 = a1 # Skip the first column if (j > 1) { for (i in 1:(j-1)) { r[i,j] <- t(q[,i]) %*% x[,j] # Find the inner product (noted to be q^T a earlier) # Subtract the projection from v which causes v to become perpendicular to all columns of Q v <- v - r[i,j] * q[,i] } } # Find the L2 norm of the jth diagonal of R r[j,j] <- sqrt(sum(v^2)) # The orthogonalized result is found and stored in the ith column of Q. q[,j] <- v / r[j,j] } # Collect the Q and R matrices into a list and return qrcomp <- list('Q'=q, 'R'=r) return(qrcomp) }

Perform the Gram-Schmidt orthogonalization process on the matrix A using our function.

gramschmidt(A) ## $Q ## [,1] [,2] [,3] ## [1,] 0.6666667 -0.6666667 0.3333333 ## [2,] 0.6666667 0.3333333 -0.6666667 ## [3,] 0.3333333 0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] 3 0 12 ## [2,] 0 3 -12 ## [3,] 0 0 6

The results of our function match those of our manual calculations!

The qr() function in R also performs the Gram-Schmidt process. The qr() function does not output the Q and R matrices, which must be found by calling qr.Q() and qr.R(), respectively, on the qr object.

A.qr <- qr(A) A.qr.out <- list('Q'=qr.Q(A.qr), 'R'=qr.R(A.qr)) A.qr.out ## $Q ## [,1] [,2] [,3] ## [1,] -0.6666667 0.6666667 0.3333333 ## [2,] -0.6666667 -0.3333333 -0.6666667 ## [3,] -0.3333333 -0.6666667 0.6666667 ## ## $R ## [,1] [,2] [,3] ## [1,] -3 0 -12 ## [2,] 0 -3 12 ## [3,] 0 0 6

Thus the qr() function in R matches our function and manual calculations as well.


The post QR Decomposition with the Gram-Schmidt Algorithm appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Announcing R Tools 1.0 for Visual Studio 2015

Thu, 03/23/2017 - 19:16

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Shahrokh Mortazavi, Partner PM, Visual Studio Cloud Platform Tools at Microsoft

I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May.

RTVS is a free and open source plug-in that turns Visual Studio into a powerful and productive R development environment. Check out this video for a quick tour of its core features:

Core IDE Features

RTVS builds on Visual Studio, which means you get numerous features for free: from using multiple languages to word-class Editing and Debugging to over 7,000 extensions for every need:

  • A polyglot IDE – VS supports R, Python, C++, C#, Node.js, SQL, etc. projects simultaneously.
  • Editor – complete editing experience for R scripts and functions, including detachable/tabbed windows, syntax highlighting, and much more.
  • IntelliSense – (aka auto-completion) available in both the editor and the Interactive R window.
  • R Interactive Window – work with the R console directly from within Visual Studio.
  • History window – view, search, select previous commands and send to the Interactive window.
  • Variable Explorer – drill into your R data structures and examine their values.
  • Plotting – see all of your R plots in a Visual Studio tool window.
  • Debugging – breakpoints, stepping, watch windows, call stacks and more.
  • R Markdown – R Markdown/knitr support with export to Word and HTML.
  • Git – source code control via Git and GitHub.
  • Extensions – over 7,000 Extensions covering a wide spectrum from Data to Languages to Productivity.
  • Help – use ? and ?? to view R documentation within Visual Studio.

It’s Enterprise-Grade

RTVS includes various features that address the needs of individual as well as Data Science teams, for example:

SQL Server 2016

RTVS integrates with SQL Server 2016 R Services and SQL Server Tools for Visual Studio 2015. These separate downloads enhance RTVS with support for syntax coloring and Intellisense, interactive queries, and deployment of stored procedures directly from Visual Studio.

Microsoft R Client

Use the stock CRAN R interpreter, or the enhanced Microsoft R Client and its ScaleR functions that support multi-core and cluster computing for practicing data science at scale.

Visual Studio Team Services

Integrated support for git, continuous integration, agile tools, release management, testing, reporting, bug and work-item tracking through Visual Studio Team Services. Use our hosted service or host it yourself privately.


Whether it’s data governance, security, or running large jobs on a powerful server, RTVS workspaces enable setting up your own R server or connecting to one in the cloud.

The road ahead

We’re very excited to officially bring another language to the Visual Studio family!  Along with Python Tools for Visual Studio, you have the two main languages for tackling most any analytics and ML related challenge.  In the near future (~May), we’ll release RTVS for Visual Studio 2017 as well. We’ll also resurrect the “Data Science workload” in VS2017 which gives you R, Python, F# and all their respective package distros in one convenient install. Beyond that, we’re looking forward to hearing from you on what features we should focus on next! R package development? Mixed R+C debugging? Model deployment? VS Code/R for cross-platform development? Let us know at the RTVS Github repository!

Thank you!


To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Survminer Cheatsheet to Create Easily Survival Plots

Thu, 03/23/2017 - 16:46

We recently released the survminer verion 0.3, which includes many new features to help in visualizing and sumarizing survival analysis results.

In this article, we present a cheatsheet for survminer, created by Przemysław Biecek, and provide an overview of main functions.

survminer cheatsheet

The cheatsheet can be downloaded from STHDA and from Rstudio. It contains selected important functions, such as:

  • ggsurvplot() for plotting survival curves
  • ggcoxzph() and ggcoxdiagnostics() for assessing the assumtions of the Cox model
  • ggforest() and ggcoxadjustedcurves() for summarizing a Cox model

Additional functions, that you might find helpful, are briefly described in the next section.

survminer overview

The main functions, in the package, are organized in different categories as follow.

Survival Curves

  • ggsurvplot(): Draws survival curves with the ‘number at risk’ table, the cumulative number of events table and the cumulative number of censored subjects table.

  • arrange_ggsurvplots(): Arranges multiple ggsurvplots on the same page.

  • ggsurvevents(): Plots the distribution of event’s times.

  • surv_summary(): Summary of a survival curve. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.

  • surv_cutpoint(): Determines the optimal cutpoint for one or multiple continuous variables at once. Provides a value of a cutpoint that correspond to the most significant relation with survival.

  • pairwise_survdiff(): Multiple comparisons of survival curves. Calculate pairwise comparisons between group levels with corrections for multiple testing.

Diagnostics of Cox Model

  • ggcoxzph(): Graphical test of proportional hazards. Displays a graph of the scaled Schoenfeld residuals, along with a smooth curve using ggplot2. Wrapper around plot.cox.zph().

  • ggcoxdiagnostics(): Displays diagnostics graphs presenting goodness of Cox Proportional Hazards Model fit.

  • ggcoxfunctional(): Displays graphs of continuous explanatory variable against martingale residuals of null cox proportional hazards model. It helps to properly choose the functional form of continuous variable in cox model.

Summary of Cox Model

  • ggforest(): Draws forest plot for CoxPH model.

  • ggcoxadjustedcurves(): Plots adjusted survival curves for coxph model.

Competing Risks

  • ggcompetingrisks(): Plots cumulative incidence curves for competing risks.

Find out more at, and check out the documentation and usage examples of each of the functions in survminer package.


This analysis has been performed using R software (ver. 3.3.2).

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header


Make the [R] Kenntnis-Tage 2017 your stage

Thu, 03/23/2017 - 14:17

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

At the [R] Kenntnis-Tage 2017 on November 8 and 9, 2017 you will get the chance to benefit not only from the exchange about the programming language R in a business context and practical tutorials but also from the audience: use the [R] Kenntnis-Tage 2017 as your platform and hand in your topic for the call of papers.

[R] Kenntnis-Tage 2017: Call for Papers

The topics can be as diverse as the event itself: Whether you share your data science use case with the participants, tell them your personal lessons learned with R in the business environment or show how your company uses data science and R. As a speaker at the [R] Kenntnis-Tage 2017, you will get free admission to the event.

Companies present themselves as big data pioneers

At last year’s event, many speakers already took their chance: Julia Flad from Trumpf Laser talked about the application possibilities of data science in the industry and shared her experiences with the participants. In his talk “Working efficiently with R – faster to the data product” Julian Gimbel from Lufthansa Industry Solutions gave a vivid example for working with the open source programming language.

Take your chance to become part of the [R] Kenntnis-Tage 2017 and hand in your topic related to data science or R. For more information, e.g. on the length of the presentation or the choice of topic, visit our website.

You don’t want to give a talk but still want to participate? When you register until May 5 you can profit from our early bird tickets. Data science goes professional – join in.

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Tidyverse Curse

Thu, 03/23/2017 - 14:08

(This article was first published on R –, and kindly contributed to R-bloggers)

I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

The Tidyverse Curse

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method.  It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).

> library("dplyr") > mtcars %>% + as_data_frame() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the function:

> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

10 Million Dots: Mapping European Population

Thu, 03/23/2017 - 11:02

(This article was first published on R –, and kindly contributed to R-bloggers)

Dot density maps are an increasingly popular way of showing population datasets. The technique has its limitations (see here for more info), but it can create some really nice visual outputs, particularly when trying to show density. I have tried to push this to the limit by using a high resolution population grid for Europe (GEOSTAT 2011) to plot a dot for every 50 people in Europe (interactive). The detailed data aren’t published for all European countries – there are modelled values for those missing but I have decided to stick to only those countries with actual counts here.

Giant dot density maps with R

I produced this map as part of an experiment to see how R could handle generating millions of dots from spatial data and then plotting them. Even though the final image is 3.8 metres by 3.8 metres – this is as big as I could make it without causing R to crash – many of the grid cells in urban areas are saturated with dots so a lot of detail is lost in those areas. The technique is effective in areas that are more sparsely populated. It would probably work better with larger spatial units that aren’t necessarily grid cells.

Generating the dots (using the spsample function) and then plotting them in one go would require way more RAM than I have access to. Instead I wrote a loop to select each grid cell of population data, generate the requisite number of points and then plot them. This created a much lighter load as my computer happily chugged away for a few hours to produce the plot. I could have plotted a dot for each person (500 million+) but this would have completely saturated the map, so instead I opted for 1 dot for every 50 people.

I have provided full code below. T&Cs on the data mean I can’t share that directly but you can download here.

Load rgdal and the spatial object required. In this case it’s the GEOSTAT 2011 population grid. In addition I am using the rainbow function to generate a rainbow palette for each of the countries.

library(rgdal) Input<-readOGR(dsn=".", layer="SHAPEFILE") Cols<- data.frame(Country=unique(Input$Country), Cols=rainbow(nrow(data.frame(unique(Input$Country)))))

Create the initial plot. This is empty but it enables the dots to be added interatively to the plot. I have specified 380cm by 380cm at 150 dpi PNG – this is as big as I could get it without crashing R.

png("europe_density.png",width=380, height=380, units="cm", res=150) par(bg='black') plot(t(bbox(Input)), asp=T, axes=F, ann=F)

Loop through each of the individual spatial units – grid cells in this case – and plot the dots for each. The number of dots are specified in spsample as the grid cell’s population value/50.

for (j in 1:nrow(Cols)) { Subset<- Input[Input$Country==Cols[j,1],] for (i in 1:nrow(Subset@data)) { Poly<- Subset[i,] pts1<-spsample(Poly, n = 1+round(Poly@data$Pop/50), "random", iter=15) #The colours are taken from the Cols object. plot(pts1, pch=16, cex=0.03, add=T, col=as.vector(Cols[j,2])) } }


To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New mlr Logo

Thu, 03/23/2017 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

We at mlr are currently deciding on a new logo, and in the spirit of open-source, we would like to involve the community in the voting process!

You can vote for your favorite logo on GitHub by reacting to the logo with a +1.

Thanks to Hannah Atkin for designing the logos!

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Euler Problem 17: Number Letter Counts

Wed, 03/22/2017 - 22:09

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler Problem 17 asks to count the letters in numbers written as words. This is a skill we all learnt in primary school mainly useful when writing cheques—to those that still use them.

Each language has its own rules for writing numbers. My native language Dutch has very different logic to English. Both Dutch and English use a single syllable until the number twelve. Linguists have theorised this is evidence that early Germanic numbers were duodecimal. This factoid is supported by the importance of a “dozen” as a counting word and the twelve hours in the clock. There is even a Dozenal Society that promotes the use of a number system based on 12.

The English language changes the rules when reaching the number 21. While we say eight-teen in English, we do no say “one-twenty”. Dutch stays consistent and the last number is always spoken first. For example, 37 in English is “thirty-seven”, while in Dutch it is written as “zevenendertig” (seven and thirty).

Euler Problem 17 Definition

If the numbers 1 to 5 are written out in words: one, two, three, four, five, then there are 3 + 3 + 5 + 4 + 4 = 19 letters used in total. If all the numbers from 1 to 1000 (one thousand) inclusive were written out in words, how many letters would be used?

NOTE: Do not count spaces or hyphens. For example, 342 (three hundred and forty-two) contains 23 letters and 115 (one hundred and fifteen) contains 20 letters. The use of “and” when writing out numbers is in compliance with British usage.


The first piece of code provides a function that generates the words for numbers 1 to 999,999. This is more than the problem asks for, but it might be a useful function for another application. The last line concatenates all words together and removes the spaces.

numword.en <- function(x) { if (x > 999999) return("Error: Oustide my vocabulary") # Vocabulary single <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine") teens <- c( "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen") tens <- c("ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety") # Translation numword.10 <- function (y) { a <- y %% 100 if (a != 0) { and <- ifelse(y > 100, "and", "") if (a < 20) return (c(and, c(single, teens)[a])) else return (c(and, tens[floor(a / 10)], single[a %% 10])) } } numword.100 <- function (y) { a <- (floor(y / 100) %% 100) %% 10 if (a != 0) return (c(single[a], "hundred")) } numword.1000 <- function(y) { a <- (1000 * floor(y / 1000)) / 1000 if (a != 0) return (c(numword.100(a), numword.10(a), "thousand")) } numword <- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse=" ") return (trimws(numword)) } answer <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.en), collapse=""))) print(answer) Writing Numbers in Dutch

I went beyond Euler Problem 17 by translating the code to spell numbers in Dutch. Interesting bit of trivia is that it takes 307 fewer characters to spell the numbers 1 to 1000 in Dutch than it does in English.

It would be good if other people can submit functions for other languages in the comment section. Perhaps we can create an R package with a multi-lingual function for spelling numbers. <- function(x) { if (x > 999999) return("Error: Getal te hoog.") single <- c("een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "nenen") teens <- c( "tien", "elf", "twaalf", "dertien", "veertien", "fifteen", "zestien", "zeventien", "achtien", "nenentien") tens <- c("tien", "twintig", "dertig", "veertig", "vijftig", "zestig", "zeventig", "tachtig", "negengtig") numword.10 <- function(y) { a <- y %% 100 if (a != 0) { if (a < 20) return (c(single, teens)[a]) else return (c(single[a %% 10], "en", tens[floor(a / 10)])) } } numword.100 <- function(y) { a <- (floor(y / 100) %% 100) %% 10 if (a == 1) return ("honderd") if (a > 1) return (c(single[a], "honderd")) } numword.1000 <- function(y) { a <- (1000 * floor(y / 1000)) / 1000 if (a == 1) return ("duizend ") if (a > 0) return (c(numword.100(a), numword.10(a), "duizend ")) } numword<- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse="") return (trimws(numword)) } antwoord <- nchar(gsub(" ", "", paste0(sapply(1:1000,, collapse=""))) print(antwoord) print(answer - antwoord)

The post Euler Problem 17: Number Letter Counts appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Visualization – Part 2

Wed, 03/22/2017 - 20:50

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

A Quick Overview of the ggplot2 Package in R

While it will be important to focus on theory, I want to explain the ggplot2 package because I will be using it throughout the rest of this series. Knowing how it works will keep the focus on the results rather than the code. It’s an incredibly powerful package and once you wrap your head around what it’s doing, your life will change for the better! There are a lot of tools out there which provide better charts, graphs and ease of use (i.e., d3.js, Qlik, Tableau), but ggplot2 is still a fantastic resource and I use it all of the time.

In case you missed it, here’s a link to Data Visualization – Part 1

Why would you use ggplot2?
  1. More robust plotting than the base plot package
  2. Better control over aesthetics – colors, axes, background, etc.
  3. Layering
  4. Variable Mapping (aes)
  5. Automatic aggregation of data
  6. Built in formulas & plotting (geom_smooth)
  7. The list goes on and on…

Basically, ggplot2 allows for a lot more customization of plots with a lot less code (the rest of it is behind the scenes). Once you are used to the syntax, there’s no going back. It’s faster and easier.

Why wouldn’t you use ggplot2?
  1. A bit of a learning curve
  2. Lack of user interactivity with the plots

Fundamentally, ggplot2 gives the user the ability to start a plot and layer everything in. There are many ways to accomplish the same thing, so figure out what makes sense for you and stick to it.

A Basic Example: Unemployment Over Time

library(dplyr) library(ggplot2) # Load the economics data from ggplot2 data(economics,package='ggplot2')

# Take a look at the format of the data head(economics)

## # A tibble: 6 × 6 ## date pce pop psavert uempmed unemploy ## &lt;date&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; ## 1 1967-07-01 507.4 198712 12.5 4.5 2944 ## 2 1967-08-01 510.5 198911 12.5 4.7 2945 ## 3 1967-09-01 516.3 199113 11.7 4.6 2958 ## 4 1967-10-01 512.9 199311 12.5 4.9 3143 ## 5 1967-11-01 518.1 199498 12.5 4.7 3066 ## 6 1967-12-01 525.8 199657 12.1 4.8 3018

# Create the plot ggplot(data = economics) + geom_line(aes(x = date, y = unemploy))

What happened to get that?
  • ggplot(economics) loaded the data frame
  • + tells ggplot() that there is more to be added to the plot
  • geom_line() defined the type of plot
  • aes(x = date, y = unemploy) mapped the variables

The aes() portion is what typically throws new users off but is my favorite feature of ggplot2. In simple terms, this is what “auto-magically” brings your plot to life. You are telling ggplot2, “I want ‘date’ to be on the x-axis and ‘unemploy’ to be on the y-axis.” It’s pretty straightforward in this case but there are more complex use cases as well.

Side Note: you could have achieved the same result by mapping the variables in the ggplot() function rather than in geom_line():
ggplot(data = economics, aes(x = date, y = unemploy)) + geom_line()

Here’s the basic formula for success:
  • Everything in ggplot2 starts with ggplot(data) and utilizes + to add on every element thereafter
  • Include your data frame (economics) in a ggplot function: ggplot(data = economics)
  • Input the type of plot you would like (i.e. line chart of unemployment over time): + geom_line(aes(x = date, y = unemploy))
    • “geom” stands for “geometric object” and determines the type of object (there can be more than one type per plot)
    • There are a lot of types of geometric objects – check them out here
  • Add in layers and utilize fill and col parameters within aes()

I’ll go through some of the examples from the Top 50 ggplot2 Visualizations Master List. I will be using their examples but I will also explain what’s going on.

Note: I believe the intention of the author of the Top 50 ggplot2 Visualizations Master List was to illustrate how to use ggplot2 rather than doing a full demonstration of what important data visualization techniques are – so keep that in mind as I go through these examples. Some of the visuals do not line up with my best practices addressed in my first post on data visualization.

As usual, some packages must be loaded.

library(reshape2) library(lubridate) library(dplyr) library(tidyr) library(ggplot2) library(scales) library(gridExtra)

The Scatterplot

This is one of the most visually powerful tool for data analysis. However, you have to be careful when using it because it’s primarily used by people doing analysis and not reporting (depending on what industry you’re in).

The author of this chart was looking for a correlation between area and population.

# Use the "midwest"" data from ggplot2 data("midwest", package = "ggplot2") head(midwest)

## # A tibble: 6 × 28 ## PID county state area poptotal popdensity popwhite popblack ## &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; ## 1 561 ADAMS IL 0.052 66090 1270.9615 63917 1702 ## 2 562 ALEXANDER IL 0.014 10626 759.0000 7054 3496 ## 3 563 BOND IL 0.022 14991 681.4091 14477 429 ## 4 564 BOONE IL 0.017 30806 1812.1176 29344 127 ## 5 565 BROWN IL 0.018 5836 324.2222 5264 547 ## 6 566 BUREAU IL 0.050 35688 713.7600 35157 50 ## # ... with 20 more variables: popamerindian &lt;int&gt;, popasian &lt;int&gt;, ## # popother &lt;int&gt;, percwhite &lt;dbl&gt;, percblack &lt;dbl&gt;, percamerindan &lt;dbl&gt;, ## # percasian &lt;dbl&gt;, percother &lt;dbl&gt;, popadults &lt;int&gt;, perchsd &lt;dbl&gt;, ## # percollege &lt;dbl&gt;, percprof &lt;dbl&gt;, poppovertyknown &lt;int&gt;, ## # percpovertyknown &lt;dbl&gt;, percbelowpoverty &lt;dbl&gt;, ## # percchildbelowpovert &lt;dbl&gt;, percadultpoverty &lt;dbl&gt;, ## # percelderlypoverty &lt;dbl&gt;, inmetro &lt;int&gt;, category &lt;chr&gt;

Here’s the most basic version of the scatter plot

This can be called by geom_point() in ggplot2

# Scatterplot ggplot(data = midwest, aes(x = area, y = poptotal)) + geom_point() #ggplot

Here’s version with some additional features

While the addition of the size of the points and color don’t add value, it does show the level of customization that’s possible with ggplot2.

ggplot(data = midwest, aes(x = area, y = poptotal)) + geom_point(aes(col=state, size=popdensity)) + geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + labs(subtitle="Area Vs Population", y="Population", x="Area", title="Scatterplot", caption = "Source: midwest")


ggplot(data = midwest, aes(x = area, y = poptotal)) +
Inputs the data and maps x and y variables as area and poptotal.

geom_point(aes(col=state, size=popdensity)) +
Creates a scatterplot and maps the color and size of points to state and popdensity.

geom_smooth(method="loess", se=F) +
Creates a smoothing curve to fit the data. method is the type of fit and se determines whether or not to show error bars.

xlim(c(0, 0.1)) +
Sets the x-axis limits.

ylim(c(0, 500000)) +
Sets the y-axis limits.

labs(subtitle="Area Vs Population",




caption = "Source: midwest")
Changes the labels of the subtitle, y-axis, x-axis, title and caption.

Notice that the legend was automatically created and placed on the lefthand side. This is also highly customizable and can be changed easily.

The Density Plot

Density plots are a great way to see how data is distributed. They are similar to histograms in a sense, but show values in terms of percentage of the total. In this example, the author used the mpg data set and is looking to see the different distributions of City Mileage based off of the number of cylinders the car has.

# Examine the mpg data set head(mpg)

## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl ## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; ## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p ## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p ## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p ## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p ## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p ## # ... with 1 more variables: class &lt;chr&gt;

Sample Density Plot

g = ggplot(mpg, aes(cty)) g + geom_density(aes(fill=factor(cyl)), alpha=0.8) + labs(title="Density plot", subtitle="City Mileage Grouped by Number of cylinders", caption="Source: mpg", x="City Mileage", fill="# Cylinders")

You’ll notice one immediate difference here. The author decided to create a the object g to equal ggplot(mpg, aes(cty)) – this is a nice trick and will save you some time if you plan on keeping ggplot(mpg, aes(cty)) as the fundamental plot and simply exploring other visualizations on top of it. It is also handy if you need to save the output of a chart to an image file.

ggplot(mpg, aes(cty)) loads the mpg data and aes(cty) assumes aes(x = cty)

g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
geom_density kicks off a density plot and the mapping of cyl is used for colors. alpha is the transparency/opacity of the area under the curve.

labs(title="Density plot",

subtitle="City Mileage Grouped by Number of cylinders",

caption="Source: mpg",

x="City Mileage",

fill="# Cylinders")
Labeling is cleaned up at the end.

How would you use your new knowledge to see the density by class instead of by number of cylinders?

**Hint: ** g = ggplot(mpg, aes(cty)) has already been established.

g + geom_density(aes(fill=factor(class)), alpha=0.8) + labs(title="Density plot", subtitle="City Mileage Grouped by Class", caption="Source: mpg", x="City Mileage", fill="Class")

Notice how I didn’t have to write out ggplot() again because it was already stored in the object g.

The Histogram

How could we show the city mileage in a histogram?

g = ggplot(mpg,aes(cty)) g + geom_histogram(bins=20) + labs(title="Histogram", caption="Source: mpg", x="City Mileage")

geom_histogram(bins=20) plots the histogram. If bins isn’t set, ggplot2 will automatically set one.

The Bar/Column Chart

For all intensive purposes, bar and column charts are essentially the same. Technically, the term “column chart” can be used when the bars run vertically. The author of this chart was simply looking at the frequency of the vehicles listed in the data set.

#Data Preparation freqtable &lt;- table(mpg$manufacturer) df &lt;- head(df)

## Var1 Freq ## 1 audi 18 ## 2 chevrolet 19 ## 3 dodge 37 ## 4 ford 25 ## 5 honda 9 ## 6 hyundai 14

#Set a theme theme_set(theme_classic()) g &lt;- ggplot(df, aes(Var1, Freq)) g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(title="Bar Chart", subtitle="Manufacturer of vehicles", caption="Source: Frequency of Manufacturers from 'mpg' dataset") + theme(axis.text.x = element_text(angle=65, vjust=0.6))

The addition of theme_set(theme_classic()) adds a preset theme to the chart. You can create your own or select from a large list of themes. This can help set your work apart from others and save a lot of time.

However, theme_set() is different than the theme(axis.text.x = element_text(angle=65, vjust=0.6)) the one used inside the plot itself in this case. The author decided to tilt the text along the x-axis. vjust=0.6 changes how far it is spaced away from the axis line.

Within geom_bar() there is another new piece of information: stat="identity" which tells ggplot to use the actual value of Freq.

You may also notice that ggplot arranged all of the data in alphabetical order based off of the manufacturer. If you want to change the order, it’s best to use the reorder() function. This next chart will use the Freq and coord_flip() to orient the chart differently.

g &lt;- ggplot(df, aes(reorder(Var1,Freq), Freq)) g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(title="Bar Chart", x = 'Manufacturer', subtitle="Manufacturer of vehicles", caption="Source: Frequency of Manufacturers from 'mpg' dataset") + theme(axis.text.x = element_text(angle=65, vjust=0.6)) + coord_flip()

Let’s continue with bar charts – what if we wanted to see what hwy looked like by manufacturer and in terms of cyl?

g = ggplot(mpg,aes(x=manufacturer,y=hwy,col=factor(cyl),fill=factor(cyl))) g + geom_bar(stat='identity', position='dodge') + theme(axis.text.x = element_text(angle=65, vjust=0.6))

position='dodge' had to be used because the default setting is to stack the bars, 'dodge' places them side by side for comparison.

Despite the fact that the chart did what I wanted, it is very difficult to read due to how many manufacturers there are. This is where the facet_wrap() feature comes in handy.

theme_set(theme_bw()) g = ggplot(mpg,aes(x=factor(cyl),y=hwy,col=factor(cyl),fill=factor(cyl))) g + geom_bar(stat='identity', position='dodge') + facet_wrap(~manufacturer)

This created a much nicer view of the information. It “auto-magically” split everything out by manufacturer!

Spatial Plots

Another nice feature of ggplot2 is the integration with maps and spatial plotting. In this simple example, I wanted to plot a few cities in Colorado and draw a border around them. Other than the addition of the map, ggplot simply places the dots directly on the locations via their longitude and latitude “auto-magically.”

This map is created with ggmap which utilizes Google Maps API.

library(ggmap) library(ggalt) foco &lt;- geocode("Fort Collins, CO") # get longitude and latitude # Get the Map ---------------------------------------------- colo_map &lt;- qmap("Colorado, United States",zoom = 7, source = "google") # Get Coordinates for Places --------------------- colo_places &lt;- c("Fort Collins, CO", "Denver, CO", "Grand Junction, CO", "Durango, CO", "Pueblo, CO") places_loc &lt;- geocode(colo_places) # get longitudes and latitudes # Plot Open Street Map ------------------------------------- colo_map + geom_point(aes(x=lon, y=lat), data = places_loc, alpha = 0.7, size = 7, color = "tomato") + geom_encircle(aes(x=lon, y=lat), data = places_loc, size = 2, color = "blue")

Final Thoughts

I hope you learned a lot about the basics of ggplot2 in this. It’s extremely powerful but yet easy to use once you get the hang of it. The best way to really learn it is to try it out. Find some data on your own and try to manipulate it and get it plotted. Without a doubt, you will have all kinds of errors pop up, data you expect to be plotted won’t show up, colors and fills will be different, etc. However, your visualizations will be leveled-up!

Coming soon:
  • Determining whether or not you need a visualization
  • Choosing the type of plot to use depending on the use case
  • Visualization beyond the standard charts and graphs

I made some modifications to the code, but almost all of the examples here were from Top 50 ggplot2 Visualizations – The Master List .

As always, the code used in this post is on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Datashader is a big deal

Wed, 03/22/2017 - 19:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions.

I am so excited about datashader capabilities I literally will not wait for the functionality to be exposed in R through rbokeh. I am going to leave my usual knitr/rmarkdown world and dust off Jupyter Notebook just to use datashader plotting. This is worth trying, even for diehard R users.


Every plotting system has two important ends: the grammar where you specify the plot, and the rendering pipeline that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter plotly::ggplotly).

Currently, to use datashader you must talk directly to Python and Bokeh (i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.

datashader makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.

Please take a look at these stills from the datashader US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:

If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:

What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:

and even more powerful than zooming out on a static high-resolution rendering:

datashader can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.

You need to run datashader to really see the effect. The html exports, while interactive, sometimes do not correctly perform in all web browsers.

An R example

I am going to share a simple datashader example here. Again, to see the full effect you would have to copy it into an Jupyter notebook and run it. But I will use it to show my point.

After going through the steps to install Anaconda and Juputer notebook (plus some more conda install steps to include necessary packages) we can make a plot of the ggplot2 data example diamonds

ggplot2 renderings of diamonds typically look like the following (and show of the power and convenience of the grammar):

A datashader rendering looks like the following:

If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:

Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of datashader! It is something unexpected in the data which is now jumping out at us.

datashader is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to R we can look directly at the data:

> library("dplyr") > diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400) # A tibble: 5 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 3.01 Premium I SI2 60.2 59 18242 9.36 9.31 5.62 2 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 3 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 4 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77 5 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77

There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.

For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:

The Jupyter notebooks always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a datashader issue).


datashader brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Running your R code on Azure with mrsdeploy

Wed, 03/22/2017 - 18:09

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

// <![CDATA[ // &lt;![CDATA[ // &amp;lt;![CDATA[ // &amp;amp;lt;![CDATA[ // &amp;amp;amp;lt;![CDATA[ $(document).ready(function () { window.buildTabsets(&amp;amp;amp;quot;TOC&amp;amp;amp;quot;); }); // ]]&amp;amp;amp;gt; // ]]&amp;amp;gt; // ]]&amp;gt; // ]]&gt; // ]]>

by John-Mark Agosta, data scientist manager at Microsoft

Let’s say you’ve built a model in R that is larger than you can conveniently run locally, and you want to take advantage of Azure’s resources simply to run it on a larger machine. This blog explains how to provision and run an Azure virtual machine (VM) for this, using the mrsdeploy library that comes installed with Microsoft’s R Server. We will work specifically with the Unbuntu Linux version of the VM, so I you’ll need to be familiar with working with superuser privileges at the command line in Linux, and of course, familiar with R.

The fundamental architecture consists of your local machine as the client for which you create a server machine in the Cloud. You’ll set up a service on the remote machine — the one in the cloud. Once you do this, you needn’t interact directly with the remote machine; instead you issue commands to it and see the results returned at the client. This is one approach; there are many many ways this can be done in Azure, depending on your choice of language, reliance on coding, capabilities of the service, and complexity and scale of the task. A data scientist typically works first interactively to explore data on an individual machine, then puts the model thus built into production at scale, in this example, in the Cloud. The purpose of this posting is to clarify the deployment process, or as it is called, in a mouthful, operationalization. In short, using a VM running the mrsdeploy library in R Server lets you operationalize your code with little effort, at modest expense.

Alternatively, instead of setting up a service with R server, one unadvisedly could just provision a an bare virtual machine, and login into it as one would any remote machine with the manual encumbrance of having to work with multiple machines, load application software, and move data and code back and forth. But that’s what we avoid. The point of the Cloud is making large data and compute as much as possible like working on your local computer.

Deploying Microsoft R Server (MRS) on an Azure VM

Azure Marketplace offers a Linux VM (Ubuntu version 16.04) preconfigured with R Server 2016. Additionally the Linux VM with R Server comes with mrsdeploy, a new R package for establishing a remote session in a console application and for publishing and managing a web service written in R. In order to use the R Server’s deployment and operationalization features, one needs to configure R Server for operationalization after installation, to act as a deployment server and host analytic web services.

Alternately there are other Azure platforms for operationalization using R Server in the Marketplace, with other operating systems and platforms including HDInsight, Microsoft’s Hadoop offering. Or, equivalently one could use the Data Science VM available in the Marketplace, since it has a copy of R Server installed. Configuration of these platforms is similar to the example covered in this posting.

Provisioning an R Server VM, as reference in the documentation, takes a few steps that are detailed here, which consist of configuring the VM and setting up the server account to authorize remote access. To set up the server you’ll use the system account you set up as a user of the Linux machine. The server account is used for client interaction with the R Server, and should not be confused with the Linux system account. This is a major difference with the Windows version of the R Server VM that uses Active Directory services for authentication.

Provisioning a machine from the Marketplace

You will want to do the install of a Unbuntu Marketplace VM with R server preinstalled. The best way to find it on is to search for “r server”:

R Server in the Marketplace

Select the Ubuntu version. Do a conventional deployment—lets say you name yours mymrs. Take note of the mymrs-ip public address, and the mymrs-nsg network security group resources created for it since you will want to customize them.

Login to the VM using the system account you set up in the Portal, and add these aliases, one for the path to the version of the R executable, MRS (aka Revo64), and one for the mrsdeploy menu-driven administration tool.

alias rserver='/usr/local/bin/Revo64-9.0' alias radmin='sudo /usr/local/bin/dotnet \ /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Utils.AdminUtil/Microsoft.DeployR.Utils.AdminUtil.dll'

The following are a set of steps to bring up on the VM a combined web-compute server (a “one-box” server) that can be accessed remotely.

1. Check if you can run Microsoft R Server (MRS).

Just use the alias for MRS

$ rserver [Note a line in the banner saying "Loading Microsoft R Server packages, ..."]

Here’s a simple test that MRS library is pre-loaded and runs. Note the MRS libraries (“rx” functions) are preloaded.

> rxSummary(formula = ~., data = iris) 2. Set up the MRS server for mrsdeploy

mrsdeploy operationalization runs two services, the web node and one or more compute nodes. In the simplest configuration, the one described here, both “nodes” are services running on same VM. Alternately, by making these separate, larger loads can be handled with one web node and one or more compute nodes.

Use the alias you created for the admin tool.

$ radmin

This utility brings up a menu

************************************* Administration Utility (v9.0.1) ************************************* 1. Configure R Server for Operationalization 2. Set a local admin password 3. Stop and start services 4. Change service ports 5. Encrypt credentials 6. Run diagnostic tests 7. Evaluate capacity 8. Exit Web node endpoint: **http://localhost:12800/** Please enter an option: 1 Set the admin password: ************* Confirm this password: ************* Configuration for Operationalization: A. One-box (web + compute nodes) B. Web node C. Compute node D. Reset machine to default install state E. Return to main menu Please enter an option: A Success! Web node running (PID: 4172) Success! Compute node running (PID: 4172)

At this point the setup should be complete. Running diagnostics with the admin tool can check that it is.

Run Diagnostic Tests: A. Test Configuration Please enter an option: 6 Preparing to run diagnostics... *********************** DIAGNOSTIC RESULTS: *********************** Overall Health: pass Web Node Details: Logs: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.WebAPI/logs Available compute nodes: 1 Compute Node Details: Health of 'http://localhost:12805/': pass Logs: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.BackEnd/logs Authentication Details: A local admin account was found. No other form of authentication is configured. Database Details: Health: pass Type: sqlite

Code Execution Test: PASS Code: ‘y <- cumprod(c(1500, 1+(rnorm(n=25,mean=.05, sd = 1.4)/100)))’

Yes, it even tests that the MRS interpreter runs! If the web or the service had stopped the following test will complain loudly. Note the useful links to the log directories for failure details. Services can be stopped and started from selection 3 in the top level menu.

Run Diagnostic Tests: B. Raw Server Status ********************** SERVICE STATE (raw): ********************** Please authenticate... Username: admin Password: ************* Server: Health: pass Details: logPath: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.WebAPI/logs backends: Health: pass http://localhost:12805/: Health: pass Details: maxPoolSize: 80 activeShellCount: 1 currentPoolSize: 5 logPath: /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.BackEnd/logs database: Health: pass Details: type: sqlite name: main state: Open 3. Verify that the MRS server is running from the server linux prompt

The R server) webservices can also be checked by looking at the machine’s open ports, without going into the admin tool. This command reveals ports the linux machine is listening on:

$ netstat - tupln Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0* LISTEN 42527/mdsd tcp 0 0* LISTEN 2001/mdsd tcp 0 0* LISTEN 1265/sshd tcp 0 0* LISTEN 55348/Rserve tcp 0 0* LISTEN 55348/Rserve tcp6 0 0 :::12805 :::* LISTEN 55327/dotnet tcp6 0 0 :::22 :::* LISTEN 1265/sshd tcp6 0 0 :::12800 :::* LISTEN 55285/dotnet udp 0 0* 1064/dhclient

We can see that port 12800 is active for the web service. 12805 is the compute server, running here on the same machine as the web service.

Next thing you should do is see if you can connect to the service with R server running locally, and load mrsdeploy.

4. Check the MRS server is running by logging-in in from the server itself.

Do this by running a remote mrsdeploy session from the server as localhost. This is the way one would “run MRS as R Client,” even though the full set of MRS features are available. Running MRS as both a client and a server on the same machine is possible, but I see no purpose other than to test that the web service is accessible. The sequence of steps is:

$ rserver [ MRS banner...] > endpoint <- "localhost:12800" # The forum shows this format for logins. > library(mrsdeploy) > remoteLogin(endpoint) Username: admin Password: ************* # The password you set in the admin tool. [...] REMOTE>

If authentication is failing, you can look at the tail of the system log file for the error, like this

$ cd /usr/lib64/microsoft-deployr/9.0.1/Microsoft.DeployR.Server.WebAPI/logs $ sudo tail $(ls -t1 | head -1) # Look at the end of the most recent logfile ... "Message":"The username doesn't belong to the admin user",...

Then, to end the remote session, the command is exit’

REMOTE> exit 5. Finish VM Configuration for remote access

Another two steps are needed before you can use the server over the network. You should set the public DNS (e.g. domain) address since the VM’s public IP address is dynamic and may change when the machine is restarted. And as a matter of security, the Azure firewall (the “network security gateway” resource) needs to be configured.

Go back to the and find these resources associated with the VM: – Public DNS address – Open incoming service ports

Public IP

To set the public DNS name, go to the portal’s VM overview pane and click on the public-IP item, for instance, “mymrs-ip”:

until you get to the configuration blade:

This will send you to the mymrs-ip component where you can change the DNS label.

Network Security Group

If you don’t do this, a remote mrsdeploy login attempt will fail with a message

Error: Couldn't connect to server

since only the port 22 for ssh is allowed by default for the VM’s network security gateway. One option is to use ssh to set up port forwarding. I won’t explain that here. The alternative is to configure remote access on the server. For this you’ll need to open the port the admin tool reported as the web endpoint, typically 12800. The inbound security rules’ blade is buried in the VM choices -> Network Interfaces -> Network Security Group -> Inbound Security Rules. Choose “Add” to create a custom inbound rule for TCP port 12800. The result looks like this:

Now the server is ready for use!

6. Check that the MRS server is running from another machine

You’ll need a local copy of MRS to do this. Copies are available from a few sources, including a “client side only” copy called, naturally–R Client that is licensed for free. R Client gives you all the remoting capabilities of R Server, also the same custom learning algorithms available with R Server, but unlike R Server, it is limited to datasets that fit in-memory.

The sources of R Server are several:

  • MSDN subscription downloads include R Server for diferent platforms
  • Also R Client is a free download on MSDN.
  • Microsoft SQL Server comes with R Server as an option. You can install R Server “standalone” with the SQL Server installer in addition to installing it as part of SQL Server.
  • If you have installed R Tools for Visual Studio (RTVS), the R Tools menu has an item to install R Client.
  • Of course any VM that comes with R Server will work too. Notably, the Data Science VM, which hosts an exhaustive collection of data science tools includes a copy of R Server .

To remotely login from your local machine, the MRS commands are the same as before, except use the domain name of the server from your local client:

> endpoint <- "' > library(mrsdeploy) > remoteLogin(endpoint)

If as shown, you do not include the admin account and passwords as arguments to remoteLogin the command will bring up a modal dialog asking you for them. Be advised that this dialog may be hidden and not come to the front, and you’ll have to look for it.

The server will kindly return a banner with the differences between your client and the server MRS environments. Here’s what a proper remote session returns on initiation:

Diff report between local and remote R sessions... Warning! R version mismatch local: R version 3.3.2 (2016-10-31) remote: R version 3.2.3 (2015-12-10) These R packages installed on the local machine are not on the remote R instance: Missing Packages 1 checkpoint 2 CompatibilityAPI 3 curl ... 23 RUnit The versions of these installed R packages differ: Package Local Remote 1 base 3.3.2 3.2.3 ... 23 utils 3.3.2 3.2.3 Your REMOTE R session is now active. Commands: - pause() to switch to local session & leave remote session on hold. - resume() to return to remote session. - exit to leave (and terminate) remote session.

Once at the REMOTE> prompt you can explore the remote interpreter environment. These handy R functions let you explore the remote environment further:

Sys.getenv() # will show the machine's OS environment variables on the server. # returns a character string with machine and user descriptions Environment differences: Adding custom packages to the server

The comparative listing of package when you log into the remote should alert you to the need to accommodate the differences between local and remote environments. Different R versions generate this warning:

Warning! R version mismatch

Different versions will limit which packages are available for both versions.

Compatible but missing packages can be installed on the server. To be able to install packages when available packages differ, the remote session will need permission to write to one of the directories identified by .libPaths() on the remote server. This is not granted by default. If you feel comfortable with letting the remote user make modifications to the server, you could grant this permission by making this directory writable by everyone

$ sudo chmod a+w /usr/local/lib/R/site-library/

Then to specify a library, for example, glmnet to be installed in this directory use

REMOTE> install.packages("glmnet", lib="/usr/local/lib/R/site-library")

These installations will persist from one remote session to another, and the “missing packages” warning at login will be updated correctly, although strangely, intellisense for package names always refers to the local list of packages, so will make suggestions that are unavailable at the remote.

Running batch R job on the server

Congratulations! Now you can run large R jobs on a VM in the cloud!

There are various uses for the server to take advantage of the VM, in addition to running interactively at the REMOTE> prompt. A simple case is to take advantage of the remote server to run large time-consuming jobs. For instance, this interation, to compute a regression’s leave-one-out r-squared values—

rsqr <- c() system.time( for (k in 1:nrow(mtcars)) { rsqr[k] <- summary(lm(mpg ~ . , data=mtcars[-k,]))$r.squared }) print(summary(rsqr))

—can be done the same remotely:

remoteExecute("rsqr <- c()\ system.time(\ for (k in 1:nrow(mtcars)) {\ rsqr[k] <- summary(lm(mpg ~ . , data=mtcars[-k,]))$r.squared\ })")

We’ll need to recall the results separately, since only the last value in the remote expression output is printed:


For larger chunks of code, you can include them in script files, and execute the file remotely by use mrsdeploy::remoteScript("myscript.R") which is simply a wrapper around mrsdeploy::remoteExecute("myscript.R", script=TRUE), where myscript.R is found in your local working directory.

Note that the the mrsdeploy library is not needed in the script running remotely. Indeed, the VM with preinstalled Microsoft R Server 2016 (version 9.0.1) for Linux (Ubuntu version 16.04) runs R version 3.2.3, which does not include the mrsdeploy library. So both library(mrsdeploy) and install.packages(“mrsdeploy") will generate an error on the remote session. If you’ve included these statements to enable your local script, be sure to remove them if you execute the script remotely, or the script will fail! If you want to use the same script in both places, a simple workaround is to avoid making the library call in the script when it runs in the remore session:

if (["user"] != "rserve2" ) { library(mrsdeploy) }

The ability of mrsdeploy to execute a script remotely is just the tip of the iceberg. It also enables moving files and variables back and forth between local and remote, and most importantly, configuring R functions as production web services. This set of deployment features merits another entire blog posting.

For more information

For details about different configuration options see Configuring R Server Operationalization. Libraries as required in the Operationalization instructions are already configured on the VM.

To see what you can do with a remote session, have a look here.. And, for a general overview see this..

Go to Rserver documentation for the full API reference.

// <![CDATA[ // &lt;![CDATA[ // &amp;lt;![CDATA[ // &amp;amp;lt;![CDATA[ // &amp;amp;amp;lt;![CDATA[ // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $(&amp;amp;amp;#39;tr.header&amp;amp;amp;#39;).parent(&amp;amp;amp;#39;thead&amp;amp;amp;#39;).parent(&amp;amp;amp;#39;table&amp;amp;amp;#39;).addClass(&amp;amp;amp;#39;table table-condensed&amp;amp;amp;#39;); } $(document).ready(function () { bootstrapStylePandocTables(); }); // ]]&amp;amp;amp;gt; // ]]&amp;amp;gt; // ]]&amp;gt; // ]]&gt; // ]]>
// <![CDATA[ // &lt;![CDATA[ // &amp;lt;![CDATA[ // &amp;amp;lt;![CDATA[ // &amp;amp;amp;lt;![CDATA[ (function () { var script = document.createElement(&amp;amp;amp;quot;script&amp;amp;amp;quot;); script.type = &amp;amp;amp;quot;text/javascript&amp;amp;amp;quot;; script.src = &amp;amp;amp;quot;;amp;amp;quot;; document.getElementsByTagName(&amp;amp;amp;quot;head&amp;amp;amp;quot;)[0].appendChild(script); })(); // ]]&amp;amp;amp;gt; // ]]&amp;amp;gt; // ]]&amp;gt; // ]]&gt; // ]]>

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

February 2017 New Package Picks

Wed, 03/22/2017 - 17:00

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Joseph Rickert

One hundred and forty-five new packages were added to CRAN in February. Here are 47 interesting packages organized into five categories; Biostatistics, Data, Data Science, Statistics and Utilities.

  • BaTFLED3D v0.1.7: Implements a machine learning algorithm to make predictions and determine interactions in data that varies along three independent modes. It was developed to predict the growth of cell lines when treated with drugs at different doses. The vignette shows an example with simulated data.

  • DClusterm v0.1: Implements methods for the model-based detection of disease clusters. Look at the JSS paper for details.

Data Data Science
  • autothresholdr v0.2.0: Is an R port of the ImageJ image processing program. The vignette shows how to use it.

  • dlib v1.0: Provides an Rcpp interface to dlib, the C++ toolkit containing machine learning algorithms and computer vision tools.

  • liquidSVM v1.0.1: Provides several functions to support a fast support vector machine implementation. There is a demo vignette and supplemental installation documentation.

  • OOBCurve v0.1: Provides a function to calculate the out-of-bag learning curve for random forest models built with the randomForest or ranger packages.

  • opusminer v0.1-0: Provides an interface to the OPUS Miner algorithm for finding the top-k, non-redundant itemsets from transaction data.

  • BayesCombo v1.0: Implements Bayesian meta-analysis methods to combine diverse evidence from multiple studies. The vignette provides a detailed example.

  • BayesianTools v0.1.0: Implements various Metropolis MCMC variants (including adaptive and/or delayed rejection MH), the T-walk, differential evolution MCMCs, DREAM MCMCs, and a sequential Monte Carlo particle filter, along with diagnostic and plot functions. The vignette will get you started.

  • FRK v0.1.1: Implements the Fixed Rank Kriging methods presented by Cressie and Johannesson in their 2008 paper. An extended vignette explains the math and provides several examples.

  • glmmTMB v0.1.1: Provides functions to fit Generalized Linear Mixed Models using the Template Model Builder (TMB) package. There are vignettes for getting started, Covariance Structures, post-hoc MCMC, simulation, and troubleshooting.

  • IMIFA v1.1.0: Provides flexible Gibbs sampler functions for fitting Infinite Mixtures of Infinite Factor Analysers and related models, introduced by Murphy et al. in a 2017 paper. The vignette shows examples.

  • ImputeRobust v1.1-1: Provides new imputation methods for the mice package, based on generalized additive models for location, scale, and shape (GAMLSS) as described in de Jong, van Buuren and Spiess.

  • lmvar v1.0.0: Provides functions to run linear regressions in which both the expected value and variance can vary by observation. Vignettes provide an introduction and explain the details of the math.

  • metaviz v0.1.0: Implements the rainforest plots proposed by Schild & Voracek as a variant of the forest plots used for meta-analysis. The vignette describes their use.

  • prophet v0.1: Implements a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. There is a Quick Start guide.

  • robustarima v0.2.5: Provides functions for fitting a linear regression model with ARIMA errors, using a filtered tau-estimate.

  • rpgm v0.1.3: Provides functions that use the Ziggurat Method to generate Normal random variates quickly.

  • sarima v0.4-3: Provides functions for simulating and predicting with seasonal ARIMA models. The vignette presents a use case.

  • sppmix v1.0.0.0: Implements classes and methods for modeling spatial point patterns using inhomogeneous Poisson point processes, where the intensity surface is assumed to be analogous to a finite additive mixture of normal components, and the number of components is a finite, fixed or random integer.

Look here for documentation.

  • walkr v0.3.4: Provides functions to sample from the interaction of a hyperplane and an N Simplex. The vignette describes the math and the algorithms involved.

  • odbc v1.0.1: Uses the DBI interface to implement a connection to ODBC compatible databases.

  • sonify v0.1-0: Contains a function to transform univariate data, sampled at regular or irregular intervals, into a continuous sound with time-varying frequency. The function is intended as a substitute for R’s plot function to simplify data analysis for the visually impaired.

  • widgetframe v0.1.0: Provides two functions for working with htmlwidgets and iframes, which may be useful when working with WordPress or R Markdown. There is a vignette.

  • wrapr v0.1.1: Contains the debugging functions DebugFnW to capture function context on error for debugging, and let to convert non-standard evaluation interfaces to standard evaluation interfaces.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Hitchhiker’s Guide to Ggplot2 in R

Wed, 03/22/2017 - 17:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Published: 2016-11-30
Updated: 2017-03-23

“Any bleeder knows that books are never finished, only abandoned.”
Why Information Grows About the book

You can find the book here.

This is a book that may look complete but changes in R package are always demanding changes in the examples contained within the book. This is why the electronic format is perfect for the purpose of this work. Trapping it inside a dead tree book is ultimately a waste of time and resources in my on view.

Aside from being my first book, this is also my first collaborative work. I wrote it in a 50-50 collaboration with Jodie Burchell. Jodie is an amazing data scientist. I highly recommend reading her blog Standard Error where you can find really good material on Reproducible Research and more.

This is a technical book. The scope of the book is to go straight to the point and the writing style is similar to a recipe with detailed instructions. It is assumed that you know the basics of R and that you want to learn how to create beautiful plots.

Each chapter will explain how to create a different type of plot, and will take you step-by-step from a basic plot to a highly customised graph. The chapters’ order is by degree of difficulty.

Every chapter is independent from the others. You can read the whole book or go to a section of interest and we are sure that it will be easy to understand the instructions and reproduce our examples without reading the first chapters.

In total this book contains 237 pages (letter paper size) of different recipes to obtain an acceptable aesthetic result. You can download the book for free (yes, really!) from Leanpub.

How the book started?

Almost a year ago I finished writing the eleventh tutorial in a series on using ggplot2 I created with Jodie Burchell.

I asked Jodie to co-authors some blog entries when I found her blog and I realised that my interest in Data Science was reflected on her blog. The book comes after those entries on our blogs.

A few weeks later those tutorials evolved into the shape of an ebook. The reason behind it was that what we started to write had an unexpected success. We even had RTs from important people in the R community such as Hadley Wickham. Finally the book was released by Leanpub.

We also included a pack that contains the Rmd files that we used to generate every chart that is displayed in the book.

Why Leanpub?

Leanpub is a platform where you can easily write your book by using MS Word among other writing software and it even has GitHub and Dropbox integration. We went for R Markdown with LaTeX output, and that means that Leanpub is both easy to use and flexible at the same time.

Even more, Leanpub enables the audience to download your books for free, if you allow it, or you can define a price range with a suggested price indication. The website gives the authors a royalty of 90% minus 50 cents per sale (compared to other platforms this is convenient for the authors). You can also sell your books with additional exercises, lessons in video, etc.

For example, last year I updated all the examples contained in the book just a few days after ggplot2 2.2 was released and my readers had a notification email just after I uploaded the new version. People who pay or does not pay for your books can download the newer versions of if for free.

If that’s not enough Leanpub allows you to create bundles and sell your books as a set or you can charge another price for your book plus additional material such as Rmarkdown notebooks, instructional videos and more.

What I learned from my first book?

At the moment I am teaching Data Visualization and from my students I learned that good visualizations come after they learn the visualization concepts. Coding cleary helps but coding goes after the fundamentals.

It would be better to teach visualization fundamentals first and not in parallel while coding, and this applies specially when a part of your audience has never wrote code before.

I got a lot of feedback from my students last term. That was really helpful to improve the book and dive some steps in smaller pieces to facilitate the understading of the Grammar of Graphics.

The interested reader may find some remarkable books that can be read before mine. I highly recommend:

Those are really good books that show the fundamentals of Data Visualisation and provide the key concepts and rules needed to communicate effectively with data.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Suggests != Depends

Wed, 03/22/2017 - 16:16

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A number of packages on CRAN use Suggests: casually.

They list other packages as "not required" in Suggests: — as opposed to absolutely required via Imports: or the older Depends: — yet do not test for their use in either examples or, more commonly, unit tests.

So e.g. the unit tests are bound to fail because, well, Suggests != Depends.

This has been accomodated for many years by all parties involved by treating Suggests as a Depends and installing unconditionally. As I understand it, CRAN appears to flip a switch to automatically install all Suggests from major repositories glossing over what I consider to be a packaging shortcoming. (As an aside, treatment of Additonal_repositories: is indeed optional; Brooke Anderson and I have a fine paper under review on this)

I spend a fair amount of time with reverse dependency ("revdep") checks of packages I maintain, and I will no longer accomodate these packages.

These revdep checks take long enough as it is, so I will now blacklist these packages that are guaranteed to fail when their "optional" dependencies are not present.

Writing R Extensions says in Section 1.1.3

All packages that are needed10 to successfully run R CMD check on the package must be listed in one of ‘Depends’ or ‘Suggests’ or ‘Imports’. Packages used to run examples or tests conditionally (e.g. via if(require(pkgname))) should be listed in ‘Suggests’ or ‘Enhances’. (This allows checkers to ensure that all the packages needed for a complete check are installed.)

In particular, packages providing “only” data for examples or vignettes should be listed in ‘Suggests’ rather than ‘Depends’ in order to make lean installations possible.


It used to be common practice to use require calls for packages listed in ‘Suggests’ in functions which used their functionality, but nowadays it is better to access such functionality via :: calls.

and continues in Section

Note that someone wanting to run the examples/tests/vignettes may not have a suggested package available (and it may not even be possible to install it for that platform). The recommendation used to be to make their use conditional via if(require("pkgname"))): this is fine if that conditioning is done in examples/tests/vignettes.

I will now exercise my option to use ‘lean installations’ as discussed here. If you want your package included in tests I run, please make sure it tests successfully when only its required packages are present.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

San Francisco EARL: First round of speakers announced

Wed, 03/22/2017 - 13:19

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

We’re excited to announced the first round of gReat speakers for San Francisco EARL. Alongside our keynote speakers, Hilary Parker and Ricardo Bion, R Users from a range of industries will share their R stories. Take a look at our line up so far:

Szilard Pafka, Epoch
Szilard studied physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He has worked in a bank quantifying and managing market risk. His main tool for data analysis for the last 10 years has been R. He is also the founder/organizer of the Los Angeles R meetup and the data science community website

Gergely Daroczi,
Gergely is an enthusiast R user and package developer. He is also founder of an R-based web application at, PhD candidate in Sociology, and working as the Lead R Developer and Director of Analytics at in Los Angeles. He has a strong interest in designing a scalable data platform built on the top of a dozen APIs, stream processing and a lot of R.

Matt Dancho, Business Science
Matt is a management professional with a background in engineering, sales and data science. He leads development efforts for products and services for Business Science, open source software for business and financial analysis. He has degrees Mechanical Engineering, Business Administration and Industrial Engineering. Loves working on software projects and traveling the world with his wife!

JS Tan, Microsoft
JS is a Program Manager on Microsoft’s Azure Big Compute Team. His focus is on providing HPC capabilities in Azure as well as other kinds of compute intensive workloads. He is currently working on enabling distributed computing frameworks in both R and Python on their infrastructure and has an interest in workloads like financial modeling and risk analytics.

Ari Lamstein, Lamstein Consulting LLC
Ari is an R trainer and consultant located in San Francisco. He is the author of several popular R packages, most notably choroplethr. He also runs, a website that helps data scientists build portfolios that improve their careers.

Ali Marami, R-Brain
Ali has a PhD in Finance from University of Neuchâtel in Switzerland and a degree in electrical engineering. He has extensive experience in financial and quantitative modeling and financial risk management in several US banks. Currently, He is a Data Science Advisor at R-Brain Inc.

Dr Mohammed Zakaria, Catapult Sports
Mohammed is a Data Scientist at Catapult Sports, a global leader in wearable sports technology. He works on harnessing the power of data obtained from Catapult devices to create solutions for sports, such as baseball and football, through machine learning. Mohammed holds a PhD in particle physics and as part of his post doctoral training, he studied the strong nuclear force of the collisions made at Large Hadron Collider.

Dr Chris Boosalis, Sacramento State University
Chris has worked in higher education for 20 years. His focus is on building better programmatic and curricular assessment methods using opensource software, shared code, and non-proprietary tools. He is currently working with R, R-Studio and Shiny to create dashboards that offer faculty and administrators insights into how changes affect both student self-perceptions and objective performance.

Want to join the line up? It’s not too late to submit your abstract – the deadline has been extended to 14 April – submit here.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Simulating Unown encounter rates in Pokémon Go

Wed, 03/22/2017 - 06:00

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Pokémon Go is an augmented reality game where people with
smartphones walk around and catch Pokémon. As in the classic games, players are
Pokémon “trainers” who have to travel around and collect creatures. Some types
are rarer than others, some have regional habitats or biomes, and so players
explore the game world to find and collect them. Pokémon Go
“augments reality” having these Pokémon spawn in the game world as players
travel around the real world. I like the game; I’ve reached level 30.

On February 16, 2017, a second “generation” of Pokémon were released into the
wild, adding dozens of new species to the game. The rarest among this new
is Unown. As part of a cruel scheme to torture
completionists, the incredibly rare Unown comes in 26 varieties—one for each
letter of the alphabet. Which brings us to the statistical question behind this
blog post: How long will it take to encounter all 26 types of Unowns (taking
into account repeats)?
Somewhat more formally, how many random encounters
(i.e., draws while sampling with replacement) will it take until we have
encountered all 26 Unowns?

This general problem is called the coupon collector’s problem—hat tip
to r/TheSilphRoad. The Wikipedia
article for the problem includes a table which says that the expected number
of encounters for n = 26 is = 101. (The analytic solution is actually
100.2, but we round up because there are no fractional encounters.) So problem

Well, not quite. Suppose we had never heard of the coupon collector’s problem.
Also, suppose that we want to get a sense of the uncertainty around the
expected value. For example, how many encounters will it take for the
unfortunate trainers in the 95th percentile? How might we tackle this problem?

When analytic solutions are difficult or tedious, we can write simulations and
get very good approximations. That’s what we’re going to do in R.

We will write a function to simulate a single exhaustive set of Unown
encounters. The main workhorse is sample(..., replace = TRUE) to sample with
replacement. Using R’s built-in LETTERS constant, we can simulate a batch of
Unown encounters.

sample(LETTERS, size = 1, replace = TRUE) #> [1] "F" sample(LETTERS, size = 5, replace = TRUE) #> [1] "H" "U" "C" "G" "D" sample(LETTERS, size = 10, replace = TRUE) #> [1] "T" "K" "K" "D" "Z" "O" "Y" "K" "Q" "Q"

The question now is how many samples does it take to get the 26 different
Unowns. An absolute, and frankly miraculous, lower bound on this number would be
26, so let’s draw 26 samples.

set.seed(252) # For reproducble blogging n_unique <- function(xs) length(unique(xs)) # Unowns in the batch first_batch <- sample(LETTERS, size = 26, replace = TRUE) first_batch #> [1] "X" "S" "I" "T" "R" "J" "L" "N" "F" "I" "P" "Q" "K" "D" "Q" "Q" "K" #> [18] "O" "S" "C" "F" "C" "P" "X" "V" "L" # Number of unique ones n_unique(first_batch) #> [1] 16 # Number of remaining Unowns we have not encountered yet leftover <- 26 - n_unique(first_batch) leftover #> [1] 10

We encountered 16 unique Unowns from the first batch of
samples. The best-case lower bound for the number of the encounters remaining is
now 10, so let’s take 10 more samples.

second_batch <- sample(LETTERS, size = leftover, replace = TRUE) second_batch #> [1] "Y" "X" "S" "F" "P" "Z" "F" "F" "Y" "G" # Combine the batches both_batches <- c(first_batch, second_batch) n_unique(both_batches) #> [1] 19 leftover <- 26 - n_unique(both_batches) leftover #> [1] 7

We found 3 new Unowns in this
batch, and we have encountered 19 unique ones so far
from 36 total samples. That means the lower bound is now

third_batch <- sample(LETTERS, size = leftover, replace = TRUE) third_batch #> [1] "U" "A" "G" "P" "K" "L" "E" all_batches <- c(both_batches, third_batch) n_unique(all_batches) #> [1] 22 leftover <- 26 - n_unique(all_batches) leftover #> [1] 4

We found 3 new Unowns in this—

Actually, this is getting tedious. We all know where this process is going: Take
a sample, see how many you have left to find, take another sample of that size,
etc. until you have 0 left to find. Pretty simple? Great! Now, I don’t have to
explain how the while loop inside the function works.

simulate_unown <- function() { # Use a sequence of numbers instead of LETTERS n_unowns <- 26 unown_set <- seq_len(n_unowns) n_unique <- 0 encountered <- character(0) # Take 26 samples on first iteration, 26 - n_unique on next iteration, etc. while (n_unowns - n_unique > 0) { batch <- sample(unown_set, size = n_unowns - n_unique, replace = TRUE) encountered <- c(encountered, batch) n_unique <- length(unique(encountered)) } length(encountered) }

Each call of the function simulates a process of encountering all 26 Unowns,
returning how many encounters were required to find them all.

simulate_unown() #> [1] 111 simulate_unown() #> [1] 81 simulate_unown() #> [1] 116

We use replicate() to call this function thousands of times.

simulation <- replicate(10000, simulate_unown())

We can get summary statistics and other quantiles for these simulations.

summary(simulation) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 41.0 78.0 95.0 100.3 116.0 373.0 quantile(simulation, probs = c(.05, .10, .25, .50, .75, .90, .95)) #> 5% 10% 25% 50% 75% 90% 95% #> 61 67 78 95 116 141 161

The mean in our simulations 100.3 is very close to
the analytic solution of 100.2. The median 95 is less than
the mean, which is a bit of good news: More than half of players will hit 26
in less than the expected 100 encounters. The bad news is that there is a long
tail to these simulations, as the RNG gods have cursed one player by requiring
373 encounters.

We can visualize the distribution with a histogram.

library(ggplot2) p1 <- ggplot(data.frame(x = simulation)) + aes(x = x) + geom_histogram(binwidth = 5, color = "white", center = 102.5) + labs(x = "Num. Unowns encountered until 26 unique Unowns encountered", y = "Num. samples in 10,000 simulations") + theme(axis.title.x = element_text(hjust = .995), axis.title.y = element_text(hjust = .995)) + ggtitle("A long-tail of unfortunate RNG for Unown completionists") p1

I haven’t seen any of these Pokémon in the month since they were added to the
game. Assuming I see an Unown every month, I can expect to see them all in 100
encounters over the course of 8.3 years. As I said, a cruel
joke for completionists.

What? SIMULATE_UNOWN is evolving!

One nice thing about using simulations to compute statistics is that we can
easily modify the simulation function to answer related questions. For example:

  • The above simulation assumed that we can catch the Unowns 100% of the time.
    What if we fail on 5% of the encounters?
  • Two more Unowns were added to the series in the third Pokémon generation.
    What is the expected number of encounters for 28 Unowns?
  • Suppose we already have 20 Unowns. How many more encounters are required to
    collect the remaining 6?

We can add some parameters to our function to address these questions. To
simulate catching, we sample a TRUE or FALSE value for each Unown sampled. The
prob argument for sample() lets us assign probabilities to elements in the
sample, so we can use c(p_catch, 1 - p_catch) as probabilities for sampling
TRUE or FALSE—that is, catching a Pokémon.

To handle already-captured Unowns, we add this information to the initial
values for encountered and n_unique to give those values a head start
before the encounter/catch loop. Afterwards, we have to adjust our encounter
count by that head start.

# Defaults to 26 unowns, 0 already caught, and 100% success rate simulate_unown_catches <- function(n_unowns = 26, n_already = 0, p_catch = 1) { unown_set <- seq_len(n_unowns) catch_probs <- c(p_catch, 1 - p_catch) # Take into account any previously caught ones n_unique <- n_already already_encountered <- seq_len(n_already) already_caught <- already_encountered encountered <- already_encountered caught <- already_caught # Encounter/catch loop while (n_unowns - n_unique > 0) { batch <- sample(unown_set, size = n_unowns - n_unique, replace = TRUE) # Simulate catching success for each Unown catches <- sample(c(TRUE, FALSE), size = n_unowns - n_unique, replace = TRUE, prob = catch_probs) caught_in_batch <- batch[catches] encountered <- c(encountered, batch) caught <- c(caught, caught_in_batch) n_unique <- length(unique(caught)) } length(encountered) - length(already_encountered) }

With the default settings, the function should reproduce the original behavior
and give us similar results.

simulation2 <- replicate(10000, simulate_unown_catches()) summary(simulation2) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 38.0 78.0 94.0 100.2 115.0 369.0

We should expect the average number of required encounters to catch all 26
Unowns to increase by 1.05 if there’s a 95% catch rate. Our simulations confirm
this intuition.

simulation_95_rate <- replicate( n = 10000, expr = simulate_unown_catches(n_unowns = 26, p_catch = .95)) summary(simulation_95_rate) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 41.0 82.0 100.0 106.2 124.0 338.0

We can also simulate the case for 28 Unowns with a 100% encounter rate, and see
that the average number of required encounters increases by approximately 10.

simulation_28_unowns <- replicate( n = 10000, expr = simulate_unown_catches(n_unowns = 28, p_catch = 1)) summary(simulation_28_unowns) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 44.0 85.0 104.0 109.9 127.0 322.0

Finally, we can simulate the home stretch of Unown completion where we
have 20 Unowns with 6 more to go.

simulation_last_6 <- replicate( n = 10000, expr = simulate_unown_catches(n_unowns = 26, n_already = 20, p_catch = 1)) summary(simulation_last_6) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 8.00 42.00 57.00 63.39 79.00 282.00

This result is interesting: It says that we will spend the majority of our time
working on the last 6 Unowns.

Simulate ‘em all

A natural next step is to run many simulations over a range of parameter
values. Here, we can study the home-stretch behavior by running a simulation
for each value of n_already from 0 to 25.

We will create a function to simulate 2000 home-stretch samples of required
encounters for a given number of starting Unowns. This function will return a
dataframe with one row per simulation sample. We use Map() to apply this
function for each value of n_already from 0 to 25.

library(dplyr, warn.conflicts = FALSE) simulate_2k <- function(n_already) { n_sims <- 2000 sim_results <- replicate( n = n_sims, expr = simulate_unown_catches(n_unowns = 26, n_already, p_catch = 1)) # Package everything together in a dataframe data_frame( n_already = rep(n_already, times = n_sims), simulation = seq_len(n_sims), n_encounters = sim_results) } results <- Map(simulate_2k, n_already = 0:25) df_results <- bind_rows(results) df_results #> # A tibble: 52,000 × 3 #> n_already simulation n_encounters #> <int> <int> <int> #> 1 0 1 157 #> 2 0 2 125 #> 3 0 3 129 #> 4 0 4 107 #> 5 0 5 69 #> 6 0 6 65 #> 7 0 7 70 #> 8 0 8 147 #> 9 0 9 69 #> 10 0 10 77 #> # ... with 51,990 more rows

We can now plot the expected number of encounters for each starting value.
Variance from random simulation keeps the points from following a neat curve,
so let’s just appreciate the main trends in the results.

ggplot(df_results) + aes(x = n_already, y = n_encounters) + stat_summary(fun.y = mean, geom = "point") + labs(x = "Num. Unown types already encountered", y = "Expected num. encounters to find all 26") + theme(axis.title.x = element_text(hjust = .995), axis.title.y = element_text(hjust = .995)) + ggtitle("The painful home stretch of Unown completion")

Because the analytic solution for finding 26 Unowns starting from 0 is 100, we
can also read the y-axis as a percentage. In other words, 60% of the work
(number of encounters out of 100) will be spent on the last 5 Unowns.

Simulation as a way to learn statistics

An underlying theme for this post is best summarized by a line from a talk
called Statistics for Hackers:

“If you can write for loops, you can do
statistics.” — Statistics for Hackers

Simulation provides a way for “hackers” to leverage one skill (programming) to
learn another domain (statistics). I myself find that I have trouble learning
statistics from equations alone. I need code to play with a problem and develop
intuitions about it.

All of these points about the nice features of simulation must be painfully
obvious to professional statisticians. But strangely, I only used simulations
once or twice in my statistics courses, so it wasn’t part of my statistical tool
kit. Indeed, I had the usefulness of simulation impressed upon me last year by
Gelman and Hill’s textbook. The book advises the
reader to not worry about analytically computing statistics for tricky
calculations—for example, trying to estimate a 95% prediction interval for the
income difference between two groups when the underlying statistical model used
log-dollars. Just have the model generate thousands for predictions, run the
transformation, and find the interval on the transformed data. The book also
uses simulation as a sneaky backdoor into informal Bayesian statistics: Measure
uncertainty by simulating new data from perturbed model parameters. This idea
made enough sense to me to lure me into the chapters on formal Bayesian
inference and learn that statistical framework.

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

anytime 0.2.2

Wed, 03/22/2017 - 02:50

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A bugfix release of the anytime package arrived at CRAN earlier today. This is tenth release since the inaugural version late last summer, and the second (bugfix / feature) release this year.

anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects — and to do so without requiring a format string. See the anytime page, or the GitHub for a few examples.

This releases addresses an annoying bug related to British TZ settings and the particular impact of a change in 1971, and generalizes input formats to accept integer or numeric format in two specific ranges. Details follow below:

Changes in anytime version 0.2.2 (2017-03-21)
  • Address corner case of integer-typed (large) values corresponding to POSIXct time (PR #57 closing ##56)

  • Add special case for ‘Europe/London’ and 31 Oct 1971 BST change to avoid a one-hour offset error (#58 fixing #36 and #51)

  • Address another corner case of numeric values corresponding to Date types which are now returned as Date

  • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols(); already used .registration=TRUE in useDynLib in NAMESPACE

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Use mlrMBO to optimize via command line

Wed, 03/22/2017 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

Many people who want to apply Bayesian optimization want to use it to optimize an algorithm that is not implemented in R but runs on the command line as a shell script or an executable.

We recently published mlrMBO on CRAN.
As a normal package it normally operates inside of R, but with this post I want to demonstrate how mlrMBO can be used to optimize an external application.
At the same time I will highlight some issues you can likely run into.

First of all we need a bash script that we want to optimize.
This tutorial will only run on Unix systems (Linux, OSX etc.) but should also be informative for windows users.
The following code will write a tiny bash script that uses bc to calculate $sin(x_1-1) + (x_1^2 + x_2^2)$ and write the result “hidden” in a sentence (The result is 12.34!) in a result.txt text file.

The bash script # write bash script lines = '#!/bin/bash fun () { x1=$1 x2=$2 command="(s($x1-1) + ($x1^2 + $x2^2))" result=$(bc -l <<< $command) } echo "Start calculation." fun $1 $2 echo "The result is $result!" > "result.txt" echo "Finish calculation." ' writeLines(lines, "") # make it executable: system("chmod +x") Running the script from R

Now we need a R function that starts the script, reads the result from the text file and returns it.

library(stringi) runScript = function(x) { command = sprintf("./ %f %f", x[['x1']], x[['x2']]) error.code = system(command) if (error.code != 0) { stop("Simulation had error.code != 0!") } result = readLines("result.txt") # the pattern matches 12 as well as 12.34 and .34 # the ?: makes the decimals a non-capturing group. result = stri_match_first_regex(result, pattern = "\\d*(?:\\.\\d+)?(?=\\!)") as.numeric(result) }

This function uses stringi and regular expressions to match the result within the sentence.
Depending on the output different strategies to read the result make sense.
XML files can usually be accessed with XML::xmlParse, XML::getNodeSet, XML::xmlAttrs etc. using XPath queries.
Sometimes the good old read.table() is also sufficient.
If, for example, the output is written in a file like this:

value1 = 23.45 value2 = 13.82

You can easily use source() like that:

EV = new.env() eval(expr = {a = 1}, envir = EV) as.list(EV) source(file = "result.txt", local = EV) res = as.list(EV) rm(EV)

which will return a list with the entries $value1 and $value2.

Define bounds, wrap function.

To evaluate the function from within mlrMBO it has to be wrapped in smoof function.
The smoof function also contains information about the bounds and scales of the domain of the objective function defined in a ParameterSet.

library(mlrMBO) # Defining the bounds of the parameters: par.set = makeParamSet( makeNumericParam("x1", lower = -3, upper = 3), makeNumericParam("x2", lower = -2.5, upper = 2.5) ) # Wrapping everything in a smoof function: fn = makeSingleObjectiveFunction( id = "", fn = runScript, par.set = par.set, has.simple.signature = FALSE ) # let's see if the function is working des = generateGridDesign(par.set, resolution = 3) des$y = apply(des, 1, fn) des ## x1 x2 y ## 1 -3 -2.5 16.006802 ## 2 0 -2.5 5.408529 ## 3 3 -2.5 16.159297 ## 4 -3 0.0 9.756802 ## 5 0 0.0 0.841471 ## 6 3 0.0 9.909297 ## 7 -3 2.5 16.006802 ## 8 0 2.5 5.408529 ## 9 3 2.5 16.159297

If you run this locally, you will see that the console output generated by our shell script directly appears in the R-console.
This can be helpful but also annoying.

Redirecting output

If a lot of output is generated during a single call of system() it might even crash R.
To avoid that I suggest to redirect the output into a file.
This way no output is lost and the R console does not get flooded.
We can simply achieve that by replacing the command in the function runScript from above with the following code:

# console output file output_1490030005_1.1_2.4.txt output_file = sprintf("output_%i_%.1f_%.1f.txt", as.integer(Sys.time()), x[['x1']], x[['x2']]) # redirect output with ./ 1.1 2.4 > output.txt # alternative: ./ 1.1 2.4 > /dev/null to drop it command = sprintf("./ %f %f > %s", x[['x1']], x[['x2']], output_file) Start the Optimization

Now everything is set so we can proceed with the usual MBO setup:

ctrl = makeMBOControl() ctrl = setMBOControlInfill(ctrl, crit = crit.ei) ctrl = setMBOControlTermination(ctrl, iters = 10) configureMlr( = FALSE, show.learner.output = FALSE) run = mbo(fun = fn, control = ctrl) ## Computing y column(s) for design. Not provided. ## [mbo] 0: x1=-1.58; x2=-1.64 : y = 4.65 : 0.0 secs : initdesign ## [mbo] 0: x1=-0.251; x2=0.0593 : y = 0.883 : 0.0 secs : initdesign ## [mbo] 0: x1=1.04; x2=2.05 : y = 5.3 : 0.0 secs : initdesign ## [mbo] 0: x1=-2.39; x2=-0.345 : y = 6.07 : 0.0 secs : initdesign ## [mbo] 0: x1=0.608; x2=-0.742 : y = 0.538 : 0.0 secs : initdesign ## [mbo] 0: x1=2.85; x2=0.9 : y = 9.87 : 0.0 secs : initdesign ## [mbo] 0: x1=2.07; x2=-2.11 : y = 9.58 : 0.0 secs : initdesign ## [mbo] 0: x1=-0.967; x2=1.25 : y = 1.58 : 0.0 secs : initdesign ## [mbo] 1: x1=0.301; x2=-2.08 : y = 3.79 : 0.0 secs : infill_ei ## [mbo] 2: x1=0.375; x2=-0.191 : y = 0.408 : 0.0 secs : infill_ei ## [mbo] 3: x1=0.189; x2=-0.586 : y = 0.346 : 0.0 secs : infill_ei ## [mbo] 4: x1=0.397; x2=-0.516 : y = 0.144 : 0.0 secs : infill_ei ## [mbo] 5: x1=-0.702; x2=-0.45 : y = 0.296 : 0.0 secs : infill_ei ## [mbo] 6: x1=-0.476; x2=-0.536 : y = 0.481 : 0.0 secs : infill_ei ## [mbo] 7: x1=-0.982; x2=-0.258 : y = 0.115 : 0.0 secs : infill_ei ## [mbo] 8: x1=-0.94; x2=0.0274 : y = 0.0493 : 0.0 secs : infill_ei ## [mbo] 9: x1=-1.09; x2=0.238 : y = 0.364 : 0.0 secs : infill_ei ## [mbo] 10: x1=-0.897; x2=-0.102 : y = 0.132 : 0.0 secs : infill_ei # The resulting optimal configuration: run$x ## $x1 ## [1] -0.9395412 ## ## $x2 ## [1] 0.02737539 # The best reached value: run$y ## [1] 0.04929388 Execute the R script from a shell

Also you might not want to bothered having to start R and run this script manually so what I would recommend is saving all above as an R-script plus some lines that write the output in a JSON file like this:

library(jsonlite) write_json(run[c("x","y")], "mbo_res.json")

Let’s assume we saved all of that above as an R-script under the name runMBO.R (actually it is available as a gist).

Then you can simply run it from the command line:

Rscript runMBO.R

As an extra the script in the gist also contains a simple handler for command line arguments.
In this case you can define the number of optimization iterations and the maximal allowed time in seconds for the optimization.
You can also define the seed to make runs reproducible:

Rscript runMBO.R iters=20 time=10 seed=3

If you want to build a more advanced command line interface you might want to have a look at docopt.

Clean up

To clean up all the files generated by this script you can run:

file.remove("result.txt") file.remove("") file.remove("mbo_res.json") output.files = list.files(pattern = "output_\\d+_[0-9_.-]+\\.txt") file.remove(output.files)

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...