RcppCNPy 0.2.9
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Another minor maintenance release of the RcppCNPy package arrived on CRAN this evening.
RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.
There is only small code change: a path is now checked before an attempt to save. Thanks to Wush for the suggestion. I also added a short new vignette showing how reticulate can be used for NumPy data.
Changes in version 0.2.9 (20180322)
The npySave function has a new option to check the path in the given filename.

A new vignette was added showing how the reticulate package can be used instead.
CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The most prolific package maintainers on CRAN
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
During a discussion with some other members of the R Consortium, the question came up: who maintains the most packages on CRAN? DataCamp maintains a list of most active maintainers by downloads, but in this case we were interested in the total number of packages by maintainer. Fortunately, this is pretty easy to figure thanks to the CRAN repository tools now included in R, and a little dplyr (see the code below) gives the answer quickly[*].
And the answer? The most prolific maintainer is Scott Chamberlain from ROpenSci, who is currently the maintainer of 77 packages. Here's a list of the top 20:
Maint n 1 Scott Chamberlain 77 2 Dirk Eddelbuettel 53 3 Jeroen Ooms 40 4 Hadley Wickham 39 5 Gábor Csárdi 37 6 ORPHANED 37 7 Thomas J. Leeper 29 8 Bob Rudis 28 9 Henrik Bengtsson 28 10 Kurt Hornik 28 11 Oliver Keyes 28 12 Martin Maechler 27 13 Richard Cotton 27 14 Robin K. S. Hankin 24 15 Simon Urbanek 24 16 Kirill Müller 23 17 Torsten Hothorn 23 18 Achim Zeileis 22 19 Paul Gilbert 21 20 Yihui Xie 21(That list of orphaned packages with no current maintainer includes XML, d3heatmap, and flexclust, to name just 3 of the 37.) Here's the R code used to calculate the top 20:
[*]Well, it would have been quick, until I noticed that some maintainers had two forms of their name in the database, one with surrounding quotes and one without. It seemed like it was going to be trivial to fix with a regular expression, but it took me longer than I hoped to come up with the final regexp on line 4 above, which is now barely distinguishable from line noise. As usual, there an xkcd for this situation:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing additive and multiplicative regressions using AIC in R
(This article was first published on R – Modern Forecasting, and kindly contributed to Rbloggers)
One of the basic things the students are taught in statistics classes is that the comparison of models using information criteria can only be done when the models have the same response variable. This means, for example, that when you have \(\log(y_t)\) and calculate AIC, then this value is not comparable with AIC from a model with \(y_t\). The reason for this is because the scales of variables are different. But there is a way to make the criteria in these two cases comparable: both variables need to be transformed into the original scale, and we need to understand what are the distributions of these variables in that scale. In our example, if we assume that \(\log(y_t) \sim \mathcal{N}(0, \sigma^2_{l}) \) (where \(\sigma^2_{l}\) is the variance of the residuals of the model in logarithms), then the exponent of this variable will have lognormal distribution:
\begin{equation}
y_t \sim \text{log}\mathcal{N}(0, \sigma^2_{l})
\end{equation}
Just as a reminder, all the information criteria rely on the loglikelihood. For example, here’s the formula of AIC:
\begin{equation} \label{eq:AIC}
AIC = 2k 2\ell ,
\end{equation}
where \(k\) is the number of all the estimated parameters and \(\ell\) is the value of the loglikelihood.
If we use the likelihood of lognormal distribution instead of the likelihood of the normal in \eqref{eq:AIC} for the variable \(y_t\), then the information criteria will become comparable. In order to understand what needs to be done for this transformation, we need to compare the formulae for the two distributions: normal and lognormal. Here’s normal for the variable \(\log y_t\):
\begin{equation} \label{eq:normal}
f(y_t  \theta, \sigma^2_{l}) = \frac{1}{\sqrt{2 \pi \sigma^2_{l}}} e ^{\frac{\left(\log y_t \log \mu_{t} \right)^2}{2 \sigma^2_{l}}}
\end{equation}
and here’s the lognormal for the variable \(y_t = \exp(\log(y_t))\) (the multiplicative model in the original scale):
\begin{equation} \label{eq:lognormal}
f(y_t  \theta, \sigma^2_{l}) = \frac{1}{y_t} \frac{1}{\sqrt{2 \pi \sigma^2_{l}}} e ^{\frac{\left(\log y_t \log \mu_{t} \right)^2}{2 \sigma^2_{l}}} ,
\end{equation}
where \(\theta\) is the vector of parameters of our model. The main difference between the two distributions is in \(\frac{1}{y_t}\). If we derive the loglikelihood based on \eqref{eq:lognormal}, here is what we get:
\begin{equation} \label{eq:loglikelihoodlognormal}
\ell(\theta, \sigma^2_{l}  Y) = \frac{1}{2} \left( T \log \left( 2 \pi {\sigma}^2_{l} \right) +\sum_{t=1}^T \frac{\left(\log y_t \log \mu_{t} \right)^2}{2\sigma^2_{l}} \right) – \sum_{t=1}^T \log y_t ,
\end{equation}
where \(Y\) is the vector of all the actual values in the sample. When we extract likelihood of the model in logarithms, we calculate only the first part of \eqref{eq:loglikelihoodlognormal}, before the “\(\sum_{t=1}^T \log y_t \)”, which corresponds to the normal distribution. So, in order to produce the likelihood of the model with the variable in the original scale, we need to subtract the sum of logarithms of the response variable from the extracted likelihood.
The function
AIC() in R, applied to the model in logarithms, will extract the value based on that first part of \eqref{eq:loglikelihoodlognormal}. As a result, in order to fix this and get AIC in the same scale as the variable \(y_t\) we need to take the remaining part into account, modifying equation \eqref{eq:AIC}:
\begin{equation} \label{eq:AICNew}
AIC^{\prime} = 2k 2\ell + 2 \sum_{t=1}^T \log y_t = AIC + 2 \sum_{t=1}^T \log y_t,
\end{equation}
Let’s see an example in R. We will use
longleydata from
datasetspackage. First we construct additive and multiplicative models:
modelAdditive < lm(GNP~Employed,data=longley) modelMultiplicative < lm(log(GNP)~Employed,data=longley)And extract the respective AICs:
AIC(modelAdditive) > 142.7824 AIC(modelMultiplicative) > 44.5661As we see, the values are not comparable. Now let’s modify the second AIC:
AIC(modelMultiplicative) + 2*sum(log(longley$GNP)) > 145.118So, now the values are comparable, and we can conclude that the first model (additive) is better than the second one in terms of AIC.
Similar technique can be used for the other transformed response variables (square root, BoxCox transformation), but the respective distributions of the variables would need to be derived, which is not always a simple task.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Machine Learning Modelling in R : : Cheat Sheet
(This article was first published on R – The R Trader, and kindly contributed to Rbloggers)
I came across this excellent article lately “Machine learning at central banks” which I decided to use as a basis for a new cheat sheet called Machine Learning Modelling in R. The cheat sheet can be downloaded from RStudio cheat sheets repository.
As the R ecosystem is now far too rich to present all available packages and functions, this cheat sheet is by no means exhaustive . It’s rather a guide to what I consider being the most useful tools available in R for modelling. It also contains a standard modelling workflow which represents my view on how to approach a generic modelling exercise. Any comments comments welcome!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – The R Trader. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
spacings on a torus
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
While in Brussels last week I noticed an interesting question on X validated that I considered in the train back home and then more over the weekend. This is a question about spacings, namely how long on average does it take to cover an interval of length L when drawing unit intervals at random (with a torus handling of the endpoints)? Which immediately reminded me of Wilfrid Kendall (Warwick) famous gif animation of coupling from the past via leaves covering a square region, from the top (forward) and from the bottom (backward)…
The problem is rather easily expressed in terms of uniform spacings, more specifically on the maximum spacing being less than 1 (or 1/L depending on the parameterisation). Except for the additional constraint at the boundary, which is not independent of the other spacings. Replacing this extra event with an independent spacing, there exists a direct formula for the expected stopping time, which can be checked rather easily by simulation. But the exact case appears to be add a few more steps to the draws, 3/2 apparently. The following graph displays the regression of the Monte Carlo number of steps over 10⁴ replicas against the exact values:
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R Tip: Break up Function Nesting for Legibility
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
There are a number of easy ways to avoid illegible code nesting problems in R.
In this R tip we will expand upon the above statement with a simple example.
At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.
head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")]) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25One popular way to break up nesting is to use magrittr‘s “%>%” in combination with dplyr transform verbs as we show below.
library("dplyr") mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% head # mpg cyl wt # 1 18.7 8 3.44 # 2 14.3 8 3.57 # 3 16.4 8 4.07 # 4 17.3 8 3.73 # 5 15.2 8 3.78 # 6 10.4 8 5.25Note: the above code lost (without warning) the row names that are part of mtcars. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.
Many R users already routinely avoid nested notation problems through a convention I call “name reuse.” Such code looks like the following.
result < mtcars result < filter(result, cyl == 8) result < select(result, mpg, cyl, wt) head(result)The above convention is enough to get around all problems of nesting. It also has the great advantage that it is stepdebuggable.
I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr verbs, to base R operators).
. < mtcars . < subset(., cyl == 8) . < .[, c("mpg", "cyl", "wt")] result < . head(result) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.
library("dplyr") library("microbenchmark") library("ggplot2") timings < microbenchmark( base = { . < mtcars . < subset(., cyl == 8) . < .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow }) print(timings) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 122.948 136.948 167.2253 159.688 179.924 349.328 100 ## dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770 100 autoplot(timings)
Durations for related tasks, smaller is better.
Contrary to what many repeat, base R is often faster than the dplyr alternative. In this case the base R is 15 times faster (possibly due to magrittr overhead and the small size of this example). We also see, with some care, base R can be quite legible. dplyr is a useful tool and convention, however it is not the only allowed tool or only allowed convention.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Automate R processes
(This article was first published on bnosac :: open analytical helpers, and kindly contributed to Rbloggers)
Last week we updated the cronR R package and released it to CRAN allowing you to schedule any R code on whichever timepoint you like. The package was updated in order to comply to more stricter CRAN policies regarding writing to folders. Along the lines, the RStudio addin of the package was also updated. It now looks as shown below and is tailored to Data Scientists that want to automate basic R scripts.
The cronR (https://github.com/bnosac/cronR) and taskscheduleR (https://github.com/bnosac/taskscheduleR) R packages are distributed on CRAN and provide the basic functionalities to schedule R code at your specific timepoints of interest. The taskscheduleR R package is designed to schedule processes on Windows, the cronR R package allows to schedule your jobs on Linux or Mac. Hope you enjoy the packages.
If you need support in automating and integrating more complex R flows in your current architecture, feel free to get into contact here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Regression Analysis Essentials For Machine Learning
(This article was first published on Easy Guides, and kindly contributed to Rbloggers)
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
Linear regression is the most simple and popular technique for predicting a continuous variable. It assumes a linear relationship between the outcome and the predictor variables.
The linear regression equation can be written as y = b0 + b*x + e, where:
 b0 is the intercept,
 b is the regression weight or coefficient associated with the predictor variable x.
 e is the residual error
Technically, the linear regression coefficients are detetermined so that the error in predicting the outcome value is minimized. This method of computing the beta coefficients is called the Ordinary Least Squares method.
When you have multiple predictor variables, say x1 and x2, the regression equation can be written as y = b0 + b1*x1 + b2*x2 +e. In some situations, there might be an interaction effect between some predictors, that is for example, increasing the value of a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining the variation in the outcome variable.
Note also that, linear regression models can incorporate both continuous and categorical predictor variables.
When you build the linear regression model, you need to diagnostic whether linear model is suitable for your data.
In some cases, the relationship between the outcome and the predictor variables is not linear. In these situations, you need to build a nonlinear regression, such as polynomial and spline regression.
When you have multiple predictors in the regression model, you might want to select the best combination of predictor variables to build an optimal predictive model. This process called model selection, consists of comparing multiple models containing different sets of predictors in order to select the best performing model that minimize the prediction error. Linear model selection approaches include best subsets regression and stepwise regression
In some situations, such as in genomic fields, you might have a large multivariate data set containing some correlated predictors. In this case, the information, in the original data set, can be summarized into few new variables (called principal components) that are a linear combination of the original variables. This few principal components can be used to build a linear model, which might be more performant for your data. This approach is know as principal componentbased methods, which include: principal component regression and partial least squares regression.
An alternative method to simplify a large multivariate model is to use penalized regression, which penalizes the model for having too many variables. The most well known penalized regression include ridge regression and the lasso regression.
You can apply all these different regression models on your data, compare the models and finally select the best approach that explains well your data. To do so, you need some statistical metrics to compare the performance of the different models in explaining your data and in predicting the outcome of new test data.
The best model is defined as the model that has the lowest prediction error. The most popular metrics for comparing regression models, include:
 Root Mean Squared Error, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as RMSE = mean((observeds  predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.
 Adjusted Rsquare, representing the proportion of variation (i.e., information), in your data, explained by the model. This corresponds to the overall quality of the model. The higher the adjusted R2, the better the model
Note that, the above mentioned metrics should be computed on a new test data that has not been used to train (i.e. build) the model. If you have a large data set, with many records, you can randomly split the data into training set (80% for building the predictive model) and test set or validation set (20% for evaluating the model performance).
One of the most robust and popular approach for estimating a model performance is kfold crossvalidation. It can be applied even on a small data set. kfold crossvalidation works as follow:
 Randomly split the data set into ksubsets (or kfold) (for example 5 subsets)
 Reserve one subset and train the model on all other subsets
 Test the model on the reserved subset and record the prediction error
 Repeat this process until each of the k subsets has served as the test set.
 Compute the average of the k recorded errors. This is called the crossvalidation error serving as the performance metric for the model.
Taken together, the best model is the model that has the lowest crossvalidation error, RMSE.
In this Part, you will learn different methods for regression analysis and we’ll provide practical example in R.
The content is organized as follow:
 Regression Analysis
 Regression Model Diagnostics
 Regression Model Validation
 Model Selection Essentials in R
To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ggplot2: How Geoms & Aesthetics ≈ Whipped Cream
(This article was first published on R – TRinker's R Blog, and kindly contributed to Rbloggers)
In this post I have a few goals:
1. Become (re)familiar with available geoms
2. Become (re)familiar with aesthetic mappings in geoms (stroke who knew?)
3. Answer these questions:
 How often do various geoms appear and how often do they have required aesthetics?
 How often do various aesthetics appear and how often are they required?
 What geoms are most similar based on mappings?
Two weeks go I made whipped cream for the first time. In a tweet I lamented not having tried making it earlier in my life:
This was an error that resulted in missing out because of a lack of confidence. Now the opposite tale. Missing out because of false confidence. I know ggplot2…there’s nothing new to learn. Recently I realized it’s been too long since I’ve reread the documentation. I’m betting my time making this blog post that there’s a few more like me.
I’m teaching an upcoming analysis course. To prepare I’m reading and rereading many important texts including R for Data Science. In my close read I noticed that some ggplot2 functions have a stroke aesthetic.
Didn’t know that…I figure I needed to spend a bit more time with the documentation and really get to know geoms and aesthetics that I may have overlooked.
What’s an aesthetic Anyways?ggplot2’s author, Hadley, states:
In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)….aesthetics…are the properties that can be perceived on the graphic. Each aesthetic can be mapped to a variable, or set to a constant value. pp. 4; 176
So without aesthetics, geoms can’t represent data.
Getting the Geoms and Their AestheticsBefore getting started note that wordpress.com destroys code. To get the undestroyed version of code from this blog use this gist: https://gist.github.com/trinker/a03c24a7fe9603d3091f183ffe4781ab
I wasn’t sure how I was going to accomplish this task but a little digging showed that it was pretty easy. First I used Dason Kurkiewicz & I’s pacman package to get all the geoms from ggplot2. This is where I first hit a road block. How do I get the aesthetics for a geom? I recalled that the documentation tells you the aesthetics for each geom. I noticed that in the roxygen2 markup the following line that creates the aesthetics reference in the documentation:
@eval rd_aesthetics("geom", "bar")That allowed me to follow the bread crumbs to make the following code to grab the aesthetics per geom and a flag for when an aesthetic is required:
if (!require("pacman")) install.packages("pacman") pacman::p_load(pacman, ggplot2, dplyr, textshape, numform, tidyr, lsa, viridis) ## get the geoms geoms < gsub('^geom_', '', grep('^geom_', p_funs(ggplot2), value = TRUE)) ## function to grab the aesthetics get_aesthetics < function(name, type){ obj < switch(type, geom = ggplot2:::find_subclass("Geom", name, globalenv()), stat = ggplot2:::find_subclass("Stat", name, globalenv())) aes < ggplot2:::rd_aesthetics_item(obj) req < obj$required_aes data_frame( aesthetic = union(req, sort(obj$aesthetics())), required = aesthetic %in% req, ) } ## loop through and grab the aesthetics per geom aesthetics_list < lapply(geoms, function(name){ type < 'geom' name < switch(name, jitter = 'point', freqpoly = 'line', histogram = 'bar', name ) out < try(get_aesthetics(name, 'geom'), silent = TRUE) if (inherits(out, 'tryerror')) out % setNames(geoms) ## convert the list of data.frames to one tidy data.frame aesthetics % tidy_list('geom') %>% tbl_df() aesthetics geom aesthetic required 1 abline slope TRUE 2 abline intercept TRUE 3 abline alpha FALSE 4 abline colour FALSE 5 abline group FALSE 6 abline linetype FALSE 7 abline size FALSE 8 area x TRUE 9 area y TRUE 10 area alpha FALSE # ... with 341 more rows Geoms: Getting AcquaintedFirst I wanted to get to know geoms again. There are currently 44 of them.
length(unique(aesthetics$geom)) ## 44This bar plot details the geoms and how many aesthetics are optional/required.
geom_ord % count(geom) %>% arrange(n) %>% pull(geom) aesthetics %>% mutate(geom = factor(geom, levels = geom_ord)) %>% ggplot(aes(geom, fill = required)) + geom_bar() + coord_flip() + scale_y_continuous(expand = c(0, 0), limits = c(0, 15)) + scale_fill_manual(name = 'Required', values = c('gray88', '#1C86EE'), labels = f_response) + theme_minimal() + theme( panel.grid.major.y = element_blank(), axis.text = element_text(color = 'grey55', size = 10), legend.key.size = unit(.35, 'cm'), legend.title = element_text(color = 'grey30', size = 10), legend.position = c(.76, .1), axis.title = element_text(color = 'gray55') ) + labs( x = 'Geom', y = 'Count', title = 'Count of Geom Aesthetics', subtitle = 'Distribution of geom aesthetic count filled by requirement flag.' )Some interesting things come out. Most geoms have 2ish required aesthetics. The boxplot geom has the most required and unrequired aesthetics. Sensibly, a blank geom requires no aesthetics. I wanted to see what all of these aesthetics were for the boxplot. Some quick dplyring has us there in no time.
aesthetics %>% filter(geom == 'boxplot') geom aesthetic required 1 boxplot x TRUE 2 boxplot lower TRUE 3 boxplot upper TRUE 4 boxplot middle TRUE 5 boxplot ymin TRUE 6 boxplot ymax TRUE 7 boxplot alpha FALSE 8 boxplot colour FALSE 9 boxplot fill FALSE 10 boxplot group FALSE 11 boxplot linetype FALSE 12 boxplot shape FALSE 13 boxplot size FALSE 14 boxplot weight FALSESeems that x is the only aesthetic that is truly required. The other “required” ones are computed if you just supply x.
Aesthetics: May I join You?Now time to get to know aesthetics. Are there others like stroke that I’ve overlooked? There are 36 aesthetics currently. I see right away that there is a weight aesthetic I’ve never seen before.
length(unique(aesthetics$aesthetic)) ## 36 aes_ord % count(aesthetic) %>% arrange(n) %>% pull(aesthetic) aesthetics %>% mutate(aesthetic = factor(aesthetic, levels = aes_ord)) %>% ggplot(aes(aesthetic, fill = required)) + geom_bar() + scale_y_continuous(expand = c(0, 0), limits = c(0, 45), breaks = seq(0, 45, by = 5)) + scale_fill_manual(name = 'Required', values = c('gray88', '#1C86EE'), labels = f_response) + theme_minimal() + theme( panel.grid.major.y = element_blank(), panel.grid.minor = element_blank(), axis.text = element_text(color = 'grey55', size = 10), legend.key.size = unit(.35, 'cm'), legend.title = element_text(color = 'grey30', size = 10), legend.position = c(.83, .1), axis.title = element_text(color = 'gray55') ) + labs( x = 'Aesthetic', y = 'Count', title = 'Count of Aesthetics', subtitle = 'Distribution of aesthetics filled by requirement flag.' ) + coord_flip()It seems from this plot that almost every geom requires an x/y position aesthetic. That makes sense. What doesn’t make sense are cases besides geom_blank that don’t require x/y.
aesthetics %>% filter(aesthetic %in% c('x', 'y') & !required) geom aesthetic required 1 count y FALSE 2 qq x FALSE 3 qq y FALSE 4 rug x FALSE 5 rug y FALSEThis dplyr summarize/filter shows where an x/y are possible aesthetics but not required. Sensible. But are there some geoms that x/y aren’t even in their mapping?
aesthetics %>% group_by(geom) %>% summarize( has_x = 'x' %in% aesthetic, has_y = 'y' %in% aesthetic ) %>% filter(!has_x  ! has_y) %>% arrange(has_x) geom has_x has_y 1 abline FALSE FALSE 2 blank FALSE FALSE 3 hline FALSE FALSE 4 map FALSE FALSE 5 rect FALSE FALSE 6 vline FALSE FALSE 7 boxplot TRUE FALSE 8 errorbar TRUE FALSE 9 linerange TRUE FALSE 10 ribbon TRUE FALSEDig into the documentation if you want to make sense of why these geoms don’t require x/y positions.
Geoms & Aesthetics IntersectOK so we’ve explored variability in geoms and aesthetics a bit…let’s see how they covary. The heatmap below provides an understanding of what geoms utilize what aesthetics and if they are required. NA means that the aesthetic is not a part of the geom’s mapping.
boolean % count(geom, aesthetic) %>% spread(aesthetic, n, fill = 0) boolean %>% gather(aesthetic, value, geom) %>% left_join(aesthetics, by = c('geom', 'aesthetic')) %>% mutate( aesthetic = factor(aesthetic, levels = rev(aes_ord)), geom = factor(geom, levels = rev(geom_ord)) ) %>% ggplot(aes(y = aesthetic, x = geom, fill = required)) + geom_tile(color = 'grey95') + scale_fill_manual(name = 'Required', values = c('gray88', '#1C86EE'), labels = f_response) + theme_minimal() + theme( axis.ticks = element_blank(), axis.text.y = element_text(size = 8, margin = margin(r = 3)), axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45, margin = margin(t = 3)), legend.key.size = unit(.35, 'cm'), legend.title = element_text(color = 'grey30', size = 10), panel.grid = element_blank(), axis.title = element_text(color = 'gray55') ) + labs( title = 'Geom & Aesthetic Cooccurrence', subtitle = NULL, y = 'Aesthetic', x = 'Geom' )Here we can see that weight aesthetic again. Hmmm, boxplot has it and other, mostly univariate functions as well. The documentation is a bit sparse on it. Hadley says: https://github.com/tidyverse/ggplot2/issues/1893
Also there are two geoms that are just now catching my eye…geom_count & geom_spoke. I’ll come back to them later.
Also, there are a few aesthetics like height and intercept that are one time use only aesthetics. Also, the bottom 10 aesthetics are used, by far, the most frequently. For beginners, these are the ones that will really pay to learn quickly.
Geom SimilarityI thought it my be fun to use the geoms aesthetics to see if we could cluster aesthetically similar geoms closer together. The heatmap below uses cosine similarity and heirarchical clustering to reorder the matrix that will allow for like geoms to be found closer to one another (note that today I learned from “R for Data Science” about the seriation package [https://cran.rproject.org/web/packages/seriation/index.html] that may make this matrix reordering task much easier).
boolean %>% column_to_rownames() %>% as.matrix() %>% t() %>% cosine() %>% cluster_matrix() %>% tidy_matrix('geom', 'geom2') %>% mutate( geom = factor(geom, levels = unique(geom)), geom2 = factor(geom2, levels = unique(geom2)) ) %>% group_by(geom) %>% ggplot(aes(geom, geom2, fill = value)) + geom_tile() + scale_fill_viridis(name = bquote(cos(theta))) + theme( axis.text.y = element_text(size = 8) , axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45), legend.position = 'bottom', legend.key.height = grid::unit(.2, 'cm'), legend.key.width = grid::unit(.7, 'cm'), axis.title = element_text(color = 'gray55') ) + labs( title = "ggplot2 Geom Cosine Similarity", subtitle = 'Geoms with similar aesthetics are clustered together along the diagonal.', x = 'Geom', y = 'Geom' )Looking at the bright square clusters along the diagonal is a good starting place for understanding which geoms tend to aesthetically cluster together. Generally, this ordering seems pretty reasonable.
Aesthetic SimilarityI performed the same analysis of aesthetics, asking which ones tended to be used within the same geoms.
boolean %>% column_to_rownames() %>% as.matrix() %>% cosine() %>% {x < .; diag(x) % cluster_matrix() %>% tidy_matrix('aesthetic', 'aesthetic2') %>% mutate( aesthetic = factor(aesthetic, levels = unique(aesthetic)), aesthetic2 = factor(aesthetic2, levels = unique(aesthetic2)) ) %>% group_by(aesthetic) %>% ggplot(aes(aesthetic, aesthetic2, fill = value)) + geom_tile() + scale_fill_viridis(name = bquote(cos(theta))) + theme( axis.text.y = element_text(size = 8) , axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45), legend.position = 'bottom', legend.key.height = grid::unit(.2, 'cm'), legend.key.width = grid::unit(.7, 'cm'), axis.title = element_text(color = 'gray55') ) + labs( title = "ggplot2 Aesthetics Cosine Similarity", subtitle = f_wrap(c( 'Aesthetics that tend to be used together within the same geoms are clustered together along the diagonal.' ), width = 95, collapse = TRUE ), x = 'Aesthetic', y = 'Aesthetic' )The result is pretty sensible. The lower left corner has the largest cluster which seems to be related to text based geoms. The next cluster up and to the right one, has group, size, x, y, etc. This seems to be the most common set of typically geometric aesthetics. The upper, lower, middle cluster is specific to the boxplot summary stat. Stroke and shape as a cluster are related to geoms that are point based.
Geom Similarity: Required AestheticsThe last clustering activity I wanted was to reduce the seahetics to jsut required (as we might assume these are the truest attributes of a geom) and see which geoms cluster from that analysis.
boolean2 % filter(required) %>% count(geom, aesthetic) %>% spread(aesthetic, n, fill = 0) boolean2 %>% column_to_rownames() %>% as.matrix() %>% t() %>% cosine() %>% cluster_matrix() %>% tidy_matrix('geom', 'geom2') %>% mutate( geom = factor(geom, levels = unique(geom)), geom2 = factor(geom2, levels = unique(geom2)) ) %>% group_by(geom) %>% ggplot(aes(geom, geom2, fill = value)) + geom_tile() + scale_fill_viridis(name = bquote(cos(theta))) + theme( axis.text.y = element_text(size = 8) , axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45), legend.position = 'bottom', legend.key.height = grid::unit(.2, 'cm'), legend.key.width = grid::unit(.7, 'cm'), axis.title = element_text(color = 'gray55') ) + labs( title = "ggplot2 Geom Cosine Similarity", subtitle = 'Geoms with similar aesthetics are clustered together along the diagonal.', x = 'Geom', y = 'Geom' )This seemed less interesting. I didn’t really have a plausible explanation for what patterns did show up and for the most part, clusters became really large or really small. I’m open to others’ interpretations.
A Few Un/Rediscovered ggplot2 FeaturesI did want to learn more about geom_count & geom_spoke in the remaining exploration.
?geom_count ggplot(diamonds, aes(x = cut, y = clarity)) + geom_count(aes(size = ..prop.., group = 1)) + scale_size_area(max_size = 10)Oh yeah! Now I remember. A shortcut to make the geom_point bubble plot for investigating categorical variable covariance as an alternative to the heatmap.
?geom_spoke df < expand.grid(x = 1:10, y=1:10) df$angle < runif(100, 0, 2*pi) df$speed < runif(100, 0, sqrt(0.1 * df$x)) ggplot(df, aes(x, y)) + geom_point() + geom_spoke(aes(angle = angle), radius = 0.5)Interesting. The documentation says:
…useful when you have variables that describe direction and distance.
Not sure if I have a use case for my own work. But I’ll store it in the vault (but better than I remembered geom_count).
Learning More About Geoms & AestheticsI wanted to leave people with a quick reference guide that RStudio has kindly provided to help give quick reference to geoms and aesthetics and whe to use them.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – TRinker's R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Power analysis for longitudinal multilevel models: powerlmm 0.2.0 is now out on CRAN
(This article was first published on R Psychologist  R, and kindly contributed to Rbloggers)
My R packge powerlmm 0.2.0 is now out on CRAN. It can be installed from CRAN https://cran.rproject.org/package=powerlmm or GitHub https://github.com/rpsychologist/powerlmm.
Changes in version 0.2.0 New features Analytical power calculations now support using Satterthwaite’s degrees of
freedom approximation.  Simulate.plcp will now automatically create lme4 formulas if none is
supplied, see ?create_lmer_formula.  You can now choose what alpha level to use.
 Treat cluster sizes as a random variable, uneqal_clusters now accepts
a function indicating the distribution of cluster sizes, via the new argument
func, e.g. rpois or rnorm could be used to draw cluster sizes.  Expected power for designs with parameters that are random variables,
can be calculated by averaging over multiple realizations, using the
argument R.  Support for parallel computations on Microsoft Windows, and in GUIs/interactive
environments, using parallel::makeCluster (PSOCK). Forking is still used for
noninteractive Unix environments.
 Calculations of the variance of the treatment effect is now much faster for
designs with unequal clusters and/or missing data, when cluster sizes are
large. The calculations now use the much faster implementation used by lme4.  Cleaner printmethods for plcp_multiobjects.
 Multiple power calculations can no be performed in parallel, via the
argument cores.  simulate.plcp_multi now have more options for saving intermediate results.
 print.plcp_multi_power now has better support for subsetting via either [],
head(), or subset().
 icc_pre_subject is now defined as (u_0^2 + v_0^2) / (u_0^2 + v_0^2 + error^2),
instead of (u_0^2) / (u_0^2 + v_0^2 + error^2). This would be the subjectlevel ICC,
if there’s no random slopes, i.e. correlation between time points for the same subject.  study_parameters(): 0 and NA now means different things. If 0 is passed, the parameters
is kept in the model, if you want to remove it specify it as NA instead.  study_parameters(): is now less flexible, but more robust. Previously a large
combination if raw and relative parameters could be combined, and the individual
parameters was solved for. To make the function less bug prone and easier to maintain,
it is now only possible to specify the clusterlevel variance components as relative values,
if the other parameters as passed as raw inputs.
 Output from simulate_data() now includes a column y_c that contains the full outcome vector,
without missing values added. This makes it easy to compare the complete and incomplete
data set, e.g. via simulate().  simulate() new argument batch_progress enables showing progress when doing
multiple simulations.  Fix bug in summary.plcp_sim where the wrong % convergence was calculated.
 Simulation function now accepts lme4 formulas containing “”.
 The clusterlevel intercept variance is now also set to zero in the control
group, when a partially nested design is requested.  Fix incorrect error message from study_parameters when
icc_cluster_pre = NULL and all inputs are standardized.  Fix bug that would cause all slopes to be zero when var_ratio argument was
passed a vector of values including a 0, e.g. var_ratio = c(0, 0.1, 0.2).  Fix bug for multisim objects that caused the wrong class the be used for,
e.g. res[[1]]$paras, and thus the single simulation would not print
correctly.  Results from multisim objects can now be summarized for all random effects
in the model.  More support for summarizing random effects from partially nested formulas,
e.g. cluster_intercept and cluster_slope is now correctly extracted from
(0 + treatment + treatment:time  cluster).  When Satterthwaite’s method fails the between clusters/subjects DFs
are used to calculate pvalues.  Power.plcp_multi is now exported.
 get_power.plcp_multi now shows a progress bar.
 Fix a bug that caused dropout to be wrong when one condition had 0 dropout, and
deterministic_dropout = FALSE.
To leave a comment for the author, please follow the link and comment on their blog: R Psychologist  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
NIMBLE webinar Friday April 13
(This article was first published on R – NIMBLE, and kindly contributed to Rbloggers)
We’ll be presenting a webinar on NIMBLE, hosted by the Eastern North America Region of the International Biometric Society. Details are as follows. Programming with hierarchical statistical models: An introduction to the BUGScompatible NIMBLE system for MCMC and more Friday, April 13, 2018 11:00 a.m. – 1:00 p.m. EST Must register before April 12. You can register here. (You’ll need to create an account on the ENAR website and there is a modest fee – from $25 for ENAR student members up through $85 for nonIBS members.) This webinar will introduce attendees to the NIMBLE system for programming with hierarchical models in R. NIMBLE (rnimble.org) is a system for flexible programming and dissemination of algorithms that builds on the BUGS language for declaring hierarchical models. NIMBLE provides analysts with a flexible system for using MCMC, sequential Monte Carlo and other techniques on userspecified models. It provides developers and methodologists with the ability to write algorithms in an Rlike syntax that can be easily disseminated to users. C++ versions of models and algorithms are created for speed, but these are manipulated from R without any need for analysts or algorithm developers to program in C++.While analysts can use NIMBLE as a dropin replacement for WinBUGS or JAGS, NIMBLE provides greatly enhanced functionality in a number of ways. The webinar will first show how to specify a hierarchical statistical model using BUGS syntax (including userdefined function and distributions) and fit that model using MCMC (including user customization for better performance). We will demonstrate the use of NIMBLE for biostatistical methods such as semiparametric random effects models and clustering models. We will close with a discussion of how to use the system to write algorithms for use with hierarchical models, including building and disseminating your own methods.
Presenter:
Chris Paciorek
Adjunct Professor, Statistical Computing Consultant
Department of Statistics, University of California, Berkeley
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Multiple Versions of R
(This article was first published on R Views, and kindly contributed to Rbloggers)
Data scientists prefer using the latest R packages to analyze their data. To ensure a good user experience, you will need a recent version of R running on a modern operating system. If you run R on an production server – and especially if you use RStudio Connect – plan to support multiple versions of R side by side so that your code, reports, and apps remain stable over time. You can support multiple versions of R concurrently by building R from source. Plan to install a new version of R at least once per year on your servers.
A solid foundation for RAdministering R on the desktop is relatively easy, because desktops are designed for a single user at a specific time. Desktop users upgrade R versions and R packages as new software becomes available, leaving old versions and packages behind. Servers, on the other hand, are designed to support multiple people who want to access content across time. Servers are increasingly used for building data science labs in R, deploying R in production, and running R in the cloud. You may find that the same strategies you use to administer R on your desktop do not work as well on a server. In particular, upgrading your version of R must be handled differently.
If you upgrade R on your server as you do on your desktop, you could easily break some apps and disrupt your teams. Administrators should exercise caution when upgrading to a new version of R on a Linux server. Consider the following situations:
 You are hosting apps on RStudio Connect and Shiny Server for more than a year. When you upgrade R, you break many of your older apps.
 Your team is developing code on a shared instance of RStudio Server. When you upgrade R, you disrupt people’s work and break their code.
Instead of upgrading your existing version of R, a better solution to these problems is to run multiple versions of R side by side. This strategy preserves past versions of R so you can manage upgrades and keep your code, apps, and reports stable over time.
Building R from sourceThe best way to run multiple versions of R side by side is to build R from source. If you are running R on a Linux server – and particularly in the enterprise – you should always build R from source, because it will help you:
 Run multiple versions of R side by side
 Guarantee that R will work on your unique server configuration
 Potentially speed up certain lowlevel computations used by R
 Build technical expertise that will help you administer R at scale
Most enterprise IT departments will be comfortable building software from source. If you have never built R from source, it is very straightforward. First, you need the build dependencies for R. If you’ve already installed R from a binary source like CRAN or EPEL, you may already have these dependencies installed; otherwise, you can run sudo yumbuilddep R on RedHat or sudo aptget builddep rbase on Ubuntu. Second, you should obtain and unpack the source tarball for the version of R you want to install from CRAN. Third, from within the extracted source directory, build R from source using configure, make, and make install commands. For example:
# BUILD R FROM SOURCE ON REDHAT LINUX # R3.4.3 # Install Linux dependencies $ sudo yumbuilddep R # Download and extract source code $ wget https://cran.rproject.org/src/base/R3/R3.4.3.tar.gz $ tar xzvf R3.4.3.tar.gz $ cd R3.4.3 # Build R from source $ ./configure prefix=/opt/R/$(cat VERSION) enableRshlib withblas withlapack $ make $ sudo make installThis script installs R version 3.4.3 into /opt/R/3.4.3, but you can install R into any of the recommended directories. The enableRshlib option is required to make the shared libraries known to RStudio. The withblas and withlapack options are not required, but are commonly included. These options install the system BLAS and LAPACK libraries, which are used to speed up certain lowlevel math computations (e.g., multiplying and inverting matrices). These libraries will not speed up R itself, but can significantly speed up the underlying code execution.
If you run into problems installing R from source, you can always remove the installation directory and start over. However, once the installation succeeds, you should never move the installation directory – in other words, always install into the final destination directory. If you run into problems with dependencies, make sure you are able to identify and install all of the required Linux libraries (e.g., the X11 library is commonly overlooked). Building R from source will be much easier with a modern operating system that is connected to the Internet.
For further details about building R from source, see the RStudio Server Admin Guide.
RStudio professional productsRStudio professional products automatically support multiple versions of R and provide additional features, such as having administrators control access to multiple versions, or allowing users to choose for themselves. RStudio Connect automatically provides R version matching. Running multiple versions of R side by side with RStudio Connect will ensure that your content persists over time.
References Installing multiple versions of R on Linux
 Upgrading to a new version of R
 Multiple versions of R with RStudio Server Pro
 Multiple versions of R with Shiny Server Pro
 Multiple versions of R with RStudio Connect
_____='https://rviews.rstudio.com/2018/03/21/multipleversionsofr/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R and Docker
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
If you regularly have to deal with specific versions of R, or different package combinations, or getting R set up to work with other databases or applications then, well, it can be a pain. You could dedicate a special machine for each configuration you need, I guess, but that's expensive and impractical. You could set up virtual machines in the cloud which works well for oneoff situations, but gets tedious having to reconfigure a new VM each time. Or, you could use Docker containers, which were expressly designed to make it quick easy to configure and launch an independent and secure collection of software and services.
If you're new to the concept of Docker containers, here's a docker tutorial for data scientists. But the concepts are pretty simple. At Docker hub, you can search "images" — basically, bundles of software with preconfigured settings — contributed by the community and by vendors. (You'll be referring to the images by name, for example: rocker/rbase.) You can then create a "container" (a running instance of that image) on your machine with the docker application, or in the cloud using the tools offered by your provider of choice.
For R users, there's a wide array of preconfigured Docker images for R available since 2014, thanks to the Rocker project. You can browse the rocker repository at Docker Hub to see everything available, but it includes:
 Simple images containing just the latest official R release or the latest daily R build.
 Images containing both R and RStudio Server.
 Images with the tidyverse suite of packages preinstalled.
 Versionstable images, snapshotted to specific R (and RStudio versions) and the R package ecosystem at specific points in time. If you retrieve one of these images using a tag, your docker image will always include the same software, even months or years down the line. These are perfect for production instances, where reproducibility is paramount.
I find the images containing RStudio Server super convenient whenever I need to try out something in a specific R version. All I need to do is provide the image name to Azure Container Instances, and make sure port 8787 is open:
Creating a container for R 3.4.1 with the tidyverse packages and RStudio Server.Be sure to open port 8787 here for browser access.
That's it for the configuration, and after the instance is ready (about 2 minutes) I can use a web browser to visit http://40.121.205.121:8787/ to find a completely fresh R instance and the RStudio IDE. (The actual IP address will be provided for you by Container Instances, and can be found in the Overview section for your instance in the Azure Portal.)
You can of course use other cloud providers as well: Andre Heiss provides this guide for setting up a rocker image in Digital Ocean, and also provides some handy tips for creating your own Docker Files to create custom images of your own design. For more on the Rocker project, follow the link below.
The Rocker Project: Docker Containers for the R Environment
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Building the Olympics blog: tidy data preparation
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
Last week I did an analysis of gold diggers at the Olympics. Here I take you along how I scraped the data from wikepedia and subsequently used the tools of the tidyverse to get the data in a format in which they can be analyzed. You are invited to copypaste the code and see how the data is gradually getting into a shape in which we can analyze it. I will often not show the data in the intermediate steps, because this will clog the blog. Copy the code to find out for yourself.
Scraping the dataWeb scraping in R can be done using the rvest package. There are several introductions to this package, such as this one. You can use either the Inspector Gadget to find out the html tags of the data you want to scrape, or inspect elements in the web browser. I prefer the latter. These data are published on Wikipedia, for each sport you will find the results in a separate table. It appears that the names of the sports all have h2 tags, the data of the medal winners all have a table class. First we obtain the names of the sports
library(rvest) library(tidyverse) library(stringr) Olympics_2018_wiki < read_html("https://en.wikipedia.org/wiki/List_of_2018_Winter_Olympics_medal_winners") sports < html_nodes(Olympics_2018_wiki, "h2") %>% html_text() sports %>% head(3) ## [1] "Alpine skiing[edit]" "Biathlon[edit]" "Bobsleigh[edit]"Note that you have to use html_text to convert the xml nodes to regular R characters. I am not showing the full output here, but it appears that the first fifteen elements contain the sport’s names. Also, we still need to get rid of the “[edit]” part of the strings.
sports < sports %>% `[`(1:15) %>% str_split("\\[") %>% map_chr(1)In the above I call the subsetting operator [ as a prefix function, rather than its usual usage object[index]. (Remember everything in R is a function, even when it appears not be). This to make it pipeable. Every infix operator can be used in this way. If you find this unaesthetic, magrittr has alternative functions, such as extract to do subsetting. I personally find this a nice use of infixes. Next, we need to get rid of the “[edit]” after each name. We use stringr’s str_split to split on each “[”. This results in a list of character vectors, each vector of length 2. purrr::map_chr will select the first element of each vector and store the result in a single character vector.
Now, the tables with the sports
medals < html_nodes(Olympics_2018_wiki, "table") %>% html_table() %>% `[`(3:26)I checked manually that the 3rd up until the 26th table contain the results. Now, if you visit the wiki page you will notice that some sports have a single results table (such as Curling), while others have several (for women’s, for men’s, and some even for mixed events). I counted on the website the number of tables for each sport.
tables_by_sport < c(3, 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2)The next step is flatten these little data frames to one, and add the sport of each event as a column.
medals_tbl < medals %>% map(~select(.x, Gold, Silver, Bronze)) %>% map2_df(rep(sports, tables_by_sport), ~mutate(.x, sport = .y)) %>% as_data_frame()At the first step we apply dplyr’s select to obtain the three columns we are interested in, with map this is applied to each of the little data frames. Next we use map2_df, this will iterate over two vectors instead of one, to add the sports names to each of the data frames with mutate. Note that by using map2_df, we bind all the little data frames into one big data frame right away. The result is of class data.frame. Since I like to work with tibbles, I coerce it at the final line.
medals_tbl %>% select(Gold) %>% slice(c(4, 16)) ## # A tibble: 2 x 1 ## Gold ## ## 1 "Andr\u00e9 Myhrer\u00a0Sweden" ## 2 "Sweden\u00a0(SWE)Peppe FemlingJesper NelinSebastian SamuelssonFredrik LindNow we have a little challenge. We want to distill the country names from the strings. However, the names are sometimes at the start of the string and other times at the end. Splitting and selecting the nth element, like before, won’t work here. To solve this we match every string to every country that won a medal, like so.
medals_tbl %>% select(Gold) %>% slice(c(4, 16)) %>% unlist() %>% str_detect("Sweden") ## [1] TRUE TRUEFirst we need to get all the countries that won a medal from the wiki page with the total medal table. The scraping and cleaning is very similar to the ones we already did.
medal_table_site < read_html("https://en.wikipedia.org/wiki/2018_Winter_Olympics_medal_table") medal_table < medal_table_site %>% html_nodes("table") %>% `[`(2) %>% html_table(fill = TRUE) %>% `[[`(1) countries_with_medal < medal_table %>% pull(NOC) %>% `[`(31) %>% # last element is not a country str_sub(end = 7) # It got an asterix added to the name countries_with_medal[7] < "South Korea"Now, str_detect is vectorized over the characters, but it can match to only one pattern at the time. We can wrap over all country names with map_lgl. Getting a logical vector that is TRUE for the country present, and FALSE for all the others. This vector can then subsequently be used to subset the country name.
detect_country < function(string_with_country, name_vec) { ind < map_lgl(name_vec, ~str_detect(string_with_country, pattern = .x)) countries_with_medal[ind] } detect_country("Andr MyhrerSweden", countries_with_medal) ## [1] "Sweden"This works for a single character, like in the example, in order to get it to work on an entire vector we have to wrap it in map_chr.
detect_country_vec < function(country_vec, name_vec = countries_with_medal) { map_chr(country_vec, detect_country, name_vec) }Now, that is a nice and clean function to extract the country names. However, there are some lines that spoil the party.
medals_tbl %>% slice(23:24) ## # A tibble: 2 x 4 ## Gold Silver ## ## 1 "Canada\u00a0(CAN)Justin KrippsAlexander Kopacz" Not awarded ## 2 "Germany\u00a0(GER)Francesco FriedrichThorsten Margis" Not awarded ## # ... with 2 more variables: Bronze , sportWe have a shared Gold in the bobsleigh, spread over two lines. For the Gold itself it doesn’t cause a problem, but for the Silver the function breaks, and for the Bronze Latvia would be counted twice for one medal.
medals_tbl %>% slice(62) ## # A tibble: 1 x 4 ## Gold ## ## 1 "Tobias Wendl\nand Tobias Arlt\u00a0(GER)" ## # ... with 3 more variables: Silver , Bronze , sportAnother spoiler. For some reason in the luge abbreviations are used instead of the full names. Pfff, we have no match here. There are several more exceptions that make our function break.
During this type of analyses you are almost always confronted with the choice between manual labor and automation (writing a general purpose function) several times. I use the following heuristics for this choice:
1) How much more work takes automation compared to manual labor? If little, automate.
2) Is the code likely to be run on data other than the current? If yes, probably automate.
3) Is a general function portable to and useful in other projects? If yes, most definitively automate.
In this case, should we incorporate the exceptions in the function, or do we just do them by hand? It is a lot more work to automate because of the many different exceptions. No, we are not expecting new data to flow through this. And finally, these problems seem very specific for this problem, a general purpose function is not likely to make our future life easier. Manual labor it is. By trial and error we find the problem lines, discard them, apply the function and add the countries for the problem lines manually.
problem_lines < c(23:26, 34, 41, 45, 46, 62) happy_flow < medals_tbl[problem_lines, ] %>% mutate_at(.vars = vars(Gold, Silver, Bronze), .funs = detect_country_vec) hand_work < medals_tbl[problem_lines, ] %>% mutate(Gold = c("Canada", "Germany", "Germany", NA, "Norway", "Sweden", "Germany", "Canada", "Germany"), Silver = c(NA, NA, "South Korea", "Germany", "Sweden", "South Korea", "China", "France", "Austria"), Bronze = c("Latvia", NA, NA, NA, "Norway", "Japan", "Canada", "United States", "Germany")) %>% bind_rows(data_frame(Gold = NA, Silver = NA, Bronze = "Finland", sport = "Crosscountry skiing")) medal_set < bind_rows(happy_flow, hand_work) %>% gather(medal, country, sport) %>% filter(!is.na(country))Note the use of the nice mutate_at function. In one single line we replace the original strings by the values extracted by our function for all three columns. In the final lines gather (tidyr package) is applied to transform the data from wide to long. Each row is now a medal won.
medal_set %>% head(2) ## # A tibble: 2 x 3 ## sport medal country ## ## 1 Alpine skiing Gold Norway ## 2 Alpine skiing Gold AustriaFinally, since we are interested in the number of medals per sport per country. We can already aggregated.
medal_set_sum < medal_set %>% count(sport, medal, country) %>% rename(medal_count = n) medal_set_sum %>% head() ## # A tibble: 6 x 4 ## sport medal country medal_count ## ## 1 Alpine skiing Bronze Austria 2 ## 2 Alpine skiing Bronze France 2 ## 3 Alpine skiing Bronze Italy 1 ## 4 Alpine skiing Bronze Liechtenstein 1 ## 5 Alpine skiing Bronze Norway 2 ## 6 Alpine skiing Bronze Switzerland 2I showed you how the great tidyverse toolbox can be used to get raw data from the web, and transform it into a clean set that is ready for analysis.
Some remarks on web scrapingA disadvantage of using web data as a source, is that the layout of the data might change. My pipeline broke several times, because changes were made to the wiki. Because of this, it is not assured this code will run in future times. For this example I kept the pipeline live, because I wanted to do this blog post including the scraping. However, it would have saved me a good deal of trouble if I stored the set in a csv the first time I had a proper version of the data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introducing tabr: guitar tabs with R
(This article was first published on rbloggers – SNAP tech blog, and kindly contributed to Rbloggers)
This post introduces a new R package I am working on called tabr for creating guitar tablature (“tabs”) from R code. The tabr package provides programmatic music notation and a wrapper around LilyPond for creating quality guitar tablature.
[Click here for original post] (in process of switching where I post to RBloggers from)
tabr offers functions for describing and organizing musical structures and wraps around the LilyPond backend. LilyPond is an open source music engraving program for generating high quality sheet music based on markup syntax. tabr generates files following the LilyPond markup syntax to be subsequently processed by LilyPond into sheet music.
A standalone LilyPond (.ly) file can be created or the package can make a system call to LilyPond directly to render the guitar tablature output (pdf or png). While LilyPond caters to sheet music in general, tabr is focused on leveraging it specifically for creating quality guitar tablature.
While music can be quite complex and a full score will be much longer, something as simple as the following code snippet produces the music notation in the accompanying image.
p("r a2 c f d a f c4", "4 8*6 1") %>% track %>% score %>% tab("out.pdf")Installation
You can install tabr from GitHub with:
# install.packages('devtools') devtools::install_github("leonawicz/tabr") Basic exampleAs a brief example, recreate the tablature shown in the tabr logo, which is almost the same as the first measure in the code example above. It has a tiny bit more in the form of metadata and doesn’t take as many shortcuts, but it’s still short. Here are the steps.
 Define a musical phrase with phrase or the shorthand alias p.
 Add the phrase to a track.
 Add the track to a score.
 Render the score to pdf with tab.
A phrase here does not require a strict definition. Think of it as the smallest piece of musical structure you intend to string together. The first argument to phrase is a string describing notes of a specific pitch (or rests: “r”), separated in time by spaces. For chords, just remove spaces to indicate simultaneous notes. Integers are appended to indicate the octave number so that the pitch is unique. For example, a rest followed by a sequence of notes might be given by notes = "r a2 c3 f3 d3 a3 f3".
The second argument is a similar string giving note metadata. In this example there is nothing to add but the time durations. Whole notes taking up an entire measure of music are given by 1, half notes by 2, quarter notes 4, eighth notes 8, and so on. To specify a quarter note rest followed by a sequence of eighth notes, use info = "4 8 8 8 8 8 8". This basic example does not require specifying additional note information such as dotted notes for different fractions of time, staccato notes, ties/slurs, slides, bends, hammer ons and pull offs, etc. These specifications are currently available in tabr to varying degrees of development and are covered in the vignette tutorials.
The third argument, string, is optional but generally important for guitar tablature. In similar format, it specifies the strings of the guitar on which notes are played. Providing this information fixes the fretstring combinations so that LilyPond does not have to guess what position on the neck of the guitar to play a specific note. An inability to specify this in various tablature notation software (or laziness by the user), is a common cause of inaccurate tabs scouring the internet, where even when the notes are correct they are written in the tab suggesting they be played in positions no one would sensibly use. Note that the x shown below is just a placeholder indicating no need to specify a string for the quarter note rest.
Score metadata and accessing LilyPondFinally, specify some song metadata to reproduce the original staff: the key of D minor, common time, and the tempo. If LilyPond is installed on your system (and added to your system PATH variable on Windows systems), tab should call it successfully. Alternatively, on Windows, it can be added explicitly by calling tabr_options. This option to specify the LilyPond path is still available on other systems. An example of this is commented out below.
R code library(tabr) # path < 'C:/Program Files (x86)/LilyPond/usr/bin/lilypond.exe' # tabr_options(lilypond = path) p1 < p("r a2 c3 f3 d3 a3 f3", "4 8 8 8 8 8 8", "x 5 5 4 4 3 4") track1 < track(p1) song < score(track1) tab(song, "phrase.pdf", key = "dm", time = "4/4", tempo = "4 = 120") #> #### Engraving score to phrase.pdf #### #> GNU LilyPond 2.18.2 #> Processing `./phrase.ly' #> Parsing... #> Interpreting music... #> Preprocessing graphical objects... #> Interpreting music... #> MIDI output to `./phrase.mid'... #> Finding the ideal number of pages... #> Fitting music on 1 page... #> Drawing systems... #> Layout output to `./phrase.ps'... #> Converting to `./phrase.pdf'... #> Success: compilation successfully completedSee the pdf result embedded at the tabr website.
Development status, context and caveatsFirst, why LilyPond? LilyPond is an exceptional sheet music engraving program. It produces professional, high quality output. It is open source. It offers an access point for a programmatic approach to music notation. It is developed and utilized by a large community. Most GUIbased applications are WYSIWYG and force a greater limitation on what you can do and what it will look like after you do it. On the other hand, I have zero interest in writing LilyPond files. tabr has made it more enjoyable, a bit less ugly, and enables me to stick with LilyPond for its quality as I try to shield myself from its native input structures. I’m sure there are far more LilyPond users who don’t mind it at all and have never heard of R; to each their own.
The tabr package is in early development. Breaking changes could occur in a later version. Many capabilities are missing. Others are incompletely implemented. Others in the R developer community who are probably much better musicians than myself are welcome to contribute. This is the type of package that will only develop in response to specific needs of its contributor(s). There are many things that tabr does not address at this stage of development. For example, tabr assumes standard guitar tuning. It has no ability to recognize or handle nonstandard tunings or instruments like bass with a different number of strings. There are essentially countless other aspects of music notation available in LilyPond that tabr does not wrap around. The aim is not to do it all, but certainly to do much more than is currently in place.
I am not an expert in music theory, or in music notation and transcription, or in LilyPond. In fact, my skill in music notation is ironically low enough that I do not find it any more challenging or an impediment to describe a song in R code rather than to just tab it out by hand. The main intent with tabr, however, is simply to be able to generate markup files that LilyPond accepts and understands, without having to write that markup directly.
Finally, there are nonetheless limitations to LilyPond itself. It has been developed for sheet music in general and guitar tablature features were added as a relative afterthought. There are plenty of features I have not yet developed R wrappers around. Then there are other features like string bending that are technically available, but not fully developed yet on the LilyPond side either. Case in point, LilyPond’s bend engraver is still under development; specifying something as common as a bendreleasepulloff is, to put it mildly, challenging.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SNAP tech blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A rentrez paper, and how to use the NCBI’s new API keys
I am happy to say that the latest issue of The R Journal includes a paper
describing rentrez,
the rOpenSci package for retrieving data from the National Center for Biotechnology Information
(NCBI).
The NCBI is one of the most important sources of biological data. The centre
provides access to information on 28 million scholarly articles through PubMed and 250
million DNA sequences through GenBank. More importantly, records in the 50 public
databases maintained by the NCBI are strongly crossreferenced. As a result, it is
possible to pinpoint searches using almost 2 million taxonomic names or a
controlled vocabulary with 270,000 terms.
rentrez has been designed to make it easy to search for and download NCBI
records and download them from within an R session.
The paper and the package vignette
both describe typical usages of rentrez. I though it might be fun to use this
post to find out where papers describing R packages are published these days.
Although PubMed only covers journals in the biological sciences, searching that
database will at least give us an idea of which journals like to publish these
sorts of papers. Here we use the entrez_search and entrez_summary functions
to get some information on all of the papers published in 2017 with the term
‘R package’ in their title:
As you can tell from the output above, you can get a lot of information from
these summary records. In this case, we are interested in the journals in which
these papers appear. We can use the helper function extract_from_esummary
to isolate the ‘source’ of each paper, then use table to count up the frequency
of each journal.
So, it looks like Bioinformatics, BMC Bionformatics and Molecular Ecology
Resources are popular destinations for papers describing R packages, but these
appear in journals all the way across the biological sciences.
The R Journal article describes some more typical uses of rentrez, and also
describes some of decisions that went into the design of the package. If this
example has whetted your appetite, then please check out the article or the
package documentation.
The publication of this paper gives me a chance to thank the
many people that have helped make rentrez into a useful package. I was very
lucky to have this code included in rOpenSci at an early stage. Being part of
the wider project made sure rentrez kept pace with the bestpractices for code
and documentation developed by the R community and got the package out to a wider
audience than would have otherwise been possible. I am thankful to everyone who has
filed an issue or contributed code to rentrez. I also have to
single out Scott Chamberlain, who has done a great deal to make sure the code
meets community standards and is useful to as many people as possible.
To celebrate the publication of this paper I am going to speed up rentrez by a
factor of three!
Well, the timing is coincidental, but the latest release of rentrez does make it
possible to send and receive information from the NCBI at a greater rate than
was previously possible. The NCBI now gives users the opportunity to register for an access
key
that will allow them to make up to 10 requests per second (nonregistered users are limited
to 3 requests per second per IP address). As of the latest release, rentrez
supports the use of these access keys while enforcing the appropriate rate limits.
For oneoff cases, this is as simple as adding the api_key argument to a given
function call.
It most cases you will want to use your key for each of several calls to the
NCBI. rentrez makes this easy by allowing you to set an environment variable,
ENTREZ_KEY. Once this value is set to your key rentrez will use it for all
requests to the NCBI. To set the value for a single R session you can use the
function set_entrez_key(). Here we set the value and confirm it is now
available as an environment variable.
If you use rentrez often you should edit your .Renviron file (see
help(Startup) for a description of this file) to include your key. Doing so will
mean all requests you send will take advantage of your API key. Here’s the line
to add:
As long as an API key is set by one of these methods, rentrez will allow you
to make up to ten requests per second.
The publication in the R Journal is not the end of development for rentrez.
Though the package is now featurecomplete and stable, I am very keen to make sure
it keeps pace with the API it wraps and squash any bugs that might arise. I also
appreciate usecases that demonstrate how the package can take advantage of NCBI
data. So, please, file issues at the project’s repository if you have any
questions about it!
Truncated Poisson distributions in R and Stan by @ellis2013nz
(This article was first published on Peter's stats stuff  R, and kindly contributed to Rbloggers)
Data that’s been only partially observedI’ve been updating my skills in fitting models to truncated data and was pleased to find that, like so much else in statistics, it’s easier than it used to be.
First, some definitions:
 censored data is where some observations are cut off at some maximum or minimum level; those data points are “off the scale” but at least we know they exist and we know which direction they are off the scale. For example, if we were analysing the life span of people born in 1980, anyone who has survived to the end of 2017 has their age at death recorded as “37 or higher”. We know they’re in the data, and their value is at least some minimum amount, but we don’t know with precision what it will end up being.
 truncated data is where data beyond some maximum or minimum level is just missing. Typically this is because of some feature of the measurement process eg anything smaller than X just doesn’t show up.
I’ve got some future blog posts on a more substantive real life issue where I have count data for which, in some situations, I only see the observations with a count higher than some threshold. Let’s imagine, for example, we are looking at deaths per vehicle crash, and are dependent for measurement on newspapers that only report crashes with at least two deaths, even though many crashes have one or zero deaths.
Here’s a greatly simplified example. I generate 1,000 observations of counts, with an average value of 1.3. Then I compare that original distribution with what I’d get if only those of two or higher were observed. It looks like this:
…generated by this code:
library(tidyverse) library(scales) library(fitdistrplus) library(rstan) library(truncdist) # original data: set.seed(321) a < rpois(1000, 1.3) # truncated version of data: b < a[ a > 1] # graphic: data_frame(value = c(a, b), variable = rep(c("Original data", "Truncated so only observations of 2 or more show up"), c(length(a), length(b)))) %>% ggplot(aes(x = value)) + geom_histogram(binwidth = 1, colour = "white") + facet_wrap(~variable, ncol = 1) + ggtitle("Comparing full and truncated datasets from a Poisson distribution") + labs(y = "Number of observations") # fitting a model to original works well: mean(a) fitdistr(a, "Poisson") # but obviously not if naively done to the truncated version: mean(b) fitdistr(b, "Poisson")Estimating the key parameter lambda for the full data (a) works well, giving an estimate of 1.347 that is just over one standard error from the true value of 1.3. The fitdistr function from the MASS package distributed with base R does a nice job in such circumstances.
But the mean value of b is badly biased upwards if used to estimate lambda; at 2.6 the mean of b is roughly twice the correct value of the mean of the underlying distribution. Obviously, removing a whole bunch of data at one end of the distribution is going to make naive estimation methods biased. So we need specialist methods that try to estimate lambda on the assumption that the data come from a Poisson distribution, but only the rightmost part of it.
Maximum likelihoodThe fitdistrplus package by Aurélie Siberchicot, Marie Laure DelignetteMuller and Christophe Dutang in combination with truncdist by Frederick Novomestky and Saralees Nadarajah gives a straightforward way to implement maximum likelihood estimation of a truncated distribution. Methods other than maximum likelihood are also available if required.
You need to make truncated versions of the dpois and ppois functions (or their equivalents for whatever distribution you are modelling) and use these within fitdistrplus::fitdist, which has some added functionality over MASS::fitdistr used in the previous chunk of code.
#MLE fitting in R # adapted from https://stackoverflow.com/questions/16947799/fittingalognormaldistributiontotruncateddatainr dtruncated_poisson < function(x, lambda) { dtrunc(x, "pois", a = 1.5, b = Inf, lambda) } ptruncated_poisson < function(x, lambda) { ptrunc(x, "pois", a = 1.5, b = Inf, lambda) } fitdist(b, "truncated_poisson", start = c(lambda = 0.5))Note that to do this I specified the lower threshold as 1.5; as all the data are integers this effectively means we only observe the observations of 2 or more, as is the case. We also needed to specify a reasonably starting value for the estimate of lambda – getting this too far out will lead to errors.
This method gives us an estimate of 1.34 with a standard error of 0.08, which is pretty good given we’ve only got 398 observations now. Of course, we’ve got the luxury of knowing for sure the true data generating process was Poisson.
BayesianFor an alternative Bayesian method, Stan makes it easy to describe data and probability distributions as truncated. The Stan manual has an entire chapter on truncated or censored data. Here’s an example Stan program to estimate the mean of the original Poisson distribution from our truncated data. As well as the original data, which I call x in this program, we need to tell it how many observations (n), the lower_limit that it was truncated by, and whatever is needed to characterise a prior distribution for the parameter we’re estimating.
The key bits of the program below are:
 In the data chunk, specify that the data for x has a lower limit of lower_limit
 In the model chunk, specify that distribution of x is truncated via T[lower_limit, ]
With a little more effort it’s possible to extend this by making Stan estimate lower_limit from the data; not necessary in my hypothetical example because I know where the minimum cutoff point of observed data lies.
Here’s how the data are fed to Stan from R:
#Calling Stan from R data < list( x = b, lower_limit = 2, n = length(b), lambda_start_mu = 2, lambda_start_sigma = 1 ) fit < stan("0120trunc.stan", data = data, cores = 4) fit plot(fit) + ggtitle("Credibility interval for lambda, estimated by Stan from truncated data", "(Correct value is 1.3)") + labs(y = "Estimated parameters") + theme_minimal(base_family = "myfont")This gives us a posterior distribution for lambda that matches that from the fitdistrplus method: 1.35 with a standard deviation of 0.08. The rstan package automatically turns this into a ggplot2 image of a credibility interval:
So, nice. Two simple ways to estimate the original distribution from truncated data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
xts 0.102 on CRAN
(This article was first published on FOSS Trading, and kindly contributed to Rbloggers)
This xts release contains mostly bugfixes, but there are a few noteworthy features. Some of these features were added in version 0.101, but I forgot to blog about it. Anyway, in no particular order:
 endpoints() gained subsecond accuracy on Windows (#202)!
 na.locf.xts() now honors ‘x‘ and ‘xout‘ arguments by dispatching to the next method (#215). Thanks to Morten Grum for the report.
 na.locf.xts() and na.omit.xts() now support character xts objects. Thanks to Ken Williams and Samo Pahor for the reports (#42).
Many of the bug fixes were related to the new plot.xts() introduced in 0.100. And a handful of bugfixes were to make xts more consistent with zoo in some edge cases.
As always, I’m looking forward to your questions and feedback! If you have a question, please ask on Stack Overflow and use the [r] and [xts] tags. Or you can send an email to the RSIGFinance mailing list (you must subscribe to post). Open an issue on GitHub if you find a bug or want to request a feature, but please read the contributing guide first!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppSMC 0.2.1: A few new tricks
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new release, now at 0.2.1, of the RcppSMC package arrived on CRAN earlier this afternoon (and once again as a very quick pretestpublish within minutes of submission).
RcppSMC provides Rcppbased bindings to R for the Sequential Monte Carlo Template Classes (SMCTC) by Adam Johansen described in his JSS article. Sequential Monte Carlo is also referred to as Particle Filter in some contexts .
This releases contains a few bug fixes and one minor rearrangment allowing headeronly use of the package from other packages, or via a Rcpp plugin. Many of these changes were driven by new contributors, which is a wonderful thing to see for any open source project! So thanks to everybody who helped with. Full details below.
Changes in RcppSMC version 0.2.1 (20180318)
The sampler now has a copy constructor and assignment overload (Brian Ni in #28).

The SMC library component can now be used in headeronly mode (Martin Lysy in #29).

Plugin support was added for use via cppFunction() and other Rcpp Attributes (or inline functions (Dirk in #30).

The sampler copy ctor/assigment operator is now copyconstructor safe (Martin Lysy In #32).

A bug in state variance calculation was corrected (Adam in #36 addressing #34).

History getter methods are now more userfriendly (Tiberiu Lepadatu in #37).

Use of pow with atomic types was disambiguated to std::pow) to help the Solaris compiler (Dirk in #42).
Courtesy of CRANberries, there is a diffstat report for this release.
More information is on the RcppSMC page. Issues and bugreports should go to the GitHub issue tracker.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Exploring the underlying theory of the chisquare test through simulation – part 1
(This article was first published on ouR data generation, and kindly contributed to Rbloggers)
Kids today are so sophisticated (at least they are in New York City, where I live). While I didn’t hear about the chisquare test of independence until my first stint in graduate school, they’re already talking about it in high school. When my kids came home and started talking about it, I did what I usually do when they come home asking about a new statistical concept. I opened up R and started generating some data. Of course, they rolled their eyes, but when the evening was done, I had something that might illuminate some of what underlies the theory of this ubiquitous test.
Actually, I created enough simulations to justify two posts – so this is just part 1, focusing on the \(\chi^2\) distribution and its relationship to the Poisson distribution. Part 2 will consider contingency tables, where we are often interested in understanding the nature of the relationship between two categorical variables. More on that the next time.
The chisquare distributionThe chisquare (or \(\chi^2\)) distribution can be described in many ways (for example as a special case of the Gamma distribution), but it is most intuitively characterized in relation to the standard normal distribution, \(N(0,1)\). The \(\chi^2_k\) distribution has a single parameter \(k\) which represents the degrees of freedom. If \(U\) is standard normal, (i.e \(U \sim N(0,1)\)), then \(U^2\) has a \(\chi^2_1\) distribution. If \(V\) is also standard normal, then \((U^2 + V^2) \sim \chi^2_2\). That is, if we add two squared standard normal random variables, the distribution of the sum is chisquared with 2 degrees of freedom. More generally, \[\sum_{j=1}^k X^2_j \sim \chi^2_k,\]
where each \(X_j \sim N(0,1)\).
The following code defines a data set with two standard normal random variables and their sum:
library(simstudy) def < defData(varname = "x", formula = 0, variance = 1, dist = "normal") def < defData(def, "chisq1df", formula = "x^2", dist = "nonrandom") def < defData(def, "y", formula = 0, variance = 1, dist = "normal") def < defData(def, "chisq2df", formula = "(x^2) + (y^2)", dist = "nonrandom") set.seed(2018) dt < genData(10000, def) dt[1:5,] ## id x chisq1df y chisq2df ## 1: 1 0.42298398 0.178915450 0.05378131 0.181807879 ## 2: 2 1.54987816 2.402122316 0.70312385 2.896505464 ## 3: 3 0.06442932 0.004151137 0.07412058 0.009644997 ## 4: 4 0.27088135 0.073376707 1.09181873 1.265444851 ## 5: 5 1.73528367 3.011209400 0.79937643 3.650212075The standard normal has mean zero and variance one. Approximately 95% of the values will be expected to fall within two standard deviations of zero. Here is your classic “bell” curve:
Since the statistic \(X^2\) (try not to confuse \(X^2\) and \(\chi^2\), unfortunate I know) is the sum of the squares of a continuous random variable and is always greater or equal to zero, the \(\chi^2\) is a distribution of positive, continuous measures. Here is a histogram of chisq1df from the data set dt, which has a \(\chi^2_1\) distribution:
And here is a plot of chisq2df, which has two degrees of freedom, and has a \(\chi^2_2\) distribution. Unsurprisingly, since we are adding positive numbers, we start to see values further away from zero:
Just to show that the data we generated by adding two squared standard normal random variables is actually distributed as a \(\chi^2_2\), we can generate data from this distribution directly, and overlay the plots:
actual_chisq2 < rchisq(10000, 2) Recycling and the Poisson distributionWhen we talk about counts, we are often dealing with a Poisson distribution. An example I use below is the number of glass bottles that end up in an apartment building’s recycling bin every day (as I mentioned, I do live in New York City). The Poisson distribution is a nonnegative, discrete distribution that is characterized by a single parameter \(\lambda\). If \(H \sim Poisson(\lambda)\), then \(E(H) = Var(H) = \lambda\).
def < defData(varname = "h", formula = 40, dist = "poisson") dh < genData(10000, def) round(dh[, .(avg = mean(h), var = var(h))], 1) ## avg var ## 1: 40.1 40To standardize a normally distributed variable (such as \(W \sim N(\mu,\sigma^2)\)), we subtract the mean and divide by the standard deviation:
\[ W_i^{s} = \frac{W_i – \mu}{\sigma},\]
and \(W^s \sim N(0,1)\). Analogously, to standardize a Poisson variable we do the same, since \(\lambda\) is the mean and the variance:
\[ S_{i} = \frac{H_i – \lambda}{\sqrt{\lambda}}\]
The distribution of this standardized variable \(S\) will be close to a standard normal. We can generate some data and check this out. In this case, the mean and variance of the Poisson variable is 40:
defA < defDataAdd(varname = "s", formula = "(h40)/sqrt(40)", dist = "nonrandom") dh < addColumns(defA, dh) dh[1:5, ] ## id h s ## 1: 1 34 0.9486833 ## 2: 2 44 0.6324555 ## 3: 3 37 0.4743416 ## 4: 4 46 0.9486833 ## 5: 5 42 0.3162278The mean and variance of the standardized data do suggest a standardized normal distribution:
round(dh[ , .(mean = mean(s), var = var(s))], 1) ## mean var ## 1: 0 1Overlaying the plots of the standardized poisson distribution with the standard normal distribution, we can see that they are quite similar:
Since the standardized Poisson is roughly standard normal, the square of the standardized Poisson should be roughly \(\chi^2_1\). If we square normalized Poisson, this is what we have:
\[ S_i^2 = \frac{(H_i – \lambda)^2}{\lambda}\]
Or maybe in a more familiar form (think Pearson):
\[ S_i^2 = \frac{(O_i – E_i)^2}{E_i},\]
where \(O_i\) is the observed value and \(E_i\) is the expected value. Since \(\lambda\) is the expected value (and variance) of the Poisson random variable, the two formulations are equivalent.
Adding the transformed data to the data set, and calculating the mean and variance, it is apparent that these observations are close to a \(\chi^2_1\) distribution:
defA < defDataAdd(varname = "h.chisq", formula = "(h40)^2/40", dist = "nonrandom") dh < addColumns(defA, dh) round(dh[, .(avg = mean(h.chisq), var = var(h.chisq))], 2) ## avg var ## 1: 1 1.97 actual_chisq1 < rchisq(10000, 1) round(c(avg = mean(actual_chisq1), var = var(actual_chisq1)), 2) ## avg var ## 0.99 2.04Once again, an overlay of the two distributions based on the data we generated shows that this is plausible:
Just for fun, let’s repeatedly generate 10 Poisson variables each with its own value of \(\lambda\) and calculate \(X^2\) for each iteration to compare with data generated from a \(\chi^2_{10}\) distribution:
nObs < 10000 nMeasures < 10 lambdas < rpois(nMeasures, 50) poisMat < matrix(rpois(n = nMeasures*nObs, lambda = lambdas), ncol = nMeasures, byrow = T) poisMat[1:5,] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 48 51 49 61 59 51 67 35 43 39 ## [2,] 32 58 49 67 57 35 69 40 57 55 ## [3,] 44 50 60 56 57 49 68 49 48 32 ## [4,] 44 44 42 49 52 50 63 39 51 38 ## [5,] 42 38 62 57 62 40 68 34 41 58Each column (variable) has its own mean and variance:
rbind(lambdas, mean = apply(poisMat, 2, function(x) round(mean(x), 0)), var = apply(poisMat, 2, function(x) round(var(x), 0)) ) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## lambdas 43 45 51 61 55 46 62 35 48 47 ## mean 43 45 51 61 55 46 62 35 48 47 ## var 43 46 51 61 55 46 62 35 47 47Calculate \(X^2\) for each iteration (i.e. each row of the matrix poisMat), and estimate mean and variance across all estimated values of \(X^2\):
X2 < sapply(seq_len(nObs), function(x) sum((poisMat[x,]  lambdas)^2 / lambdas)) round(c(mean(X2), var(X2)), 1) ## [1] 10.0 20.2The true \(\chi^2\) distribution with 10 degrees of freedom:
chisqkdf < rchisq(nObs, nMeasures) round(c(mean(chisqkdf), var(chisqkdf)), 1) ## [1] 10.0 19.8These simulations strongly suggest that summing across independent standardized Poisson variables generates a statistic that has a \(\chi^2\) distribution.
The consequences of conditioningIf we find ourselves in the situation where we have some number of bins or containers or cells into which we are throwing a fixed number of something, we are no longer in the realm of independent, unconditional Poisson random variables. This has implications for our \(X^2\) statistic.
As an example, say we have those recycling bins again (this time five) and a total of 100 glass bottles. If each bottle has an equal chance of ending up in any of the five bins, we would expect on average 20 bottles to end up in each. Typically, we highlight the fact that under this constraint (of having 100 bottles) information about about four of the bins is the same as having information about all five. If I tell you that the first four bins contain a total of 84 bottles, we know that the last bin must have exactly 16. Actually counting those bottles in the fifth bin provides no additional information. In this case (where we really only have 4 pieces of information, and not the the 5 we are looking at), we say we have lost 1 degree of freedom due to the constraint. This loss gets translated into the chisquare test.
I want to explore more concretely how the constraint on the total number of bottles affects the distribution of the \(X^2\) statistic and ultimately the chisquare test.
Unconditional countingConsider a simpler example of three glass recycling bins in three different buildings. We know that, on average, the bin in building 1 typically has 20 bottles deposited daily, the bin in building 2 usually has 40, and the bin in building 3 has 80. These number of bottles in each bin is Poisson distributed, with \(\lambda_i, \ i \in \{1,2, 3\}\) equal to 20, 40, and 80, respectively. Note, while we would expect on average 140 total bottles across the three buildings, some days we have fewer, some days we have more – all depending on what happens in each individual building. The total is also Poisson distributed with \(\lambda_{total} = 140\).
Let’s generate 10,000 days worth of data (under the assumption that bottle disposal patterns are consistent over a very long time, a dubious assumption).
library(simstudy) def < defData(varname = "bin_1", formula = 20, dist = "poisson") def < defData(def, "bin_2", formula = 40, dist = "poisson") def < defData(def, "bin_3", formula = 80, dist = "poisson") def < defData(def, varname = "N", formula = "bin_1 + bin_2 + bin_3", dist = "nonrandom") set.seed(1234) dt < genData(10000, def) dt[1:5, ] ## id bin_1 bin_2 bin_3 N ## 1: 1 14 44 59 117 ## 2: 2 21 36 81 138 ## 3: 3 21 34 68 123 ## 4: 4 16 43 81 140 ## 5: 5 22 44 86 152The means and variances are as expected:
round(dt[ ,.(mean(bin_1), mean(bin_2), mean(bin_3))], 1) ## V1 V2 V3 ## 1: 20 39.9 80.1 round(dt[ ,.(var(bin_1), var(bin_2), var(bin_3))], 1) ## V1 V2 V3 ## 1: 19.7 39.7 80.6This plot shows the actual numbers of bottles in each bin in each building over the 10,000 days:
There is also quite a lot of variability in the daily totals calculated by adding up the bins across three buildings. (While it is clear based on the mean and variance that this total has a \(Poisson(140)\) distribution, the plot looks quite symmetrical. It is the case that as \(\lambda\) increases, the Poisson distribution becomes well approximated by the normal distribution.)
round(dt[, .(avgN = mean(N), varN = var(N))], 1) ## avgN varN ## 1: 140 139.6 Conditional countingNow, let’s say that the three bins are actually in the same (very large) building, located in different rooms in the basement, just to make it more convenient for residents (in case you are wondering, my bins are right next to the service elevator). But, let’s also make the assumption (and condition) that there are always between 138 and 142 total bottles on any given day. The expected values for each bin remain 20, 40, and 80, respectively
We calculate the total number of bottles every day and identify all cases where the sum of the three bins is within the fixed range. For this subset of the sample, we see that the means are unchanged:
defAdd < defDataAdd(varname = "condN", formula = "(N >= 138 & N <= 142)", dist = "nonrandom") dt < addColumns(defAdd, dt) round(dt[condN == 1, .(mean(bin_1), mean(bin_2), mean(bin_3))], 1) ## V1 V2 V3 ## 1: 20.1 40 79.9However, and this is really the key point, the variance of the sample (which is conditional on the sum being between 138 and 142) is reduced:
round(dt[condN == 1, .(var(bin_1), var(bin_2), var(bin_3))], 1) ## V1 V2 V3 ## 1: 17.2 28.3 35.4The red points in the plot below represent all daily totals \(\sum_i bin_i\) that fall between 138 and 142 bottles. The spread from top to bottom is contained by the rest of the (unconstrained) sample, an indication that the variance for this conditional scenario is smaller:
Not surprisingly, the distribution of the totals across the bins is quite narrow. But, this is almost a tautology, since this is how we defined the sample:
Biased standardizationAnd here is the grand finale of part 1. When we calculate \(X^2\) using the standard formula under a constrained data generating process, we are not dividing by the proper variance. We just saw that the conditional variance within each bin is smaller than the variance of the unconstrained Poisson distribution. So, \(X^2\), as defined by
\[ X^2 = \sum_{i=1}^{k \ bins} {\frac{(O_i – E_i)^2}{E_i}}\]
is not a sum of approximately standard normal variables – the variance used in the formula is too high. \(X^2\) will be smaller than a \(\chi^2_k\). How much smaller? Well, if the constraint is even tighter, limited to where the total equals exactly 140 bottles every day, \(X^2\) has a \(\chi^2_{k1}\) distribution.
Even using our slightly looser constraint of fixing the total between 138 and 142, the distribution is quite close to a \(chi^2_2\) distribution:
defA < defDataAdd(varname = "X2.1", formula = "(bin_120)^2 / 20", dist = "nonrandom") defA < defDataAdd(defA, "X2.2", formula = "(bin_240)^2 / 40", dist = "nonrandom") defA < defDataAdd(defA, "X2.3", formula = "(bin_380)^2 / 80", dist = "nonrandom") defA < defDataAdd(defA, "X2", formula = "X2.1 + X2.2 + X2.3", dist = "nonrandom") dt < addColumns(defA, dt)Comparison with \(\chi^2_3\) shows clear bias:
Here it is with a \(\chi^2_2\) distribution:
Recycling more than glassPart 2 will extend this discussion to the contingency table, which is essentially a 2dimensional array of bins. If we have different types of materials to recycle – glass bottles, plastic containers, cardboard boxes, and metal cans – we need four bins at each location. We might be interested in knowing if the distribution of these four materials is different across the 3 different locations – this is where the chisquare test for independence can be useful.
As an added bonus, you can expect to see lots of code that allows you to simulate contingency tables under different assumptions of conditioning. I know my kids are psyched.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...