R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 29 min ago

### Hacking Highcharter: observations per group in boxplots

Mon, 07/24/2017 - 14:46

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Highcharts has long been a favourite visualisation library of mine, and I’ve written before about Highcharter, my preferred way to use Highcharts in R.

Highcharter has a nice simple function, hcboxplot(), to generate boxplots. I recently generated some for a project at work and was asked: can we see how many observations make up the distribution for each category? This is a common issue with boxplots and there are a few solutions such as: overlay the box on a jitter plot to get some idea of the number of points, or try a violin plot, or a so-called bee-swarm plot. In Highcharts, I figured there should be a method to get the number of observations, which could then be displayed in a tool-tip on mouse-over.

There wasn’t, so I wrote one like this.

First, you’ll need to install highcharter from Github to make it work with the latest dplyr.

Next, we generate a reproducible dataset using the wakefield package. For some reason, we want to look at age by gender, but only for redheads:

library(dplyr) library(tidyr) library(highcharter) library(wakefield) library(tibble) set.seed(1001) sample_data <- r_data_frame( n = 1000, age(x = 10:90), gender, hair ) %>% filter(hair == "Red") sample_data %>% count(Gender) ## # A tibble: 2 x 2 ## Gender n ## ## 1 Male 62 ## 2 Female 48

Giving us 62 male and 48 female redheads. The tibble package is required because later on, our boxplot function calls the function has_name from that package.

The standard hcboxplot function shows us, on mouse-over, the summary data used in the boxplot, as in the image below.

hcboxplot(x = sample_data$Age, var = sample_data$Gender) %>% hc_chart(type = "column")

To replace that with number of observations per group, we need to edit the function. In RStudio, View(hcboxplot) will open a tab with the (read-only) code, which can be copy/pasted and edited. Look for the function named get_box_values, which uses the R boxplot.stats function to generated a data frame:

get_box_values <- function(x) { boxplot.stats(x)$stats %>% t() %>% as.data.frame() %>% setNames(c("low", "q1", "median", "q3", "high")) } Edit it to look like this – the new function just adds a column obs with number of observations: get_box_values <- function(x) { boxplot.stats(x)$stats %>% t() %>% cbind(boxplot.stats(x)$n) %>% as.data.frame() %>% setNames(c("low", "q1", "median", "q3", "high", "obs")) } Save the new function as, for example, my_hcboxplot. Now we can customise the tooltip to use the obs property of the point object: my_hcboxplot(x = sample_data$Age, var = sample_data$Gender) %>% hc_chart(type = "column") %>% hc_tooltip(pointFormat = 'n = {point.obs}') Voilà. Filed under: R, statistics var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Random Forests in R Mon, 07/24/2017 - 12:55 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) Ensemble Learning is a type of Supervised Learning Technique in which the basic idea is to generate multiple Models on a training dataset and then simply combining(average) their Output Rules or their Hypothesis $$H_x$$ to generate a Strong Model which performs very well and does not overfits and which balances the Bisa-Variance Tradeoff too. The idea is that instead of producing a single complicated and complex Model which might have a high variance which will lead to Overfitting or might be too simple and have a high bias which leads to Underfitting, we will generate lots of Models by training on Training Set and at the end combine them. Such a technique is Random Forest which is a popular Ensembling technique is used to improve the predictive performance of Decision Trees by reducing the variance in the Trees by averaging them. Decision Trees are considered very simple and easily interpretable as well as understandable Modelling techniques, but a major drawback in them is that they have a poor predictive performance and poor Generalization on Test Set. They are also referred to as Weak Learners which are Learners which always perform better than chance and have an error less than 50 %. Random Forests Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them. Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting. The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller. Another major difference is that we only consider a Random subset of predictors $$m$$ each time we do a split on training examples.Whereas usually in Trees we find all the predictors while doing a split and choose best amongst them. Typically $$m=\sqrt{p}$$ where $$p$$ are the number of predictors. Now it seems crazy to throw away lots of predictors, but it makes sense because the effect of doing so is that each tree uses different predictors to split data at various times. So by doing this trick of throwing away Predictors, we have de-correlated the Trees and the resulting average seems a little better. Implementation in R loading the required packages require(randomForest) require(MASS)#Package which contains the Boston housing dataset attach(Boston) set.seed(101) dim(Boston) ## [1] 506 14 Saperating Training and Test Sets #training Sample with 300 observations train=sample(1:nrow(Boston),300) ?Boston #to search on the dataset We are going to use variable ′medv′ as the Response variable, which is the Median Housing Value. We will fit 500 Trees. Fitting the Random Forest We will use all the Predictors in the dataset. Boston.rf=randomForest(medv ~ . , data = Boston , subset = train) Boston.rf ## ## Call: ## randomForest(formula = medv ~ ., data = Boston, subset = train) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## Mean of squared residuals: 12.62686 ## % Var explained: 84.74 The above Mean Squared Error and Variance explained are calculated using Out of Bag Error Estimation.In this $$\frac23$$ of Training data is used for training and the remaining $$\frac13$$ is used to Validate the Trees. Also, the number of variables randomly selected at each split is 4. Plotting the Error vs Number of Trees Graph. plot(Boston.rf) This plot shows the Error and the Number of Trees.We can easily notice that how the Error is dropping as we keep on adding more and more trees and average them. Now we can compare the Out of Bag Sample Errors and Error on Test set The above Random Forest model chose Randomly 4 variables to be considered at each split. We could now try all possible 13 predictors which can be found at each split. oob.err=double(13) test.err=double(13) #mtry is no of Variables randomly chosen at each split for(mtry in 1:13) { rf=randomForest(medv ~ . , data = Boston , subset = train,mtry=mtry,ntree=400) oob.err[mtry] = rf$mse[400] #Error of all Trees fitted pred<-predict(rf,Boston[-train,]) #Predictions on Test Set for each Tree test.err[mtry]= with(Boston[-train,], mean( (medv - pred)^2)) #Mean Squared Test Error cat(mtry," ") #printing the output to the console } ## 1 2 3 4 5 6 7 8 9 10 11 12 13

Test Error

test.err ## [1] 26.06433 17.70018 16.51951 14.94621 14.51686 14.64315 14.60834 ## [8] 15.12250 14.42441 14.53687 14.89362 14.86470 15.09553

Out of Bag Error Estimation

oob.err ## [1] 19.95114 13.34894 13.27162 12.44081 12.75080 12.96327 13.54794 ## [8] 13.68273 14.16359 14.52294 14.41576 14.69038 14.72979

What happens is that we are growing 400 trees for 13 times i.e for all 13 predictors.

Plotting both Test Error and Out of Bag Error matplot(1:mtry , cbind(oob.err,test.err), pch=19 , col=c("red","blue"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split") legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","blue"))

Now what we observe is that the Red line is the Out of Bag Error Estimates and the Blue Line is the Error calculated on Test Set. Both curves are quite smooth and the error estimates are somewhat correlated too. The Error Tends to be minimized at around $$mtry = 4$$ .

On the Extreme Right Hand Side of the above plot, we considered all possible 13 predictors at each Split which is only Bagging.

Conclusion

Now in this article, I gave a simple overview of Random Forests and how they differ from other Ensemble Learning Techniques and also learned how to implement such complex and Strong Modelling Technique in R with a simple package randomForest.Random Forests are a very Nice technique to fit a more Accurate Model by averaging Lots of Decision Trees and reducing the Variance and avoiding Overfitting problem in Trees. Decision Trees themselves are poor performance wise, but when used with Ensembling Techniques like Bagging, Random Forests etc, their predictive performance is improved a lot.Now obviously there are various other packages in R which can be used to implement Random Forests in R.

I hope the tutorial is enough to get you started with implementing Random Forests in R or at least understand the basic idea behind how this amazing Technique works.

Thanks for reading the article and make sure to like and share it.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Stippling and TSP art in R: emulating StippleGen

Mon, 07/24/2017 - 12:06

(This article was first published on R – dahtah, and kindly contributed to R-bloggers)

Stippling is the creation of a pattern simulating varying degrees of solidity or shading by using small dots (Wikipedia).StippleGen is a piece of software that renders images using stipple patterns, which I discovered on Xi’an’s blog a couple days ago.

Stippled version of a sketch by Paul Cézanne, rendered in R

StippleGen uses an algorithm by Adrian Secord (described here) that turns out to be related to a problem in spatial statistics, specifically how to mess with high-order statistics of point processes while controlling density. The algorithm is a variant of k-means and is extremely easy to implement in R.

library(imager) library(dplyr) library(purrr) stipple <- function(im,nPoints=1e3,gamma=2,nSteps=10) { dens <- (1-im)^gamma xy <- sample(nPix(im),nPoints,replace=TRUE,prob=dens) %>% coord.index(im,.) %>% select(x,y) for (ind in 1:nSteps) { xy <- cvt(xy,dens) plot(im); points(xy,col="red") } xy } plot.stipple <- function(im,out,cex=.25) { g <- imgradient(im,"xy") %>% map(~ interp(.,out)) plot(out,ylim=c(height(im),1),cex=cex,pch=19,axes=FALSE,xlab="",ylab="") } ##Compute Voronoi diagram of point set xy, ##and return center of mass of each cell (with density given by image im) cvt <- function(xy,im) { voronoi(xy,width(im),height(im)) %>% as.data.frame %>% mutate(vim=c(im)) %>% group_by(value) %>% dplyr::summarise(x=weighted.mean(x,w=vim),y=weighted.mean(y,w=vim)) %>% select(x,y) %>% filter(x %inr% c(1,width(im)),y %inr% c(1,height(im))) } ##Compute Voronoi diagram for points xy over image of size (w,h) ##Uses a distance transform followed by watershed voronoi <- function(xy,w,h) { v <- imfill(w,h) ind <- round(xy) %>% index.coord(v,.) v[ind] <- seq_along(ind) d <- distance_transform(v>0,1) watershed(v,-d,fill_lines=FALSE) } #image from original paper im <- load.image("http://dahtah.github.io/imager/images/stippling_leaves.png") out <- stipple(im,1e4,nSteps=5,gamma=1.5) plot.stipple(im,out)

TSP art is a variant where you solve a TSP problem to connect all the dots.

library(TSP) ##im is the original image (used only for its dimensions) ##out is the output of the stipple function (dot positions) draw.tsp <- function(im,out) { tour <- out %>% ETSP %>% solve_TSP plot(out[tour,],type="l",ylim=c(height(im),1),axes=FALSE,xlab="",ylab="") } ##Be careful, this is memory-heavy (also, slow) out <- stipple(im,4e3,gamma=1.5) draw.tsp(im,out)

I’ve written a more detailed explanation on the imager website, with other variants like stippling with line segments, and a mosaic filter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – dahtah. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Beneath the canvas

Mon, 07/24/2017 - 02:00

(This article was first published on Data Imaginist - R posts, and kindly contributed to R-bloggers)

Recently a blog
post
rounds on the internet describing how it is possible to speed up plot creation
in ggplot2 by first creating a blank canvas and then later adding the plot
elements on top of it. The main takeaway plot is reproduced below:

The blog post is in generally well reasoned and I love how it moves from a
curious observation to an investigation and ends with a solid recommendation.
Alas, I don’t agree with the recommendation (that you shold create a canvas
for subsequent use). Most of the misunderstanding in the blog post comes from
the fact that ggplot2 in many ways seems to be fueled by magic and unicorn
blood — what arises when you write ggplot() and hit enter is far from clear. I
would like to spend most of the time on this point so I’m just going to get a
more general point out of the way first.

Premature optimisation is premature

When looking for ways to optimise your code, one must always ask whether the
code needs optimisation in the first place, and then whether the changes made
successfully makes a meaningful impact. What the plot above shows is that
caching the ggplot() call leads to a statistically significant performance
improvement meassured in <10 ms. This means that in order to get a percievable
runtime difference, it would be necessary to generate hundreds of plots, or
thousands of plots to get a meaningful difference. My own rule of thumb is that
you should not give up coding conventions unless there’s a tangible result, and
in this case I don’t see any. Does this mean you should never strive for
millisecond improvements? No, if you expect your piece of code to be called
thousands of times and compounding the effect this would be worthwhile. This is
why you sometimes see code where the square root of a variable is saved in a new
variable rather than being computed on the fly every time. In this case you
should ask yourself whether you mean to generate a 1000 plots with your code in
one go, and if so, whether an additional second is really worth it.

There is no spoon canvas

The notion that ggplot() creates a canvas for subsequent calls to add onto is
a sensible one, supported by the ggplot2 API where layers are added to the
initial plot. Further, if we simply write ggplot() and hits enter we get this:

library(ggplot2) ggplot()

Which sure looks like a blank canvas. This is all magic and unicorns though –
the call to ggplot() doesn’t actually draw or render anything on the device.
In order to understand what is going on, let’s have a look at the code
underneath it all:

ggplot #> function (data = NULL, mapping = aes(), ..., environment = parent.frame()) #> { #> UseMethod("ggplot") #> } #>

So, ggplot() is an S3 generic. As it is dispatching on the data argument, and
that defaults to NULL I’ll take the wild guess and say we’re using the default
method:

ggplot2:::ggplot.default #> function (data = NULL, mapping = aes(), ..., environment = parent.frame()) #> { #> ggplot.data.frame(fortify(data, ...), mapping, environment = environment) #> } #>

Huh, so even if we’re not passing in a data.frame as data we’re ending up with
a call to the data.frame ggplot method (this is actually the reason why you
can write your own fortify methods for custom objects and let ggplot2 work with
them automatically). Just for completeness let’s have a look at a fortified
NULL value:

fortify(NULL) #> list() #> attr(,"class") #> [1] "waiver"

We get a waiver object, which is an internal ggplot2 approach to saying: “I’ve
got nothing right now but let’s worry about that later” (grossly simplified).

With that out of the way, let’s dive into ggplot.data.frame():

ggplot2:::ggplot.data.frame #> function (data, mapping = aes(), ..., environment = parent.frame()) #> { #> if (!missing(mapping) && !inherits(mapping, "uneval")) { #> stop("Mapping should be created with aes() or aes_().", #> call. = FALSE) #> } #> p <- structure(list(data = data, layers = list(), scales = scales_list(), #> mapping = mapping, theme = list(), coordinates = coord_cartesian(), #> facet = facet_null(), plot_env = environment), class = c("gg", #> "ggplot")) #> p$labels <- make_labels(mapping) #> set_last_plot(p) #> p #> } #> This is actually a pretty simple piece of code. There are some argument checks to make sure the mappings are provided in the correct way, but other than that it is simply constructing a gg object (a ggplot subclass). The set_last_plot() call makes sure that this new plot object is now retrievable with the last_plot() function. In the end it simply returns the new plot object. We can validate this by looking into the return value of ggplot(): str(ggplot()) #> List of 9 #>$ data : list() #> ..- attr(*, "class")= chr "waiver" #> $layers : list() #>$ scales :Classes 'ScalesList', 'ggproto', 'gg' #> add: function #> clone: function #> find: function #> get_scales: function #> has_scale: function #> input: function #> n: function #> non_position_scales: function #> scales: NULL #> super: #> $mapping : list() #>$ theme : list() #> $coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' #> aspect: function #> distance: function #> expand: TRUE #> is_linear: function #> labels: function #> limits: list #> modify_scales: function #> range: function #> render_axis_h: function #> render_axis_v: function #> render_bg: function #> render_fg: function #> setup_data: function #> setup_layout: function #> setup_panel_params: function #> setup_params: function #> transform: function #> super: #>$ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' #> compute_layout: function #> draw_back: function #> draw_front: function #> draw_labels: function #> draw_panels: function #> finish_data: function #> init_scales: function #> map_data: function #> params: list #> setup_data: function #> setup_params: function #> shrink: TRUE #> train_scales: function #> vars: function #> super: #> $plot_env : #>$ labels : list() #> - attr(*, "class")= chr [1:2] "gg" "ggplot"

We see our waiver data object in the data element. As expected we don’t have
any layers, but (perhaps surprising) we do have a coordinate system and a
facet specification. These are the defaults getting added to every plot and in
effect until overwritten by something else (facet_null() is simply a one-panel
plot, cartesian coordinates are a standard coordinate system, so the defaults
are sensible). While there’s a default theme in ggplot2 it is not part of the
plot object in the same way as the other defaults. The reason for this is that
it needs to be possible to change the theme defaults and have these changes
applied to all plot objects already in existence. So, instead of carrying the
full theme around, a plot object only keeps explicit changes to the theme and
then merges these changes into the current default (available with
theme_get()) during plotting.

All in all ggplot() simply creates an adorned list ready for adding stuff
onto (you might call this a virtual canvas but I think this is stretching
it…).

So how come something pops up on your plotting device when you hit enter? (for
a fun effect read this while sounding as Carrie from Sex and the City)

This is due to the same reason you get a model summary when hitting enter on a
lm() call etc.: The print() method. The print() method is called
automatically by R every time a variable is queried and, for a ggplot object,
it draws the content of your object on your device. An interesting side-effect
of this is that ggplots are only rendered when explicetly print()ed/plot()ed
within a loop, as only the last return value in a sequence of calls gets its
print method invoked. This also means that the benchmarks in the original
blogposts were only measuring plot object creation, and not actual plot
rendering, as this is never made explecit in the benchmarked function (A point
later emphasized in the original post as well). For fun, let’s see if
doing that changes anything:

canv_mt <- ggplot(mtcars, aes(hp, mpg, color = cyl))+ coord_cartesian() # test speed with mocrobenchmark test <- microbenchmark::microbenchmark( without_canvas = plot(ggplot(mtcars, aes(hp, mpg, color = cyl)) + coord_cartesian() + geom_point()), with_canvas = plot(canv_mt + geom_point()), times = 100 ) test #> Unit: milliseconds #> expr min lq mean median uq max #> without_canvas 256.1470 293.1915 330.2396 311.9711 360.8715 500.6386 #> with_canvas 256.4203 287.1507 321.6902 303.2688 334.9503 535.2007 #> neval cld #> 100 a #> 100 a autoplot(test) + scale_y_continuous('Time [milliseconds]') # To get axis ticks

So it appears any time difference is hidden by the actual complexity of
rendering the plot. This is sensible as the time scale has now increased
considerably and a difference in 1 ms will not be visible.

Strange trivia that I couldn’t fit in anywhere else

Prior to ggplot2 v2 simply plotting the result of ggplot() would result in an
error as the plot had no layers to draw. ggplot2 did not get the ability to draw
layerless plots in v2, but instead it got an invisible default layer
(geom_blank) that gets dropped once new layers are added. This just goes to show
the amount of logic going into plot generation in ggplot2 and why it sometimes
feels magical…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Runtime vs. Success (Using IMDB)

Sun, 07/23/2017 - 19:22

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

The content in this blog comes from a shiny application proof of concept using IMDB movie data.  To view the application:

1. IMDB Movie Data App on Shiny.io (Created in R / Shiny Dashboard)
2. Source Code on Github
Background and Inspiration:

Years ago, students of film writing were advised to make their feature length movie scripts between 90 and 129 pages.  A page is expected to be one minute of runtime, and today these numbers have been revised to a simple target of 110 minutes.  Little fun fact connected with this:  my professor tells the story of an insider’s view from development and how the decision makers at one of the giants used to pick up a screenplay, weigh it in his hands, and say “too light” or “too heavy.” Then the script was rejected without ever getting a single page read.  Are these page length numbers arbitrary, or is there something to these guidelines?

A Walkthrough of The Shiny App And What It Can Tell Us

Movie Releases by year – Small Bins

Movie Releases by Years – Big Bins

With the bin size set very small in the above visualizations, we can see movie releases increasing over time, with the rate accelerating much more starting in the 1970’s. Increasing the bin size from 2 to 28, further illustrates how in recent years, there are so many more movies then in the past.

A quick look at the history may shed a little light on what’s going on here.  The first motion pictures were created in the 1890’s and were little more than short, black and white moving images.  As the novelty of this waned, next came serial shorts from which the term “cliff hanger” originates.  When real full-length stories started to evolve, they were  expensive and hard to produce.  As a practical matter, only small numbers could be made per year. Today, the bar has been raised but technology keeps lowering the cost of what it takes to meet that bar.  Independent and ancillary markets have widened the distribution channels and even YouTube is becoming a part of this expanding network.  It’s just a matter of time before low budget movies get made on smart phones if they haven’t been already.  Nothing earth shattering here but the visualization does help show that the run-away escalation started during the time when “Star Wars”, “Jaws”, and “Close Encounters of The Third Kind” all made their debut.  Many see this time as “the beginning of the blockbuster.”

As shown here, the data used in this application is organized into 5 subsets of data:

IMDB Shiny App – Data Tab

There is some overlap in the data sets shown on the application’s “Data” tab:

1. The 5270 “Kaggle 5000 Project IDs” data set contains the results of merging 5000 movies whose IDs were obtained from a “Kaggle” project with the unique IDs from all the other data sets described above.  The “Kaggle Project IDs” included a sample of releases from 1916 up until 2016.
2. The other data sets came from Top and Bottom lists on IMDB.  Most of the Top 250 English Movies were also in the “Top 250 Movies”.  Some of the “Top 250 Indian Movies” are also in the “Top 250 Movies”.  The Bottom 100 should only overlap with the “Kaggle” data set.  Movies in these lists include titles from 2017.

Click on the “Visualization” tab of the app to obtain basic stats on each of the aforementioned  datasets (Min, Max, Mean, Median and Quartiles) for these 4 numeric variables:

1. Year Released
2. Runtime (in minutes)
3. Number of Reviews
4. Movie Rating

This tab also provides a line plot using Loess linear regression which we can analyze for the general trend in movie runtimes that we are looking for.

“Kaggle 5000 Project IDs” Record Set (All of Our Data)

If we start with the plot for “all the data” in our application, outliers are mostly clustered pretty close to the general confidence interval for the model.  No outliers outside this range appear after 1990, and only a small number of points barely outside the confidence interval appear from 1960 to 1990.

Since there is a limited outlier effect, the mean seems like a reasonable metric.  It is 109.2 minutes. Of more interest to the original question, this plot essentially reflects a 5000+ record sample of what got made.  The hills and valleys of the line seem to range between 105 and 120 minutes up through the 1970’s.  Then the line becomes a slightly downward sloping trend up through the present with our data points mostly appearing at around the 110 minute mark. Though anecdotal in nature, the original 110 minute recommendation for movie scripts would seem to be supported by the data.  The confidence interval around the line though might suggest a range from about 100 to 120 minutes.  If our movie gets made, we may then be concerned with what are its chances of making it into the top or bottom? Starting with the Top 250 Movies:

Top 250 Movies (On IMDB)

The mean for the Top 250 movies was 130 minutes.  The curve trends upwards over all with a range in recent years (1990 to the present) that fell between: 130 and 140 minutes.  There are a larger scattering of outliers on this plot, but based on how the outliers mostly cluster not too far from the confidence interval after 1990, the mean still seems reasonable to use.  If we think about why these better received movies are often longer than the general trend for what gets made, I’m sure there are many factors involved.  For one thing, big name director and  producers have the kind of clout to make us sit through longer movies.  Consider “The Lord of The Rings” saga, which collectively was over 9 hours long with each of 3 movies lasting in the range of 3 hours a piece.

We don’t want to end up in the bottom, so we’ll take a look at this too:

Bottom 100 Movies (on IMDB)

The mean for the Bottom 100 movies was 92.26 minutes.  This curve is also showing a distinct upwards trend over all but with a range in recent years (1990 to the present) that was from about 90 minutes to 115 minutes.  There are fewer outliers, but there is also less data in this grouping.

In Conclusion

Note that for this application about 6000 IMDB records were obtained. In 2017 alone, IMDB recorded 12,193 new releases by July and the year is still young!   Further analysis is indicated and performing the analysis with more data would also be helpful.  Given the small size of the data, findings here should not be taken as an absolute.  Given that, here is a summary of what the data suggests so far:

1. The “Sweet Spot” to target for a feature length screen play to get made is about 110 minutes
2. If we want that movie to make it into the Top 250 on IMDB, the final movie produces is expected to range from 130 to 140 minutes
3. It is also interesting to note that movies in the bottom 100 tend to range between 90 and 115 minutes

With more time and development, a more thorough analysis could be developed from IMDB data.

The Data Collection Process

Three R markdown scripts (available on Git) were used to gather source data.  A full workflow from initial input files to what got used in the application is provided in the third script.  As this process was experimentally developed and modified while creating the source data for this project, R markdown was used to step through the process and keep notes on what was going.  High level:

1. An XML based R library was used to screen scrape data from “hot lists” (Top 250, Bottom 100, etc.) directly off of the IMDB website in July of 2017. This produced ‘csv’ files used as source for all the “Top / Bottom” lists provided in the final application.
2. 5000+ IMDB IDs were extracted using a software process (not documented) from a json file posted with a Kaggle project.  That project last gathered data in 2016.  IDs from steps 1 and 2 were then used with step 3 to extract current data on these IDs.
3. An Open API referred to as “IMDB Open API” was used to extract records from the IMDB website one record at a time, with a 3 second delay between each extraction.  This was to ensure no harm to the website.  This script was designed to save its work “along the way” and provide helpful output messages so that if an error occurred that halted the program.  Data collected so far would be available, and the script could then pick up where it left off to finish the job.
4. Close to 6000 observations in total (5270 after filtering out of duplicates) w/ about 27  variables were obtained from IMDB.
5. Data cleaning of just the 7 variables brought into the Shiny application ensued with the intent to clean other variables in future development iterations.  The vision was to develop subset tables with Movie_IDs on each one as a unique identifier, and then join them to previous data when bringing in each new set to expand the app in support of new analysis and features.
The Initial Plan for this Project

The CEO of a company I used to work for would tell stories of how senior management of various IT areas under him would plan and design solutions on a napkin in a bar which formed the basis of software project requirements.  For this shiny app, the “napkin” was actually a piece of notebook paper captured as a PDF for your amusement.  Only a small piece of this plan has been realized so far.

The post Runtime vs. Success (Using IMDB) appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Programming with dplyr by using dplyr

Sun, 07/23/2017 - 18:49

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The title may seem tautological, but since the arrival of dplyr 0.7.x, there have been some efforts at using dplyr without actually using it that I can’t quite understand. The tidyverse has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.

Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to program over dplyr without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval”. Let’s consider the following interactive pipeline:

library(dplyr) starwars %>% group_by(homeworld) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr. But is it? Not at all: we just need to add one line and a bang-bang (!!):

starwars_mean <- function(var) { var <- enquo(var) starwars %>% group_by(!!var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean(homeworld) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

The enquo() function quotes the name we put in our function (homeworld), and the bang-bang unquotes and uses that name instead of var. That’s it. What about seplyr? With seplyr, we just have to (and I quote)

• Change dplyr verbs to their matching seplyr “*_se()” adapters.
• Add quote marks around names and expressions.
• Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
• Replace “=” in expressions with “:=”.

This is the result:

library(seplyr) starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

Basically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:

starwars_mean <- function(var) { var <- as.name(var) starwars %>% group_by(!!var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

But we can do even better if we remember that dplyr provides scoped variants (see ?dplyr::scoped) for most of the verbs. In this case, group_by_at() comes in handy:

starwars_mean <- function(var) { starwars %>% group_by_at(var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

That’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr and R’s paste0() function to build up expressions:

grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") data %>% group_by_se(grouping_variables) %>% summarize_se(c(result_names := expressions, "count" := "n()")) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11

And the same with dplyr’s scoped verbs (note that I’ve added the last rename_at() on a whim, just to get exactly the same output as before, but it is not really necessary):

grouped_mean <- function(data, grouping_variables, value_variables) { data %>% group_by_at(grouping_variables) %>% mutate(count = n()) %>% summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>% rename_at(value_variables, funs(paste0("mean_", .))) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11

Wrapping up, the tidyeval paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr is full of tools that will make your life easier, not harder.

Article originally published in Enchufa2.es: Programming with dplyr by using dplyr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data Visualization with googleVis exercises part 8

Sun, 07/23/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Annotation & Sankey Charts

In the eighth part of our series we are going to learn about the features of some interesting types of charts. More specifically we will talk about Annotation and Sankey charts.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package

As you already know, the first thing you have to do is install and load the googleVis package with:

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.

Annotation chart

It is quite simple to create an Annotation Chart with googleVis. We will use the “Stocks” dataset. You can see the variables of your dataset with head().
Look at the example below to create a simple Annotation Chart:
AnnoC <- gvisAnnotationChart(Stock)
plot(AnnoC)

Exercise 1

Create a list named “AnnoC” and pass to it the “Stock” dataset as an annotation chart. HINT: Use gvisAnnotationChart().

Exercise 2

Plot the the annotation chart. HINT: Use plot().

Set the variables

As you can see the annotation chart you built is empty so we have to fill it with some information. We will use the variables of ths “Stock” dataset for this purpose like this:
datevar="Date",
numvar="Value",
idvar="Device",
titlevar="Title",
annotationvar="Annotation"

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

• Work extensively with the GoogleVis package and its functionality
• Learn what visualizations exist for your specific use case
• And much more

Exercise 3

Use the example above to fill your annotation chart with the proper information and plot the chart.

Dimensions

You can use height and width to change the dimensions of the annotation chart.
options=list(width=500, height=300)

Exercise 4

Set height to 700, width to 500 and plot the chart. HINT: Use list().

Colours

You can change the colours of the lines with:
options=list(colors="['green','yellow']")

Exercise 5

Set the line colours to green and red and plot the chart.

With the fill option you can color the relevant areas and adjust how filled these areas will be. Look at the example:
options=list(colors="['green','yellow']",fill=20)

Exercise 6

Set fill to 50 and plot the chart.

Exercise 7

Now set fill to 150 to see the difference and plot the chart.

Sankey Chart

Before creating a sankey chart we have to create a data frame that will help us as an example, so copy and paste this code to crate the data frame “exp” first:
exp <- data.frame(From=c(rep("A",3), rep("B", 3)),
To=c(rep(c("X", "Y", "Z"),2)),
Weight=c(6,9,7,9,3,1))

As you can see we created a data frame with the variabls “From”, “To” and “Weight”. You can use head() in order to see them.
It is quite simple to create an Sankey Chart with googleVis. We will use the “exp” data frame.
Look at the example below to create a simple Sankey Chart:
SankeyC <- gvisSankey(exp )
plot(SankeyC)

Exercise 8

Create a list named “SankeyC” and pass to it the “exp” dataset as a sankey chart. HINT: Use gvisSankey().

Exercise 9

Plot the the sankey chart. HINT: Use plot().

You can change the link colours with:
options=list(sankey="{link: {color: { fill: 'red' } }

Exercise 10

Color the links of ths sankey chart green and plot it.

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Version 2.2.1 Released

Sun, 07/23/2017 - 06:38

(This article was first published on ggtern: ternary diagrams in R, and kindly contributed to R-bloggers)

It has been a while since any kind of significant update has been released for the ggetern library, and the other day a minor update was submitted to CRAN, which is now available. There were a few minor bug fixes, and the inclusion of a few new geometries which some of you may find of value.

Newly Approved Geometry: geom_encircle

One of our users commented on the ggtern blog, that the geom_encircle geometry provided by the ggalt package might be handy, and after some investigation, this didn’t need any kind of modification, so it has been added to the list of approved protos, and can be used as follows:

library(ggtern) library(ggalt) #<<< Install this if it isn't installed! data("Fragments") base = ggtern(Fragments,aes(Qm,Qp,Rf+M,fill=GrainSize,shape=GrainSize)) + theme_bw() + theme_legend_position('tr') + geom_encircle(alpha=0.5,size=1) + geom_point() + labs(title = "Example Plot", subtitle = "using geom_encircle") print(base)

Pretty simple huh, well this produces the following output:

As you can see (and as expected), points have been circled in an ‘approximate’ manner, sort of like someone grabbing a pen and circling on a piece of paper, perfect for highlighting features without any sort of statistical significance. Of course you will need to install the ggalt package, if it is not already installed.

New Text/Label Geometries:

Some other feedback was generously offered, in that it might be convenient if users could apply text or label elements to the plot region, by defining a fractional position in x or y, relative to the viewport, and so, this has been taken this on board and created two new geometries geom_text_viewport and and geom_label_viewport — these are basically the same as the geom_text and geom_label geometries, except, the user is only expected to provide an x and y positional mapping, which are values constrained between the limits [0,1], representing fractional coordinates of the plot viewport, used as follows:

df.vp = data.frame(x = c(.5,1,0,0,1), y = c(.5,1,0,1,0), label = c('Middle','Top Right','Bottom Left','Top Left','Bottom Right')) base2 = base + theme(legend.position = 'right') + geom_mask() + geom_text_viewport(data=df.vp,aes(x=x,y=y,label=label,color=label),inherit.aes = FALSE) + labs(color = "Label", subtitle = "using geom_text_viewport") print(base2)

base3 = base + theme(legend.position = 'right') + geom_mask() + geom_label_viewport(data=df.vp,aes(x=x,y=y,label=label,color=label),inherit.aes = FALSE) + labs(color = "Label", subtitle = "using geom_label_viewport") print(base3)

The above have the special added benefit with respect to ternary diagrams, insomuch as they reduce the mental overhead of having to think in ternary coordinates, if you want to manually place a label somewhere. Furthermore, both of these new geometries behave similarly on native ggplot2 objects, so, these represent an intuitive way to place labels in a relative manner to the plot region, without having to consider the axis limits.

Finally, many of you have been citing ggtern in the literature, which I am grateful for, please continue to do so, and let me know so that I might add your respective publications to the publications ledger.

I have been thinking alot of the best way to enable piper diagrams, and/or diagrams with cartesian-binary plots perpendicular to the ternary axes, I am hoping to get this functionality happening soon, which will considerably improve the depth of outputs that this package is able to produce.

Some of you have contacted me to produce custom geometries for your respective projects, and as a final plug, this is something I am more than happy to do…

Cheerio,

NH

The post Version 2.2.1 Released appeared first on ggtern: ternary diagrams in R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ggtern: ternary diagrams in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tutorial: Using seplyr to Program Over dplyr

Sat, 07/22/2017 - 19:43

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement.

suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.2' cat(colnames(starwars), sep='\n') ## name ## height ## mass ## hair_color ## skin_color ## eye_color ## birth_year ## gender ## homeworld ## species ## films ## vehicles ## starships starwars %>% group_by(homeworld) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

The above is colloquially called "an interactive script." The name comes from the fact that we use names of variables (such as "homeworld") that would only be known from looking at the data directly in the analysis code. Only somebody interacting with the data could write such a script (hence the name).

It has long been considered a point of discomfort to convert such an interactive dplyr pipeline into a re-usable script or function. That is a script or function that specifies column names in some parametric or re-usable fashion. Roughly it means the names of the data columns are not yet known when we are writing the code (and this is what makes the code re-usable).

This inessential (or conquerable) difficulty is largely a due to the preference for non-standard evaluation interfaces (that is interfaces that capture and inspect un-evaluated expressions from their calling interface) in the design dplyr.

seplyr is a dplyr adapter layer that prefers "slightly clunkier" standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr notation into seplyr notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.

The translation is mechanical in that it is mostly using seplyr documentation as a lookup table. What you have to do is:

• Change dplyr verbs to their matching seplyr "*_se()" adapters.
• Add quote marks around names and expressions.
• Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the "c()" notation.
• Replace "=" in expressions with ":=".

Our converted code looks like the following.

# devtools::install_github("WinVector/seplyr") library("seplyr") starwars %>% group_by_se("homeworld") %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows

This code works the same as the original dplyr code. Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr notation is saving us is from having to write a few quote marks).

The benefit is: this new code can very easily be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.

For example: suppose (as in the original example) we want to create a function that lets us choose the grouping variable? This is now easy, we copy the code into a function and replace the explicit value "homeworld" with a variable:

starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("hair_color") ## # A tibble: 13 x 4 ## hair_color mean_height mean_mass count ## ## 1 auburn 150.0000 NaN 1 ## 2 auburn, grey 180.0000 NaN 1 ## 3 auburn, white 182.0000 77.00000 1 ## 4 black 174.3333 73.05714 13 ## 5 blond 176.6667 80.50000 3 ## 6 blonde 168.0000 55.00000 1 ## 7 brown 175.2667 79.27273 18 ## 8 brown, grey 178.0000 120.00000 1 ## 9 grey 170.0000 75.00000 1 ## 10 none 180.8889 78.51852 37 ## 11 unknown NaN NaN 1 ## 12 white 156.0000 59.66667 4 ## 13 141.6000 314.20000 5

In seplyr programming is easy (just replace values with variables). For example we can make a completely generic re-usable "grouped mean" function using R‘s paste() function to build up expressions.

grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") calculation <- result_names := expressions print(as.list(calculation)) # print for demonstration data %>% group_by_se(grouping_variables) %>% summarize_se(c(calculation, "count" := "n()")) } starwars %>% grouped_mean(grouping_variables = "eye_color", value_variables = c("mass", "birth_year")) ## $mean_mass ## [1] "mean(mass, na.rm = TRUE)" ## ##$mean_birth_year ## [1] "mean(birth_year, na.rm = TRUE)" ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11

The only part that requires more study and practice was messing around with the expressions using paste() (for more details on the string manipulation please try "help(paste)"). Notice also we used the ":=" operator to bind the list of desired result names to the matching calculations (please see "help(named_map_builder)" for more details).

The point is: we did not have to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval or lazyeval to complete our programming task. Once you are in seplyr notation, changes are very easy. You can separate translating into seplyr notation from the work of designing your wrapper function (breaking your programming work into smaller easier to understand steps).

The seplyr method is simple, easy to teach, and powerful. The package contains a number of worked examples both in help() and vignette(package='seplyr') documentation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### What’s in our internal chaimagic package at work

Sat, 07/22/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

At my day job I’m a data manager and statistician for an epidemiology project called CHAI lead by Cathryn Tonne. CHAI means “Cardio-vascular health effects of air pollution in Telangana, India” and you can find more about it in our recently published protocol paper . At my institute you could also find the PASTA and TAPAS projects so apparently epidemiologists are good at naming things, or obsessed with food… But back to CHAI! This week Sean Lopp from RStudio wrote an interesting blog post about internal packages. I liked reading it and feeling good because we do have an internal R package for CHAI! In this blog post, I’ll explain what’s in there, in the hope of maybe providing inspiration for your own internal package!

As posted in this tweet, this pic represents the Barcelona contingent of CHAI, a really nice group to work with! We have other colleagues in India obviously, but also in the US.

The birth of chaimagic: help for data wrangling

Note: part of this post was submitted as an abstract for the useR! conference in Brussels, for which I received feedback from my whole team, as well as from François Michonneau, Dirk Schumacher and Jennifer Thompson. Thanks a lot! I got to present a lightning talk about rtimicropem instead.

Right from the beginning of my time here in October 2015 I had quite a lot of data to clean, which I started doing in R of course. One task in particular was parsing filenames because those were informative in our project, containing for instance a date and a participant ID. I wrote a function for that, batch_parse_filenames, which parses all filenames in a vector and returns a list composed of two data.frames: one with the information contained in the filename (e.g. participant ID, date of sampling) and one with possible parsing mistakes (e.g. a nonexistent ID) and an informative message. It was a very useful function, and I packaged it up together with the data containing the participant IDs for instance. This was the beginning of an internal package!

I decided to call it chaimagic because it sounded cool, and this despite knowing that there’s absolutely no magic in the code I write.

The growth of chaimagic: support for data analysis

chaimagic evolved with helper functions for analysis, e.g. a function returning the season in India from a date (e.g. monsoon, winter), or functions for common operations on variables from our questionnaire data. chaimagic also got one more contributor.

chaimagic also contains a Shiny dashboard for exploring personal monitoring data that were collected in the project: one selects a participant-day and gets interactive leaflet and rbokeh plots of GPS, air quality, and time-activity data from questionnaires and wearable cameras. I told my boss that the dashboard was brought to us by the biblical Magi but maybe I helped them, who knows.

The maturity of chaimagic: serving scientific communication

Now, our package also supports scientific communication thanks to an RMarkdown template for internal reports which fostered reproducibility of analyses by making the use of RMarkdown even more appealing. Having an RMarkdown template also supports consistent branding, which is discussed in the RStudio blog post.

chaimagic also includes a function returning affiliation information for any possible collaborator we have; and a function returning the agreed-upon acknowledgement sentences to be put in each manuscript. These are pieces of information we can be very happy not to have to go look for in a folder somewhere, we can get them right without leaving R!

Why is chaimagic a really cool package?

chaimagic has been a valuable tool for the work of two statisticians/data managers, one GIS technician, one post-doc and one PhD student. Discussing its development and using it was an occasion to share our R joy, thus fostering best practices in scientific computing in our project. Why use Stata or other statistical softwares when you have such a nice internal R package? We found that even if our package served a small team of users, an R package provided a robust infrastructure to ensure that everyone was using the same code and R coding best practices in our team.

So this is very good… but we all know from e.g. Airbnb (see also this post about Airbnb and R) that stickers are a very important aspect of an internal R package. I was over the moon when my colleague Carles handed me these fantastic chaimagic stickers! He used the hexSticker package. I’ll leave the team in September after everyone’s vacations, and this piece of art was the goodbye present I received at my goodbye lunch. Doesn’t it make our internal package completely cool now?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Stan Weekly Roundup, 21 July 2017

Fri, 07/21/2017 - 21:00

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

It was another productive week in Stan land. The big news is that

• Jonathan Auerbach reports that

A team of Columbia students (mostly Andrew’s, including myself) recently won first place in a competition predicting elementary school enrollment. I heard 192 entered, and there were 5 finalists….Of course, we used Stan (RStan specifically). … Thought it might be Stan news worthy.

I’d say that’s newsworthy. Jon also provided a link to the “challenge” page, a New York City government sponsored “call for innovations”: Enhancing School Zoning Efforts by Predicting Population Change. They took home a US20K paycheck for their efforts! Stan’s seeing quite a lot of use these days among demographers and others looking to predict forward from time series data. Jonathan’s been very active using government data sets (see his StanCon 2017 presentation with Rob Trangucci, Twelve Cities: Does lowering speed limits save pedestrian lives?). Hopefully they’ll share their code—maybe they already have. I really like to see this combination of government data and careful statistical analysis. In other big news coming up soon, In other news, • Andrew Gelman‘s been driving a lot of rethinking of our interfaces because he’s revising his and Jennifer Hill’s regression book (the revision will be two books). Specifically, he’s thinking a lot about workflow and how we can manage model expansion by going from fixed to modeled parameters. Right now, the process is onerous in that you have to move data variables into the parameters block and keep updating their priors. Andrew wants to be able to do this from the outside, but Michael Betancourt and I both expressed a lot of skepticism in terms of it breaking a lot of our fundamental abstractions (like a Stan program defining the density that’s fit!). More to come on this hot topic. Any ideas on how to manage developing a model would be appreciated. This goes back to the very first thing Matt Hoffman, Michael Malecki and I worked on with Andrew when he hired us before we’d conceived Stan. You’d think we’d have better advice on this after all this time. I’ve seen people do everything from use the C++ preprocessor to write elaborate program generation code in R. • Breck Baldwin has been working on governance and we’re converging on a workable model that we’ll share with everyone soon. The goal’s to make the governance clear and less of a smoke-filled room job by those of us who happen to go to lunch after the weekly meetings. • Jonah Gabry is taking on the ggplot2-ification of the new regression book and trying to roll everything into a combination of RStanArm and BayesPlot. No word yet if the rest of the tidyverse is to follow. Andrew said, “I’ll see what Jonah comes up with” or something to that effect. • Jonah has alos been working on priors for multilevel regression and poststratification with Yajuan Si (former postdoc of Andrew’s, now at U. Wisconsin); the trick is doing somethign reasonable when you have lots of interactions. • Ben Goodrich has been working on the next release of RStanArm. It’s beginning to garner a lot of contributions. Remember that the point has been to convert a lot of common point estimation packages to Bayesian versions and supply them with familiar R interfaces. Ben’s particularly been working on survival analysis lately. • Our high school student intern (don’t know if I can mention names online—the rest of our developers are adults!) is working on applying the Cook-Gelman-Rubin metric to evaluating various models. We’re doing much more of this method and it needs a less verbose and more descriptive name! • Mitzi Morris submitted a pull request for the Stan repo to add line numbers to error messages involving compound declare-and-define statements. • A joint effort by Mitzi Morris, Dan Simpson, Imad Ali, and Miguel A Martinez Beneito has led to convergence of models and fits among BUGS, Stan, and INLA for intrinsic conditional autorgression (ICAR) models. Imad’s building the result in RStanArm and has figured out how to extend the loo (leave one out cross-validation) package to deal with spatial models. Look for it in a Stan package near you soon. Mitzi’s working on the case study, which has been updated in the example-models repo. • Charles Margossian knocked off a paper on the mixed ODE solver he’s been working on, with a longer paper promised that goes through all the code details. Not sure if that’s on arXiv or not. He’s also been training Bill Gillespie to code in C++, which is great news for the project since Charles has to contend with his first year of grad school next year (whereas Bill’s facing a pleasant retirement of tracking down memory leaks). Charles is also working on getting the algebraic and fixed state solvers integrated into Torsten before the fall. • Krzysztof Sakrejda has a new paper out motivating a custom density he wrote in Stan for tracking dealys in diseseas incidence in a countrywide analysis for Thailand. I’m not sure where he put it. • Yuling Yao is revising the stacking paper (for a kind of improved model averaging). I believe this is going into the loo package (or is maybe already there). So much going on with Stan I can’t keep up! • Yuling also revised the simulated tempering paper, which is some custom code on top of Stan to fit models with limited multimodality. There was some discussion about realistic examples with limited multimodality and we hope to have a suite of them to go along with the paper. • Sebastian Weber, Bob Carpenter, and Micahel Betancourt, and Charles Margossian had a long and productive meeting to design the MPI interface. It’s very tricky trying to find an interface that’ll fit into the Stan language and let you ship off data once and reuse it. I think we’re almost there. The design discussion’s on Discourse. • Rob Trangucci is finishing the GP paper (after revising the manual chapter) with Dan Simpson, Aki Vehtari, and Michael Betancourt. I’m sure I’m missing a lot of goings on, especially among people not at our weekly meetings. If you know of things that should be on this list, please let me know. The post Stan Weekly Roundup, 21 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### IEEE Spectrum 2017 Top Programming Languages Fri, 07/21/2017 - 20:41 (This article was first published on Revolutions, and kindly contributed to R-bloggers) IEEE Spectrum has published its fourth annual ranking of of top programming languages, and the R language is again featured in the Top 10. This year R ranks at #6, down a spot from its 2016 ranking (and with an IEEE score — derived from search, social media, and job listing trends — tied with the #5 place-getter, C#). Python has taken the #1 slot from C, jumping from its #3 ranking in 2016. For R (a domain specific language for data science) to rank in the top 10, and for Python (a general-purpose language with many data science applications) to take the top spot, may seem like a surprise. I attribute this to continued broad demand for machine intelligence application development, driven by the growth of "big data" initiatives and the strategic imperative to capitalize on these data stores by companies wordwide. Other data-oriented languages appear in the Top 50 rankings, including Matlab (#15), SQL (#23), Julia (#31) and SAS (#37). For the complete announcement of the 2017 IEEE Spectrum rankings, including additional commentary and analysis of changes, follow the link below. IEEE Spectrum: The 2017 Top Programming Languages var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### How to create reports with R Markdown in RStudio Fri, 07/21/2017 - 18:00 (This article was first published on R-exercises, and kindly contributed to R-bloggers) Introduction R Markdown is one of the most popular data science tools and is used to save and execute code to create exceptional reports whice are easily shareable. The documents that R Markdown provides are fully reproducible and support a wide variety of static and dynamic output formats. R Markdown uses markdown syntax, which provides an easy way of creating documents that can be converted to many other file types, while embeding R code in the report, so it is not necessary to keep the report and R script separately. Furthermore The report is written as normal text, so knowledge of HTML is not required. Of course no additional files are needed because everything is incorporated in the HTML file. Package Installation In order to use the R Markdown package we have to install and call it with: install.packages("rmarkdown") library(rmarkdown) Create an R Markdown file To create a new R Markdown file (.Rmd) in RStudio we follow these steps: select File -> New File -> R Markdown…, then choose the file type you want to create. We choose the default .html, which can be easily converted to other file types later. The .Rmd file we just created comes with default text as you can see but we will create our own file step by step so we delete everything. R Markdown Syntax The YAML Header At the top of any R Markdown script is always the YAML header section enclosed by --- and is the minimum code chunk that should be put there. By default this includes a title, author, date, and the file type you want as output. Whatever we set in the header section is applied to the whole document. This is an example of a YAML header at the top of an .Rmd script which creates an html document. --- title: "R Markdown Example" author: Makis Kasvikis date: 17/Aug/2017 output: html_document --- Inserting Code Code that is included in your .Rmd document should be enclosed by three backwards apostrophes  like the example below: {r} summary(cars)  The space inside the brackets is where you can assign rules for that code chunk. The code chunk above says that the code is R code. Of course you can click the green button with the “plus” symbol in RStudio to put a code chunk wherever you want. Learn more about reporting your results in the online course: R for Data Science Solutions. In this course you will learn how to: • Build a complete workflow in R for your data science problem • Get indepth on how to report your results in a interactive way • And much more Instructions for Code Chunks As you have probably understood by now you can use code to do everything you want. For example you can generate a plot like this: {r} plot(cars)  Or you can create a dataframe: {r} A <- c("Bob", "Tom", "Bill", "Joe") B <- c(1.78, 1.86, 1.85, 1.70) dataframe <- data.frame(A, B) head(dataframe)  In case you don’t want that code to appear in the report and just display the result, then you should set echo=FALSE in the code chunk instructions. {r,echo=FALSE} A <- c("Bob", "Tom", "Bill", "Joe") B <- c(1.78, 1.86, 1.85, 1.70) dataframe <- data.frame(A, B) head(dataframe)  If you want to load a package like knitr you must load the package in the .Rmd file. In case you do not want the entire code chunk and its output from appearing in the report, use include=FALSE: {r, include=FALSE} library(knitr)  More code instructions eval: eval=TRUE Is the code run and the results included in the output? include: include=TRUE Are the code and the results included in the output (the code is still run)? echo: echo=TRUE Is the code displayed alongside the results? warning: warning=TRUE Are warning messages displayed? error: error=FALSE Are error messages displayed? message: message=TRUE Are messages displayed? tidy: tidy=FALSE Is the code reformatted to make it look “tidy”? results: results="markup" How are results treated? cache: cache=FALSE Are the results cached for future renders comment:comment="##" What character are comments prefaced with? fig.width:, fig.width=7 What width (in inches) are the plots? fig.align: fig.align="left" “left” “right” “center” Inserting Figures To manually set the figure dimensions, you can insert an instruction: {r,fig.width=5, fig.height=5,echo=FALSE} plot(cars)  By default, figures are rendered as .png files by R Markdown, so you can change that to .svg, a vector file format by adding dev=”svg” to the code chunk instruction section. {r,fig.width=5, fig.height=5,echo=FALSE,dev="svg"} plot(cars)  Inserting Tables While R Markdown can print the contents of a data frame easily by enclosing the name of the data frame in a code chunk this doea not seem the best idea. {r,echo=FALSE} dataframe  A better solution could be to use the table formatting function kable() in the knitr package like the example below: {r,echo=FALSE} library(knitr) kable(dataframe,digits=1)  Formatting Text Markdown syntax can be used to change how text appears in your output file, here are a few common formatting commands header 1: # header 1 header 2: ## header 2 header 3: ### header 3 header 4: #### header 4 bold text: **bold** text italics text: *italics* text code text: code text inlcude a link: [link](www.r exercises.com) Including a picture: R Studio Logo ![R Studio Logo](img/R Studio Logo.png) LaTeX equation syntax:A = \pi \times r^{2}$You can also manually create small tables using markdown syntax. :----:: Centre :-----: Left -----:: Right ------: Auto For example: | Teams | Wins | Loses | |:------|:-----:|-------:| | A | 21 | 4 | | B | 19 | 6 | | C | 17 | 8 | Compiling an R Markdown file Select Knit -> Knit to HTML in RStudio. A preview appears in the Viewer window in RStudio and the .html file is saved to your working directory. You can set your working directory location using: setwd("~/...") You can find your working directory using: getwd() Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Power analysis and sample size calculation for Agriculture Fri, 07/21/2017 - 17:27 (This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers) Power analysis is extremely important in statistics since it allows us to calculate how many chances we have of obtaining realistic results. Sometimes researchers tend to underestimate this aspect and they are just interested in obtaining significant p-values. The problem with this is that a significance level of 0.05 does not necessarily mean that what you are observing is real. In the book “Statistics Done Wrong” by Alex Reinhart (which you can read for free here: https://www.statisticsdonewrong.com/) this problem is discussed with an example where we can clearly see that a significance of 0.05 does not mean that we have 5% chances of getting it wrong, but actually we have closer to 30% chances of obtaining unrealistic results. This is because there are two types of errors in which we can incur (for example when performing an ANOVA), the type I (i.e. rejecting a null hypothesis when it is actually true) and type II (i.e. accepting a null hypothesis when it is actually false). Image taken from: https://datasciencedojo.com/ The probability of incurring in a type I error is indicated by α (or significance level) and usually takes a value of 5%; this means that we are happy to consider a scenario where we have 5% chances of rejecting the null hypothesis when it is actually true. If we are not happy with this, we can further decrease this probability by decreasing the value of α (for example to 1%). On the contrary the probability of incurring in a type II error is expressed by β, which usually takes a value of 20% (meaning a power of 80%). This means we are happy to work assuming that we have a 20% chance of accepting the null hypothesis when it is actually false. If our experiment is not designed properly we cannot be sure whether we actually incurred in one of these two errors. In other words, if we run a bad experiment and we obtain a insignificant p-value it may be that we incurred in a type II error, meaning that in reality our treatment works but its effect cannot be detected by our experiment. However, it may also be that we obtained a significant p-value but we incurred in a type I error, and if we repeat the experiment we will find different results. The only way we can be sure to run a good experiment is by running a power analysis. By definition power is the probability of obtaining statistical significance (not necessarily a small p-value, but at least a realistic outcome). Power analysis can be used before an experiment to test whether our design has good chances of succeeding (a priori) or after to test whether the results we obtained were realistic. Effect Size A simple and effective definition of effect size is provided in the book “Power Analysis for Experimental Research” by Bausell & Li. They say: “effect size is nothing more than a standardized measure of the size of the mean difference(s) among the study’s groups or of the strength of the relationship(s) among its variables”. Despite its simple definition the calculation of the effect size is not always straightforward and many indexes have been proposed over the years. Bausell & Li propose the following definition, in line with what proposed by Cohen in his “Statistical Power Analysis for the Behavioral Sciences”: where ES is the effect size (in Cohen this is referred as d). In this equation, Ya is the mean of the measures for treatment A, and Yb is the mean for treatment B. The denominator is the pooled standard deviation, which is computed as follows: where SD are the standard deviation for treatments B and A. This is the main definition but then every software or functions tend to use indexes correlated to this but not identical. We will see each way of calculating the effect size case by case. One-Way ANOVA Sample size For simple models the power calculating can be performed with the package pwr: library(pwr) In the previous post (Linear Models) we worked on a dataset where we tested the impact on yield of 6 levels of nitrogen. Let’s assume that we need to run a similar experiment and we would like to know how many samples we should collect (or how many plants we should use in the glass house) for each level of nitrogen. To calculate this we need to do a power analysis. To compute the sample size required to reach good power we can run the following line of code: pwr.anova.test(k=6, f=0.25, sig.level=0.05, power=0.8) Let’s start describing the options from the end. We have the option power, to specify the power you require for your experiment. In general, this can be set to 0.8, as mentioned above. The significance level is alpha and usually we are happy to accept a significance of 5%. Another option is k, which is the number of groups in our experiment, in this case we have 6 groups. Clearly if we were considering two treatments, the first with 6 levels and the second with 3, k would have been 6*3=18. Finally we have the option f, which is the effect size. As I mentioned above, there are many indexes to express the effect size and f is one of them. According to Cohen, f can be expressed as: where the numerator is the is the standard deviation of the effects that we want to test and the denominator is the common standard deviation. For two means, as in the equation we have seen above, f is simply equal to: Clearly, before running the experiment we do not really know what the effect size would be. In some case we may have an idea, for example from previous experiments or a pilot study. However, most of the times we do not have a clue. In such cases we can use the classification suggested by Cohen, who considered the following values for f: The general rule is that if we do not know anything about our experiment we should use a medium effect size, so in this case 0.25. This was suggested in the book Bausell & Li and it is based on a review of 302 studies in the social and behavioral sciences. for this reason it may well be that the effect size of your experiment would be different. However, if you do not have any additional information this is the only thing the literature suggest. The function above returns the following output: > pwr.anova.test(k=6, f=0.25, sig.level=0.05, power=0.8) Balanced one-way analysis of variance power calculation k = 6 n = 35.14095 f = 0.25 sig.level = 0.05 power = 0.8 NOTE: n is number in each group In this example we would need 36 samples for each nitrogen level to achieve a power of 80% with a significance of 5%. Power Calculation As I mentioned above, sometimes we have a dataset we collected assuming we could reach good power but we are not actually sure if that is the case. In those instances what we can do is the a posteriori power analysis, where we basically compute the power for a model we already fitted. As you remember is the previous post about linear models, we fitted the following: mod1 = aov(yield ~ nf, data=dat) To compute the power we achieved here we first need to calculate the effect size. As discussed above we have several options: d, f and another index called partial eta squared. Let’s start from d, which can be simply calculated using means and standard deviation of two groups, for example N0 (control) and N5: numerator = (mean(dat[dat$nf=="N5","yield"])-mean(dat[dat$nf=="N0","yield"])) denominator = sqrt((sd(dat[dat$nf=="N5","yield"])^2+sd(dat[dat$nf=="N0","yield"])^2)/2) d = numerator/denominator This code simply computes the numerator (difference in means) and the denominator (pooled standard deviation) and then computes the Cohen’s d, which results in 0.38. Again Cohen provides some values for the d, so that we can determine how large is our effects, which are presented below: From this table we can see that our effect size is actually low, and not medium as we assumed for the a priori analysis. This is important because if we run the experiment with 36 samples per group we may end up with unrealistic results simply due to low power. For this reason it is my opinion that we should always be a bit more conservative and maybe include some additional replicates or blocks, just to account for potential unforeseen differences between our assumptions and reality. The function to compute power is again pwr.anova.test, in which the effect size is expressed as f. We have two ways of doing that, the first is by using the d values we just calculated and halve it, so in this case f = 0.38/2 = 0.19. However, this will tell you the specific effects size for the relation between N0 and N5, and not for the full set of treatments. At this link there is an Excel file that you can use to convert between indexes of effect size: http://www.stat-help.com/spreadsheets/Converting%20effect%20sizes%202012-06-19.xls Another way to get a fuller picture is by using the partial Eta Squared, which can be calculated using the sum of squares: This will tell us the average effect size for all the treatments we applied, so not only for N5 compared to N0, but for all of them. To compute the partial eta squared we first need to access the anova table, with the function anova: > anova(mod1) Analysis of Variance Table Response: yield Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797.4 12.396 6.075e-12 *** Residuals 3437 1330110 387.0 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 From this table we can extract the sum of squares for the treatment (i.e. nf) and the sum of squares of the residuals and then solve the equation above: > EtaSQ = 23987/(23987+1330110) > print(EtaSQ) [1] 0.01771439 As for the other indexes, eta squares also has its table of interpretation: The relation between f and eta squared is the following: so to compute the f related to the full treatment we can simply do the following: > f = sqrt(EtaSQ / (1-EtaSQ)) > print(f) [1] 0.1342902 So now we have everything we need to calculate the power of our model: > pwr.anova.test(k=6, n=571, f=f, sig.level=0.05) Balanced one-way analysis of variance power calculation k = 6 n = 571 f = 0.1342902 sig.level = 0.05 power = 1 NOTE: n is number in each group To compute the power we need to run again the function pwr.anova.test, but this time without specifying the option power, but replacing it with the option n, which is the number of samples per group. As you remember from the previous post this was an unbalanced design, so the number of samples per group is not constant. We could either use a vector as input for n, with all the samples per each group. In that case the function will return a power for each group. However, what I did here is putting the lowest number, so that we are sure to reach good power for the lowest sample size. As you can see even with the small effect size we are still able to reach a power of 1, meaning 100%. This is because the sample size is more than adequate to catch even such a small effect. You could try to run again the sample size calculation to actually see what would be the minimum sample requirement for the observed effect size. Linear Model The method we have seen above is only valid for one-way ANOVAs. For more complex model, which may simply be ANOVA with two treatments we should use the function specific for linear models. Sample Size To calculate the sample size for this analysis we can refer once again to the package pwr, but now use the function pwr.f2.test. Using this function is slightly more complex because here we start reasoning in terms of degrees of freedom for the F ratio, which can be obtained using the following equation: From: https://cnx.org/contents/crKQFJtQ@14/F-Distribution-and-One-Way-ANO where MS between is the mean square variance between groups and MS within is the mean square variance within each group. These two terms have the following equations (again from: https://cnx.org/contents/crKQFJtQ@14/F-Distribution-and-One-Way-ANO) : The degrees of freedom we need to consider are the denominators of the last two equations. For an a priori power analysis we need to input the option u, with the degrees of freedom of the numerator of the F ratio, thus MS between. As you can see this can be computed as k-1, for a one-way ANOVA. For more complex model we need to calculate the degrees of freedom ourselves. This is not difficult because we can generate dummy datasets in R with the specific treatment structure we require, so that R will compute the degrees of freedom for us. We can generate dummy dataset very easily with the function expand.grid: > data = expand.grid(rep=1:3, FC1=c("A","B","C"), FC2=c("TR1","TR2")) > data rep FC1 FC2 1 1 A TR1 2 2 A TR1 3 3 A TR1 4 1 B TR1 5 2 B TR1 6 3 B TR1 7 1 C TR1 8 2 C TR1 9 3 C TR1 10 1 A TR2 11 2 A TR2 12 3 A TR2 13 1 B TR2 14 2 B TR2 15 3 B TR2 16 1 C TR2 17 2 C TR2 18 3 C TR2 Working with expand.grid is very simple. We just need to specify the level for each treatment and the number of replicates (or blocks) and the function will generate a dataset with every combination. Now we just need to add the dependent variable, which we can generate randomly from a normal distribution: data$Y = rnorm(nrow(data))

Now our dataset is ready so we can fit a linear model to it and generate the ANOVA table:

> mod.pilot = lm(Y ~ FC1*FC2, data=data)
> anova(mod.pilot)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
FC1 2 0.8627 0.4314 0.3586 0.7059
FC2 1 3.3515 3.3515 2.7859 0.1210
FC1:FC2 2 1.8915 0.9458 0.7862 0.4777
Residuals 12 14.4359 1.2030

Since this is a dummy dataset all the sum of squares and the other values are meaningless. We are only interested in looking at the degrees of freedom.
To calculate the sample size for this analysis we can refer once again to the package pwr, but now use the function pwr.f2.test, as follows: pwr.f2.test(u = 2, f2 = 0.25, sig.level = 0.05, power=0.8)
The first option in the function is u, which represents the degrees of freedom of the numerator of the F ratio. This is related to the degrees of freedom of the component we want to focus on. As you probably noticed from the model, we are trying to see if there is an interaction between two treatments. From the ANOVA table above we can see that the degrees of freedom of the interaction are equal to 2, so that it what we include as u.
Other options are again power and significance level, which we already discussed. Moreover, in this function the effect size is f2, which is again different from the f we’ve seen before. F2 again has its own table:

Since we assume we have no idea about the real effect size we use a medium value for the a priori testing.

The function returns the following table:

> pwr.f2.test(u = 2, f2 = 0.25, sig.level = 0.05, power=0.8)
Multiple regression power calculation
u = 2
v = 38.68562
f2 = 0.25
sig.level = 0.05
power = 0.8

As you can see what the function is actually providing us is the value of the degrees of freedom for the denominator of the F test (with v), which results in 38.68, so 39 since we always round it by excess.
If we look to the equation to compute MS withing we can see that the degrees of freedom is given by n-k, meaning that to transform the degrees of freedom into a sample size we need to add what we calculated before for the option u. The sample size is then equal to n = v + u + 1, so in this case the sample size is equal 39 + 2 + 1 = 42

This is not the number of samples per group but it is the total number of samples.

Another way of looking at the problem would be to compute the total power of our model, and not just how much power we have to discriminate between levels of one of the treatments (as we saw above). To do so we can still use the function pwr.f2.test, but with some differences. The first is that we need to compute u using all elements in the model, so basically sum the decrees of freedom of the ANOVA table, or sum all the coefficients in the model minus the intercept:

u = length(coef(mod3))-1

Another difference is in how we compute the effects size f2. Before we used its relation with partial eta square, now we can use its relation with the R2 of the model:

With these additional element we can compute the power of the model.

Power Calculation Now we look at estimating the power for a model we’ve already fitted, which can be done with the same function. We will work with one of the models we used in the post about Linear Models:

mod3 = lm(yield ~ nf + bv, data=dat)

Once again we first need to calculate the observed effect size as the eta squared, using again the sum of squares:

> Anova(mod3, type="III")
Anova Table (Type III tests)
Response: yield
Sum Sq Df F value Pr(>F)
(Intercept) 747872 1 2877.809 < 2.2e-16 ***
nf 24111 5 18.555 < 2.2e-16 ***
bv 437177 1 1682.256 < 2.2e-16 ***
Residuals 892933 3436
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this example, I used the function Anova (with option type=”III”) in the package car just to remind you that if you have an unbalanced design, like in this case, you should use the type III sum of squares.
From this table we can obtain the sum of squares we need to compute the eta squared, for example for nf we will use the following code:

> EtaSQ = 24111/(24111+892933)
> EtaSQ
[1] 0.02629209

Then we need to transform this into f2 (of f squared), which is what the pwr.f2.test function uses:

> f2 = EtaSQ / (1-EtaSQ)
> f2
[1] 0.02700203

The only thing we need to do now is calculating the value of v, i.e. the denominator degrees of freedom. This is equal to the n (number of samples) – u – 1, but a quick way of obtaining this number is looking at the anova table above and take the degrees of freedom of the residuals, i.e. 3436.

Now we have everything we need to obtain the observed power:

> pwr.f2.test(u = 5, v = 3436, f2 = f2, sig.level = 0.05)
Multiple regression power calculation
u = 5
v = 3436
f2 = 0.02700203
sig.level = 0.05
power = 1

which again returns a very high power, since we have a lot of samples.

Generalized Linear Models For GLM we need to install the package lmsupport: #install.packages("lmSupport")
library(lmSupport)

Sample Size

For calculating the sample size for GLM we can use the same procedure we used for linear models.

Power Calculation For this example we are going to use one of the model we discussed in the post about GLM, using the dataset beall.webworms (n = 1300): dat = beall.webworms
pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))

Once again we would need to compute effect size and degrees of freedom. As before, we can use the function anova to generate the data we need:

> anova(pois.mod2)
Analysis of Deviance Table
Model: poisson, link: log
Response: y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 1299 1955.9
block 12 122.040 1287 1833.8
spray 1 188.707 1286 1645.2
lead 1 42.294 1285 1602.8
spray:lead 1 4.452 1284 1598.4

Let’s say we are interested in looking at the interaction between spray and lead, its degrees of freedom are 1, so this is our u. On its side we also have the residuals degrees of freedom, so v is 1284.
The other thing we need is the effect size, which we can compute with the function modelEffectSizes from the package lmSupport:

> modelEffectSizes(pois.mod2)
glm(formula = y ~ block + spray * lead, family = c("poisson"),
data = dat)
Coefficients
SSR df pEta-sqr dR-sqr
block 122.0402 12 0.0709 NA
spray 142.3487 1 0.0818 0.0849
lead 43.7211 1 0.0266 0.0261
Sum of squared errors (SSE): 1598.4
Sum of squared total (SST): 1675.9

This function calculates the partial eta squares, and it works also for lm models. As you can see it does not provide the eta squared for the interaction, but just to be on the safe side we can use the lowest value (0.03) from the values provided for sparay and lead.
Now that we have the observed eta squared we can use the function modelPower:

> modelPower(u=1, v=1284, alpha=0.05, peta2=0.03)
Results from Power Analysis

pEta2 = 0.030
u = 1
v = 1284.0
alpha = 0.050
power = 1.000

This function can take the option f2, as we’ve seen before for the package pwr. However, since computing the partial eta square is generally easier, we can use the option peta2 and use directly this index.
Once again our power is very high.

Linear Mixed Effects Models For power analysis with mixed effects models we would need to install the following packages:

#install.packages("simr")
library(simr)

In this example we will be working with models fitted with the package lme4, but what is discussed here should work also with models fitted with nlme.

Sample Size A priori power analysis for mixed effect model is not easy. There are packages that should provide functions to do that (e.g. simstudy and longpower), but they are probably more related to the medical sciences and I found them difficult to use. For this reason I decided that probably the easiest way to test the power of an experiment for which we need to use a mixed-effect model (e.g. involving clustering or repeated measures) would be to use a dummy dataset again and simulation. However, please be advised that I am not 100% sure of the validity of this procedure. To create the dummy dataset we can use the same procedure we employed above, with expand.grid: data = expand.grid(subject=1:5, treatment=c("Tr1", "Tr2", "Tr3"))
data$Y = numeric(nrow(data)) In this case we are simulating a simple experiment with 5 subjects, 3 treatments and a within subject design, like a crossover I suppose. As you can see the Y has not been drawn from a normal distribution, this is because for the time being it is just a list of zeroes. We need to create data for each treatment as follows: data[data$treatment=="Tr1","Y"] = rnorm(nrow(data[data$treatment=="Tr1",]), mean=20, sd=1) data[data$treatment=="Tr2","Y"] = rnorm(nrow(data[data$treatment=="Tr2",]), mean=20.5, sd=1) data[data$treatment=="Tr3","Y"] = rnorm(nrow(data[data$treatment=="Tr3",]), mean=21, sd=1) In these lines I created three samples, from normal distribution which means differ by half their standard deviation. This should provide an effect size (d) of around 0.5, so medium. Now we can create the model: mod1 = lmer(Y ~ treatment + (1|subject), data=data) summary(mod1) and then test its power with the function powerSim from the package simr. This function runs 1000 simulation and provide a measure for the power of the experiment: > powerSim(mod1, alpha=0.05) Power for predictor 'treatment', (95% confidence interval): 25.90% (23.21, 28.73) Test: Likelihood ratio Based on 1000 simulations, (84 warnings, 0 errors) alpha = 0.05, nrow = 15 Time elapsed: 0 h 3 m 2 s nb: result might be an observed power calculation Warning message: In observedPowerWarning(sim) : This appears to be an "observed power" calculation From this output we can see that our power is very low, so we probably need to increase the number of subjects and then try again the simulation. Let’s now look at repeated measures. In this case we do not only have the effect size to account for in the data, but also the correlation between in time between measures. library(mvtnorm) sigma <- matrix(c(1, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5 ,1 ), ncol=4, byrow=T) data = expand.grid(subject=1:4, treatment=c("Tr1", "Tr2", "Tr3"), time=c("t1","t2","t3","t4")) data$Y = numeric(nrow(data))

T1 = rmvnorm(4, mean=rep(20, 4), sigma=sigma)
T2 = rmvnorm(4, mean=rep(20.5, 4), sigma=sigma)
T3 = rmvnorm(4, mean=rep(21, 4), sigma=sigma)

data[data$subject==1&data$treatment=="Tr1","Y"] = T1[1,]
data[data$subject==2&data$treatment=="Tr1","Y"] = T1[2,]
data[data$subject==3&data$treatment=="Tr1","Y"] = T1[3,]
data[data$subject==4&data$treatment=="Tr1","Y"] = T1[4,]

data[data$subject==1&data$treatment=="Tr2","Y"] = T2[1,]
data[data$subject==2&data$treatment=="Tr2","Y"] = T2[2,]
data[data$subject==3&data$treatment=="Tr2","Y"] = T2[3,]
data[data$subject==4&data$treatment=="Tr2","Y"] = T2[4,]

data[data$subject==1&data$treatment=="Tr3","Y"] = T3[1,]
data[data$subject==2&data$treatment=="Tr3","Y"] = T3[2,]
data[data$subject==3&data$treatment=="Tr3","Y"] = T3[3,]
data[data$subject==4&data$treatment=="Tr3","Y"] = T3[4,]

modRM = lmer(Y ~ treatment + (time|subject), data=data)
summary(modRM)

powerSim(modRM, alpha=0.05)

In this case we need to use the function rmvnorm to draw, from a normal distribution, samples with a certain correlation. For this example I followed the approach suggested by William Huber here: https://stats.stackexchange.com/questions/24257/how-to-simulate-multivariate-outcomes-in-r/24271#24271

Basically, if we assume a correlation equal to 0.5 between time samples (which is what the software G*Power does for repeated measures), we first need to create a symmetrical matrix in sigma. This will allow rmvnorm to produce values from distributions with standard deviation equal to 1 and 0.5 correlation.
A more elegant approach is the one suggested by Ben Amsel on his blog: https://cognitivedatascientist.com/2015/12/14/power-simulation-in-r-the-repeated-measures-anova-5/

sigma = 1 # population standard deviation
rho = 0.5 #Correlation between repeated measurs
# create k x k matrix populated with sigma
sigma.mat <- rep(sigma, 4)
S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))
# compute covariance between measures
Sigma <- t(S) * S * rho
# put the variances on the diagonal
diag(Sigma) <- sigma^2

The result is the same but at least here you can specify different values for SD and correlation.

The other elementS the function needs are mean values, for which I used the same as before. This should guarantee a difference of around half a standard deviation between treatments.
The remaining of the procedure is the same we used before with no changes.

As I said before, I am not sure if this is the correct way of computing power for linear mixed effects models. It may be completely or partially wrong, and if you know how to do this or you have comments please do not hesitate to write them below.

Power Analysis

As we have seen with the a priori analysis, computing the power of mixed effects models is actually very easy with the function powerSim.

References

PWR package Vignette: https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html

William E. Berndtson (1991). “A simple, rapid and reliable method for selecting or assessing the number of replicates for animal experiments”
http://scholars.unh.edu/cgi/viewcontent.cgi?article=1312&context=nhaes

NOTE:
This paper is what some of my colleagues, who deal with animal experiments, use to calculate how many subjects to use for their experiments. I’m not sure if this is still a good method of computing sample size, but I just felt I should include it as additional reading.

NOTE 2:
The article above computes the number of replicates based on the estimated difference between means and coefficient of variation (CV) for each group. Since a lot of researchers always talk about CV, I thought of saying that the power analysis actually does not seem to care about CV. This is because one of the assumptions of linear models is that the variance between groups would be constant, if this does not happen theoretically we should not even try an ANOVA.

Berkowitz J. “Sample Size Estimation” – http://www.columbia.edu/~mvp19/RMC/M6/M6.doc

This document gives you some rule of thumb to determine the sample size for a number of experiments.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Inter-country inequality and the World Development Indicators by @ellis2013nz

Fri, 07/21/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

I recently read the high quality book Global Inequality by Branko Milanovic. When reading this sort of thing, I often find I can increase my engagement with a topic by playing around with the data myself. In this day and age, this is much more likely to be possible than a couple of decades back! I remember when I first studied development economics, typing into Lotus 123 the data from tables in the back of a World Development Report. These days to get the same data we just fire up R and get the data from the WDI package, which speaks directly to the World Bank’s World Development Indicators API.

Re-creating three charts

Chapter 4 of Global Inequality is entitled “Global Inequality in This Century and the Next”. It explores a range of issues to do with inter-country inequality and global inequality. The difference between the two is that “global” compares the income (or wealth, although there usually isn’t adequate data to do this) of all global citizens on an equal basis whereas “inter-country” uses the average income in each country and helps explore the idea of “citizenship rent” ie the bonus one gets from having been born in a wealthier place. I won’t try to summarise the discussion in this excellent chapter, I just recommend reading the original.

I set out to reproduce the first three figures in the chapter and had an interestingly partial success. That is, a partial success, that was partial for interesting reasons.

Global income inequality among countries

First, here is my version of Milanovic’s Figure 4.1. His original figure looks close to an extended version of what I have in just the left panel. This is based on purchasing-power parity (PPP) GDP per capita – that is,
adjusted not just for exchange rates but for differing price levels in different countries. However, Milanovic used data back to 1960, and for more countries than I have available. Like mine, his data were drawn
from the World Development Indicators. But in 2014, the World Development Indicators were updated with new PPP
references. Because of concerns about inaccuracy the data were limited to 1990 onwards. It seems also at some time in the past few years
the WDI were restricted to data about currently existing countries (eg excluding data from the USSR or Yugoslavia).

For these reasons, the left pane of my chart below differs from the original in Global Inequality. And to get the picture over a longer period, in the right pane I had to use a constant US dollar GDP series instead of a PPP one. I think this not only overstates inequality (because a dollar buys more in poor countries than rich ones), it hides some of the patterns; but it’s better than nothing.

If this were important, I’d definitely want to do something about the slowly-changing-dimension problem; that is, the people who have been excluded from counting because their countries have not been in continuous existence over the whole period in question. What to do about this could be a blog post (or a book) in itself…

However, the basic story is still supported even though absolute levels differ from my version to the one based on previous data. Reflecting income growth in China and India, there is a steep decline in the population-weighted Gini coefficient from 1990 or 2000 in my chart; in Milanovic’s original there was a gradual decline from 1980, accelerating from 2000. My unweighted Gini coefficients show similar patterns to his, with inequality declining from around 2000. Average incomes in Latin America, Eastern Europe and Africa failed to catch up with those in wealthier countries in the 1980-2000 period, but have been converging since 2000.

Difference in the combined (population-weighted) growth rates

The second chart I tried to reproduce was Figure 4.2, which shows differing growth rates between the advanced economies (treated as a single bloc) and the “principal emerging economies (excluding China)”. Milanovic defined this latter group as India, Brazil, Indonesia, South Africa and Vietnam; I’ve chosen a larger group. When the bar is above 0, the emerging economies have grown faster than the advanced economies.

The data for Vietnam available to me only started in the 1980s, and this alone leads to quite a difference in conclusion and interpretation between my chart above and Milanovic’s original. Vietnam was ravaged by war for decades until 1975. It’s a populous country and if its data were included in the chart above it would definitely drag down the “emerging” economies’ pre-1980 combined growth rate, and up their rate since 1980. Without Vietnam in the dataset, I see considerably more growth in the historical “emerging” economies than is present in the figure in the original book. Playing around with different combinations of the “principal” emerging economies didn’t make much difference. I chose the countries I did on the basis of population size and data availability.

As in the original chart, I still have a generally stronger relative performance by the (non-China) emerging economies from 2000 onwards than previously; but the trend over time isn’t as dramatic when Vietnam is excluded, more emerging economies are added in, and the updated data source used.

Level of GDP per capita in 1970 and average subsequent growth

Finally, I had good success in creating a substantively similar chart to Milanovic’s Figure 4.3, shown below:

The descriptive/historical interpretation of my chart is identical to Milanovic’s. For Asian and Western countries, the countries with lower incomes in 1970 have successfully played catch-up, with higher average annual economic growth for poorer countries. For Eastern European, Latin American and African countries, this hasn’t been the case, with no relationship between income in 1970 and subsequent growth.

I have a little more information in my charts than was in the original. Point sizes now show population, which is important for appreciating the significance of India and China in the right hand panel in particular. I also include both weighted and unweighted regression lines; Milanovic discusses the importance of weighting by population in this sort of analysis but I think in the chart he has presented only the unweighted regression line (the substance of the conclusion stays the same).

If you’re wondering who the tiny “Asian” country is that had negative average GDP per capita growth over this 46 year period (!) it is Kiribati (pronounced K EE R ih b ae s – unlike as spoken when it was mentioned once on the Gilmore Girls). Following Milanovic, I included Pacific countries in the “Asian” grouping. Of countries with data available, only Liberia, Democratic Republic of the Congo, Central African Republic, Madagascar, Niger had worse shrinkage in economic production per capita.

Delving

Here’s how I did this. First, here’s the code to load up R functionality for the whole session, set a colour palette I’m going to use a couple of times, explore the names of the various data collections available in the WDI interface, and download the datasets I need for GDP per capita (constant US dollars), GDP per capita (constant PPP US dollars) and population:

library(tidyverse) library(scales) library(WDI) library(ineq) # for Gini library(acid) # for weighted.gini library(forcats) library(RColorBrewer) library(directlabels) library(ggrepel) library(testthat) library(forecastHybrid) library(stringr) # for str_wrap # Colour palette for making sure charts use the same colours: gini_palette <- brewer.pal(3, "Set2") names(gini_palette)[c(1,2)] <- c("Unweighted", "Weighted") #==============download data=========== tmp1 <- WDIsearch("GDP per capita") tmp2 <- WDIsearch("population") # "NY.GDP.PCAP.PP.KD" "GDP per capita, PPP (constant 2005 international $)" # - this one sounds and looks like what Milanovic used but now it only goes to 1990 # In http://databank.worldbank.org/data/download/WDIrevisions.xls, # it is stated that from May 2014 PPP data are onlyu provided back to 1990 due # to concerns about the risk of inaccuracy the further back from the benchmark year # alternatives: # "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)" ie not adjusted for inter-country price differences # "GDP.PC.KD" "GDP per Capita, constant US$, millions" - doesn't seem to exist # "NY.GDP.PCAP.CD" "GDP per capita (current US$)" ie not corrected for inflation at all # "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)" ie local currency # SPPOPTOTL Population, millions at least this is straightforward... gdp_ppp <- WDI(indicator = c("NY.GDP.PCAP.PP.KD", "SP.POP.TOTL"), start = 1960, end = 2016, extra = TRUE) gdp_constant <- WDI(indicator = c("NY.GDP.PCAP.KD", "SP.POP.TOTL"), start = 1960, end = 2016, extra = TRUE)

I wanted to get a feel for the data in its most basic presentation, so I drew myself a time series line chart to see if China’s economy grew as fast in the data as I expected it to:

#============time series chart================= p1 <- gdp_constant %>% rename(gdp = NY.GDP.PCAP.KD) %>% filter(country %in% c("United States", "United Kingdom", "France", "China", "India", "Thailand", "South Africa")) %>% ggplot(aes(x = year, y = gdp, colour = country)) + geom_line() + scale_y_log10("GDP per capita, constant US dollars", label = dollar) + labs(x = "", caption = "Source: World Development Indicators") + ggtitle("GDP per capita over time, selected countries") direct.label(p1)

Next I had to sort a problem of which “countries” to include. The WDI data includes a number of regional aggregations, and also countries that are missing quite a bit of data.

I had thought that we might find the number of countries with GDP data (not the PPP data, which I knew started in 1990) might increase to 1990 and then be flat, but we see in the chart below that it’s more complicated than that:

Here’s a chart of the countries that have at least some data, but not for every year from 1960 to 2016. As well as things I expected (Russian Federation and Germany don’t enter the dataset until relatively late, due to changing national boundaries), there are a few surprises; for example, New Zealand’s GDP per capita in this particular dataset doesn’t start until 1977:

Here’s the code for sorting through the country and regional issues:

#============country coverage========== # first we need to work out which "countries" are actually regional aggregations country_codes <- gdp_ppp %>% dplyr::select(iso2c, country) %>% distinct() # everything begining with X except XK (Kosovo) is a grouping eg "Euro area" # everything with a number in it is a a group eg T5 is South Asia (IDA and IBRD) # ZF, ZG, ZQ, ZT, ZJ are regions # OE is the OECD # EU is the European Union codes <- country_codes$iso2c exes <- codes[grepl("^X", codes)] exes <- exes[exes != "XK"] # regs has the iso2c codes for regional aggregfations: regs <- c(exes, codes[grepl("[0-9]", codes)], "ZF", "ZG", "ZQ", "ZT", "ZJ", "OE", "EU") # how many countries have data each year? gdp_constant %>% filter(!iso2c %in% regs) %>% group_by(year) %>% summarise(countries = sum(!is.na(NY.GDP.PCAP.KD))) %>% ggplot(aes(x = year, y = countries)) + ggtitle("Variable GDP coverage in the World Development Indicators") + geom_line() + labs(y = "Number of countries with \nconstant US prices GDP data in the WDI", caption = "Source: World Development Indicators") country_coverage <- gdp_constant %>% filter(!iso2c %in% regs) %>% group_by(country) %>% summarise(prop_complete = sum(!is.na(NY.GDP.PCAP.KD)) / n()) %>% arrange(desc(prop_complete)) # some surprises here. For example New Zealand only has data from 1977 onwards country_coverage %>% filter(prop_complete != 0 & prop_complete != 1) %>% mutate(country = fct_reorder(country, prop_complete)) %>% ggplot(aes(x = prop_complete, y = country)) + geom_point() + ggtitle("Countries with partial GDP data in the WDI") + scale_x_continuous("Percentage of years 1960-2016 with data on GDP in constant US dollars", label = percent) + labs(y = "", caption = "Source: World Development Indicators") Here’s the code that draws the three main charts. The tidyverse makes it pretty easy to do this calculation of summary statistics like Gini or weighted Gini for arbitrary groupings and turn them into nice charts. #================Variant of figure 4.1 on Gini over time=========== # constant prices version, which we will then rbind onto the PPP version: tmp <- gdp_constant %>% rename(gdp = NY.GDP.PCAP.KD) %>% mutate(var = "USD exchange rates") gdp_ppp %>% rename(gdp = NY.GDP.PCAP.PP.KD) %>% mutate(var = "Purchasing Power Parity") %>% rbind(tmp) %>% filter(country %in% filter(country_coverage, prop_complete == 1)$country) %>% # need to knock out the NA values in the PPP data: filter(!is.na(gdp)) %>% group_by(year, var) %>% summarise(Unweighted = Gini(gdp), Weighted = weighted.gini(gdp, SP.POP.TOTL)$bcwGini) %>% gather(variable, value, -year, -var) %>% mutate(variable = fct_reorder(variable, -value, fun = last)) %>% ggplot(aes(x = year, y = value, colour = variable)) + facet_wrap(~var) + geom_line() + theme(legend.position = "right") + labs(colour = "", y = "Gini or weighted Gini", caption = "Source: World Development Indicators") + ggtitle("Inter-country inequality over time", "GDP per capita, two different methods of comparing across countries Only includes countries with full coverage of data 1960 to 2016 Inspired by Figure 4.1 from Milanovic, Global Inequality") + scale_x_continuous("", limits = c(1960, 2020)) + scale_colour_manual(values = gini_palette) #================Figure 4.2==================== # excluding Vietnam because they don't have any data now in WDI until the 1980s # adding in some other countries by population - not clear why excluded from the original emerging <- c("India", "Brazil", "Indonesia", "South Africa", "Pakistan", "Bangladesh", "Nigeria", "Mexico", "Thailand", "Philippines", "Egypt, Arab Rep.", "Turkey", "Korea, Rep.", "Iran, Islamic Rep.", "Myanmar", "Colombia", "Congo, Dem. Rep.") expect_equal(sum(!emerging %in% gdp_constant$country), 0) gdp_constant %>% rename(gdp = NY.GDP.PCAP.KD, pop = SP.POP.TOTL) %>% filter(country %in% c("European Union", "United States", "Japan", emerging)) %>% mutate(type = ifelse(country %in% emerging, "emerging", "advanced")) %>% group_by(year, type) %>% # needs to be false so bars don't draw when not present summarise(gdp = sum(gdp * pop, na.rm = FALSE)) %>% group_by(type) %>% mutate(diffgdp = c(NA, diff(gdp)), growth = diffgdp / (gdp - diffgdp)) %>% dplyr::select(year, type, growth) %>% spread(type, growth) %>% mutate(difference = emerging - advanced) %>% ggplot(aes(x = year, weight = difference)) + geom_bar() + ggtitle("GDP growth gap between 'emerging' and 'advanced' countries over time", "Excluding China for illustrative reasons, and Vietnam for data reasons. Inspired by Figure 4.2 from Milanovic, Global Inequality") + scale_y_continuous("Difference in population-weighted growth rate between emerging and advanced economies, in percentage points", label = percent) + labs(x = paste(str_wrap(paste0("Emerging countries defined as ", paste(emerging, collapse = ", ")), 100), "\n\nAdvanced countries defined as EU, USA and Japan"), caption = "Source: World Development Indicators") # This is much more positive in the late 1960s and 1970s than in the original, # and perhaps this is all because he had Vietnam (during the war...) and I don't # have data. #====================Figure 4.3 scatter plots======== end_year <- 2016 gdp_summary <- gdp_constant %>% filter(!iso2c %in% regs & year >= 1970) %>% rename(gdp = NY.GDP.PCAP.KD, pop = SP.POP.TOTL) %>% arrange(year) %>% mutate(country_type = ifelse(income == "High income: OECD", "Western", "Other"), country_type = ifelse(region == "East Asia & Pacific (all income levels)" | country %in% c("India", "Bangladesh", "Nepal", "Pakistan"), "Asian", country_type)) %>% group_by(iso2c, country, country_type) %>% summarise(gdp1970 = gdp[year == 1970], growth = (gdp[year == end_year] / gdp1970) ^ (1 / (end_year - 1970)) - 1, pop1970 = pop[year == 1970]) %>% filter(!is.na(growth)) %>% mutate(plot_label = ifelse(country_type == "Other", "Excluding Asian and Western countries", "Asian and Western countries")) gdp_summary$plot_label <- relevel(factor(gdp_summary$plot_label), "Excluding Asian and Western countries") gdp_summary$country_type <- fct_relevel(factor(gdp_summary$country_type), c("Asian", "Western")) gdp_summary %>% ggplot(aes(x = gdp1970, y = growth)) + facet_wrap(~plot_label) + geom_smooth(method = "lm", se = FALSE, colour = "grey40", linetype = 2) + geom_smooth(method = "lm", se = FALSE, colour = "grey40", aes(weight = pop1970)) + geom_point(aes(colour = country_type, size = pop1970 / 10^6), alpha = 0.5) + geom_point(aes(size = pop1970 / 10^6), colour = "black", shape = 1) + # next line would give labels for points but horribly cluttered plot # geom_text_repel(aes(label = iso2c)) + scale_x_log10("\nGDP per capita in 1970\n\nDotted line is unweighted regression; solid line is population-weighted regression.\n", label = dollar, breaks = c(1000, 5000, 20000)) + scale_y_continuous(paste("Average growth between 1970 and", end_year), label = percent) + scale_size_area("Population\nin 1970 (m)", max_size = 10) + labs(colour = "Region", caption = "Source: World Development Indicators") + ggtitle("GDP in 1970 and subsequent growth", "Constant prices, US dollars\nInspired by Figure 4.3 from Milanovic, Global Inequality") + theme(legend.position = "right") # Note the pattern is the same as in Milanovich's graphic, even though # the growth rates seem different Bonus – forecasting inter-country inequality

An excursus in Milanovic’s book refers to some work by others forecasting trends in global inequality if current trends continue. I’m only working with inter-country data today, nothing that features the two individual-level global inequality, buit I was sufficiently interested in the idea to knock together my own very crude forecasts. I basically projected current country-level trends in GDP per capita and population growth with straightforward time series methods (a hybrid of Hyndman’s auto.arima and ets methods, as conveniently pulled together in the forecastHybrid package by David Shaub with minor help from myself). The result shows the expected continued decline in weighted inequality as China and India continue to “catch up”:

Here’s the code for that forecasting exercise:

#================forecasts of global gini============= # this is only a taster for what is described in Excursus 4.1 in Milanovic's book # because I only deal with inter-national inequality, and use very crude forecasts! # but it comes up with a modestly similar result. I see a drop in inter-national inequaltiy # of about 0.07 whereas they find a drop in global inequality of 0.04. These feel quite # potentially consistent (because most forecasts for intranational inequaltiy are increases). # could do the forecasts of Gini by forecasting GDP per capita, and using the population # forecasts from https://esa.un.org/unpd/wpp/Download/Standard/Population/ # but.... effort!... # so let's just do our own forecasts #' Forecast gdp and pop for one country gdp_constant #' and combines it with the historical data #' context-specific function, only works in this session forecast_gdp_pop <- function(the_country, h = 20){ # assumes presence in environment of a data.frame called gdp_constant country_vars <- gdp_constant %>% rename(gdp = NY.GDP.PCAP.KD, pop = SP.POP.TOTL) %>% arrange(year) %>% filter(country == the_country) %>% dplyr::select(gdp, pop) %>% ts(start = 1960) # a bit of fiddliness is needed because some countries are missing early GDP per person data. # maybe they should be knocked out altogether of course... but let's decide that # later and explicitly, this function shoudl be able to handle it either way years_orig <- time(country_vars) rows_include <- which(!is.na(country_vars[ , "gdp"])) # some "countries" have virtually no observations, so don't even try to forecast them... if(length(rows_include) > 5){ country_vars <- country_vars[rows_include, ] mod_gdp <- hybridModel(country_vars[ , "gdp"], lambda = 0, models = "ae", verbose = FALSE) mod_pop <- hybridModel(country_vars[ , "pop"], lambda = 0, models = "ae", verbose = FALSE) fc_gdp <- forecast(mod_gdp, h = h) fc_pop <- forecast(mod_pop, h = h) years <- years_orig[rows_include] years <- c(years, max(years + 1):(max(years) + h)) tmp <- data.frame( country = the_country, gdp_pp = c(fc_gdp$x, fc_gdp$mean), pop = c(fc_pop$x, fc_pop$mean), year = years )} else { tmp <- NULL } return(tmp) } # test cases. Is designed to work even with countries with some NA values, # even though when doing it for real I decided to drop them. forecast_gdp_pop("Australia", h = 100) %>% gather(variable, value, -year, -country) %>% ggplot(aes(x = year, y = value)) + facet_wrap(~variable, scales = "free_y", ncol = 1) + geom_line() + ggtitle("Crude forecast of GDP per person and population for Australia") forecast_gdp_pop("China", h = 5) %>% gather(variable, value, -year, -country) %>% ggplot(aes(x = year, y = value)) + facet_wrap(~variable, scales = "free_y", ncol = 1) + geom_line() + ggtitle("Crude forecast of GDP per person and population for China") forecast_gdp_pop("Vietnam") %>% gather(variable, value, -year, -country) %>% ggplot(aes(x = year, y = value)) + facet_wrap(~variable, scales = "free_y", ncol = 1) + geom_line() # vector of all the countries with complete data: all_countries <- filter(country_coverage, prop_complete == 1)$country # do the actual forecasts: all_forecasts <- lapply(all_countries, forecast_gdp_pop) all_forecasts_df <- do.call("rbind", all_forecasts) # draw graphic: all_forecasts_df %>% group_by(year) %>% summarise(Unweighted = Gini(gdp_pp), Weighted = weighted.gini(gdp_pp, pop)$bcwGini) %>% gather(variable, value, -year) %>% mutate(variable = fct_reorder(variable, -value, fun = last)) %>% mutate(type = ifelse(year <= 2016, "Actual", "Forecast")) %>% ggplot(aes(x = year, y = value, colour = variable, linetype = type)) + geom_line() + theme(legend.position = "right") + labs(colour = "", linetype = "", x = "", y = "Gini or weighted Gini", caption = "Source: World Development Indicators") + ggtitle("Inter-country inequality over time, including crude forecasts", "GDP per capita, two different methods of comparing across countries Only includes countries with complete data 1960 to 2013") + scale_colour_manual(values = gini_palette) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Chart golf: the “demographic tsunami”

Fri, 07/21/2017 - 07:52

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

“‘Demographic tsunami’ will keep Sydney, Melbourne property prices high” screams the headline.

While the census showed Australia overall is aging, there’s been a noticeable lift in the number of people aged between 25 to 32.
As the accompanying graph shows…

Whoa, that is one ugly chart. First thought: let’s not be too hard on Fairfax Media, they’ve sacked most of their real journalists and they took the chart from someone else. Second thought: if you want to visualise change over time, time as an axis rather than a coloured bar is generally a good idea.

Can we do better?

As usual, you can find this project at Github, with an accompanying published document at RPubs. I rarely copy/paste/format code here any more so if you want the details, take a look at the Rmd file. Some of the charts in this post are thumbnails, so click on them for the full-sized version.

First, grab the raw data: in this case, a spreadsheet (Table 59) from the Australian Bureau of Statistics (ABS). It contains counts of males, females and persons from ages 0 – 100+ in one year increments, for the years 1971 – 2016.

For some reason, government departments like to make their data as wide as possible. In this case, 251 columns where 1 = the year, 2 – 102 are male ages, 103-202 female ages and 203 – 251 are persons (male + female) ages 0 – 47. Persons ages 48 – 100+ are in a second sheet in the same file. No, I don’t know why either.

Fortunately, readxl takes care of all this, so all we need to do is give the columns some sensible names followed by the tidyr treatment.

## # A tibble: 11,500 x 5
## Date gender age value Year
##
## 1 1971-06-01 M 0 133347 1971
## 2 1972-06-01 M 0 136987 1972
## 3 1973-06-01 M 0 129492 1973
## 4 1974-06-01 M 0 124803 1974
## 5 1975-06-01 M 0 120678 1975
## 6 1976-06-01 M 0 115457 1976
## 7 1977-06-01 M 0 115298 1977
## 8 1978-06-01 M 0 115249 1978
## 9 1979-06-01 M 0 113443 1979
## 10 1980-06-01 M 0 114488 1980
## # ... with 11,490 more rows

Ages 21-42 from 2005 – 2016

The original chart from the SMH focused on ages 21 – 42 and years 2005 – 2016, so we’ll do the same. I thought it would be interesting to animate the changes in population by year. My first attempt posted to Twitter was incorrect, in that the numbers were summed year on year, so here’s a better version. The cumulative colouring gets a bit weird when numbers decrease, but I think it works to some degree.

Indeed, there does seem to be a recent surge in the “25 – 32 bracket” if that’s what we’re calling it.

Ages 25-32 as a proportion of 21-42
Things get less tsunami-like when we try to visualize age brackets as a proportion of all ages. The 25-32 band grows a little but – tsunami?

Ages 25-32 as a proportion of total

When you put things out on Twitter, be sure that chart nerds will descent to join in the fun.

@PeteWargent @MichaelPascoe01 If we're mainly interested in the "prime house-buying population" (25-32), I'd prob visualise it like this pic.twitter.com/K4qqoL60tQ

— Matt Cowgill (@MattCowgill) July 3, 2017

Matt rightly asks: what’s the proportion of 25-32 year-olds anyway? We can reproduce his line charts like so. Now we see that 25-32 year-olds as a percentage of total population have increased recently after an all-time (since 1971) low and in fact, the rate of increase seems to have slowed.

In conclusion then:

• Years as coloured bars: not great
• Excel: no
• “Tsunami”: hardly

Bonus section: population pyramids
I’ve always liked population pyramids, ever since I first learned about them in high school geography class. Here’s my attempt to animate one. The trick is to subset the data by gender, then create two geoms and set the values for one subset to be negative (but not the labels). More commonly, ages are binned and proportions rather than counts may be used, but I did neither in this case.

I find it either mesmerising or a bit “too much”, depending on my mood. How about you?

Filed under: R, statistics Tagged: news, smh, sydney, visualisation

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Surprising result when exploring Rcpp gallery

Fri, 07/21/2017 - 06:52

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

I’m starting to incorporate more Rcpp in my R work, and so decided to spend some time exploring the Rcpp Gallery. One example by John Merrill caught my eye. He provides a C++ solution to transforming an list of lists into a data frame, and shows impressive speed savings compared to as.data.frame.

This got me thinking about how I do this operation currently. I tend to rely on the do.call method. To mimic the example in the Rcpp example:

a <- replicate(250, 1:100, simplify=FALSE) b <- do.call(cbind, a)

For fairness, I should get a data frame rather than a matrix, so for my comparisons, I do convert b into a data frame. I follow the original coding in the example, adding my method above into the mix. Comparing times:

res <- benchmark(as.data.frame(a), CheapDataFrameBuilder(a), as.data.frame(do.call(cbind, a)), order="relative", replications=500) res[,1:4]

The results were quite interesting to me

test replications elapsed relative 3 as.data.frame(do.call(cbind, a)) 500 0.36 1.000 2 CheapDataFrameBuilder(a) 500 0.52 1.444 1 as.data.frame(a) 500 7.28 20.222

I think part of what’s happening here is that as.data.frame.list expends overhead checking for different aspects of making a legit data frame, including naming conventions. The comparison to CheapDataFrameBuilder should really be with my barebones strategy. Having said that, the example does provide great value in showing what can be done using Rcpp.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing Portfolio Volatility

Fri, 07/21/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

This is the third post in our series on portfolio volatility, variance and standard deviation. If you want to start at the beginning with calculating portfolio volatility, have a look at the first post here – Intro to Volatility. The second post on calculating rolling standard deviations is here: Intro to Rolling Volatility.

Today we will visualize rolling standard deviations with highcharter using two objects from that second post.

The charts, which are the fun payoff after all of the equations and functions that we have ground out in the previous posts, should highlight any unusual occurrences or volatility spikes/dips that we might want to investigate.

First, load the .RDat file saved from our previous Notebook (or you can run the scripts from the previous posts).

We now have 2 objects in our Global Environment – spy_rolling_sd – an xts object of rolling SPY standard deviations – roll_portfolio_result – an xts object of rolling portfolio standard deviations. Because both of those are xts objects, we can pass them straight to highcharter with the hc_add_series() function and set a name and a color with the name and color arguments. Nothing too complicated here – we did the hard work in our previous Notebooks.

highchart(type = "stock") %>% hc_title(text = "SPY v. Portfolio Rolling Volatility") %>% hc_add_series(spy_rolling_sd, name = "SPY Volatility", color = "blue") %>% hc_add_series(roll_portfolio_result, name = "Port Volatility", color = "green") %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

It is interesting to note that from late April 2016 to late October 2016, SPY’s rolling standard deviation dipped below that of the diversified portfolio. The portfolio volatility was plunging at the same time, but SPY’s was falling faster. What happened over the 6 preceding months to explain this?

Maybe we should add a flag to highlight this event. We can also add flags for the maximum SPY volatility, maximum and minimum portfolio rolling volatility and might as well include a line for the mean rolling volatility of SPY to practice adding horizontal lines.

We will use two methods for adding flags. First, we’ll hard code the date for the flag as “2016-04-29” using the date when rolling SPY volatility dipped below the portfolio. Second, we’ll set a flag with the date
as.Date(index(roll_portfolio_result[which.max(roll_portfolio_result)]),format = "%Y-%m-%d") which looks like a convoluted mess but is adding a date for whenever the rolling portfolio standard deviation hit its maximum.

This is a bit more ‘dynamic’ because we can change our assets but keep this code the same and it will find the date with the maximum rolling standard deviation. Our first flag is not dynamic in the sense that it is specific to the comparison between SPY and this exact portfolio.

spy_important_date <- as.Date(c("2016-04-29"), format = "%Y-%m-%d") port_max_date <- as.Date(index(roll_portfolio_result[which.max(roll_portfolio_result)]), format = "%Y-%m-%d") port_min_date <- as.Date(index(roll_portfolio_result[which.min(roll_portfolio_result)]), format = "%Y-%m-%d") spy_max_date <- as.Date(index(spy_rolling_sd[which.max(spy_rolling_sd)]), format = "%Y-%m-%d") highchart(type = "stock") %>% hc_title(text = "SPY v. Portfolio Rolling Volatility") %>% hc_add_series(spy_rolling_sd, name = "SPY Volatility", color = "blue", id = "SPY") %>% hc_add_series(roll_portfolio_result, name = "Portf Volatility", color = "green", id = "Port") %>% hc_add_series_flags(spy_important_date, title = c("SPY Vol Dips"), text = c("SPY rolling sd dips below portfolio."), id = "SPY") %>% hc_add_series_flags(spy_max_date, title = c("SPY Max "), text = c("SPY max rolling volatility."), id = "SPY") %>% hc_add_series_flags(port_max_date, title = c("Portf Max"), text = c("Portfolio maximum rolling volatility."), id = "Port") %>% hc_add_series_flags(port_min_date, title = c("Portf Min"), text = c("Portfolio min rolling volatility."), id = "Port") %>% hc_yAxis(title = list(text = "Mean SPY rolling Vol"), showFirstLabel = FALSE, showLastLabel = FALSE, plotLines = list( list(value = mean(spy_rolling_sd), color = "#2b908f", width = 2))) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

Hover on the flags and you can see the text we added for explanation.

It’s remarkable how rolling volatility has absolutely plunged since early-to-mid 2016. Since August of 2016, both the portfolio and SPY rolling standard deviations have been well below the SPY mean.

Thanks for sticking with this three-part introduction to volatility. Next time, we’ll port our work to Shiny and play with different assets and allocations.

_____='https://rviews.rstudio.com/2017/07/21/visualizing-portfolio-volatility/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to make maps with Census data in R

Fri, 07/21/2017 - 02:00

(This article was first published on randomjohn.github.io, and kindly contributed to R-bloggers)

US Census Data

The US Census collects a number of demographic measures and publishes aggregate data through its website. There are several ways to use Census data in R, from the Census API to the USCensus2010 package. If you are interested in geopolitical data in the US, I recommend exploring both these options – the Census API requires a key for each person who uses it, and the package requires downloading a very large dataset. The setups for both require some effort, but once that effort is done you don’t have to do it again.

The acs package in R allows you to access the Census API easily. I highly recommend checking it out, and that’s the method we will use here. Note that I’ve already defined the variable api_key – if you are trying to run this code you will need to first run something like api_key <- before running the rest of this code.

library(acs) api.key.install(key=api_key) # now you are ready to run the rest of the acs code

For purposes here, we will use the toy example of plotting median household income by county for every county in South Carolina. First, we obtain the Census data. The first command gives us the table and variable names of what we want. I then use that table number in the acs.fetch command to get the variable I want.

If you have the maps and ggplot2 packages, you already have the data you need to plot. We use the map_data function from ggplot2 to pull in county shape data for South Carolina. (A previous attempt at this blogpost had used the ggmap package, but there is an incompatibility between that and the latest ggplot2 package at the time of this writing.)

library(ggplot2) ## Want to understand how all the pieces fit together? Buy the ## ggplot2 book: http://ggplot2.org/book/ sc_map <- map_data("county",region="south.carolina") ggplot() + geom_polygon(aes(x=long,y=lat,group=group),data=sc_map,colour="white",fill="black") + theme_minimal()

Merging the demographic and map data

Now we have the demographic data and the map, but merging the two will take a little effort. The reason is that the map data gives a lower case representation of the county and calls it a “subregion”, while the Census data returns the county as “xxxx County, South Carolina”. I use the dplyr and stringr packages (for str_replace) to make short work of this merge.

library(dplyr) library(stringr) merged <- as.data.frame(home_median_price@estimate) %>% mutate(county_full = rownames(.), county = str_replace(county_full,"(.+) County.*","\\1") %>% tolower) %>% select(county,B19126_001) %>% rename(med_income=B19126_001) %>% right_join(sc_map,by=c("county"="subregion")) knitr::kable(head(merged,10)) county med_income long lat group order region abbeville 44918 -82.24809 34.41758 1 1 south carolina abbeville 44918 -82.31685 34.35455 1 2 south carolina abbeville 44918 -82.31111 34.33163 1 3 south carolina abbeville 44918 -82.31111 34.29152 1 4 south carolina abbeville 44918 -82.28247 34.26860 1 5 south carolina abbeville 44918 -82.25955 34.25142 1 6 south carolina abbeville 44918 -82.24809 34.21131 1 7 south carolina abbeville 44918 -82.23663 34.18266 1 8 south carolina abbeville 44918 -82.24236 34.15401 1 9 south carolina abbeville 44918 -82.27674 34.10818 1 10 south carolina

It’s now a simple matter to plot this merged dataset. In fact, we only have to tweak a few things from the first time we plotted the map data.

ggplot() + geom_polygon(aes(x=long,y=lat,group=group,fill=med_income),data=merged) + theme_minimal()

Discussion

It’s pretty easy to plot U.S. Census data on a map. The real power of Census data comes not just from plotting it, but combining with other geographically-based data (such as crime). The acs package in R makes it easy to obtain Census data, which can then be merged with other data using packages such as dplyr and stringr and then plotted with ggplot2. Hopefully the authors of the ggmap and ggplot2 packages can work out their incompatibilities so that the above maps can be created using the Google API map or open street maps.

It should be noted that while I obtained county-level information, aggregate data can be obtained at Census block and tract levels as well, if you are looking to do some sort of localized analysis.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: randomjohn.github.io. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Package bigstatsr: Statistics with matrices on disk (useR 2017)

Fri, 07/21/2017 - 02:00

In this post, I will talk about my package bigstatsr, which I’ve just presented in a lightning talk of 5 minutes at useR!2017. You can listen to me in action there. I should have chosen a longer talk to explain more about this package, maybe next time. I will use this post to give you a more detailed version of the talk I gave in Brussels.

Motivation behind bigstatsr

I’m a PhD student in predictive human genetics. I’m basically trying to predict someone’s risk of disease based on their DNA mutations. These DNA mutations are in the form of large matrices so that I’m currently working with a matrix of 15K rows and 300K columns. This matrix would take approximately 32GB of RAM if stored as a standard R matrix.

When I began studying this dataset, I had only 8GB of RAM on my computer. I now have 64GB of RAM but it would take only copying this matrix once to make my computer begin swapping and therefore slowing down. I found a convenient solution by using the object big.matrix provided by the R package bigmemory (Kane, Emerson, and Weston 2013). With this solution, you can access a matrix that is stored on disk almost as if it were a standard R matrix in memory.

Yet, for these special matrices, some useful statistical functions were missing or not fast enough. So, I implemented these. It was a good experience about programming optimized and parallelized algorithms. I’m aware that there are other packages that come with bigmemory, such as biganalytics and bigalgebra, that already implement some of the features I will talk about in this post. I will discuss why I don’t use these packages. However, I use the work of other packages such as biglasso and RSpectra.

Introduction to bigmemory # loading package bigstatsr (and bigmemory) library(bigstatsr) # initializing some matrix on disk: wrapper to bigmemory::big.matrix() mat <- FBM(backingroot = "matrix-on-disk", descriptor = FALSE)(5e3, 10e3) ## Creating directory "backingfiles" which didn't exist.. dim(mat) ## [1] 5000 10000 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[1, 1] <- 2 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[2:4] <- 3 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 3 0 0 0 0 ## [3,] 3 0 0 0 0 ## [4,] 3 0 0 0 0 ## [5,] 0 0 0 0 0 mat[, 2:3] <- rnorm(2 * nrow(mat)) mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 -0.1611891 3.1218981 0 0 ## [2,] 3 -0.6026680 -1.3935111 0 0 ## [3,] 3 0.4743703 -0.6566549 0 0 ## [4,] 3 0.1514252 0.8247594 0 0 ## [5,] 0 -1.5167186 0.7329260 0 0

What we can see is that big matrices (big.matrix objects) can be accessed (read/write) almost as if they were standard R matrices, but you have to be cautious. For example, doing mat[1, ] isn’t recommended. Indeed, big matrices, as standard R matrices, are stored by column so that it is in fact a big vector with columns stored one after the other, contiguously. So, accessing the first row would access elements that are not stored contiguously in memory, which is slow. One should always access columns rather than rows.

Apply an R function to a big matrix

An easy strategy to apply an R function to a big matrix would be the split-apply-combine strategy (Wickham 2011). For example, you could access only a block of columns at a time, apply a (vectorized) function to this block, and then combine the results of all blocks. This is implemented in function big_apply().

# Compute the sums of the first 1000 columns colsums_1 <- colSums(mat[, 1:1000]) # Compute the sums of the second block of 1000 columns colsums_2 <- colSums(mat[, 1001:2000]) # Combine the results colsums_1_2 <- c(colsums_1, colsums_2) # Do this automatically with big_apply() colsums_all <- big_apply(mat, a.FUN = function(X, ind) colSums(X[, ind]), a.combine = 'c')

When the split-apply-combine strategy can be used for a given function, you could use big_apply() to get the results, while accessing only small blocks of columns (or rows) at a time. Package biganalytics, by the creators of bigmemory, provides a way to apply an R function to margins of a big.matrix. Yet, for example, if the big.matrix has a lot of columns, it would be much slower to loop through all columns rather that applying a vectorized function to blocks of columns. You can find more examples of applications for big_apply() there.

colsums_all2 <- biganalytics::apply(mat, 2, sum) all.equal(colsums_all2, colsums_all) ## [1] TRUE Use Rcpp with a big matrix

Using Rcpp with a big.matrix is super easy. Let’s use the previous example, i.e. the computation of the colsums of a big.matrix. We will do it in 3 different ways.

1. Using the bigmemory way // [[Rcpp::depends(bigmemory, BH)]] #include #include using namespace Rcpp; // [[Rcpp::export]] NumericVector bigcolsums(const S4& BM) { XPtr xpMat = BM.slot("address"); MatrixAccessor<double> macc(*xpMat); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc[j][i]; return res; } colsums_all3 <- bigcolsums(mat) all.equal(colsums_all3, colsums_all) ## [1] TRUE 2. Using the bigstatsr way // [[Rcpp::depends(bigstatsr, bigmemory, BH)]] #include // [[Rcpp::export]] NumericVector bigcolsums2(const S4& BM, const IntegerVector& rowInd, const IntegerVector& colInd) { XPtr xpMat = BM.slot("address"); // C++ indices begin at 0 IntegerVector rows = rowInd - 1; IntegerVector cols = colInd - 1; // An accessor of only part of the big.matrix SubMatAcc<double> macc(*xpMat, rows, cols); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc(i, j); return res; } colsums_all4 <- bigcolsums2(mat, rows_along(mat), cols_along(mat)) all.equal(colsums_all4, colsums_all) ## [1] TRUE

In bigstatsr, most of the functions have parameters for subsetting rows and columns because it is often useful. One of the main reasons why I don’t use package bigalgebra is its lack of subsetting options.

3. Use an already implemented function str(colsums_all5 <- big_colstats(mat)) ## 'data.frame': 10000 obs. of 2 variables: ## $sum: num 11 139.6 73.2 0 0 ... ##$ var: num 0.0062 1.0117 1.0164 0 0 ... all.equal(colsums_all5$sum, colsums_all) ## [1] TRUE Principal Component Analysis Let’s begin by filling the matrix with random numbers in a tricky way. U <- sweep(matrix(rnorm(nrow(mat) * 10), ncol = 10), 2, 1:10, "/") V <- matrix(rnorm(ncol(mat) * 10), ncol = 10) big_apply(mat, a.FUN = function(X, ind) { X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind)) NULL }, a.combine = 'c') ## NULL Let’s say we want the first 10 PCs of the (scaled) matrix. system.time( small_svd <- svd(scale(mat[, 1:2000]), nu = 10, nv = 10) ) ## user system elapsed ## 8.348 0.063 8.429 system.time( small_svd2 <- big_SVD(mat, big_scale(), ind.col = 1:2000) ) ## (2) ## user system elapsed ## 1.900 0.083 1.989 plot(small_svd2$u, small_svd$u) system.time( small_svd3 <- big_randomSVD(mat, big_scale(), ind.col = 1:2000) ) ## user system elapsed ## 0.353 0.002 0.355 plot(small_svd3$u, small_svd$u) system.time( svd_all <- big_randomSVD(mat, big_scale()) ) ## user system elapsed ## 2.267 0.001 2.268 plot(svd_all) Function big_randomSVD() uses Rcpp and package Rpsectra to implement a fast Singular Value Decomposition for a big.matrix that is linear in all dimensions (standard PCA algorithm is quadratic in the smallest dimension) which makes it very fast even for large datasets (that have both dimensions that are large). Some linear models M <- 100 # number of causal variables set <- sample(ncol(mat), M) y <- mat[, set] %*% rnorm(M) y <- y + rnorm(length(y), sd = 2 * sd(y)) ind.train <- sort(sample(nrow(mat), size = 0.8 * nrow(mat))) ind.test <- setdiff(rows_along(mat), ind.train) mult_test <- big_univLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ]) library(ggplot2) plot(mult_test) + aes(color = cols_along(mat) %in% set) + labs(color = "Causal?")

train <- big_spLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5) pred <- predict(train, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) plot(apply(pred, 2, cor, y = y[ind.test]))

The functions big_spLinReg(), big_spLogReg() and big_spSVM() all use lasso (L1) or elastic-net (L1 & L2) regularizations in order to limit the number of predictors and to accelerate computations thanks to strong rules (R. Tibshirani et al. 2012). The implementation of these functions are based on modifications from packages sparseSVM and biglasso (Zeng and Breheny 2017). Yet, these models give predictions for a range of 100 different regularization parameters whereas we are only interested in one prediction.

So, that’s why I came up with the idea of Cross-Model Selection and Averaging (CMSA), which principle is:

1. This function separates the training set in K folds (e.g. 10).
2. In turn,
• each fold is considered as an inner validation set and the others (K – 1) folds form an inner training set,
• the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
• the vector of scores which maximizes feval is determined,
• the vector of coefficients corresponding to the previous vector of scores is chosen.
3. The K resulting vectors of coefficients are then combined into one vector.
train2 <- big_CMSA(big_spLinReg, feval = function(pred, target) cor(pred, target), X. = mat, y.train = y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5, ncores = parallel::detectCores() / 2) mean(train2 != 0) # percentage of predictors ## [1] 0.1806194 pred2 <- predict(train2, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) cor(pred2, y[ind.test]) ## [1] 0.311157 Some matrix computations

For example, let’s compute the correlation of the first 2000 columns.

system.time( corr <- cor(mat[, 1:2000]) ) ## user system elapsed ## 10.822 0.007 10.824 system.time( corr2 <- big_cor(mat, ind.col = 1:2000) ) ## user system elapsed ## 0.729 0.005 0.737 all.equal(corr2, corr) ## [1] TRUE Advantages of using big.matrix objects
• you can apply algorithms on 100GB of data,
• you can easily parallelize your algorithms because the data on disk is shared,
• you write more efficient algorithms,
• you can use different types of data, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix).

In a next post, I’ll try to talk about good practices on how to use parallelism in R.

References

Kane, Michael J, John W Emerson, and Stephen Weston. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19. doi:10.18637/jss.v055.i14.

Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshirani. 2012. “Strong rules for discarding predictors in lasso-type problems.” Journal of the Royal Statistical Society. Series B, Statistical Methodology 74 (2): 245–66. doi:10.1111/j.1467-9868.2011.01004.x.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software.” Journal of Statistical Software 40 (1): 1–29. doi:10.1039/np9971400083.

Zeng, Yaohui, and Patrick Breheny. 2017. “The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R,” January. http://arxiv.org/abs/1701.05936.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));