Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 14 min ago

Data viz challenge: Recreating FiveThirtyEight’s ‘Deadest Names’ graphic with ggplot2

Tue, 11/06/2018 - 21:15

(This article was first published on my (mis)adventures in R programming, and kindly contributed to R-bloggers)

I’ve recently begun reading through the book Modern Data Science with R, by Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. It’s quite clear and informative. One of the things I especially appreciate about it is that I’m not finding the math to be too cumbersome. That is, even for someone like me, whose primary background isn’t in math or statistics, I’m able to follow along with the book quite easily.

As I’m reading through the book, I’m doing the exercises at the back of the chapters, and I recently worked through chapter 3, which covers ggplot2 basics. One of the exercises at the end of this chapter asks us to recreate this graphic from FiveThirtyEight. The goal of the exercise is to use ggplot2 to make production-quality graphics.

The project makes use of the babynames package, which uses public data on baby names from the Social Security Administration. We then use the make_babynames_dist() function from the mdsr package that the authors developed to add variables relevant to the goals of the exercise. Basically, it takes the data from the lifetables table in the babynames package and adds variables and filters and returns just the data relevant to 2014.

Truth told, I was stumped by this exercise when I first read it. So I reached out to Nicholas Horton at Amherst and he helped me with the basic scripting and I was able to tweak it to recreate what I was looking for.

So first we load the necessary libraries and inspect the dataset.

library(mdsr) library(dplyr) library(tidyr) library(ggplot2) library(ggthemes) library(babynames) babynames_dist <- make_babynames_dist() head(babynames_dist)

# A tibble: 6 x 9 year sex name n prop alive_prob count_thousands age_today est_alive_today 1 1900 F Mary 16707 0.0526 0 16.7 114 0 2 1900 F Helen 6343 0.0200 0 6.34 114 0 3 1900 F Anna 6114 0.0192 0 6.11 114 0 4 1900 F Marga… 5304 0.0167 0 5.30 114 0 5 1900 F Ruth 4765 0.0150 0 4.76 114 0 6 1900 F Eliza… 4096 0.0129 0 4.10 114 0

So what we need to do is create some new variables that provide the total number of people with a given name who are likely still alive, and from that we can then calculate the percentage who are likely dead. Then (and this is what initially had me stumped) we need to select the top 10 male names and the top 10 female names.

deadest <- babynames_dist %>% filter(year >= 1900) %>% group_by(name, sex) %>% summarise(N = n(), total_est_alive_today = sum(est_alive_today), total = sum(n)) %>% mutate(percent_dead = 1 - (total_est_alive_today / total)) %>% filter(total > 50000) %>% arrange(desc(percent_dead)) %>% group_by(sex) %>% top_n(10)

The above gives us the following dataset, which we can then use to create the graphic:

# A tibble: 20 x 6 # Groups: sex [2] name sex N total_est_alive_today total percent_dead 1 Mabel F 111 20233. 96037 0.789 2 Gertrude F 111 31360. 145693 0.785 3 Myrtle F 99 25491. 108941 0.766 4 Blanche F 111 16509. 69524 0.763 5 Beulah F 110 15642. 63361 0.753 6 Opal F 111 17471. 65821 0.735 7 Florence F 111 77679. 284945 0.727 8 Agnes F 111 37593. 134940 0.721 9 Viola F 111 32957. 116666 0.718 10 Bessie F 111 36824. 130155 0.717 11 Elmer M 111 35548. 116830 0.696 12 Wilbur M 111 17881. 54423 0.671 13 Homer M 111 18809. 55639 0.662 14 Willard M 111 28576. 74821 0.618 15 Hubert M 111 21417. 55340 0.613 16 Chester M 111 44995. 114370 0.607 17 Clarence M 111 113641. 280518 0.595 18 Herbert M 111 88652. 217291 0.592 19 Harry M 111 153501. 374524 0.590 20 Horace M 111 20723. 50340 0.588

So then here is the code to create the actual data viz:

ggplot(deadest, aes(reorder(name, percent_dead), percent_dead, fill = sex)) + geom_bar(stat = "identity") + geom_text(aes(y = percent_dead + 0.05), label = paste(round(deadest$percent_dead * 100, 1))) + coord_flip() + ggtitle("Deadest Names", subtitle = "Estimated % of Americans with a given name born since 1900\nwho were dead as of Jan. 1, 2014") + scale_x_discrete(NULL) + scale_y_continuous(NULL) + scale_fill_manual(values = c("#f6b900", "#008fd5")) + theme_fivethirtyeight() + theme(axis.text.x = element_blank(), panel.grid = element_blank(), legend.position = "none")

I’ve used the fivethirtyeight theme from the ggthemes package, and apart from the footer that FiveThirtyEight uses, it looks pretty close.

The post Data viz challenge: Recreating FiveThirtyEight’s ‘Deadest Names’ graphic with ggplot2 appeared first on my (mis)adventures in R programming.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: my (mis)adventures in R programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R-bloggers weekly – top R posts from last week (2018-10-28 till 2018-11-03)

Tue, 11/06/2018 - 20:25

Most liked R posts from last week, sorted based on the number of likes they got on twitter, enjoy:


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

More on sigr

Tue, 11/06/2018 - 19:29

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form:

model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.47747 -0.59072 -0.00668 0.60484 2.49512 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** ## Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8678 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

and also in the summary() object:

c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic[1]) ## R2 F.value ## 0.7599546 468.5501535

Note, though, that while the summary reports the model’s significance, it does not carry it as a specific summary() object item. sigr::wrapFTest() is a convenient way to extract the model’s R-squared and F statistic and simultaneously calculate the model significance, as is required by many scientific publications.

sigr is even more helpful for logistic regression, via glm(), which reports neither the model’s chi-squared statistic nor its significance.

iris$isVersicolor <- iris$Species == "versicolor" model_glm <- glm( isVersicolor ~ Sepal.Length + Sepal.Width, data = iris, family = binomial) (smod_glm <- summary(model_glm)) ## ## Call: ## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, ## data = iris) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9769 -0.8176 -0.4298 0.8855 2.0855 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 8.0928 2.3893 3.387 0.000707 *** ## Sepal.Length 0.1294 0.2470 0.524 0.600247 ## Sepal.Width -3.2128 0.6385 -5.032 4.85e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 190.95 on 149 degrees of freedom ## Residual deviance: 151.65 on 147 degrees of freedom ## AIC: 157.65 ## ## Number of Fisher Scoring iterations: 5

To get the significance of a logistic regression model, call wrapr::wrapChiSqTest():

library(sigr) (chi2Test <- wrapChiSqTest(model_glm)) ## [1] “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”

Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the sigr object, as well:

str(chi2Test) ## List of 10 ## $ test : chr "Chi-Square test" ## $ df.null : int 149 ## $ df.residual : int 147 ## $ null.deviance : num 191 ## $ deviance : num 152 ## $ pseudoR2 : num 0.206 ## $ pValue : num 2.92e-09 ## $ sig : num 2.92e-09 ## $ delta_deviance: num 39.3 ## $ delta_df : int 2 ## - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"

And of course you can render the sigr object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.

render(chi2Test, format = "html")

Chi-Square Test summary: pseudo-R2=0.21 (χ2(2,N=150)=39, p<1e-05).

By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in Practical Data Science with R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Adding different annotation to each facet in ggplot

Tue, 11/06/2018 - 15:33

(This article was first published on Emma R, and kindly contributed to R-bloggers)

Help! The same annotations go on every facet!

(with thanks to a student for sending me her attempt).

This is a question I get fairly often and the answer is not straightforward especially for those that are relatively new to R and ggplot2.

In this post, I will show you how to add different annotations to each facet. More like this:

This is useful in its own right but can also help you understand ggplot better.

I will assume you have R Studio installed and have at least a little experience with it but I’m aiming to make this do-able for novices. I’ll also assume you’ve analysed your data so you know what annotations you want to add.

Faceting is a very useful feature of ggplot which allows you to efficiently plot subsets of your data next to each other.
In this example the data are the wing lengths for males and females of two imaginary species of butterfly in two regions, north and south. Some of the results of a statistical analysis are shown with annotation.

1. Preparation

The first thing you want to do is make a folder on your computer where your code and the data for plotting will live. This is your working directory.
Now get a copy of the data by saving this file the folder you just made.

2. Start in RStudio

Start R Studio and set your working directory to the folder you created:

Now start a new script file:

and save it as figure.R or similar.

3. Load packages

Make the packages you need available for use in this R Studio session by adding this code to your script and running it.

# package loading library(ggplot2) library(Rmisc) 4. Read the data in to R

The data are in a plain text file where the first row gives the column names. It can be read in to a dataframe with the read.table() command:

butter <- read.table("butterflies.txt", header = TRUE)

This each row in this data set is an individual butterfly and the columns are four variables:

  • winglen the wing length (in millimeters) of an individual
  • spp its species, one of “F.concocti” or “F.flappa”
  • sex its sex, one of “female” or “male”
  • region where it is from, one of “North” or “South”
5. Summarise the data

Our plot has the means and standard errors for each group and this requires us to summarize over the replicates which we can do with the summarySE() function:

buttersum <- summarySE(data = butter, measurevar = "winglen", groupvars = c("spp", "sex", "region")) buttersum ## spp sex region N winglen sd se ci ## 1 F.concocti female North 10 25.93591 4.303011 1.3607315 3.078189 ## 2 F.concocti female South 10 31.37000 4.275265 1.3519574 3.058340 ## 3 F.concocti male North 10 23.22876 4.250612 1.3441617 3.040705 ## 4 F.concocti male South 10 24.97000 4.957609 1.5677337 3.546460 ## 5 F.flappa female North 10 33.18389 4.286312 1.3554509 3.066243 ## 6 F.flappa female South 10 24.67000 3.270423 1.0341986 2.339520 ## 7 F.flappa male North 10 24.46586 5.492053 1.7367398 3.928778 ## 8 F.flappa male South 10 23.45000 3.012290 0.9525696 2.154862

A group is a species-sex-region combination.

6. Plot

We have four variables to plot. Three are explanatory: species, sex and region. We map one of the explanatory variables to the x-axis, one to different colours and one to the facets.

To plot North and South on separate facets, we tell facet_grid() to plot everything else (.) for each region:

ggplot(data = buttersum, aes(x = spp, y = winglen)) + geom_point(aes(colour = sex), position = position_dodge(width = 1)) + geom_errorbar(aes(colour = sex, ymin = winglen - se, ymax = winglen + se), width = .2, position = position_dodge(width = 1)) + ylim(0, 40) + facet_grid(. ~ region)

Build understanding

This section will help you understand why facet annotations are done as they are but you can go straight to 7. Create a dataframe for the annotation information if you just want the code.

We plan to facet by region but in order to understand better, it is useful to first plot just one region. We can subset the data to achieve that:

a) Subset # subset the northern region butterN <- butter[butter$region == "North",] b) Summarise data subset for plotting butterNsum <- summarySE(data = butterN, measurevar = "winglen", groupvars = c("spp", "sex")) butterNsum ## spp sex N winglen sd se ci ## 1 F.concocti female 10 25.93591 4.303011 1.360732 3.078189 ## 2 F.concocti male 10 23.22876 4.250612 1.344162 3.040705 ## 3 F.flappa female 10 33.18389 4.286312 1.355451 3.066243 ## 4 F.flappa male 10 24.46586 5.492053 1.736740 3.928778 c) Plot subset

Since we are dealing only with data from the North, we have just three variables to plot. We map one of the explanatory variables to the x-axis and the other to different colours:

ggplot(data = butterNsum, aes(x = spp, y = winglen, colour = sex)) + geom_point(position = position_dodge(width = 1)) + geom_errorbar(aes(ymin = winglen - se, ymax = winglen + se), width = .2, position = position_dodge(width = 1)) + ylim(0, 40)

  • data = butterNsum tells ggplot which dataframe to plot (the summary)
    • aes(x = spp, y = winglen, colour = sex) the “aesthetic mappings” specify where to put each variable. Aesthetic mappings given in the ggplot() statement will apply to every “layer” in the plot unless otherwise specified.
  • geom_point() the first “layer” adds points
    • position = position_dodge(width = 1) indicates female and male means should be plotted side-by-side for each species not on top on each other
  • geom_errorbar() the second layer adds the error bars. These must also be position dodged so they appear on the points.
    • aes(ymin = winglen - se, ymax = winglen + se) The error bars need new aesthetic mappings because they are not at winglen (the mean in the summary) but at the mean – the standard error and the mean + the standard error. Since all of that information is inside butterNsum, we do not need to give the data argument again.
d) Annotate this plot (i)

The annotation is composed of three lines – or segments – and some text. Each segment has a start (x, y) and an end (xend, yend) which we need to specify. The text is centered on its (x, y)

The x-axis has two categories which have the internal coding of 1 and 2. We want the annotation to start a bit before 2 and finish a bit after 2.

Note that position_dodge() units are twice the category axis units in this example.

ggplot(data = butterNsum, aes(x = spp, y = winglen, colour = sex)) + geom_point(position = position_dodge(width = 1)) + geom_errorbar(aes(ymin = winglen - se, ymax = winglen + se), width = .2, position = position_dodge(width = 1)) + geom_text(x = 2, y = 38, label = "***", colour = "black") + geom_segment(x = 1.75, xend = 1.75, y = 36, yend = 37, colour = "black") + geom_segment(x = 2.25, xend = 2.25, y = 36, yend = 37, colour = "black") + geom_segment(x = 1.75, xend = 2.25, y = 37, yend = 37, colour = "black") + ylim(0, 40)

e) Annotate this plot (ii)

Instead of hard coding the co-ordinates into the plot, we could have put them in a dataframe with a column for each x or y as follows:

anno <- data.frame(x1 = 1.75, x2 = 2.25, y1 = 36, y2 = 37, xstar = 2, ystar = 38, lab = "***") anno ## x1 x2 y1 y2 xstar ystar lab ## 1 1.75 2.25 36 37 2 38 ***

Then give a dataframe argument to geom_segment() and geom_text() and the aesthetic mappings for that dataframe. We also need to move the colour mapping from the ggplot() statement to the geom_point() and geom_errorbar().

This is because the mappings applied in the ggplot() will apply to every layer unless otherwise specified and if the colour mapping stays there, geom_segment() and geom_text() will try to find the variable ‘sex’ in the anno dataframe.

ggplot(data = butterNsum, aes(x = spp, y = winglen)) + geom_point(aes(colour = sex), position = position_dodge(width = 1)) + geom_errorbar(aes(colour = sex, ymin = winglen - se, ymax = winglen + se), width = .2, position = position_dodge(width = 1)) + ylim(0, 40) + geom_text(data = anno, aes(x = xstar, y = ystar, label = lab)) + geom_segment(data = anno, aes(x = x1, xend = x1, y = y1, yend = y2), colour = "black") + geom_segment(data = anno, aes(x = x2, xend = x2, y = y1, yend = y2), colour = "black") + geom_segment(data = anno, aes(x = x1, xend = x2, y = y2, yend = y2), colour = "black")

7. Create a dataframe for the annotation information

The easiest way to annotate for each facet separately is to create a dataframe with a row for each facet:

anno <- data.frame(x1 = c(1.75, 0.75), x2 = c(2.25, 1.25), y1 = c(36, 36), y2 = c(37, 37), xstar = c(2, 1), ystar = c(38, 38), lab = c("***", "**"), region = c("North", "South")) anno ## x1 x2 y1 y2 xstar ystar lab region ## 1 1.75 2.25 36 37 2 38 *** North ## 2 0.75 1.25 36 37 1 38 ** South 7. Annotate the plot

Use the annotation dataframe as the value for the data argument in geom_segment() and geom_text()
New aesthetic mappings will be needed too:

ggplot(data = buttersum, aes(x = spp, y = winglen)) + geom_point(aes(colour = sex), position = position_dodge(width = 1)) + geom_errorbar(aes(colour = sex, ymin = winglen - se, ymax = winglen + se), width = .2, position = position_dodge(width = 1)) + ylim(0, 40) + geom_text(data = anno, aes(x = xstar, y = ystar, label = lab)) + geom_segment(data = anno, aes(x = x1, xend = x1, y = y1, yend = y2), colour = "black") + geom_segment(data = anno, aes(x = x2, xend = x2, y = y1, yend = y2), colour = "black") + geom_segment(data = anno, aes(x = x1, xend = x2, y = y2, yend = y2), colour = "black")+ facet_grid(. ~ region)

Or a little more report friendly:

ggplot(data = buttersum, aes(x = spp, y = winglen)) + geom_point(aes(shape = sex), position = position_dodge(width = 1), size = 2) + scale_shape_manual(values = c(1, 19), labels = c("Female", "Male") )+ geom_errorbar(aes(group = sex, ymin = winglen - se, ymax = winglen + se), width = .2, position = position_dodge(width = 1)) + ylim(0, 40) + ylab("Wing length (mm)") + xlab("") + geom_text(data = anno, aes(x = xstar, y = ystar, label = lab)) + geom_segment(data = anno, aes(x = x1, xend = x1, y = y1, yend = y2), colour = "black") + geom_segment(data = anno, aes(x = x2, xend = x2, y = y1, yend = y2), colour = "black") + geom_segment(data = anno, aes(x = x1, xend = x2, y = y2, yend = y2), colour = "black")+ facet_grid(. ~ region) + theme(panel.background = element_rect(fill = "white", colour = "black"), strip.background = element_rect(fill = "white", colour = "black"), legend.key = element_blank(), legend.title = element_blank())

If you want to add images to each facet you can use the ggimage package. I covered this in a previous blog, Fun and easy R graphs with images
You need to add column to your annotation dataframe.

Package references

Hope R.M. (2013). Rmisc: Rmisc: Ryan Miscellaneous. R package version 1.5.

Wickham H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4

Yu G. (2018). ggimage: Use Image in ‘ggplot2’. R package version 0.1.7.

R and all it’s packages are free so don’t forget to cite the awesome contributors.
How to cite packages in R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Emma R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Can we predict the crawling of the Google-Bot?

Tue, 11/06/2018 - 14:04

(This article was first published on Keyword Hero – The SEO Revolution, and kindly contributed to R-bloggers)

Logfile analysis allows website owners a deep insight about how Google and other search engines crawl their pages. How often does a bot come by a page, which pages are rarely or not at all visited by the bots? For which pages does the bot get errors? All these and more questions can be answered by a log file analysis.

Unfortunately, a log file analysis is extremely time-consuming in many cases, as very few employees have access to the log files, especially in large organizations. If you have managed to access this treasure trove of data, the evaluation of the usually huge files can cause you headaches.

For collecting our logs we used, a tool which stores the logs directly in Google Analytics and we can simply query the data there via the API and compare this data directly with user data from the main account as needed in the analysis.

Get the Bot Data from Google Analytics

To get to know the paths the google crawler travels along our page, we first need the data provided through Log Hero. We can easily acccess the data through an API call to Google Analyitcs. We like to use the rga package, but you may prefer the googleAnalyticsR. Both deliver the data in the same format, so it’s mostly a matter of personal preference. It is very important though to use the rga package from and not it’s capital letter cousin RGA."ga", = "", client.secret = "xxxxxxxxxxxxxxxxx") logdata <- ga$getData("177217805", = '2018-06-01', = '2018-10-28', dimensions = "ga:year,ga:date,ga:pagePath", metrics = "ga:hits", filters = "ga:dimension2=@Google;ga:dimension4==200", batch = TRUE


Follow the Googlebot

We do some data wrangling to put the steps of the crawler into a format that is accepted by the clickstream package. Now we know the paths the crawlers is taking and are able to find a structure within these paths. A natural model to be applied here is markov chains which are easily fitted and plotted.

# Use four lettters of md4 sum to make plot readable logdata$pagePathMD5 = substr(unlist(lapply(logdata$pagePath,FUN = digest::sha1)),1,4) logdata %>% filter(hits > 10) -> filtered_logdata # Data wrangling for clickstream format path <- ddply(filtered_logdata, "year",function(logdata)paste(logdata$pagePathMD5,collapse = ",")) newchar <- as.character(path$V1) csf <- file(description = "clickstream.txt") writeLines(newchar,csf) cls <- readClickstreams(csf, header = TRUE) # Fit and plot MC mc <- fitMarkovChain(cls) plot(mc, edge.label = NULL)

But how often will Google visit us again?

With the help of the high-level prediction package of prophet, we can make easy and surprisingly accurate predictions of how many times the crawler will visit the page in the future – by learning from past data. There is however some erratic component to the crawler behaviour – or rather we are missing some information.

prophet_input = logdata[,c("date","hits")] colnames(prophet_input)<-c("ds", "y") m <- prophet(prophet_input)

## Initial log joint probability = -3.33717 ## Optimization terminated normally: ## Convergence detected: relative gradient magnitude is below tolerance

future <- make_future_dataframe(m, periods = 60) forecast <- predict(m, future) plot(m, forecast)

Estimating the hits of a single page

Now if we add a new page or otherwise want to predcit how often a single page will be visited by the crawler, we would like to build a prediction model. For this, we will use 4 sources of data:

  1. The Loghero data – Where was the crawler in the past and how often?
  2. The standard Google Analytics Data – Where and how long do human users stay?
  3. Screaming Frog page data – Meta data of each page (text ratio, word count, size…)
  4. Screaming Frog link data – (Unique) in and outlinks to the page

So let us gather all this data. We will be using the googleAnalyticsR package this time.

# GA data googleAnalyticsR::ga_auth(new_user = TRUE) # Loghero Data df_loghero = google_analytics_3(id = ">your_LOGHERO_ID<",start = '2018-05-01',end = '2018-10-15', dimensions = c("ga:dimension2","ga:dimension1", "ga:dimension4" ,"ga:pagePath"), metrics = c("ga:hits","ga:metric1"), filters = "ga:dimension2=~^Googlebot(\\-Mobile)?", samplingLevel = "WALK") # Google Analytics data df_user_ga = google_analytics_3(id = ">your_GA_ID",start = '2018-05-01',end = '2018-08-15', dimensions = c("ga:pagePath","ga:hostname"), metrics = c("ga:pageviews","ga:avgTimeOnPage","entrances"), samplingLevel = "WALK") # Screaming Frog data # Written to csv files in screaming frog application page_data = fread(input = "internal_html.csv") page_inlinks = fread(input = "all_inlinks.csv")

To merge all our data together we have to do some basic data wrangling.

library(googleAnalyticsR) library(igraph) library(ranger) library(data.table) library(dplyr) # Feature Engineering and Type Casting df_loghero$avgMetric1 = df_loghero$metric1 / as.numeric(df_loghero$hits) df_loghero$hits = as.factor(df_loghero$hits) # Build into a single file page_data$pagePath = gsub(pattern = "","",page_data$Address) df_loghero$pagePath<-gsub("sitemap", "sitemap",df_loghero$pagePath) df_loghero$pagePath<-gsub(".*sitemap.*", "sitemap",df_loghero$pagePath) df_loghero$pagePath<-gsub("^\\/$", "Home",df_loghero$pagePath) df_user_ga$pagePath<-gsub("sitemap", "sitemap",df_user_ga$pagePath) df_user_ga$pagePath<-gsub(".*sitemap.*", "sitemap",df_user_ga$pagePath) df_user_ga$pagePath<-gsub("^\\/$", "Home",df_user_ga$pagePath) # summarise df_user_ga %>% group_by(pagePath) %>% summarise(pageviews = sum(pageviews), avgTimeOnPage = mean(avgTimeOnPage), entrances = sum(entrances)) %>% ungroup() -> df_user_ga df_loghero %>% group_by(pagePath) %>% summarise(hits = sum(as.numeric(hits)), avgMetric1 = mean(avgMetric1)) -> df_loghero # merge 1 df_merged = merge(page_data, df_loghero, by = "pagePath") df_merged = merge(df_merged, df_user_ga, by = "pagePath", all.x = TRUE) # Basic imputation df_merged$pageviews[$pageviews)] = mean(df_merged$pageviews,na.rm = TRUE) df_merged$avgTimeOnPage[$avgTimeOnPage)] = mean(df_merged$avgTimeOnPage,na.rm = TRUE) df_merged$entrances[$entrances)] = mean(df_merged$entrances,na.rm = TRUE) # Use only R standard variable names names(df_merged) = make.names(names(df_merged))

We will now use to basic models which we will train on a subset and test on test set. First, we will use a Random Forest to estimate a basic regression, ignoring the ordinal factor scale of counting the visits discreetly, but enjoying the non-linearity. In the second model, we will use a generalize linear model with poisson link and thusly incorporate the fact we are predicting a count variable but losing non-linearity in the proess.

# Modelling # Train Test Split train_idx = sample(1:nrow(df_merged),size = round(2*nrow(df_merged)/3),replace = FALSE) train = df_merged[train_idx,] test = df_merged[-train_idx,] # Ranger basic Model mod_rf = ranger(formula = hits ~ Word.Count + Unique.Inlinks + Unique.Outlinks + Inlinks + Outlinks + Text.Ratio + Response.Time + Size + pageviews + avgTimeOnPage + entrances , data = train, num.trees = 500,importance = "impurity") mod_glm = glm(formula = hits ~ Word.Count + Unique.Inlinks + Unique.Outlinks + Inlinks + Outlinks + Text.Ratio + Response.Time + Size + pageviews + avgTimeOnPage + entrances, data = train, family = poisson) # Predict values test$pred_rf = round(predict(mod_rf, data = test)$predictions) test$pred_glm = round(predict(mod_glm, newdata = test))

We have trained the models and now predicted the number of crawler visits for pages the model hast never seen. Let’s compare the predicted values with the actual ones:

test %>% dplyr::select(pagePath,hits,pred_rf,pred_glm) %>% arrange(desc(hits))

pagePath hits pred_rf pred_glm /ohrenschmalz/ 197 126 5 /ohren-reinigen/ohrenkerzen/ 195 74 5 /ohren-reinigen/ohren-reinigen-bei-haustieren/ 127 120 6 /ohren-reinigen/ohrenreiniger/ 79 72 4 /otoskop/ 69 41 4 /reinigungsanleitung/ 63 44 5 /ohren-reinigen/wattestaebchen/ 47 97 4 /testsieger/ 38 53 2 test %>% dplyr::select(pagePath,hits,pred_rf,pred_glm) %>% arrange(desc(hits)) %>% mutate(rank = rank(1/hits, ties.method = "min"), rank_rf = rank(1/pred_rf, ties.method = "min"), rank_glm = rank(1/pred_glm, ties.method = "min")) %>% dplyr::select(pagePath, rank, rank_rf, rank_glm)

pagePath rank rank_rf rank_glm /ohrenschmalz/ 1 1 2 /ohren-reinigen/ohrenkerzen/ 2 4 2 /ohren-reinigen/ohren-reinigen-bei-haustieren/ 3 2 1 /ohren-reinigen/ohrenreiniger/ 4 5 5 /otoskop/ 5 8 5 /reinigungsanleitung/ 6 7 2 /ohren-reinigen/wattestaebchen/ 7 3 5 /testsieger/ 8 6 8

As we can see, the random forest is far more capable of dealing with the large differences in scale in between the pages than the glm. But if we only consider the rank of the pages, both models acutally work pretty well. From an user perspective, the concret number of predicted visits is not as important as the fact which pages are visited more often. The GLM of course offers to take a look at the beta values:


## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 11.7771264452 1.884972e+00 6.2479049 4.159949e-10 ## Word.Count 0.0063256324 7.856065e-04 8.0519097 8.151201e-16 ## Unique.Inlinks 0.0967094083 3.276334e-02 2.9517569 3.159715e-03 ## Unique.Outlinks 0.2747712141 1.120261e-01 2.4527436 1.417714e-02 ## Inlinks 0.0062746899 6.595578e-03 0.9513481 3.414277e-01 ## Outlinks -0.0886505758 4.462267e-02 -1.9866712 4.695885e-02 ## Text.Ratio -0.4113764679 7.342136e-02 -5.6029539 2.107293e-08 ## Response.Time 2.2484413075 1.285903e+00 1.7485306 8.037219e-02 ## Size -0.0002508579 6.675199e-05 -3.7580583 1.712370e-04 ## pageviews -0.0004068211 1.399437e-04 -2.9070336 3.648740e-03 ## avgTimeOnPage 0.0059767188 1.986728e-03 3.0083225 2.626942e-03 ## entrances 0.0004635700 1.624705e-04 2.8532566 4.327367e-03


With these information you can further tune and adjust your homepage and find out what the crawler is actually doing.

Further thoughts to predict the Google Bot

This article shows the first experiments we can do with the data from Log Hero for individual pages. Even further, we managed to produce viable proof-of-concept models for crawler behaviour prediction within a few dozen lines of code. Investing more time and thought may lead to very sophisticate models directly suited to your needs.

The idea now is to add more and more data points to the prediction and to find out what were the most important factors in our prediction – the page speed, behaviour/traffic by real users, in- and outlinks?

We are very happy if you could share your results with us, if you can do similar analyses (or even the same with the code above) analyses or give further tips for perhaps relevant data points.

Der Beitrag Can we predict the crawling of the Google-Bot? erschien zuerst auf Keyword Hero – The SEO Revolution.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Keyword Hero – The SEO Revolution. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

xts 0.11-2 on CRAN

Tue, 11/06/2018 - 13:35

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

xts version 0.11-2 was published to CRAN yesterday. xts provides data structure and functions to work with time-indexed data.  This is a bug-fix release, with notable changes below:

  • The xts method for shift.time() is now registered. Thanks to Philippe Verspeelt for the report and PR (#268, #273).
  • An if-statement in the xts constructor will no longer try to use a logical vector with length > 1. Code like if (c(TRUE, TRUE)) will throw a warning in an upcoming R release, and this patch will prevent that warning. Thanks to Hugh Parsonage for the report and PR (#270, #272).
  • Fix subset when index(i) and i contain duplicates. Observations were being incorrectly dropped, and behavior is now consistent with zoo. Thanks to Stack Overflow user scs for the report, and Philippe Verspeelt for the help debugging (#275).
  • Make column names for merge() results with unnamed objects shorter and more like zoo (#248). Previously, column names could be hundreds, even thousands, of characters. This change has the added benefit of making na.fill() much faster (#259). NOTE: This may BREAK existing code for integer unnamed objects.
  • The to.period() family of functions now use the index timezone when converting intraday index values to daily values (or lower frequency). Previously, the dates would be calculated as UTC dates, instead of dates in the local timezone (as they are now). Thanks to Garrett See and Gabor Grothendieck for the reports (#53, #277).

As always, I’m looking forward to your questions and feedback! If you have a question, please ask on Stack Overflow and use the [r] and [xts] tags. Or you can send an email to the R-SIG-Finance mailing list (you must subscribe to post). Open an issue on GitHub if you find a bug or want to request a feature, but please read the contributing guide first!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Happy 10th Bday, Rcpp – and welcome release 1.0 !!

Tue, 11/06/2018 - 01:55

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Ten years ago today I wrote the NEWS.Rd entry in this screenshot for the very first Rcpp_release:

First Rcpp release

So Happy Tenth Birthday, Rcpp !! It has been quite a ride. Nearly 1500 packages on CRAN, or about one in nine (!!), rely on Rcpp to marshall data between R and C++, and to extend R with performant C++ code.

Rcpp would not exist without Dominick Samperi who recognised early just how well C++ template classes could fit the task of seamlessly interchanging data between the two systems, and contributed the first versions to my RQuantLib package. After a few versions he left this and R around 2006. About two years later, as I needed this for what I was working on, I rejuvenated things with the initial “second wave” release whose 10th birthday we celebrate today.

In late 2009 Romain François joined, intially contacted because of the needs of RProtoBuf for some Java tooling. (As an aside, there is a neat story of the very first steps of RProtoBuf here.) And from late 2009 to some time in 2013 or so Romain pushed Rcpp incredibly hard, far and well. We all still benefit from the work he did, and probably will for some time.

These days Rcpp is driven by a small and dedicated core team: JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Doug Bates, John Chambers and myself – plus countless contributors and even more users via GitHub and the mailing list. Thanks to all for making something excellent together!

And because ten is written as one-oh, I felt the time was right to go to the big one oh release and just committed this:

Release 1.0

I will probably wrap this up as a release in a day or two and sent it to CRAN where it may remain for a few days of testing and checking. If you want it sooner, try the GitHub repo or the Rcpp drat repo where I will push the release once completed.

Again, a very big cheers and heartfelt Thank You! to everybody who helped on this ten year journey of making Rcpp what it is today.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.

Tue, 11/06/2018 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

In my last blogpost about Random Forests I introduced the Bootcamp. The next part I published was about Neural Networks and Deep Learning. Every video of our bootcamp will have example code and tasks to promote hands-on learning. While the practical parts of the bootcamp will be using Python, below you will find the English R version of this Neural Nets Practical Example, where I explain how neural nets learn and how the concepts and techniques translate to training neural nets in R with the H2O Deep Learning function.

You can find the video on YouTube but as, as before, it is only available in German. Same goes for the slides, which are also currently German only. See the end of this article for the embedded video and slides.

Neural Nets and Deep Learning

Just like Random Forests, neural nets are a method for machine learning and can be used for supervised, unsupervised and reinforcement learning. The idea behind neural nets has already been developed back in the 1940s as a way to mimic how our human brain learns. That’s way neural nets in machine learning are also called ANNs (Artificial Neural Networks).

When we say Deep Learning, we talk about big and complex neural nets, which are able to solve complex tasks, like image or language understanding. Deep Learning has gained traction and success particularly with the recent developments in GPUs and TPUs (Tensor Processing Units), the increase in computing power and data in general, as well as the development of easy-to-use frameworks, like Keras and TensorFlow. We find Deep Learning in our everyday lives, e.g. in voice recognition, computer vision, recommender systems, reinforcement learning and many more.

The easiest type of ANN has only node (also called neuron) and is called perceptron. Incoming data flows into this neuron, where a result is calculated, e.g. by summing up all incoming data. Each of the incoming data points is multiplied with a weight; weights can basically be any number and are used to modify the results that are calculated by a neuron: if we change the weight, the result will change also. Optionally, we can add a so called bias to the data points to modify the results even further.

But how do neural nets learn? Below, I will show with an example that uses common techniques and principles.


First, we will load all the packages we need:

  • tidyverse for data wrangling and plotting
  • readr for reading in a csv
  • h2o for Deep Learning (h2o.init initializes the cluster)
library(tidyverse) library(readr) library(h2o) h2o.init(nthreads = -1) ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 3 hours 46 minutes ## H2O cluster timezone: Europe/Berlin ## H2O data parsing timezone: UTC ## H2O cluster version: ## H2O cluster version age: 1 month and 16 days ## H2O cluster name: H2O_started_from_R_shiringlander_jpa775 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.16 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.5.1 (2018-07-02) Data

The dataset used in this example is a customer churn dataset from Kaggle.

Each row represents a customer, each column contains customer’s attributes

We will load the data from a csv file:

telco_data <- read_csv("../../../Data/Churn/WA_Fn-UseC_-Telco-Customer-Churn.csv") ## Parsed with column specification: ## cols( ## .default = col_character(), ## SeniorCitizen = col_integer(), ## tenure = col_integer(), ## MonthlyCharges = col_double(), ## TotalCharges = col_double() ## ) ## See spec(...) for full column specifications. head(telco_data) ## # A tibble: 6 x 21 ## customerID gender SeniorCitizen Partner Dependents tenure PhoneService ## ## 1 7590-VHVEG Female 0 Yes No 1 No ## 2 5575-GNVDE Male 0 No No 34 Yes ## 3 3668-QPYBK Male 0 No No 2 Yes ## 4 7795-CFOCW Male 0 No No 45 No ## 5 9237-HQITU Female 0 No No 2 Yes ## 6 9305-CDSKC Female 0 No No 8 Yes ## # ... with 14 more variables: MultipleLines , InternetService , ## # OnlineSecurity , OnlineBackup , DeviceProtection , ## # TechSupport , StreamingTV , StreamingMovies , ## # Contract , PaperlessBilling , PaymentMethod , ## # MonthlyCharges , TotalCharges , Churn

Let’s quickly examine the data by plotting density distributions of numeric variables…

telco_data %>% select_if(is.numeric) %>% gather() %>% ggplot(aes(x = value)) + facet_wrap(~ key, scales = "free", ncol = 4) + geom_density() ## Warning: Removed 11 rows containing non-finite values (stat_density).

… and barcharts for categorical variables.

telco_data %>% select_if(is.character) %>% select(-customerID) %>% gather() %>% ggplot(aes(x = value)) + facet_wrap(~ key, scales = "free", ncol = 3) + geom_bar()

Before we can work with h2o, we need to convert our data into an h2o frame object. Note, that I am also converting character columns to categorical columns, otherwise h2o will ignore them. Moreover, we will need our response variable to be in categorical format in order to perform classification on this data.

hf <- telco_data %>% mutate_if(is.character, as.factor) %>% as.h2o

Next, I’ll create a vector of the feature names I want to use for modeling (I am leaving out the customer ID because it doesn’t add useful information about customer churn).

hf_X <- colnames(telco_data)[2:20] hf_X ## [1] "gender" "SeniorCitizen" "Partner" ## [4] "Dependents" "tenure" "PhoneService" ## [7] "MultipleLines" "InternetService" "OnlineSecurity" ## [10] "OnlineBackup" "DeviceProtection" "TechSupport" ## [13] "StreamingTV" "StreamingMovies" "Contract" ## [16] "PaperlessBilling" "PaymentMethod" "MonthlyCharges" ## [19] "TotalCharges"

I am doing the same for the response variable:

hf_y <- colnames(telco_data)[21] hf_y ## [1] "Churn"

Now, we are ready to use the h2o.deeplearning function, which has a number of arguments and hyperparameters, which we can set to define our neural net and the learning process. The most essential arguments are

  • y: the name of the response variable
  • training_frame: the training data
  • x: the vector of features to use is optional and all remaining columns would be used if we don’t specify it, but since we don’t want to include customer id in our model, we need to give this vector here.

The H2O documentation for the Deep Learning function gives an overview over all arguments and hyperparameters you can use. I will explain a few of the most important ones.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf) Activation functions

Before, when describing the simple perceptron, I said that a result is calculated in a neuron, e.g. by summing up all the incoming data multiplied by weights. However, this has one big disadvantage: such an approach would only enable our neural net to learn linear relationships between data. In order to be able to learn (you can also say approximate) any mathematical problem – no matter how complex – we use activation functions. Activation functions normalize the output of a neuron, e.g. to values between -1 and 1, (Tanh), 0 and 1 (Sigmoid) or by setting negative values to 0 (Rectified Linear Units, ReLU). In H2O we can choose between Tanh, Tanh with Dropout, Rectifier (default), Rectifier with Dropout, Maxout and Maxout with Dropout. Let’s choose Rectifier with Dropout. Dropout is used to improve the generalizability of neural nets by randomly setting a given proportion of nodes to 0. The dropout rate in H2O is specified with two arguments: hidden_dropout_ratios, which per default sets 50% of hidden (more on that in a minute) nodes to 0. Here, I want to reduce that proportion to 20% but let’s talk about hidden layers and hidden nodes first. In addition to hidden dropout, H2O let’s us specify a dropout for the input layer with input_dropout_ratio. This argument is deactivated by default and this is how we will leave it.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout") Hidden layers

In more complex neural nets, neurons are arranged in layers. The first layer is the input layer with our data that is flowing into the neural net. Then we have a number of hidden layers and finally an output layer with the final prediction of our neural net. There are many different types and architectures for neural nets, like LSTMs, CNNs, GANs, etc. A simple architecture is the Multi-Layer-Perceptron (MLP) in which every node is connected to all other nodes in the preceding and the following layers; such layers are also called dense layers. We can train such an MLP with H2O by specifying the number hidden layers and the number of nodes in each hidden layer with the hidden argument. Let’s try out three hidden layers with 100, 80 and 100 nodes in each. Now, we can define the hidden_dropout_ratios for every hidden layer with 20% dropout.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout", hidden = c(100, 80, 100), hidden_dropout_ratios = c(0.2, 0.2, 0.2)) Learning through optimisation

Just as in the simple perceptron from the beginning, our MLP has weights associated with every node. And just as in the simple perceptron, we want to find the most optimal combination of weights in order to get a desired result from the output layer. The desired result in a supervised learning task is for the neural net to calculate predictions that are as close to reality as possible.

How that works has a lot to do with mathematical optimization and the following concepts:

  • difference between prediction and reality
  • loss functions
  • backpropagation
  • optimization methods, like gradient descent
Difference between prediction and reality

When our network calculates a predictions, what the output nodes will return before applying an activation function is a numeric value of any size – this is called the score. Just as before, we will now apply an activation function to this score in order to normalize it. In the output of a classification task, we would like to get values between 0 and 1, that’s why we most commonly use the softmax function; this function converts the scores for every possible outcome/class into a probability distribution with values between 0 and 1 and a sum of 1.


In order to compare this probability distribution of predictions with the true outcome/class, we use a special format to encode our outcome: One-Hot-Encoding. For every instance, we will generate a vector with either 0 or 1 for every possible class: the true class of the instance will get a 1, all other classes will get 0s. This one-hot-encoded vector now looks very similar to a probability distribution: it contains values between 0 and 1 and sums up to 1. In fact, our one-hot-encoded vector looks like the probability distribution if our network had predicted the correct class with 100% certainty!

We can now use this similarity between probability distribution and one-hot-encoded vector to calculate the difference between the two; this difference tells us how close to reality our network is: if the difference is small, it is close to reality, if the difference is big, it wasn’t very accurate in predicting the outcome. The goal of the learning process is now to find the combination of weights that make the difference between probability distribution and one-hot-encoded vector as small as possible.

One-Hot-Encoding will also be applied to categorical feature variables, because our neural nets need numeric values to learn from – strings and categories in their raw format are not useful per se. Fortunately, many machine learning packages – H2O being one of them – perform this one-hot-encoding automatically in the background for you. We could still change the categorical_encoding argument to “Enum”, “OneHotInternal”, “OneHotExplicit”, “Binary”, “Eigen”, “LabelEncoder”, “SortByResponse” and “EnumLimited” but we will leave the default setting of “AUTO”.


This minimization of the difference between prediction and reality for the entire training set is also called minimising the loss function. There are many different loss functions (and in some cases, you will even write your own specific loss function), in H2O we have “CrossEntropy”, “Quadratic”, “Huber”, “Absolute” and “Quantile”. In classification tasks, the loss function we want to minimize is usually cross-entropy.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout", hidden = c(100, 80, 100), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy") Backpropagation

With backpropagation, the calculated error (from the cross entropy in our case) will be propagated back through the network to calculate the proportion of error for every neuron. Based on this proportion of error, we get an error landscape for every neuron. This error landscape can be thought of as a hilly landscape in the Alps, for example. The positions in this landscape are different weights, so that we get weights with high error at the peaks and weights with low error in valleys. In order to minimize the error, we want to find the position in this landscape (i.e. the weight) that is in the deepest valley.

Gradient descent

Let’s imagine we were a hiker, who is left at a random place in this landscape – while being blindfolded – and we are tasked with finding our way to this valley. We would start by feeling around and looking for the direction with the steepest downward slope. This is also what our neural net does, just that this “feeling around” is called “calculating the gradient”. And just as we would then make a step in that direction with the steepest downwards slope, our neural net makes a step in the direction with the steepest gradient. This is called gradient descent. This procedure will be repeated until we find a place, where we can’t go any further down. In our neural net, this number of repeated rounds is called the number of epochs and we define it in H2O with the epochs argument.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout", hidden = c(100, 80, 100), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy", epochs = 200) Adaptive learning rate and momentum

One common problem with this simple approach is, that we might end up in a place, where there is no direction to go down any more but the steepest valley in the entire landscape is somewhere else. In our neural net, this would be called getting stuck in local minima or on saddle points. In order to be able to overcome these points and find the global minimum, several advanced techniques have been developed in recent years.

One of them is the adaptive learning rate. Learning rate can be though of as the step size of our hiker. In H2O, this is defined with the rate argument. With an adaptive learning rate, we can e.g. start out with a big learning rate reduce it the closer we get to the end of our model training run. If you wanted to have an adaptive learning rate in your neural net, you would use the following functions in H2O: adaptive_rate = TRUE, rho (rate of learning rate decay) and epsilon (a smoothing factor).

In this example, I am not using adaptive learning rate, however, but something called momentum. If you imagine a ball being pushed from some point in the landscape, it will gain momentum that propels it into a general direction and won’t make big jumps into opposing directions. This principle is applied with momentum in neural nets. In our H2O function we define momentum_start (momentum at the beginning of training), momentum_ramp (number of instances for which momentum is supposed to increase) and momentum_stable (final momentum).

The Nesterov accelerated gradient is an addition to momentum in which the next step is guessed before taking it. In this way, the previous error will influence the direction of the next step.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout", hidden = c(100, 80, 100), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy", epochs = 200, rate = 0.005, adaptive_rate = FALSE, momentum_start = 0.5, momentum_ramp = 100, momentum_stable = 0.99, nesterov_accelerated_gradient = TRUE) L1 and L2 Regularization

In addition to dropout, we can apply regularization to improve the generalizability of our neural network. In H2O, we can use L1 and L2 regularization. With L1, many nodes will be set to 0 and with L2 many nodes will get low weights.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout", hidden = c(100, 80, 100), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy", epochs = 200, rate = 0.005, adaptive_rate = TRUE, momentum_start = 0.5, momentum_ramp = 100, momentum_stable = 0.99, nesterov_accelerated_gradient = TRUE, l1 = 0, l2 = 0) Crossvalidation

Now you know the main principles of how neural nets learn!

Before we finally train our model, we want to have a way to judge how well our model learned and if learning improves over the epochs. Here, we want to use validation data, to make these performance measures less biased compared to using the training data. In H2O, we could either give a specific validation set with validation_frame or we use crossvalidation. The argument nfolds specifies how many folds shall be generated and with fold_assignment = "Stratified" we tell H2O to make sure to keep the class ratios of our response variable the same in all folds. We also specify to save the cross validation predictions for examination.

dl_model <- h2o.deeplearning(x = hf_X, y = hf_y, training_frame = hf, activation = "RectifierWithDropout", hidden = c(100, 80, 100), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy", epochs = 200, rate = 0.005, adaptive_rate = FALSE, momentum_start = 0.5, momentum_ramp = 100, momentum_stable = 0.99, nesterov_accelerated_gradient = TRUE, l1 = 0, l2 = 0, nfolds = 3, fold_assignment = "Stratified", keep_cross_validation_predictions = TRUE, balance_classes = TRUE, seed = 42)

Finally, we are balancing the class proportions and we set a seed for pseudo-number-generation – and we train our model!

The resulting model object can be used just like any other H2O model: for predicting new/test data, for calculating performance metrics, saving it, or plotting the training results:


The crossvalidation predictions can be accessed with h2o.cross_validation_predictions() (returns one data frame for every fold) and h2o.cross_validation_holdout_predictions() (combined in one data frame).

h2o.cross_validation_predictions(dl_model) ## [[1]] ## predict No Yes ## 1 No 0.0000000 0.00000000 ## 2 No 0.9784236 0.02157642 ## 3 No 0.0000000 0.00000000 ## 4 No 0.9808329 0.01916711 ## 5 Yes 0.1512552 0.84874476 ## 6 Yes 0.1590060 0.84099396 ## ## [7043 rows x 3 columns] ## ## [[2]] ## predict No Yes ## 1 Yes 0.2484017 0.7515983 ## 2 No 0.0000000 0.0000000 ## 3 No 0.8107972 0.1892028 ## 4 No 0.0000000 0.0000000 ## 5 No 0.0000000 0.0000000 ## 6 No 0.0000000 0.0000000 ## ## [7043 rows x 3 columns] ## ## [[3]] ## predict No Yes ## 1 No 0 0 ## 2 No 0 0 ## 3 No 0 0 ## 4 No 0 0 ## 5 No 0 0 ## 6 No 0 0 ## ## [7043 rows x 3 columns] h2o.cross_validation_holdout_predictions(dl_model) ## predict No Yes ## 1 Yes 0.2484017 0.75159829 ## 2 No 0.9784236 0.02157642 ## 3 No 0.8107972 0.18920283 ## 4 No 0.9808329 0.01916711 ## 5 Yes 0.1512552 0.84874476 ## 6 Yes 0.1590060 0.84099396 ## ## [7043 rows x 3 columns] Video


sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS 10.14 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 h2o_3.20.0.8 forcats_0.3.0 stringr_1.3.1 ## [5] dplyr_0.7.7 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 ## [9] tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1 ## ## loaded via a namespace (and not attached): ## [1] tidyselect_0.2.5 xfun_0.4 haven_1.1.2 lattice_0.20-35 ## [5] colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4 ## [9] rlang_0.3.0.1 pillar_1.3.0 glue_1.3.0 withr_2.1.2 ## [13] modelr_0.1.2 readxl_1.1.0 bindr_0.1.1 plyr_1.8.4 ## [17] munsell_0.5.0 blogdown_0.9 gtable_0.2.0 cellranger_1.1.0 ## [21] rvest_0.3.2 evaluate_0.12 labeling_0.3 knitr_1.20 ## [25] fansi_0.4.0 broom_0.5.0 Rcpp_0.12.19 scales_1.0.0 ## [29] backports_1.1.2 jsonlite_1.5 hms_0.4.2 digest_0.6.18 ## [33] stringi_1.2.4 bookdown_0.7 grid_3.5.1 rprojroot_1.3-2 ## [37] bitops_1.0-6 cli_1.0.1 tools_3.5.1 magrittr_1.5 ## [41] lazyeval_0.2.1 RCurl_1.95-4.11 crayon_1.3.4 pkgconfig_2.0.2 ## [45] xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.0 rmarkdown_1.10 ## [49] httr_1.3.1 rstudioapi_0.8 R6_2.3.0 nlme_3.1-137 ## [53] compiler_3.5.1 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Causal mediation estimation measures the unobservable

Tue, 11/06/2018 - 01:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I put together a series of demos for a group of epidemiology students who are studying causal mediation analysis. Since mediation analysis is not always so clear or intuitive, I thought, of course, that going through some examples of simulating data for this process could clarify things a bit.

Quite often we are interested in understanding the relationship between an exposure or intervention on an outcome. Does exposure \(A\) (could be randomized or not) have an effect on outcome \(Y\)?

But sometimes we are interested in understanding more than whether or not \(A\) causes or influences \(B\); we might want to have some insight into the mechanisms underlying that influence. And this is where mediation analysis can be useful. (If you want to delve deeply into the topic, I recommend you check out this book by Tyler VanderWeele, or this nice website developed at Columbia University.)

In the example here, I am using the simplest possible scenario of an exposure \(A\), a mediator \(M\), and an outcome \(Y\), without any confounding:

A key challenge of understanding and conducting a mediation analysis is how we should quantify this concept of mediation. Sure, \(A\) has an effect on \(M\), which in turn has an effect on \(Y\), and \(A\) also may have an effect on \(Y\) through other pathways. But how can we make sense of all of this? One approach, which is a relatively recent development, is to use the potential outcome framework of causal inference to define the various estimands (or quantities) that arise in a mediation analysis. (I draw on a paper by Imai, Keele and Yamamoto for the terminology, as there is not complete agreement on what to call various quantities. The estimation methods and software used here are also described in the paper.)

Defining the potential outcomes

In an earlier post, I described the concept of potential outcomes. I extend that a bit here to define the quantities we are interested in. In this case, we have two effects of the possible exposure: \(M\) and \(Y\). Under this framework, each individual has a potential outcome for each level of \(A\) (I am assuming \(A\) is binary). So, for the mediator, \(M_{i0}\) and \(M_{i1}\) are the values of \(M\) we would observe for individual \(i\) without and with exposure, respectively. That is pretty straightforward. (From here on out, I will remove the subscript \(i\), because it gets a little unwieldy.)

The potential outcomes under \(Y\) are less intuitive, as there are four of them. First, there is \(Y_{0M_0}\), which is the potential outcome of \(Y\) without exposure for \(A\) and whatever the potential outcome for \(M\) is without exposure for \(A\). This is what we observe when \(A=0\) for an individual. \(Y_{1M_1}\) is the potential outcome of \(Y\) with exposure for \(A\) and whatever the potential outcome for \(M\) is with exposure for \(A\). This is what we observe when \(A=1\) for an individual. That’s all fine.

But we also have \(Y_{0M_1}\), which can never be observed unless we can intervene on the mediator \(M\) somehow. This is the potential outcome of \(Y\) without exposure for \(A\) and whatever the mediator would have been had the individual been exposed. This potential outcome is controversial, because it is defined across two different universes of exposure to \(A\). Finally, there is \(Y_{1M_0}\). It is analogously defined across two universes, but in reverse.

Defining the causal mediation effects and direct effects

The estimands or quantities that we are interested are defined in terms of the potential outcomes. The causal mediation effects for an individual are

CME_0 &= Y_{0M_1} – Y_{0M_0} \\
CME_1 &= Y_{1M_1} – Y_{1M_0},

and the causal direct effects are

CDE_0 &= Y_{1M_0} – Y_{0M_0} \\
CDE_1 &= Y_{1M_1} – Y_{0M_1}.

A few important points. (1) Since we are in the world of potential outcomes, we do not observe these quantities for everyone. In fact, we don’t observe these quantities for anyone, since some of the measures are across two universes. (2) The two causal mediation effects under do not need to be the same. The same goes for the two causal direct effects. (3) Under a set of pretty strong assumptions related to unmeasured confounding, independence, and consistency (see Imai et al for the details), the average causal mediation effects and average causal direct effects can be estimated using observed data only. Before I simulate some data to demonstrate all of this, here is the definition for the total causal effect (and its decomposition into mediation and direct effects):

TCE &= Y_{1M_1} – Y_{0M_0} \\
&= CME_1 + CDE_0 \\
&= CME_0 + CDE_1

Generating the data

I’m using the simstudy package to generate the data. I’ll start by generating the binary potential outcomes for the mediator, \(M_0\) and \(M_1\), which are correlated in this example. \(P(M_1=1) > P(M_0=1)\), implying that exposure to \(A\) does indeed have an effect on \(M\). Note that it is possible that for an individual \(M_0 = 1\) and \(M_1 = 0\), so that exposure to \(A\) has an effect contrary to what we see in the population generally. (We don’t need to make this assumption in the data generation process; we could force \(M_1\) to be 1 if \(M_0\) is 1.)

set.seed(3872672) dd <- genCorGen(n=5000, nvars = 2, params1 = c(.2, .6), dist = "binary", rho = .3, corstr = "cs", wide = TRUE, cnames = c("M0", "M1"))

Observe treatment:

dd <- trtObserve(dd, 0.6, grpName = "A")

Initial data set:

dd ## id A M0 M1 ## 1: 1 0 0 1 ## 2: 2 1 0 0 ## 3: 3 1 0 1 ## 4: 4 0 1 1 ## 5: 5 1 0 0 ## --- ## 4996: 4996 1 0 1 ## 4997: 4997 0 0 0 ## 4998: 4998 1 1 1 ## 4999: 4999 1 1 0 ## 5000: 5000 0 0 0

\(Y_{0M_0}\) is a function of \(M_0\) and some noise \(e_0\), and \(Y_{0M_1}\) is a function of \(M_1\) and the same noise (this is not a requirement). However, if \(M_0 = M_1\) (i.e. the mediator is not affected by exposure status), then I am setting \(Y_{0M_1} = Y_{0M_0}\). In this case, \(CME_0\) for an individual is \(2(M_1 – M_0)\), so \(CME_0 \in \{-2, 0, 2\}\), and the population average \(CME_0\) will depend on the mixture of potential outcomes \(M_0\) and \(M_1\).

def <- defDataAdd(varname = "e0", formula = 0, variance = 1, dist = "normal") def <- defDataAdd(def, varname = "Y0M0", formula = "2 + M0*2 + e0", dist = "nonrandom") def <- defDataAdd(def, varname = "Y0M1", formula = "2 + M1*2 + e0", variance = 1, dist = "nonrandom")

The same logic holds for \(Y_{1M_0}\) and \(Y_{1M_1}\), though at the individual level \(CME_1 \in \{-5, 0, 5\}\):

def <- defDataAdd(def, varname = "e1", formula = 0, variance = 1, dist = "normal") def <- defDataAdd(def, varname = "Y1M0", formula = "8 + M0*5 + e1", dist = "nonrandom") def <- defDataAdd(def, varname = "Y1M1", formula = "8 + M1*5 + e1", dist = "nonrandom")

The observed mediator (\(M\)) and outcome (\(Y\)) are determined by the observed exposure (\(A\)).

def <- defDataAdd(def, varname = "M", formula = "(A==0) * M0 + (A==1) * M1", dist = "nonrandom") def <- defDataAdd(def, varname = "Y", formula = "(A==0) * Y0M0 + (A==1) * Y1M1", dist = "nonrandom")

Here is the entire data definitions table:

varname formula variance dist link e0 0 1 normal identity Y0M0 2 + M0*2 + e0 0 nonrandom identity Y0M1 2 + M1*2 + e0 1 nonrandom identity e1 0 1 normal identity Y1M0 8 + M0*5 + e1 0 nonrandom identity Y1M1 8 + M1*5 + e1 0 nonrandom identity M (A==0) * M0 + (A==1) * M1 0 nonrandom identity Y (A==0) * Y0M0 + (A==1) * Y1M1 0 nonrandom identity

Based on the parameters used to generate the data, we can calculate the expected causal mediation effects:

E[CME_0] &= E[2 + 2M_1 + e_0] – E[2+2M_0+e_0] \\
&= E[2(M_1 – M_0)] \\
&= 2(E[M_1] – E[M_0]) \\
&= 2(0.6 – 0.2) \\
&= 0.8

E[CME_1] &= E[8 + 5M_1 + e_1] – E[8+5M_0+e_1] \\
&= E[5(M_1 – M_0)] \\
&= 5(E[M_1] – E[M_0]) \\
&= 5(0.6 – 0.2) \\
&= 2.0

Likewise, the expected values of the causal direct effects can be calculated:

E[CDE_0] &= E[8 + 5M_0 + e_1] – E[2+2M_0+e_0] \\
&= E[6 + 5M_0 – 2M_0)] \\
&= 6 + 3M_0 \\
&= 6 + 3*0.2 \\
&= 6.6

E[CDE_1] &= E[8 + 5M_1 + e_1] – E[2+2M_1+e_0] \\
&= E[6 + 5M_1 – 2M_1)] \\
&= 6 + 3M_1 \\
&= 6 + 3*0.6 \\
&= 7.8

Finally, the expected total causal effect is:

ATCE &= E[CDE_0] + E[CME_1] = 6.6 + 2.0 \\
&= E[CDE_1] + E[CME_0] = 7.8 + 0.8 \\
&= 8.6
And now, the complete data set can be generated.

dd <- addColumns(def, dd) dd <- delColumns(dd, c("e0", "e1")) # these are not needed dd ## id A M0 M1 Y0M0 Y0M1 Y1M0 Y1M1 M Y ## 1: 1 0 0 1 0.933 2.93 7.58 12.58 0 0.933 ## 2: 2 1 0 0 2.314 2.31 6.84 6.84 0 6.841 ## 3: 3 1 0 1 3.876 5.88 9.05 14.05 1 14.053 ## 4: 4 0 1 1 5.614 5.61 12.04 12.04 1 5.614 ## 5: 5 1 0 0 1.469 1.47 8.81 8.81 0 8.809 ## --- ## 4996: 4996 1 0 1 2.093 4.09 8.82 13.82 1 13.818 ## 4997: 4997 0 0 0 1.734 1.73 7.28 7.28 0 1.734 ## 4998: 4998 1 1 1 3.256 3.26 12.49 12.49 1 12.489 ## 4999: 4999 1 1 0 5.149 3.15 12.57 7.57 0 7.572 ## 5000: 5000 0 0 0 1.959 1.96 5.23 5.23 0 1.959 Looking at the “observed” potential outcomes

The advantage of simulating data is that we can see what the average causal effects are based on the potential outcomes. Here are the average potential outcomes in the generated data set:

dd[,.( Y0M0 = mean(Y0M0), Y0M1 = mean(Y0M1), Y1M0 = mean(Y1M0), Y1M1 = mean(Y1M1))] ## Y0M0 Y0M1 Y1M0 Y1M1 ## 1: 2.39 3.2 8.99 11

The four average causal effects based on the data are quite close to the expected values:

dd[, .(ACME0 = mean(Y0M1 - Y0M0), ACME1= mean(Y1M1 - Y1M0), ACDE0 = mean(Y1M0 - Y0M0), ACDE1= mean(Y1M1 - Y0M1))] ## ACME0 ACME1 ACDE0 ACDE1 ## 1: 0.81 2.03 6.6 7.81

And the here is the average total causal effect from the data set:

dd[, mean(Y1M1 - Y0M0)] ## [1] 8.62

All of these quantities can be visualized in this figure. The lengths of the solid vertical lines are the mediated effects. The lengths of the dotted vertical lines are the direct effects. And the sums of these vertical lines (by color) each represent the total effect:

Estimated causal mediation effect from observed data

Clearly, the real interest is in estimating the causal effects from data that we can actually observe. And that, of course, is where things start to get challenging. I will not go into the important details here (again, Imai et al provide these), but here are formulas that have been derived to estimate the effects (simplified since there are no confounders in this example) and the calculations using the observed data:

\hat{CME_0} =\sum_{m\in0,1}E[Y|A=0, M=m][P(M=m|A=1)-P(M=m|A=0)]

# Estimate CME0 dd[M == 0 & A == 0, mean(Y)] * (dd[A == 1, mean(M == 0)] - dd[A == 0, mean(M == 0)]) + dd[M == 1 & A == 0, mean(Y)] * (dd[A == 1, mean(M == 1)] - dd[A == 0, mean(M == 1)]) ## [1] 0.805

\hat{CME_1} =\sum_{m\in0,1}E[Y|A=1, M=m][P(M=m|A=1)-P(M=m|A=0)]

# Estimate CME1 dd[M == 0 & A == 1, mean(Y)] * (dd[A == 1, mean(M == 0)] - dd[A == 0, mean(M == 0)]) + dd[M == 1 & A == 1, mean(Y)] * (dd[A == 1, mean(M == 1)] - dd[A == 0, mean(M == 1)]) ## [1] 2

\hat{CDE_0} =\sum_{m\in0,1}(E[Y|A=1, M=m] – E[Y|A=0, M=m])P(M=m|A=0)

# Estimate CDE0 (dd[M == 0 & A == 1, mean(Y)] - dd[M == 0 & A == 0, mean(Y)]) * dd[A == 0, mean(M == 0)] + (dd[M == 1 & A == 1, mean(Y)] - dd[M == 1 & A == 0, mean(Y)]) * dd[A == 0, mean(M == 1)] ## [1] 6.56

\hat{CDE_1} =\sum_{m\in0,1}(E[Y|A=1, M=m] – E[Y|A=0, M=m])P(M=m|A=1)

# Estimate CDE1 (dd[M == 0 & A == 1, mean(Y)] - dd[M == 0 & A == 0, mean(Y)]) * dd[A == 1, mean(M == 0)] + (dd[M == 1 & A == 1, mean(Y)] - dd[M == 1 & A == 0, mean(Y)]) * dd[A == 1, mean(M == 1)] ## [1] 7.76 Estimation with mediation package

Fortunately, there is software available to provide these estimates (and more importantly measures of uncertainty). In R, one such package is mediation, which is available on CRAN. This package implements the formulas derived in the Imai et al paper.

Not surprisingly, the model estimates are in line with expected values, true underlying effects, and the previous estimates conducted by hand:

library(mediation) <- glm(M ~ A, data = dd, family = binomial("logit")) <- lm(Y ~ M*A, data = dd) med.out <- mediate(,, treat = "A", mediator = "M", robustSE = TRUE, sims = 1000) summary(med.out) ## ## Causal Mediation Analysis ## ## Quasi-Bayesian Confidence Intervals ## ## Estimate 95% CI Lower 95% CI Upper p-value ## ACME (control) 0.8039 0.7346 0.88 <2e-16 *** ## ACME (treated) 2.0033 1.8459 2.16 <2e-16 *** ## ADE (control) 6.5569 6.4669 6.65 <2e-16 *** ## ADE (treated) 7.7563 7.6555 7.86 <2e-16 *** ## Total Effect 8.5602 8.4317 8.69 <2e-16 *** ## Prop. Mediated (control) 0.0940 0.0862 0.10 <2e-16 *** ## Prop. Mediated (treated) 0.2341 0.2179 0.25 <2e-16 *** ## ACME (average) 1.4036 1.2917 1.52 <2e-16 *** ## ADE (average) 7.1566 7.0776 7.24 <2e-16 *** ## Prop. Mediated (average) 0.1640 0.1524 0.17 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Sample Size Used: 5000 ## ## ## Simulations: 1000 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tesseract 4 is here! State of the art OCR in R!

Tue, 11/06/2018 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki:

Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.

We have now also updated the R package tesseract to ship with the new Tesseract 4 on MacOS and Windows. It uses the new engine by default, and the results are extremely impressive! Recognition is much more accurate then before, even without manually enhancing the image quality.


The binary R package for Windows and MacOS can be installed directly from CRAN:


Tesseract 4 uses a new training data format, so if you had previously installed custom training data you might need to redownload these as well:

# If you want to OCR french text: library(tesseract) tesseract_download('fra')

Training data for eng (default) is included with the R package already.


Let’s take a simple example from last month’s blog post about ocr’ing bird drawings from the natural history collection. We use the magick package to preprocess the image (crop the area of interest). The image_ocr() function is a magick wrapper for tesseract::ocr().

library(magick) image_read("") %>% image_crop('1200x300+100+1700') %>% image_ocr() %>% cat() H Grenveld. del Watherby & 0° CLIMACTERIS PICUMNUS. (BROWN TREE CREEPER).

Tesseract has perfectly detected the hand-written species name (both the Latin and English name), and has also found and nearly perfectly predicted the tiny author names. These results would be a very good basis for post-processing and automatic classification. For example we could match these results against known species and authors as illustrated explained in the original blog post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NG "roll returns" – inflection point?

Mon, 11/05/2018 - 22:51

(This article was first published on Commodity Stat Arb, and kindly contributed to R-bloggers)

After more than a decade of consistent losses from rolling a long NG position, the cumulative return has been positive for the past 12 months.  This has only occurred briefly during the ‘polar vortex’ of early 2014 and during the 2008 commodities ‘super cycle’ peak.

For years, long only positions in NG have experienced losses on average in the last month before rolling the position.  The traditional explanation for this has been an oversupplied market relative to storage capacity.  Producers sell into the short term market; storage owners earn a ‘negative convenience yield’ for storing this into the future.

Fracking costs have increased, partly driven by input costs Fracking PPI and increasing interest rates. Industrial demand has recovered according to the EIA.

It seems likely that the natural gas market is in transition towards an equilibrium in which there is no longer an excess of supply being stored into the future relative to demand.

Could this winter be the first test of this phenomenon?  Will we see significant backwardation in the curve as convenience yield accrues to holders of the physical commodity?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Commodity Stat Arb. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL Houston: Interview with Hadley Wickham

Mon, 11/05/2018 - 18:51

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Can you tell us about your upcoming keynote at EARL and what the key take-home messages will be for delegates?

I’m going to talk about functional programming which I think is one of the most important programming techniques used with R. It’s not something you need on day 1 as a data scientist but it gives you some really powerful tools for repeating the same action again and again with code. It takes a little while to get your head around it but recently, because I’ve been working writing the second edition of Advanced R, I’ve prepared some diagrams that make it easier to understand. So the take-home message will be to use more functional programming because it will make your life easier!

Writer, developer, analyst, educator or speaker – what do you enjoy most? Why?

The two things that motivate me most in the work that I do is the intellectual joy of understanding how you can take a big problem and break it up into small pieces that can be combined together in different ways – for example, the algebras and grammars of tools like dplyr and ggplot2. I’ve been working a lot in Advanced R to understand how the bits and pieces of R fit together which I find really enjoyable. The other thing that I really enjoy is hearing from people who have done cool stuff with the things that I’ve worked on which has made their life easier – whether that’s on Twitter or in person. Those are the two things that I really enjoy and pretty much everything else comes out of that. Educating, for example, is just helping other people understand how I’ve broken down a problem and sharing in ways that they can understand too.

What is your preferred industry or sector for data, analysis and applying the tools that you have developed?

I don’t do much data analysis myself anymore so when I do it, it’s normally data related to me in some way or, for example, data on RStudio packages. I do enjoy challenges like figuring out how to get data from a web API and turning it into something useful but the domain for my analysis is very broadly on data science topics.

When developing what are your go-to resources?

I still use StackOverflow quite a bit and Google in general. I also do quite a bit of comparative reading to understand different programming languages, seeing what’s going on across different languages, the techniques being used and learning about the evolving practices in other languages which is very helpful.

Is there anything that has surprised you about how any of the tools you’ve created has been used by others?

I used to but I’ve lost my capability to be surprised now just because the diversity of uses is crazy. I guess now, I think it’s most notable is when someone uses any of my tools to commit academic fraud (well-publicised examples sometimes). Otherwise, people are using R and data to understand pretty much every aspect of the world which is really neat.

What are the biggest changes that you see between data science now and when you started?

I think the biggest difference is that there’s a term for it – data science. I think it’s been useful to have that term rather than just Applied Statistician or Data Analyst because I think Data Science is becoming different to what these roles have been traditionally. It’s different from data analysis because data science uses programming heavily and it’s different from statistics since there’s a much greater emphasis on correct data import and data engineering with the goal may be to eventually turn the data analysis into a product, web app or something other than a standard report.

Where do you currently perceive the biggest bottlenecks in data science to be?

I think there are still a lot of bottlenecks in getting high-quality data and that’s what most people currently struggle with. I think another bottleneck is how to help people learn about all the great tools that are available, understand what their options are, where all the tools are and what they should be learning. I think there are still plenty of smaller things to improve with data manipulation, data visualization and tidying but by and large it feels to me like all the big pieces are there. Now it’s more about getting everything polished and working together really well. But still, getting data to a place to even start an analysis can be really frustrating so a major bottleneck is the whole pipeline that occurs before arriving in R.

What topic would you like to be presenting on in a data science conference a year from now?

I think one thing I’m going to be talking more about next year is this vctrs package that I’ve been working on. The package provides tools for handling object types in R and managing the types of inputs and outputs that a function expects and produces. My motivation for this is partly because there are a lot of inconsistencies in the tidyverse and base R that vctrs aims to fix. I think of this as part of my mental model because when I read R code, there’s a simplified R interpreter in my head which mainly focuses on the types of objects and predicts whether some code is going to work at all or if it’s going to fail. So part of the motivation behind this package is me thinking about how to get stuff out of my head and into the heads of other people so they can write well-functioning and predictable R code.

What do you hope to see out of RStudio in the next 5 years?

Generally, I want RStudio to be continually strengthening the connections between R and other programming languages. There’s an RStudio version 1.2 coming out which has a bunch of features to make it easy to use SQL, Stan and Python from RStudio. Also, the collaborative work we do with Wes McKinney and Ursa Labs – I think we’re just going to see more and more of that because data scientists are working in bigger and bigger teams on a fundamentally collaborative activity so making it as easy as possible to get data in and out of R is a big win for everyone.

I’m also excited to see the work that Max Kuhn has been doing on tidy modelling. I think the idea is really appealing because it gives modelling an API that is very similar to the tidyverse. But I think the thing that’s really neat about this work is that it takes inspiration from dplyr to separate the expression of a model from its computation so you can express the model once and fit it in R, Spark, tensorflow, Stan or whatever. The R language is really well suited to exploit this type of tool where the computation is easily described in R and executed somewhere that is more suitable for high performance.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Peter Bull discusses the importance of human-centered design in data science.

Mon, 11/05/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Peter Bull, a data scientist for social good and co-founder of Driven Data.

Here is the podcast link.

Introducing Peter Bull

Hugo: Hi there, Peter, and welcome to Data Framed.

Peter: Thanks, Hugo. I’m happy to be here.

Hugo: I’m happy to have you here. I’m really excited to be talking about human centered design and data science, the role of design in data science, and what data science can tell us about human centered design as well. Before we get into this, I want to find out a bit about you. What are you known for in the data community?

Peter: Primarily I’m known for my work at Driven Data. Driven Data is an organization that runs machine learning competitions that have a social impact. We work with non-profits, NGOs, government groups to help them figure out a way that machine learning can help them to be more effective. Then we put that online so that our global community of data scientists can come up with the best solution to this particular problem. Then after the competition, we help the organization to use that going forward.

Peter: That’s probably one of the things, is just that work at Driven Data that we’ve been doing for the last five years. Outside of that, there are two particular areas of data science that I often talk about and I’m very interested in. The first one is engineering best practices for data science. I’m one of the maintainers of the cookie cutter data science project, which is a project template, that I hope we get some time to talk about because it’s one of my pet projects and I think it makes a big difference in our own work, and I hope it makes a difference for other people.

Peter: The other one is thinking about the social impact that data science can have, and its relationship to the larger data ethics conversation that’s happening. We just released a package called "Deon", D-E-O-N, that’s for building ethics checklists that you can use as part of your data science project. If we get a chance, I’d love to talk about both of those as well, because they’re things that I care deeply about.

Hugo: I would love to jump into all of these things. To recap, we have machine learning competitions with social impact, engineering best practices, which I think is incredibly important, particularly because there is an argument that in data science … the idea of best practices in general is in a woeful state and something that a lot of people are working on a correction for, and bringing engineering best principles into that will be essential. Then of course, the data ethics aspect of your work, very recently … Mike Loukides, Hilary Mason, DJ Patil, have started writing their series on data ethics for O’Reilly where they’ve actually got a post on checklists versus oaths versus codes of conduct. I think all of these things are incredibly topical.

Machine Learning Competitions with Social Impact

Hugo: Let’s just spend a bit of time to go through each of these. In terms of your machine learning competitions with social impact, could you just give us a couple of examples?

Peter: Yeah, sure. I’ll start with one of my favorites, and that was the first competition we ever ran. There’s a nonprofit organization called Education Resource Strategies. Really what they want to do is they want to help school districts to spend their money more effectively. Schools are spending money on things like student transportation, teacher salaries, textbooks. They have a wide range of operational costs. Right now, it’s very difficult for a school district to say, "Am I paying a lot more for textbooks than neighboring districts and getting the same outcomes? Are my investments being effective?" The biggest barrier to doing that kind of benchmarking or comparison is that school budgets come in wildly different formats. There’s not standard for reporting the categories of a budget expenditure so that I can say, "We’re spending more on textbooks than a neighboring school, and we need to look at this."

Peter: Education Resource Strategies gathers all of this budget information from the districts they work with. Ultimately, what their output is is a recommendation for how to think about the school district’s budgeting after they go through this process of benchmarking that district against other districts. The big problem is that they spend hundreds of person hours a year looking at Excel spreadsheets, reading a budget line item, and trying to assign it a standard set of categories.

Peter: As I’m sure your audience will have picked up on from the description, they have a lot of labelled data that they’ve generated through the course of their operations. That label data is a budget line item, a description of it, the cost of that budget item, and then what category it belongs to, whether that is transportation, textbooks, extracurricular activities, administrative salaries. All of those things, they’ve captured over their history of working with school districts.

Peter: So our first competition was how do we help this organization that really cares about the output report and not about this taxonomy process? How do we help them to automate that? So we ran a competition where people built algorithms to parse the natural language in these budgets, to look at these budget costs, and to automatically assign categories for those school budgets to those line items. We took the output of that competition, we turned it into a tool that fits into the workflow that this organization already had. It’s saving their analysts tons of time in terms of just reading through sales spreadsheets so that they can focus that time where they can really add value, which is about making recommendations for how those budgets can be changed.

Hugo: That’s great. It seems like it would reduce so much of the manual labor involved.

Peter: Yeah, really it’s … Last time we checked in, it was saving them about 300 person hours a year to automate that process. For a relatively small nonprofit organization, that’s actually a huge amount of labor savings. Really, their goal is to employ those savings more effectively, where their employees actually add value, rather than in the labeling of spreadsheets where it’s just this task that had to happen, any way.

Hugo: Yeah, absolutely. We should mention that if people find this type of stuff pretty exciting and interesting, they can check out all of the competitions at Driven Data, but if they find this particular competition interesting, they can even take the DataCamp course that you’ve built and that I collaborated on, which is learning from the experts: you get to build the winning solution in the end.

Peter: Yeah, that’s right. That course will walk through not only what a baseline solution to a problem like this is, but also how the person who won the competition combined a number of interesting techniques to get to that best-performing solution.

Hugo: I’m not going to spoil the punch line. I don’t want you to either, but I will say that it’s not an LSTM or any crazy deep learning architecture that wins the competition.

Engineering Best Practices for Data Science

Hugo: Now we’ve talked about the types of machine learning competitions you do at Driven Data. Maybe you can tell us a bit about your thoughts on engineering best practices for data science, and in particular your cookie cutter data science project that you maintain?

Peter: Great. Yeah, so my background is in software engineering. One of the things that I think about while I’m working on data science projects is how software and data science go together. I think there are some important differences between the processes, in that data science tends to be more open-ended, tends to be more of a research and development effort. It’s still the fact that a lot of what we do in data science is, at its core, building software. Even if that software exists in a Jupyter notebook or in an Rmarkdown file, it’s still a piece of software. A lot of the best practices that come out of software engineering can be employed to improve those products.

Peter: The cookie cutter data science project is what we think of as the first pass at standardizing our process to make ourselves more effective. That’s to have a consistent layout of what a data science project looks like. If you were to look at a web application in Django, which is a Python web framework, or in Ruby on Rails, or in Node.js, all of these are different programming languages but any time you build a web application in one of those frameworks, you have more or less the same structure. What that means is, that anyone who’s a web developer can go into a project like that and have some expectation about where they would find certain kinds of code. The code that talks to the database usually lives in one place. The code that talks to the front end usually lives in another place. Those expectations make it very easy to work together and collaborate on projects.

Peter: The cookie cutter data science project idea is to bring that kind of standardization to our own data science work. We have a defined folder structure where our data lives. We have a set of folders for raw data, data that has some processing but is in an interim state, and then processed data. We have a particular folder structure for where we keep our Jupyter notebooks or Rmarkdown files that are built in the literate programming style. We have another set of folder structures for data processing pipelines that may exist as scripts, and then ultimately may get refactored in something like a Python package.

Peter: Because of this consistency, it’s very easy for us to move from project to project, and pick up something and remember where we are, how to reproduce something, and because we work with lots of different clients on lots of different projects, this means that anyone who works on the team can jump into any project without having to spend a lot of time figuring out what happens where.

Building a Package for Data Ethics and Ethical Checklists

Hugo: That’s great. I find all the details very interesting, but as you’ve hammered home, the idea of actually having an overarching, overall consistent layout to data science projects and a system of best practices, I think is incredibly important. I presume this actually plays a role in your approach to building the package for data ethics and ethical checklists, of having something that you can carry across projects. If there are biases involved or challenges, they are systematic in the sense that they won’t be induced by a human working on the project. They’ll be in this structure as a whole, so you’ll become aware of it.

Peter: Yeah, so I think the two are related in that we really spend a lot of our time building tools for people who are data scientists. A lot of times they start out as tools that we’re using ourselves, and then we open source those tools so that other people who are working in data science can use them. That’s how the cookie cutter data science project started, and that’s really how this ethics checklist package DEON started as well.

Peter: The idea there was, there are a lot of conversations around data ethics that we found very compelling from a standpoint of really seeing where things had gone wrong in the process, and feeling like ourselves, we were vulnerable to some of these things that could go wrong. It’s very very hard to have perfect foresight and to understand exactly what can go wrong in any given circumstance. Because we kept seeing these examples and feeling like, "Would we necessarily have caught this in our own kind of work," we wanted to have a more actionable way of engaging with that data ethics conversation, to make sure that we didn’t fall into some of these traps that just exist in the work. That if you’re focused on methods, if you’re thinking about data, you can get into these really technical aspects and not have a chance to pull up and look at the ethical implications.

Peter: We wanted to have a really process-driven way of engaging that conversation for our own projects. What this package we built does, is it generates a data ethics checklist that’s really framed around the data science process. It starts with the data collection phase. There are a set of checklist items that ask you questions like, do the participants in your data collection give informed consent for the data that is collected? That’s one example of an item on the checklist.

Peter: What we’ve done is we’ve taken each of these checklist items at different parts in the data science process, and we’ve mapped them to real world examples where something has gone wrong. We’ve got this collection of news articles and academic papers that explain when data science projects have hit ethical implications and the problems that have arisen.

Peter: There was actually just a really great article about Amazon really trying to build a resume filtering algorithm. They get an unbelievable volume of resumes for any position that they open up. They had the belief that they could use all of their historical data to train an algorithm to identify the top candidates that were applying for a given position. Now, as a sort of framing that may seem from a data science perspective like it makes a lot of sense. We’ve got a long set of training data, and we want to be able to replicate this with an algorithm.

Peter: They just actually shut down the team that had been working on this project for years, because they found that the algorithm was biased against women. In particular, it wasn’t saying, "Oh, is this person a woman and they’ve indicated that on their application and now I want to use that and discount their application," but it was using things that were a little more implicit than that. In particular, if the applicant had attended a women’s college, then their score went down. They discovered these problems with their algorithm and disbanded the team that was working on this project entirely, because they couldn’t get it over this bump in the process. They’re still having humans review all of these resumes that come in.

Peter: This is just sort of a classic example of something that’s seems like a great setup for a machine learning problem in the abstract, but if you don’t think about how it’s affecting your outputs that aren’t just some measure of accuracy, it can go really really wrong.


Hugo: These are the types of things that DEON would ask you to check for?

Peter: That’s exactly right. Yeah. It goes through the data science process from data collection to data cleaning to exploratory data analysis to actually building models and that’s where this would come in. Then actually it’s got a section on deploying models. When the model is deployed, what are the questions that you should ask? Really, the goal is not to have all of the answers about what’s right and what’s wrong in a checklist. Given all of the different domains people work in, that’s really an impossible task.

Peter: The goal is to take people who are already conscientious actors that want to be doing the right thing and make sure they’re having the conversations about the ways that things can be misused. Really the workflow we see for the tool is you generate a checklist using the tool, and then you submit a PR to your project that says, "Hey, here’s our ethics checklist, let’s make sure we talk about each of these items as a team." It’s really about spurring that team discussion to make sure you’ve considered the particular implications of your project and you’ve made the right decision about it.

Hugo: We’ll include links to the cookie cutter project and the DEON package in the show notes. I think you discussing data ethics in terms of thinking about the variety of stakeholders really dovetails nicely into our conversation about human centered design in data science and why it’s important. As a prelude to human centered design though, I’d just like to ask you a quick question about the role of empathy in data science as a whole. I’m wondering what is the role of empathy in data science for you?

Peter: That’s a great question. I actually feel like empathy is a term that has started to pop up in the data science conversation as a core skill of a data scientist. In my mind, empathy is just one way to get at a particular kind of approach. That approach is to be problem focused rather than method focused. What I mean by that is we should start as data scientists that are really in a professional service role. We’re providing a service to different parts of a business or different parts of an organization. We should start with what the problem is that we’re solving, and understand the context for that problem, rather than saying, "Hey, who’s got an NLP problem that I can solve with an LSTM?" Or, "Who’s got a computer vision problem where I can try out the latest neural network methods and get a really cool result."

Peter: If you start methods first, a lot of times you end up with a solution that’s not going to be really useful in the context in which it operates. When we talk about data science and empathy, what we’re really saying is that you should empathize with how your data science output will be used. You should empathize with what’s the problem we’re solving. When we talk about empathy, I think that’s one way of getting to a perspective that’s problem first rather than method first.

Hugo: And do any concrete examples spring to mind?

Peter: Yeah, so I think that for me, a good example is we worked on a project that is trying to automatically identify species in videos. Species of animals, that is. There’s a research organization that has these motion-detecting cameras that they set up in the jungle and they try to record videos of chimpanzees, but they get a lot of videos of other animals as well. Instead of sitting there and watching all of these videos and saying which one has a chimpanzee and which one doesn’t, we were helping them to build an algorithm to automatically identify the animals that appear. We actually ran a competition around this last year. If you’re curious, you can look at and see the results of that competition.

Hugo: What was the name of the competition?

Peter: The name of the competition was "Primate-rix" Factorization.

Hugo: Stop it.

Peter: Thanks for asking.

Hugo: That’s brilliant.

Peter: We really care about our data science puns, and that’s one of my absolutely favorites.

Naïve Bees

Hugo: When talking about the data science puns, we have to mention Naïve Bees, as well.

Peter: Yeah. To be honest Hugo, with your accent I wasn’t even sure if you were saying Bees or Bayes, so it works even better in Australia.

Hugo: Naïve Bees of course, is currently being turned into a series of DataCamp projects as well.

Peter: That’s right, yeah. The Naïve Bees competition was to build an algorithm to identify honey bees from bumblebees. We’re working on a set of DataCamp projects to help people work through that problem to give them that first exposure to a computer-visioned task that fits into a classification framework, and look at the more traditional methods and then move on to deep learning, neural networks, and convolutional neural networks.

Role of Empathy

Hugo: After that pun-tastic interlude, back to the role of empathy in identifying primates?

Peter: Back to the role of empathy. Really, this is about going back to the context and understanding the context in which you operate. We were working with this team of ecologists and biologists. They spent a lot of time in the field setting up these cameras, capturing data, watching videos, and then writing papers about what they see in the videos. The output that we ended up working on after the competition was an open source package that let you run from the command line, predict here are my new videos, and it would output a CSV with each of the videos and a probability that a certain kind of animal appeared in the videos. We were pretty pleased with this output. It’d be super useful for us.

Peter: The first thing we heard from the team we were working with is, "We can’t even get this tool installed. I can’t get XGBoost to install on my machine. I’m having trouble getting the version of TensorFlow installed. I’m having trouble getting GPU Drivers installed." All of this stuff that feels like second nature to us as data scientists, sort of blinded us to the context in which this tool was actually going to be used and it’s by ecologists that aren’t used to all of this complex machinery around the packaging of data science tools, that can make it really challenging to use the latest methods. That’s just a really concrete example of a place where we weren’t doing the right thing upfront to really understand that context and make sure we built something useful. We were building something that we knew would be useful for us.

Hugo: I’ve got to make clear that, I’m sure a non-trivial proportion of working expert data scientists have a lot of problems getting XGBoost installed occasionally, as well.

Peter: Yeah, if someone wants to take on the initiative to improve the XGBoost installation experience, that’s a really valuable project that someone could do for the open source community.

Human Centered Design

Hugo: Let’s jump into human centered design and why it’s important in data science. I think probably a lot of our listeners wouldn’t necessarily think of design principles as being something which would play a huge role in the data science process. Maybe you can tell us about human centered design and why it’s important in data science?

Peter: Great, yeah. Human centered design is a way of framing the design process. It’s really related to other terms that are in this field that you may have heard of, like design thinking, the double diamond method, and design sprints. Those are all sort of popular terms that people may talk about. It’s really referring to the same set of ideas, which is about having a design process.

Peter: Human centered design in particular is the one that we’re most familiar with through our work with an organization called IDO.Org. IDO is one of the leading human centered design firms. They helped design the first Apple mouse. They have a long history of being designers both from an industrial design perspective, but also from a digital design and then eventually service design perspective. They have a really long history and track record of working and using these design tools to spur innovation.

Peter: They spun out a nonprofit arm,, that works with NGOs to have the same sort of results. We partnered with that organization to look at digital financial services in Tanzania. Just to take a step back, that’s sort of the context that we’re working in, is with this team of human centered designers. What that human centered design process looks like, and I’ll give you the overview first and after that we can dig into the details of that particular project which I think are pretty enlightening for how data science and design work together … The big picture is that human centered design is about starting with what’s desirable. There’s a perspective that the best solutions to a problem are desirable in that someone wants to use them, they’re feasible from a technological perspective, and they’re viable from a business perspective. The best solutions sit at the intersection of that Venn Diagram. You know you’re not having a good data science conversation until you’re talking about a Venn Diagram.

Hugo: I was waiting for the Venn Diagram.

Peter: Right? This particular Venn Diagram is desirable, feasible, viable. We want the intersection of all three of those things.

Hugo: Of course a lot of data science work maybe will start with feasible, like the newest cutting edge technology and the coolest most efficient algorithms and that type of stuff, right?

Peter: That’s exactly right. I think that that is one of our tendencies as data scientists that I see in myself all the time, where I’m getting excited about lots of these new technologies and I want to find ways to use them. The trick is just to find the balance for really solving a problem where using that is appropriate. The human centered design process starts from this perspective of what’s desirable to a user. It gets there by moving through these three phases of the design process.


Peter: The first is inspiration. Inspiration is about going and observing the people who will be users of your end product. In the case where you’re a data scientist, let’s say that your job is to create this report that’s emailed out to your executive team once a month. What you would actually do is you would go and talk to the people who get that email and you would say, hey, when you get this email, what do you use it for? What does it go into? Do you say, "Okay, I need some top line numbers here that I put into slides?" Or is something you read to get context that then you say to the people that you manage, "We need to change things, XY and Z."

Peter: You would go and you would actually talk to the consumers of your data science process to see how does it fit into the bigger picture. The inspiration part of the phase is really about going broad, brainstorming, and trying to get inspired by everything you might see around you. It’s not about, "Let me see the data," and get inspired by what’s in my data. It’s "let’s get inspired by everything before we even think about the data". That’s the first phase.


Peter: The second phase is ideation. What this means is trying to come up with particular solutions to a problem and then testing those really quickly. One of the core concepts here is having the lowest fidelity prototypes possible, and getting real user feedback on them. It might be the case that you’re working on a model to do some classification and ultimately it’s going to be this big, complex machine learning system that’s deployed in the Cloud. What you might do first is … Let’s say we’re working on this honeybee/bumblebee problem. You might just say, "Okay, here’s a spreadsheet of probabilities for each of these species, what would you do with this?" That’s sort of my lowest fidelity one. Take the most basic method, take the most basic output and say, "Is this useful?" Then you take that and you learn from that.

Peter: The ideation phase is about these iterative cycles of learning from low fidelity prototypes, that slowly and slowly build fidelity around them, but it sort of helps to keep your project going in a direction that ensures that the output is going to be targeted at a real problem. That it’s actually going to be useful, and that as you come up with new ideas throughout that process, you can see which of those are good ideas and which ones aren’t.

Hugo: It keeps you honest, right, in the sense that you’re not going to end up building something which is useless, or going down the entirely wrong path.

Peter: That’s exactly right. Yeah. I mean, I’ve seen so many times that … even work that we’ve done where you build a dashboard that no one ever looks at. Everyone thought the dashboard was what they wanted, but that’s not the right tool for the job. People really cared about answering one specific question and just if they went to one page with a thumbs up for a yes and a thumbs down for a no, that would have been either better than the dashboard that you created.

Hugo: In this dashboard case, instead of building the dashboard, perhaps the inspiration or ideation phase would involve drawing the type of figure on a whiteboard or on a piece of paper that the dashboard would show and say to the stakeholders, "Hey, is this actually what you want?"

Peter: Yeah, that’s exactly right. Or even mocking it up in PowerPoint or using Microsoft Paint to make a little prototype and say, "Hey, is this graph something that you would need? How would you actually use this in your process?" Trying to get at, not just saying, "Hey, do you like this, is this something you want to see?" More, "How would you use it then?" That question of how you actually use something will change people’s answers.

Peter: I think in a lot of data science work, it’s very clear that if you ask, "Hey, do you want to see this, hey do you want to see that, hey do you want to see something else?" people say yes to all of those questions. How could you not want to have all of the information that you could have? Really, the question of how would you use it helps you to narrow it down on the things that are going to be valuable, and not create this information overload.


Hugo: Tell us about the third phase. We’ve got inspiration, ideation, and the third "I" is …?

Peter: The third "I" is implementation. Implementation here is not just, okay, you finished it, now go build it. Implementation is actually a continuation of this process, to go from prototyping to actually piloting a solution. We think of prototypes as being small scale, low fidelity, something we do with a couple of people to get some feedback. We think of implementation as, okay, how do we become more data driven about this decision that we’re making now? Implementation is about picking a pilot cohort, a set of users that will actually consume this and then saying, "Okay, here’s the version we’re working with now. Here’s a higher fidelity prototype that we have. Let’s put it out there for a particular user group, and let’s do some real testing of if this solution is working and solving the problem that we want to solve." Implementation is this piloting phase to get to the point where not only do you have a lot of great anecdotal and qualitative evidence that you’ve built up from these discussions, but you’re also starting to get this quantitative evidence for how what you built is changing the metrics that you care about.

Digital Financial Services in Tanzania Project

Hugo: You and I have discussed a really interesting project you worked on previously, which I think illuminates all of these steps really wonderfully. It’s a project where you were looking at digital financial services in Tanzania. Maybe you could tell us a bit about this project and how human centered design actually played a pivotal role for you.

Peter: Great, yeah. This project, I’m going to step back and sort of give you the context that this project is done in. For a lot of people, your money has always been digital. What I mean by that is that when I opened my first bank account, my parents brought me to a bank in middle school and said, "You’re going to have a bank account. You have to be responsible for your finances now." I gave that bank some money and they wrote down on a piece of paper how much money I had. That amount of money wasn’t physical cash that I was holding, that was actually in a database that the bank had. That was purely digital. That’s my native experience of what it’s like to interact with money, is to have what is really a digital representation of that currency.

Peter: Whereas in lots of places, bank infrastructure is not very good. It’s expensive to build banks, there are security risks of moving physical money around. In many countries, it’s the case that people do most of their interaction with money in pure cash. That means when they want to save money, they have cash. When they want to spend money, that have to get enough cash to buy something. That can be very limiting in terms of your ability to save money over time, your ability to get loans that you might need to buy particular things. There’s a belief in the international development space that one of the ways in which we can help lift people out of poverty is to provide them access to digital financial services. This same digital experience that people have had growing up where they put their money in a bank, how can we provide that without having to build all of that physical banking infrastructure?

Peter: One of the approaches to this is to have a digital wallet on your phone that’s associated with your mobile phone account. Mobile phones have been one of these leapfrogging technologies where people who didn’t even have a landline now have access to a mobile phone. The mobile phone providers are now starting to offer services where you can actually save cash on your phone. You can say, "Okay, I’ve got a wallet with $10 in it," and that’s just associated with my phone.

Peter: There’s this transition from a world in which you just work with cash to a world in which you have some digital representation. For people who haven’t had the experience of always having a digital representation, you have to build trust in that system. You have to make sure that the digital financial services actually fit the needs that this community has. That’s the big project background and context, is how do we increase uptick of these digital financial services that are providing for people in these environments that have very volatile incomes, some sense of stability for their money, some access to loan infrastructure, some access to savings infrastructure? That’s the context.

Peter: This is a project that’s funded by the Gates Foundation to try to understand what levers we can pull to help engage people and give them access to these tools. One of the particular things about this digital financial services system is that you need a way to exchange your digital currency for cash. There’s always going to be some sort of transaction that’s like working with an ATM. However, ATMs are relatively expensive and there’s a lot of physical infrastructure you’d need for an ATM to work.

Peter: What happens in places where they have digital financial services, and this also goes by the name "Mobile Money", so I’ll use those interchangeably, you have what are called Mobile Money agents. These are usually people that have a small shop somewhere, that are selling home goods, snacks, drinks. They also become Mobile Money agents. What that means is, you can go to them and say, "I want to trade five digital dollars for five dollars in cash," or, "I want to take five dollars in cash and put that in my digital wallet." They’re the interface between physical cash and digital cash.

Peter: The focus of our project was, how do we make that interaction between agents and customers one that can help build trust in this Mobile Money system, and that direction was driven by this human centered design process. When we went out, the first thing we did was talk to people who used Mobile Money and people who didn’t. We went and sat in a number of markets. We watched people buy things, and we asked them, "Why did you buy that with cash? Why did you buy that with Mobile Money? Do you have a Mobile Money account? What’s your experience been like?"

Hugo: It sounds like in this process as well then, you’re getting qualitative data as well as quantitative data.

Peter: That’s right. Really, a big part of what you do is try to gather that qualitative data as well. We’ve been gathering this qualitative data through these interviews, but we also got Mobile Money transactions from one of the mobile network operators for a full nine months. It was tens of gigabytes of data, all of the transactions. Hundreds of millions of Mobile Money transactions in the country. Our goal was to combine that very rich data source about real behaviors with behaviors that we heard about, with these qualitative experiences that people actually had.

Peter: One really great example that I think just highlights the value of human centered design is the fact that we kept talking to agents and saying, "What are your biggest struggles with Mobile Money? How do you see it fitting into the larger context of your business where Mobile Money is one of your revenue streams, but you also sell other goods?" What we kept hearing from these Mobile Money agents is, "Well, it can be really tricky to predict how much money I’m going to make on Mobile Money transactions, because I earn some commission on each transaction, but it’s really opaque what those commissions are and it’s very hard for me to predict on any aggregated timeframe for some given week or some given month how much money I’m going to get in commissions."

Peter: This was particularly interesting to us because we as data scientists had been digging into this huge treasure trove of data thinking, "Oh, this has all the answers in it, this is going to be amazing. We can find so many insights in this huge range of transactions that we have." One of the things that we realized after doing these interviews is, we didn’t know what the commissions for an agent were. That was not data that was in the dataset that we had. We had the amount of the transaction, but the portion that then the agent got as a commission was calculated totally separately by some business logic that existed in another application.

Hugo: Even if you did know how much they got, you may not know whether that was a lot or a little or how that affected them on the ground.

Peter: Oh, absolutely. Yeah. Just the thought that we could have known what was valuable to agents by looking at the dataset and figuring out these patterns, that dataset didn’t even have the most important variable to the agents that we were working with. We wouldn’t have learned that if we didn’t talk to them and just try to learn things from the data alone.

Peter: One of the big things that we try to do in all of our work, and I really encourage all data scientists to do, is to go out and observe how your data gets collected. If you’re working with IoT data in a factory, actually go to the factory floor, watch how things happen. If you’re like us working with digital financial services, go watch people make transactions. See what actions in the real world correspond with something in your data, because that perspective changes how you think about the data itself. It changes where you trust the data and where you don’t trust the data. I really encourage people to get away from their screen, step away from their desk, and go watch the data collection in action. I think in nearly every case, you can go do this and it will have a transformative effect on how you think about that data.

Hugo: I think that’s really important, particularly as we live in an age when people maybe first get associated, have experience with data science, online competitions, for example, such as yours … platforms like DataCamp, getting tech data or getting data online, and not actually thinking a lot about the data-generating process.

Peter: Yeah. It’s very easy to just start with the data and say, "Okay, what’s in here?" Until you really understand that data-generating process, you won’t know to ask, "What’s not in this data that I might care about?" Or, "What in this data is not reliable?" For example, we saw a lot of these Mobile Money transactions fail because of network connectivity, for example. For some of those transactions, we wouldn’t have seen that in the data. If the network failed, the transaction never when through, it doesn’t get recorded in the database. Understanding the limitations of the connectivity and how that affects the experience, is something that we can only even start asking about, how do we measure those transactions, when we’ve actually observed them.

Other Aspects of the Project in Tanzania

Hugo: We’re going to need to wrap up in a few minutes Peter. I was just wondering, are there any other aspects of the project in Tanzania that you wanted to discuss?

Peter: Yeah, so just to share one other example of where the human centered design approach I think really made a difference. We were looking at the times of day which a Mobile Money agent was busiest. When were people coming to trade cash for digital currency or do the opposite. We took the data and we looked at it for days of the week and times of day, and we built this really beautiful heat map visualization that you can think of as checkerboard where each of the squares is either lighter or darker based on how many transactions you have. We made it interactive so that you can hover and you can see how busy, for a different region, it’s a different time of day, it’s a different day of week, and really get a great sense of the patterns of Mobile Money use that happened.

Hugo: It’s also colorblind human friendly.

Peter: It absolutely is. We did build it using viridis, which is colorblind friendly. If you’re not thinking about that for your visualizations, you should be, because that’s a little bit of human centered design.

Hugo: Speaking as a colorblind individual, I’m red-green colorblind as you know 8% of human males are.

Peter: You appreciate viridis even more than the rest of us that think it’s a beautiful color map.

Hugo: I can’t get enough of it.

Peter: Well, I think it’s a really compelling and beautiful color map, which is one of the reasons that we loved this visualization, which is one of the reasons the people we worked with loved this visualization, and how interactive it was … but, none of the agents that we were working with had access to a computer. They weren’t sitting at a laptop and they wanted to look at a dashboard that had this beautiful visualization on it. That wasn’t going to be useful to them.

Peter: What we ended up actually building was a text-based visualization that was essentially just a bar chart where it would say "M" for Monday and then it would have three capital "I’s" in a row. Then it would have "T" for Tuesday, and it would have eight capital "I’s" in a row. By building these text-based visualizations, essentially bar charts built out of characters, we could actually give a data visualization experience to these agents that were working on feature phones. That process of taking something that we think of as an amazing data science output, this really compelling interactive visualization, and putting that visualization into a context where it can actually get used, is one of the transformative experiences of that project for us where we started to think about, "Okay, what’s the context for all of our output?" Not, "How do we make the most amazing data visualization?"

Hugo: Yeah, and it’s so reliant on the knowledge of what actual technology, what phones humans have on the ground.

Peter: Yeah, that’s absolutely right.

Call to Action

Hugo: Peter, as a final question, I’m wondering if you have a final call to action to all our listeners out there?

Peter: Yeah. I think there are, in my mind, four core activities of a human centered data scientist, and I think we should all be human centered data scientists. The first one is, go to the field and observe the data being generated. Without understanding what a row in your dataset means, without actually observing that happening, without knowing what gets captured, what doesn’t, what happens when something goes wrong, you’ll be very limited in the output that you can have. Also, if you do go and do that, you’ll be so much better positioned to ask questions that matter of your data. Without talking to those agents, we wouldn’t have asked that commission question of the dataset. Going to the field, observing data being generated, is item number one.

Peter: Item two is design with, not for, by iterating on prototypes. This process of constant iteration, conversation with people who will actually be using the output, getting their buy-in on the decisions that you’re making, means that it’s going to be something that’s useful when you finish the project. Not, "I worked for three months, is this good for you?" "Oh, no, it’s not," or it requires some major changes. It’s how do we keep that process tighter in-sync so that we’re actually building things that are useful. We do that with really low fidelity prototypes that we’re constantly testing.

Peter: The third is to put outcomes, not methods or outputs, first. That’s really saying, what is the outcome we care about? In our case, it was the increase in the adoption of digital financial services. That’s what we cared about, and in particular we thought we could do that by improving the tools that Mobile Money agents had. Our goal was to say, "Okay, the best outcome is for Mobile Money agents to be making more transactions." That’s what we want to measure. It wasn’t, how do we do the most interesting dimensionality reduction on this huge dataset that we have.

Peter: The fourth item is to build consensus on metrics for success. I think this is one of the most difficult but most important ones, is you need to define upfront what success means and you need to get buy-in from everyone on that being successful. I think this is one of the things that people assume they’ve got the same goals if they’re working on the same project from the get-go, but until you have that explicit discussion about what success means and what those metrics are, you won’t be optimizing for exactly the same thing.

Peter: Those, I think … My call to action really is to take those and try to build them into your process as a data scientist. Other than becoming a human centered data scientist, thinking about your users, and using a more collaborative process, I would encourage people to come check out a competition on We’ve got a lot of interesting social impact projects happening there. Or, to check out one of the open source projects that we had as part of this discussion, that’s cookie cutter data science on DEON, the ethics checklist package.

Hugo: Of course, both of those projects and engaging in competitions and all the other great stuff you do on Driven Data, will help any budding data scientist, or established data scientist, to doing more human centered data work, as well.

Peter: Yeah, that’s the goal.

Hugo: Peter, it’s been such a pleasure having you on the show.

Peter: Hugo, thanks for having me. I loved chatting, as always.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Beyond Univariate, Single-Sample Data with MCHT

Mon, 11/05/2018 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)


I’ve spent the past few weeks writing about MCHT, my new package for Monte Carlo and bootstrap hypothesis testing. After discussing how to use MCHT safely, I discussed how to use it for maximized Monte Carlo (MMC) testing, then bootstrap testing. One may think I’ve said all I want to say about the package, but in truth, I’ve only barely passed the halfway point!

Today I’m demonstrating how general MCHT is, allowing one to use it for multiple samples and on non-univariate data. I’ll be doing so with two examples: a permutation test and the test for significance of a regression model.

Permutation Test

The idea of the permutation test dates back to Fisher (see [1]) and it forms the basis of computational testing for difference in mean. Let’s suppose that we have two samples with respective means and , respectively. Suppose we wish to test


\mu_Y" title="H_A: \mu_X > \mu_Y" class="latex" />

using samples and , respectively.

If the null hypothesis is true and we also make the stronger assumption that the two samples were drawn from distributions that could differ only in their means, then the labelling of the two samples is artificial, and if it were removed the two samples would be indistinguishable. Relabelling the data and artificially calling one sample the sample and the other the sample would produce highly similar statistics to the one we actually observed. This observation suggests the following procedure:

  1. Generate new datasets by randomly assigning labels to the combined sample of and .
  2. Compute copies of the test statistic on each of the new samples; suppose that the test statistic used is the difference in means, .
  3. Compute the test statistic on the actual sample and compare to the simulated statistics. If the actual statistic is relatively large compared to the simulated statistics, then reject the null hypothesis in favor of the alternative; otherwise, don’t reject.

In practice step 3 is done by computing a -value representing the proportion of simulated statistics larger than the one actually computed.

Permutation Tests Using MCHT

The permutation test is effectively a bootstrap test, so it is supported by MCHT, though one may wonder how that’s the case when the parameters test_stat, stat_gen, and rand_gen only accept one parameter, x, representing the dataset (as opposed to, say, t.test(), which has an x and an optional y parameter). But MCHTest() makes very few assumptions about what object x actually is; if your object is either a vector or tabular, then the MCHTest object should not have a problem with it (it’s even possible a loosely structured list would be fine, but I have not tested this; tabular formats should cover most use cases).

In this case, putting our data in long-form format makes doing a permutation test fairly simple. One column will contain the group an observation belongs to while the other contains observation values. The test_stat function will split the data according to group, compute group-wise means, and finally compute the test statistic. rand_gen generates new dataset by permuting the labels in the data frame. stat_gen merely serves as the glue between the two.

The result is the following test.

library(MCHT) library(doParallel) registerDoParallel(detectCores()) ts <- function(x) { grp_means <- aggregate(value ~ group, data = x, FUN = mean) grp_means$value[1] - grp_means$value[2] } rg <- function(x) { x$group <- sample(x$group) x } sg <- function(x) { test_stat(x) } permute.test <- MCHTest(ts, sg, rg, seed = 123, N = 1000, localize_functions = TRUE) df <- data.frame("value" = c(rnorm(5, 2, 1), rnorm(10, 0, 1)), "group" = rep(c("x", "y"), times = c(5, 10))) permute.test(df) ## ## Monte Carlo Test ## ## data: df ## S = 1.3985, p-value = 0.036 Linear Regression F Test

Suppose for each observation in our dataset there is an outcome of interest, , and there are variables that could together help predict the value of if they are known. Consider then the following linear regression model (with ):

The first question someone should asked when considering a regression model is whether it’s worth anything at all. An alternative approach to predicting is simply to predict its mean value. That is, the model

is much simpler and should be preferred to the more complicated model listed above if it’s just as good at explaining the behavior of for all . Notice the second model is simply the first model with all the coefficients identically equal to zero.

The -test (described in more detail here) can help us decide between these two competing models. Under the null hypothesis, the second model is the true model.

The alternative says that at least one of the regressors is helpful in predicting .

We can use the statistic to decide between the two models:

and are the residual sum of squares of models 1 and
2, respectively.

This test is called the -test because usually the F-distribution is used to compute -values (as this is the distributiont the statistic should follow when certain conditions hold, at least asymptotically if not exactly). What then would a bootstrap-based procedure look like?

If the null hypothesis is true then the best model for the data is this:

is the sample mean of and is the residual. This suggests the following procedure:

  1. Shuffle over all rows of the input dataset, with replacement, to generate new datasets.
  2. Compute statistics for each of the generated datasets.
  3. Compare the statistic of the actual dataset to the generated datasets’ statistics.
F Test Using MCHT

Let’s perform the test on a subset of the iris dataset. We will see if there is a relationship between the sepal length and sepal width among iris setosa flowers. Below is an initial split and visualization:

library(dplyr) setosa <- iris %>% filter(Species == "setosa") %>% select(Sepal.Length, Sepal.Width) plot(Sepal.Width ~ Sepal.Length, data = setosa)

There is an obvious relationship between the variables. Thus we should expect the test to reject the null hypothesis. That is what we would conclude if we were to run the conventional test:

res <- lm(Sepal.Width ~ Sepal.Length, data = setosa) summary(res) ## ## Call: ## lm(formula = Sepal.Width ~ Sepal.Length, data = setosa) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.72394 -0.18273 -0.00306 0.15738 0.51709 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.5694 0.5217 -1.091 0.281 ## Sepal.Length 0.7985 0.1040 7.681 6.71e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2565 on 48 degrees of freedom ## Multiple R-squared: 0.5514, Adjusted R-squared: 0.542 ## F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10

Let’s now implement the procedure I described with MCHTest().

ts <- function(x) { res <- lm(Sepal.Width ~ Sepal.Length, data = x) summary(res)$fstatistic[[1]] # Only way I know to automatically compute the # statistic } # rand_gen's function can use both x and n, and n will be the number of rows of # the dataset rg <- function(x, n) { x$Sepal.Width <- sample(x$Sepal.Width, replace = TRUE, size = n) x } b.f.test.1 <- MCHTest(ts, ts, rg, seed = 123, N = 1000) b.f.test.1(setosa) ## ## Monte Carlo Test ## ## data: setosa ## S = 58.994, p-value < 2.2e-16

Excellent! It reached the correct conclusion.


One may naturally ask whether we can write functions a bit more general than what I’ve shown here at least in the regression context. For example, one may want parameters specifying a formula so that the regression model isn’t hard-coded into the test. In short, the answer is yes; MCHTest objects try to pass as many parameters to the input functions as they can.

Here is the revised example that works for basically any formula:

ts <- function(x, formula) { res <- lm(formula = formula, data = x) summary(res)$fstatistic[[1]] } rg <- function(x, n, formula) { dep_var <- all.vars(formula)[1] # Get the name of the dependent variable x[[dep_var]] <- sample(x[[dep_var]], replace = TRUE, size = n) x } b.f.test.2 <- MCHTest(ts, ts, rg, seed = 123, N = 1000) b.f.test.2(setosa, formula = Sepal.Width ~ Sepal.Length) ## ## Monte Carlo Test ## ## data: setosa ## S = 58.994, p-value < 2.2e-16

This shows that you can have a lot of control over how MCHTest objects handle their inputs, giving you considerable flexibility.

Next post: time series and MCHT

  1. R. A. Fisher, The design of experiments (1935)

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Explore Your Dataset in R

Mon, 11/05/2018 - 15:56

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

As person who works with data, one of the most exciting activities is to explore a fresh new dataset. You’re looking to understand what variables you have, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more. While there is a ton you can do to get up and running, I want to show you a few simple commands to help you get a fast overview of the data set you are working with.


Simple Exploratory Data Analysis (EDA) Download the data set

Before we get rolling with the EDA, we want to download our data set. For this example, we are going to use the dataset produced by my recent science, technology, art and math (STEAM) project.

#Load the readr library to bring in the dataset library(readr) #Download the data set df= read_csv('', col_names = TRUE)

Now that we have the data set all loaded, and it’s time to run some very simple commands to preview the data set and it’s structure.


To begin, we are going to run the head function, which allows us to see the first 6 rows by default. We are going to override the default and ask to preview the first 10 rows.

head(df, 10)


dim and Glimpse

Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.

And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.

dim(df) #Displays the type and a preview of all columns as a row so that it's very easy to take in. library(dplyr) glimpse(df)



We then run the summary function to show each column, it’s data type and a few other attributes which are especially useful for numeric attributes. We can see that for all the numeric attributes, it also displays min, 1st quartile, median, mean, 3rd quartile and max values.




Next we run the skim function from the skimr package. The skim function is a good addition to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable!

library(skimr) skim(df)


create_report in DataExplorer

And finally the pièce de résistance, the main attraction and the reason I wrote this blog; the create_report function in the DataExplorer package. This awesome one line function will pull a full data profile of your data frame. It will produce an html file with the basic statistics, structure, missing data, distribution visualizations, correlation matrix and principal component analysis for your data frame! I recently learned about this function in a workshop given by Stephe Locke hosted by R Ladies Austin. This function is a game changer!

library(DataExplorer) DataExplorer::create_report(df)


Thanks for reading along while we explored some simple EDA in R.  Please share your thoughts and creations with me on twitter

Note that the full code is available on my  github repo.  If you have trouble downloading the file from github, go to the main page of the repo and select “Clone or Download” and then “Download Zip”.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

“A Guide to Working With Census Data in R” is now Complete!

Mon, 11/05/2018 - 15:00

(This article was first published on R –, and kindly contributed to R-bloggers)

Two weeks ago I mentioned that I was clearing my calendar until I finished writing A Guide to Working with Census Data in R. Today I’m happy to announce that the Guide is complete!

I even took out a special domain for the project:

The Guide is designed to address two problems that people who work with R and US Census Data face:

  1. Understanding what data the Census Bureau publishes.
  2. Understanding what CRAN packages are available to help with their project.

The Guide details the five most popular datasets that the Census Bureau publishes. It also describes the seven most popular R packages for working with Census data.

The best part? The Guide is free, and lists ways for you to continue learning about both the datasets and R packages that it mentions.

A special thank you to the R Consortium for funding this project!

I hope that you enjoy reading the Guide!

Read the Guide

The post “A Guide to Working With Census Data in R” is now Complete! appeared first on

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Suppressed data (left-censored counts) by @ellis2013nz

Mon, 11/05/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

This is a post about dealing with tables that have been subject to cell suppression or perturbation for confidentiality reasons. While some of the data world is only just waking up to issues of confidentiality and privacy, national statistics offices and associated researchers over some decades have developed a whole field of knowledge and tools in the area of “statistical disclosure control”. I found Duncan, Elliot and Salazar-Gonzalez’ Statistical Confidentiality: Principles and Practice and excellent text on this subject when I had to get on top of it late in 2017, while working on Stats NZ’s Integrated Data Infrastructure (IDI) output release processes.

Why does one need to suppress data when it has already been aggregated and isn’t sensitive itself? Because we fear “data snoopers” (yes, that really is the correct technical term) who will combine our published data with other datasets – public or private – to reveal sensitive information. For example, if we publish a table that shows there is only one hairdresser in Smallville, that in itself isn’t sensitive; but if we combine it with a separate dataset showing the average income of hairdressers in Smallville we have given away their private information. The thing that makes statistical disclosure control so hard is that we don’t know what other datasets the data snooper may have available to them to use in combination with our data, however innocuous it looks on its own.

Suppression in context

National stats offices rarely rely alone on cell suppression. One reason is that unless analysis is very very tightly controlled, the secondary suppressions that are needed to avoid uncovering the true value of a suppressed cell quickly get out of hand. For example, in this table which I’ll be returning to later, we would have to suppress not only number of tigers and bears in region B, but also the total number of tigers and bears:

region lions tigers bears A 7 6 13 B 9 <6 <6 C 20 24 33 D 18 8 90

If it were critical that we publish the total number of tigers and bears, then we would need to suppress at least one more region count for each animal. So the numbers of tigers and bears in Region A would probably also be suppressed, not because it is less than six but because revealing it would let a snooper with the marginal totals deduce the counts in Region B. With the hierarchical tables and various slices and dices that official statistics practice usually requires, this procedure soon gets pretty complicated.

There is software to help calculate the best secondary cell suppressions (such as the sdcTable R package), but sometimes there are practical situations in which it would be difficult to apply, and for this or other reasons we need more techniques than just suppression. One key situation is when a significant number of analysts, not under the direct control of the agency, have a degree of access to the original microdata. Two examples I’m aware of are:

  • the Australian Bureau of Statistics’ Census Table Builder is an online tool (with free guest access) that lets you slice and dice granular census microdata, without seeing the original data but delivering aggregate tables of counts. The ABS can’t control which combinations of filters and slices will be applied, so secondary cell suppression isn’t realistic. Instead, results are reported all the way down to zero, but have been perturbed so one never knows exactly what the correct value is (other than within a small margin, of course).
  • the Stats NZ data lab allows access to de-identified microdata including the extraordinary Integrated Data Infrastructure which combines inter alia tax, benefits, health, education and justice data at the individual level. While the data are rigorously de-identified (researchers cannot see names, tax or other identification numbers, or addresses more granular than meshblock), a snooper with full access could probably combine results from the IDI with external datasets to identify sensitive information. Because of this, access is tightly controlled to secure laboratories and approved researchers, who are responsible for confidentialisation of their results. Results have to go through an “output checking” process to be sure appropriate confidentialisation has been applied. There is a 37 page Microdata Output Guide for researchers to follow throughout this process. Many of the rules involve suppression and secondary suppression, but these are not the only tools relied on. For example, counts are perturbed by means of random rounding.

Random rounding is an interesting case. Base 3 random rounding works by rounding a number to the nearest number divisible by 3 with 2/3 probability, and the second nearest with 1/3. If the number is already divisible by 3 it stays the same. So this means that of all the numbers that might end up being published as a six, “4” has a 1/3 chance of being 6; “5” has 2/3; “6” has 3/3; “7” has 2/3; and “8” has 1/3.

Simple application of Bayes’ formula indicates that if we started with equal chances for the real value being anything from 4 to 8, then the probabilities of a published six really meaning 4, 5, 6, 7 and 8 are 1/9, 2/9, 3/9, 2/9, 1/9 respectively. But were the starting chances equal? Not really. In fact, I’m sure the number “8” occurs in the sort of cross tabs that get cell suppression many times more than the number “4”. I don’t have proof of this, just an intuition based on the process of deciding what categories are used in the first place, which tables are thought worth publishing, etc. Obviously if half the numbers in a table were small enough to be suppressed, pretty soon we would stop bothering to publish the table and find other categories with higher cell counts.

If we change our priors from equal chances to something more realistic, we get quite a different picture:

The inference process here can be thought of as: “Before we even see the table, we know it is a table of counts. We know that most published counts of this sort are substantial numbers, at least in double figures. Then we look at this particular cell, and see that it is published as a random rounded 6. We update our prior, with all possibilities lower than 4 and greater than 8 automatically becoming zero. Then we combine our updated prior with the chances of each value being published as 6, and get a posterior distribution for the underlying number that relies on both our starting assumptions and the process of random rounding.”

We’ll come back to this line of thinking when we impute some numbers that are suppressed completely.

Here’s the code that generated the graphic above and sets us up for the rest of the session.

library(tidyverse) library(scales) library(knitr) library(MASS) # for corresp library(Cairo) library(png) library(VGAM) library(grid) library(ggrepel) library(mice) library(testthat) #---------random rounding base 3----------------- # probability of observing a six, given actual values 4, 5, 6, 7, 8 prob <- c(1,2,3,2,1) / 3 prior15 <- dpois(4:8, 15) prior30 <- dpois(4:8, 30) rr3_data <- data_frame( Value = 4:8, `Equal probabilities` = prob / sum(prob), `Poisson with mean of 15` = prob * prior15 / sum(prior15 * prob), `Poisson with mean of 30` = prob * prior30 / sum(prior30 * prob) ) %>% gather(prior, post_prob, -Value) # check our probabilities add up to 1 expect_equal(as.vector(tapply(rr3_data$post_prob, rr3_data$prior, sum)), c(1,1,1)) rr3_data %>% ggplot(aes(x = Value, weight = post_prob)) + geom_bar(fill = "steelblue", alpha = 0.8) + facet_wrap(~prior) + labs(y = "Posterior probability this is the actual value") + ggtitle("A value that has been random rounded (base 3) for statistical disclosure control", "Distribution of true value when published value is 6, with three different prior expectations") Simulated suppressed data

Even though national stats offices rarely rely just on cell suppression, agencies that need to safely publish aggregate data but lack the resources and expertise of a stats office will often use it as their main method to provide a certain minimal level of confidentialisation. So it’s of some interest to look at the case of how to analyse a table that has had cells suppressed but no other statistical disclosure control. I am using two such datasets, from completely different sources on different projects, in my work at the moment.

Earlier in this post I introduced this table with two suppressed cells:

Data as it is published region lions tigers bears A 7 6 13 B 9 <6 <6 C 20 24 33 D 18 8 90

I generated this from a model with a strong relationship between animal type and region, and I also know both the true observed values (as though I were the statistical office managing suppression and publication) and the expected values from the data generating process (as though I were God).

The actual data region lions tigers bears A 7 6 13 B 9 1 5 C 20 24 33 D 18 8 90 The expected values from the data generating process region lions tigers bears A 7.4 4.8 13.2 B 6.2 1.6 11.6 C 15.9 22.8 31.4 D 17.8 10.7 79.1

The data was generated by means of a generalized linear model with a Poisson response (a classic and critical model to understand when dealing with tables of count data), this way:

data <- expand.grid(animals = c("lions", "tigers", "bears"), region = LETTERS[1:4], count = 1) %>% as_tibble() %>% mutate(animals = fct_relevel(animals, "lions")) n <- nrow(data) mm <- model.matrix(count ~ animals * region, data = data) # for reproducibility: set.seed(123) # generate true coefficients of the model: true_coefs <- c(2, runif(n - 1, -1, 1)) # generate the actual data and make the suppressed version: data <- data %>% mutate(expected = exp(mm %*% true_coefs), count = rpois(n, lambda = exp(mm %*% true_coefs)), censored_count = ifelse(count < 6, "<6", count), censored_count_num = as.numeric(censored_count), count_replaced = ifelse(count < 6, 3, count)) # data as it is published: data %>% dplyr::select(animals, region, censored_count) %>% spread(animals, censored_count) %>% kable # actual underlying data: data %>% dplyr::select(animals, region, count) %>% spread(animals, count) %>% kable # actual underlying data generating process expectations: data %>% dplyr::select(animals, region, expected) %>% mutate(expected = round(expected, 1)) %>% spread(animals, expected) %>% kable Correspondence analysis and tests of independence

The most common statistical test done on this sort of data is a Chi-square test, which tests the plausibility of the row and column variables being independent of each other. It’s equivalent to comparing a generalized linear model with a Poisson response that has an interaction effect between two variables (count ~ animals + region + animals:region) with one that doesn’t (count ~ animals + region). As I generated the data with the fully saturated model, I know that the correct decision is to reject the null hypothesis of “no interaction”. But what’s a general way to perform this test on a table with data that is missing, very much not at random but missing because it is below a known threshold?

A second technique I want to look at is correspondence analysis, coming from a more exploratory paradigm of data analysis rather than basing itself on statistical tests. Correspondence analysis works well as a graphic summary of a two-way cross tab. Here’s the biplot from correspondence analysis of the true observed data in our case:

If you’re not familiar with this sort of plot and you deal with count data in cross tabs, I strongly recommend the technique. At a glance we see that bears are distinctively frequent to area D, tigers to C and lions B and A (although A is a more typical region than others, close to the cross in the centre of the chart). You can confirm this by looking at the original table and calculating row-wise and column-wise percentages, but it’s not such an easy snapshot as is provided by the graphic.

Like the Chi-square test, we need a full table of results to do this analysis.

With only two cells suppressed, there are only 36 possibilities for what they could be (each cell could really be any number from zero to five inclusive). So it’s quite feasible to perform analysis on 36 different possible data sets. If the results are substantively the same for all combinations, we can conclude it doesn’t matter what the suppressed cells really are. This is a nice, common sense and easily explainable approach.

For the correspondence analysis, the logical way to show all 36 possibilities is via an animation:

For the Chi square tests the results are numeric so we can summarise them with standard methods including maximum and minimum values. Here are the summaries of the p values from the Fisher exact test for independence (pvalues_f) and the Chi-squared test:

pvalues_f pvalues_c Min. :0.0004998 Min. :4.810e-11 1st Qu.:0.0004998 1st Qu.:6.381e-09 Median :0.0004998 Median :3.989e-08 Mean :0.0004998 Mean :1.480e-07 3rd Qu.:0.0004998 3rd Qu.:1.730e-07 Max. :0.0004998 Max. :8.438e-07

For both the correspondence analysis and the tests of independence, the results don’t vary materially when we change the values in the suppressed cells, so we could proceed to inference with confidence.

Here’s the code that did those two types of analysis with a full set of possible values plugged into the suppressed cells:

# make a traditional cross tab matrix of the data: tab <- data %>% dplyr::select(animals, region, count) %>% spread(animals, count) %>% dplyr::select(lions, tigers, bears) %>% as.matrix() rownames(tab) <- LETTERS[1:4] # biplot with the actual full values (obviously not usually available to analysts): par(font.main = 1, col.axis = "grey80", fg = "grey80") xl <- c(-1.2, 0.5) yl <- c(-1.1, 0.4) biplot(corresp(tab, nf = 2), main = "Full data including suppressed values\n", col = c("orange", "steelblue"),, xlim = xl, ylim = yl) # read in image of the table for including in the animated graphic: table_image <- readPNG("0138-table-image.png") # set up empty vectors to store the results of tests of independence: pvalues_f <- numeric() pvalues_c <- numeric() # create a temporary folder for storing frames of the animation: dir.create("tmp") pd <- setwd("tmp") # we can use brute force and see if the results are statistically significant # for all possible values. We'll do chi square tests and correspondence analysis # at the same time for(i in 0:5){ for(j in 0:5){ tab[2, 2] <- i # tigers in B tab[2, 3] <- j # bears in B pvalues_f <- c(pvalues_f, fisher.test(tab, simulate.p.value = TRUE)$p.value) pvalues_c <- c(pvalues_c, chisq.test(tab)$p.value) CairoPNG(paste0(i, j, ".png"), 2000, 2000, res = 300) par(font.main = 1, col.axis = "grey80", fg = "grey80", mar = c(2,1,4,1)) biplot(corresp(tab, nf = 2), col = c("orange", "steelblue"), main = paste0("Imputed values: tigers in B = ", i, "; bears in B = ", j, ".\n"), xlim = xl, bty = "n", ylim = yl) grid.raster(table_image, x = 0.66, y = 0.2, width = 0.3) } } system('magick -loop 0 -delay 40 *.png "0138-corresp.gif"') # move the result to where I use it for the blog file.rename("0138-corresp.gif", "../..") setwd(pd) # clean up (uncomment the last line if happy to delete the above results): # unlink("tmp", recursive = TRUE) # p-values: summary(cbind(pvalues_f, pvalues_c)) Generalized linear model

For a more general approach, I wanted to be ready for a situation with more than two categorical variables, and where the number of suppressed cells is too large to repeat the analysis with all combinations. As a use case, I fit a generalized linear model where my interest is in the values of the coefficients (ie the tiger effect relative to lions, bear effect, region B effect relative to region A, etc and their interaction impacts, on “count” as the response variable).

One thought is to treat the missing data as a parameter, constrained to be an integer from zero to five, and estimate it simultaneously with the main model in a Bayesian framework. In principle this would work, except the missing data has to be an integer (because it then forms part of a Poisson or negative binomial distribution as part of the main model) and it is difficult to estimate such discontinuous parameters. You can’t do it in Stan, for example.

Another way of looking at the problem is to note that this is censored data and use methods developed specifically with this in mind. “Survival analysis” has of course developed methods for dealing with all sorts of censored data; most obviously with “right-censored” data such as age at death when some of the subjects are still alive. Left-censored data occurs when measurement instruments struggle at the lower end of values, but is also an appropriate description of our suppressed values. In R, the VGAM package provides methods for fitting generalized linear models to data that is either left, right or interval censored. Caveat (and spoiler) – my results from doing this are so bad that I suspect I was using it wrong, so read all references to this below with even more than usual caution!.

The third family of methods I tried involved substituting values for the suppressed cells. At its simplest, I tried replacing them all with “3”. At the more complex end, I used multiple imputation to generate 20 different datasets with random values from zero to five in each suppressed cell; fit the model to each dataset; and pool the results. This is a method I use a lot with other missing data and is nicely supported by the mice R package. mice doesn’t have an imputation method (that I can see) that exactly meets my case of a number from zero to five, but it’s easy to write your own. I did this two different ways:

  • with a uniform chance for each number
  • with probabilities proportional to the density of a Poisson distribution with mean equal to the trimmed mean of the unsuppressed counts (the reasoning being the same as that used earlier in this post in discussing random rounding)

Here’s the results of all those different methods. First, in a summary of the mean squared error comparing coefficient estimates to their known true values:

… and second, with the coefficient-by-coefficient detail:

What we see from this single simulated dataset is:

  • the survival analysis method performed really badly. So badly that I think I must be doing it wrong somehow (I can’t see how), so treat this with caution.
  • simply substituting the number “3” in for all suppressed values worked fine; in fact as well as using the original unavailable data, or the most sophisticated multiple imputation method
  • multiple imputation with probabilities proportional to the Poisson density out-performed multiple imputation with sampling from a uniform distribution. Sampling the suppressed values from a uniform distribution gives particularly bad coefficients for the coefficients related to the suppressed cells (interaction of tigers with region B, and bears with region B).

Because this is a single simulated dataset, the conclusions above should be treated as only indicative. A good next step would be to generalise this by testing it with many datasets, but that will have to wait for a later blog post as this one is already long enough! My working hypothesis when I get to that is that multiple imputation with probabilities proportional to the Poisson density will be the best method, but I’m open to being wrong on that.

Here’s the code for the comparison of those different methods:

#====================different approaches to dealing with the missing data=================== #---------------regenerate data----------- # doing this here to make this chunk of code easier for when I want to generalise this whole approach # to multiple simulations set.seed(123) true_coefs <- c(2, runif(n - 1, -1, 1)) data <- data %>% mutate(expected = exp(mm %*% true_coefs), count = rpois(n, lambda = exp(mm %*% true_coefs)), censored_count = ifelse(count < 6, "<6", count), censored_count_num = as.numeric(censored_count), count_replaced = ifelse(count < 6, 3, count)) #--------------straightforward methods----------- mod0 <- glm(count ~ animals * region, data = data, family = "poisson") mod2 <- glm(count_replaced ~ animals * region, data = data, family = "poisson") #------------with censored poisson regression------------------------- # (this method does so badly that I can only assume I am using it wrongly data$z <- pmax(6, data$count) data$lcensored <-$censored_count_num ) mod3 <- vglm(SurvS4(z, lcensored, type = "left") ~ animals * region, family = cens.poisson, data = data) #------------------ with multiple imputation---------- #' Imputation function for suppressed data for use with mice - Poisson-based #' #' @param y vector to be imputed #' @param ry logical vector of observed (TRUE) and missing (FALSE) values of y #' @param x design matrix. Ignored but is probably needed for mice to use. #' @param wy vector that is TRUE where imputations are created for y. Not sure when this is different #' to just !ry (which is the default). mice.impute.desuppress <- function (y, ry, x, wy = NULL, max_value = 5, ...) { # during dev: # y <- data$censored_count_num; ry <- ! if (is.null(wy)){ wy <- !ry } # What are the relative chances of getting values from 0 to the maximum allowed value, # if we have a Poisson distribution which we have estimated the mean of via trimmed mean? # (this is very approximate but probably better than just giving equal chances to 0:5) probs <- dpois(0:max_value, mean(y[ry], tr = 0.2)) return(sample(0:max_value, sum(wy), prob = probs, replace = TRUE)) } #' Imputation function for suppressed data for use with mice - simple #' mice.impute.uniform <- function (y, ry, x, wy = NULL, max_value = 5, ...) { return(sample(0:max_value, sum(wy), replace = TRUE)) } m <- 20 # number of imputed/complete datasets to generate data_imp1 <- mice(data[ , c("censored_count_num", "animals", "region")], method = "desuppress", print = FALSE, m = m) imp_y1 <- data_imp1$imp$`censored_count_num` imp_y1 # visual check data_imp2 <- mice(data[ , c("censored_count_num", "animals", "region")], method = "uniform", print = FALSE, m = m) imp_y2 <- data_imp2$imp$`censored_count_num` imp_y2 # visual check mod_mice1 <- with(data_imp1, glm(censored_count_num ~ animals * region, family = "poisson")) coef_mice1 <- pool(mod_mice1)$pooled$estimate mod_mice2 <- with(data_imp2, glm(censored_count_num ~ animals * region, family = "poisson")) coef_mice2 <- pool(mod_mice2)$pooled$estimate #---------------results---------------------- # comparison data d <- data_frame(underlying = true_coefs, `Using full data including suppressed values (not usually possible!)` = coef(mod0), `Replacing suppressed values with 3` = coef(mod2), `Left-censored survival-based method` = coef(mod3), `MICE with Poisson proportional probabilities` = coef_mice1, `MICE with uniform probabilities` = coef_mice2, labels = names(coef(mod0))) %>% mutate(labels = gsub("animals", "", labels), labels = gsub("region", "", labels)) %>% gather(method, value, -underlying, -labels) %>% mutate(method = str_wrap(method, 25)) %>% mutate(value = ifelse(, 0, value)) # summary data: d2 <- d %>% mutate(square_error = (value - underlying) ^ 2) %>% group_by(method) %>% summarise(mse = mean(square_error), trmse = mean(square_error, tr = 0.2)) %>% ungroup() %>% mutate(method = fct_reorder(method, mse)) %>% arrange(trmse) %>% mutate(method = fct_reorder(method, trmse)) # summary graphic: ggplot(d2, aes(x = mse, y = trmse, label = method)) + geom_point(size = 2) + geom_text_repel(colour = "steelblue") + labs(x = "Mean squared error of coefficient estimates", y = "Trimmed mean squared error of coefficient estimates") + ggtitle("Comparison of different methods of handling suppressed counts in a frequency table", "Summary results of models fitted after different treatment methods") # coefficients graphic: d %>% mutate(method = factor(method, levels = levels(d2$method))) %>% ggplot(aes(x = underlying, y = value, label = labels)) + facet_wrap(~method, nrow = 1) + geom_abline(slope = 1, intercept = 0, colour = "grey50") + geom_point() + geom_text_repel(colour = "steelblue") + labs(x = "Correct value of coefficient in underlying data generating process", y = "Value estimated from a GLM with one particular method of dealing with censored data") + coord_equal() + ggtitle("Comparison of different methods of handling suppressed counts in a frequency table", "Coefficient estimates from models fitted after different treatment methods") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science With R Course Series – Week 8

Mon, 11/05/2018 - 09:00

(This article was first published on - Articles, and kindly contributed to R-bloggers)

This is a fun part of the course where you learn how to add value to the business by providing ROI-Driven Data Science!

Week 8 will teach you how to calculate a simple policy change – Implementing a No Overtime Policy. You will calculate the expected savings compared to the current policy of allowing overtime.

Next, develop a more sophisticated policy change – Implementing a Targeted Overtime Policy – wherein you target high-flight risk employees.

We’ll teach you everything you need to know about the Expected Value Framework so you can begin to implement this for ANY BINARY CLASSIFICATION MODEL. This includes Customer Churn, Targeted Advertisements, and more!

Here is a recap of our trajectory and the course overview:

Recap: Data Science With R Course Series

You’re in the Week 8: Link Data Science To Business With Expected Value. Here’s our game-plan over the 10 articles in this series. We’ll cover how to apply data science for business with R following our systematic process.

Week 8: Link Data Science To Business With Expected Value

Student Feedback

Week 8: Link Data Science To Business With Expected Value Overview & Setup

This overview will teach you how to quantify the business return on investment (ROI) using the Expected Value Framework.

The Expected Value Framework is way to apply an expected value to a classification model – it connects a machine learning classification model to ROI for the business.

Learn how to combine:

  1. The threshold,
  2. Knowledge of costs and benefits, and
  3. The confusion matrix converted to expected error rates to account for the presence of false positives and false negatives.

We can use this combination to calculate the business savings for implementing policy as a results of your data science work.

Calculating Expected ROI: No Overtime Policy

Over the past few weeks you have created a machine learning model to predict which employees are likely to leave. Through your analysis, you determined the number one cause of employee turnover is working overtime hours.

In this module, you will create a baseline calculation to determine how much the business will save if they completely remove overtime for all employees.

Targeting By Threshold Primer

Targeting by threshold will allow you to target employees above a certain level of turnover risk to pinpoint those who are most likely to leave.

Use the calculation to determine the expected value to find the optimal threshold that will maximize business savings for implementing the policy.

Calculating Expected ROI: Targeted Overtime Policy

Using the threshold from the previous module, learn how to apply the expected value to employees above a certain probability to leave.

You will also compare the targeted overtime policy to the previous no overtime policy to calculate the cost difference between each policy.

This will enable you to clearly communicate the savings for implementing overtime policy.

You Need To Learn R For Business

To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

Next Up

The next article in the Data Science With R Series covers Expected Value Optimization And Sensitivity Analysis.

This is a really fun series of chapters that teach you the skills to align your data science work with business ROI.

This week’s targeted analysis leads into Week 9, where you will perform two advanced analyses that are critical in the course:

  • Threshold Optimization – A method used to maximize expected saving via iteratively calculating savings at various thresholds

  • Sensitivity Analysis – A method used to investigate how sensitive the expected savings is to various parameter values that were created based on assumptions

Week 9: Expected Value Optimization And Sensitivity Analysis

New Course Coming Soon: Build A Shiny Web App!

You’re experiencing the magic of creating a high performance employee turnover risk prediction algorithm in DS4B 201-R. Why not put it to good use in an Interactive Web Dashboard?

In our new course, Build A Shiny Web App (DS4B 301-R), you’ll learn how to integrate the H2O model, LIME results, and recommendation algorithm building in the 201 course into an ML-Powered R + Shiny Web App!

Shiny Apps Course Coming in October 2018!!! Sign up for Business Science University Now!

Building an R + Shiny Web App, DS4B 301-R

Get Started Today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: - Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

India vs US – Kaggle Users & Data Scientists

Mon, 11/05/2018 - 01:12

(This article was first published on R programming – Journey of Analytics, and kindly contributed to R-bloggers)


This is an analysis of the Kaggle 2018 survey dataset. In my analysis I am trying to understand the similarities and differences between men and women users from US and India, since these are the two biggest segments of the respondent population. The number of respondents who chose someting other than Male/Female is quite low, so I excluded that subset as well.

The complete code is available as a kernel on the Kaggle website. If you like this post, do login and upvote!   This post is a slightly truncated version of the Kernel available on Kaggle.

You can also use the link to go to the dataset and perform your own explorations. Please do feel free to use my code as a starter script.


Kaggle users – India vs US

Couple of disclaimers:

NOT intending to say one country is better than the other. Instead just trying to explore the profiles based on what this specific dataset shows.
It is very much possible that there is a response bias and that the differences are due to the nature of the people who are on the Kaggle site, and who answered the survey.
With that out of the way, let us get started. If you like the analysis, please feel to fork the script and extend it further. And do not forget to Upvote!


Some questions that the analysis tries to answer are given below:
a. What is the respondent demographic profile for users from the 2 countries – men vs women, age bucket?
b. What is their educational background and major?
c. What are the job roles and coding experience?
d. What is the most popular language of use?
e. What is the programming language people recommend for an aspiring data scientist?

I deliberately did not compare salary because:
a. 16% of the population did not answer and 20% chose “do not wish to disclose”.
b. the lowest bracket is 0-10k USD, so the max limit of 10k translates to about INR 7,00,000 (7 lakhs) which is quite high. A software engineer, entering the IT industry probably makes around 4-5 lakhs per annum, and they earn much more than others in India. So comparing against US salaries feels like comparing apples to oranges. [Assuming an exchange rate of 1 USD = 70 INR].

Calculations / Data Wrangling:
  1. I’ve aggregated the age buckets into lesser number of segments, because the number of respondents tapers off in the higher age groups. They are quite self-explanatory, as you will see from the ifesle clause below: # age - groupings: ctry_cmp$age_bucket <- ifelse( ctry_cmp$Q2 %in% c("18-21", "22-24"), "age1_[18-24]", ifelse(ctry_cmp$Q2 %in% c("25-29", "35-39"), "age2_[25-39]", ifelse(ctry_cmp$Q2 %in% c("35-39", "40-44"), "age3_[35-44]", ifelse(ctry_cmp$Q2 %in% c("45-49", "50-54"), "age4_[45-54]", "age5_[55+]"))))
  2. Similarly, cleaned up the special characters in the educational qualifications. Also added a tag to the empty values in the following variables – jobrole (Q6), exp_group (Q8), proj(Q40), years in coding (Q24), major (Q5).
  3. I also created some frequency using the sqldf() function. You could use the summarise() from the dplyr package. It really is a matter of choice.
Observations Gender composition:

As seen in the chart below, many more males (~80%) responded to the survey than women (~20%).
Among women, almost 2/3rd are from US, and only ~38% from India.
The men are almost split 50/50 among US and India.


Age composition:

There is a definite trend showing that the Indian respondents are quite young, with both men and women showing >54% in the youngest ge bucket (18-24), and another ~28% falling in the (25-29) category. So almost 82% of the population is under 30 years of age.
Among US respondents, the women seem a bit more younger, with 68% under 30 years, compared to ~57% men of women. However, both men and women had a larger segment in the 55+ category (~20% for women, and 25% for men. Compare it with Indians, where the 55+ group is barely 12%.


Educational background:

Overall, this is an educated lot, and most had a bachelors degree or more.
US women were the most educated of the lot, with a whopping 55% with masters degrees and 16% with doctorates.
Among Indians, women had higher levels of education – 10% with Ph.D, 43% masters degree, compared with men where ~34% had a masters degree and only 4% had a doctorate.
Among US men, ~47% had a masters degree, and 19% had doctorates.
This is interesting because Indians are younger compared to US respondents, so many more Indians seem to be pursuing advanced degrees.

Undergrad major:

Among Indians, the majority of respondents added Computer Science as their major.
Maybe because :
(a) Indians have to declare a major when they join, and the choice of majors is not as wide as in the US. ,

  1. Parents tend to force kids towards majors which are known to translate into a decent paying job, which is engineering or medicine.
  2. A case of response bias? The survey came from Kaggle, so not sure if non-coding majors would have even bothered to respond.Among US respondents, the major is also computer science, but followed by maths & stats for women.
    For men, the second category was a tie between non-compsci Engg , followed by maths&stats.


Job Roles:

Among Indians, the biggest segment are predominantly students (30%). Among Indian men, the second category is software engineer.
Among US women, the biggest category was also “student” but followed quite closely by “data scientist”. Among US men , the biggest category was “data scientist” followed by “student”.
Note, “other” category is something we created now, so not considering those. They are not the biggest category for any sub-group anyway.
CEOs, not surprisingly are male, 45+ years from the US, with a masters degree.


Coding Experience:

Among Indians, most answered <1 year of coding experience , which correlates well with the fact that most of them are under 30, with a huge population of students.
Among US respondents, the split is even between 1-2 years of coding and 3-5 years of coding.
Men seem to have a bit more coding experience than women, again explained by the fact that women were slightly younger overall, compared to US men.


Most popular programming language:

Python is the most popular language, discounting the number of people who did not answer. However, among US women, R is also popular (16% favoring it).

I found this quite interesting because I’ve always used R at work, at multiple big-name employers. (Nasdaq, Td bank, etc.) Plus, a lot of companies that used SAS seem to have found it easier to move code to R. Again this is personal opinion.
Maybe it is also because many colleges teach Python as a starting programming language?


  1. Overall, Indians tended to be younger with more people pursuing masters degrees.
  2. US respondents tended to older with stronger coding experience, and many more are practicing data scientists.
    This seems like a great opportunity for Kaggle, if they could match the Indian students with the US data scientists, in a sort of mentor-matching service.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R programming – Journey of Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio 1.2 Preview – New Features in RStudio Pro

Mon, 11/05/2018 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today, we’re continuing our blog series on new features in RStudio 1.2. If you’d like to try these features out for yourself, you can download a preview release of RStudio Pro 1.2.

We’ve added some great new features to RStudio Pro for v1.2, which includes not only Server Pro, but also the new and improved Pro Desktop. Let’s get started!

RStudio Server Pro The Job Launcher

Perhaps the biggest new change in v1.2 is the Job Launcher. This allows you to run RStudio sessions and ad-hoc R scripts within your already existing cluster workload managers, such as Kubernetes, allowing you to leverage your existing infrastructure instead of provisioning load balancer nodes manually. At release, we will support the following clusters:

The following diagram shows an example of how you can use the Job Launcher with Kuberentes to painlessly scale RStudio Server Pro across potentially hundreds of nodes.

When starting RSP sessions via the Launcher, users will still use the same home page that they are familiar with, but will have additional options for controlling the creation of their sessions within your distributed environment of choice.

We determined that most RSP users were already using Slurm and Kubernetes, so integration with them was added first. However, the Job Launcher is an extensible system that makes it fairly simple to develop plugins to target different cluster types. We plan to develop more plugins in the future, and would love to hear from you about what we should tackle next! At present, we plan to add support for LSF.

For more information on launching ad-hoc jobs, see our upcoming blog post on background jobs. For more information on using the Job Launcher with RStudio Server Pro, see the documentation.

Improved R Version Management

We’ve improved management of various versions of R within your environments, allowing you to:

  • Label each version so users can have a friendly name associated with each version. This makes it easy to differentiate between similar versions for different environments, such as when running parallel versions of Microsoft R and Vanilla R.
  • Execute an arbitrary script when the version is loaded, perhaps to dynamically alter any important environment variables (such as LD_LIBRARY_PATH)
  • Load an arbitrary environment module, if using Environment Modules (see environment modules)

When specifying a label, users will see the label on the home page, as well as within the session itself when accessing the version switch menu. The following screenshots show an example where the R version 3.1.3 was given the label My Special R Version.

For a more detailed guide on configuring R Versions, see the documentation.

Configuration Reload

We’ve added the ability to reload some of Server Pro’s configuration at run-time, without having to stop the server and interrupt users’ sessions. Currently, the following is supported:

  • Reloading /etc/rstudio/load-balancer to add new nodes or remove existing nodes. Note that when removing nodes, removed nodes need to have their RStudio processes stopped by running sudo rstudio-server stop on that node before invoking the configuration reload.
  • Reloading the list of server R versions specified in /etc/rstudio/r-versions.

In order to perform the configuration reload, simply edit the above files as desired and then send the SIGHUP signal to the rserver executable, like so:

pidof rserver | sudo xargs kill -s SIGHUP RStudio Pro Desktop

With the release of RStudio v1.2 we are excited to announce the RStudio Pro Desktop, a fully licensed platform that provides enterprise users with an enhanced version of RStudio Desktop that comes with professional priority support. The Pro Desktop will be built on over time to include new capabilities and integrations with other RStudio professional products.

Bundled ODBC Drivers

Pro Desktop now adds support for installing the RStudio Pro Drivers for connecting to various ODBC data sources, such as MongoDB, Oracle, and PostgreSQL (just to name a few!).

Connecting to a database is simple – just click on the New Connection button under the Connections pane, and you’ll be greeted with a dialog from which to select your database type.

When connecting to a data source for the first time, you will be prompted to install the ODBC package. Simply click yes, and then you will be able to connect to many of the most popular databases available today!

For more information on database connectivity within RStudio Pro, see the documentation.

If you’re interested in giving the new RStudio Pro features a try, please download the RStudio 1.2 preview. For more detailed documentation on RStudio Pro features, see the admin guide.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...