Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 10 hours 14 min ago

New Skill Track: Tidyverse Fundamentals with R

Wed, 09/19/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the track link.

Track Details

In this track, you’ll learn the skills needed to get you up and running with data science in R using the tidyverse. The tidyverse is a collection of R packages that share a common design philosophy and are designed to seamlessly work together, all with novices to data science professionals equally benefiting. You’ll begin by hopping into data wrangling and data visualization with the gapminder dataset in the Introduction to the Tidyverse course. Next, in Working with Data in the Tidyverse, you’ll learn about “tidy data” and see how to get data into tidy format with fun datasets from the British reality television series “Great British Bake Off” allowing you to see what’s cooking behind the scenes of so many R data analyses. After that, you’ll ease into general modeling concepts via regression using tidyverse principles in Modeling with Data in the Tidyverse. There you’ll explore Seattle housing prices and how different variables can be used to explore patterns in these prices.

Throughout these first three courses, you’ll use the dplyr, ggplot2, and tidyr packages that serve as the powerhouses of the tidyverse allowing you to see the power of readable code. In Communicating with Data in the Tidyverse, you’ll learn how to further customize your ggplot2 graphics and use R Markdown to write reproducible reports while working with data from the International Labour Organization in Europe. The track closes with Categorical Data in the Tidyverse that explores ways to handle the sometimes tricky concept of factors in data science with R using datasets from Kaggle’s Data Science and Machine Learning Survey and

The goal of the track is for you to gain experience using the tools and techniques of the whole data science pipeline made famous by Hadley Wickham and Garrett Grolemund as shown below. You’ll gain exposure to each component of this pipeline from a variety of different perspectives in this track. We look forward to seeing you in the track!

Introduction to the Tidyverse

This is an introduction to the programming language R, focused on a powerful set of tools known as the "tidyverse". In the course, you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.

This course was even approved by Hadley himself:

Working with Data in the Tidyverse

In this course, you’ll learn to work with data using tools from the tidyverse in R. By data, we mean your own data, other people’s data, messy data, big data, small data – any data with rows and columns that comes your way! By work, we mean doing most of the things that sound hard to do with R, and that need to happen before you can analyze or visualize your data. But work doesn’t mean that it is not fun – you will see why so many people love working in the tidyverse as you learn how to explore, tame, tidy, and transform your data. Throughout this course, you’ll work with data from a popular television baking competition called "The Great British Bake Off."

Modeling with Data in the Tidyverse

In this course, you will learn to model with data. Models attempt to capture the relationship between an outcome variable of interest and a series of explanatory/predictor variables. Such models can be used for both explanatory purposes, e.g. "Does knowing professors’ ages help explain their teaching evaluation scores?", and predictive purposes, e.g., "How well can we predict a house’s price based on its size and condition?" You will leverage your tidyverse skills to construct and interpret such models. This course centers around the use of linear regression, one of the most commonly-used and easy to understand approaches to modeling. Such modeling and thinking is used in a wide variety of fields, including statistics, causal inference, machine learning, and artificial intelligence.

Communicating with Data in the Tidyverse

They say that a picture is worth a thousand words. Indeed, successfully promoting your data analysis is not only a matter of accurate and effective graphics, but also of aesthetics and uniqueness. This course teaches you how to leverage the power of ggplot2 themes for producing publication-quality graphics that stick out from the mass of boilerplate plots out there. It shows you how to tweak and get the most out of ggplot2 in order to produce unconventional plots that draw attention on social media. In the end, you will combine that knowledge to produce a slick and custom-styled report with RMarkdown and CSS – all of that within the powerful tidyverse.

Categorical Data in the Tidyverse

As a data scientist, you will often find yourself working with non-numerical data, such as job titles, survey responses, or demographic information. This type of data is qualitative and can be ordinal if they have an order to them, or categorical/nominal, if they don’t. R has a special way of representing them, called factors, and this course will help you master working with them using the tidyverse package forcats. We’ll also work with other tidyverse packages, including ggplot2, dplyr, stringr, and tidyr and use real-world datasets, such as the FiveThirtyEight flight dataset and Kaggle’s State of Data Science and ML Survey. Following this course, you’ll be able to identify and manipulate factor variables, quickly and efficiently visualize your data, and effectively communicate your results. Get ready to categorize!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Official release of shiny.router and its new features

Wed, 09/19/2018 - 14:13

(This article was first published on r – Appsilon Data Science | We Provide End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers)

In a web application, routing is the process of using URLs to drive the user interface. Routing adds more possibilities and flexibility while building a complex and advanced web application, offering dividing app into separated sections.

New features

Contributing to open source is incorporated into Appsilon mission. Last week we updated i18n internationalization package, now it’s time for shiny.router. Our shiny.router package provides you an easy solution how to add routing to your Shiny application. Since the last release we managed to improve and add some great features to it. Find them on the list below!

Routing moved fully to R assets

Previously shiny.router was based on the external page.js library. Thanks to the use of shiny session object we moved it fully to R.

Separated server for each bookmark

Now each bookmark can be isolated and fully working shiny app. The new feature allows you not only to separate UI for each bookmark – you may also define its own server now. Just check below example!

library(shiny) library(shiny.router) # This creates UI for each page. page <- function(title, content) { div( titlePanel(title), p(content), uiOutput("power_of_input") ) } # Part of both sample pages. home_page <- page("Home page", "This is the home page!") side_page <- page("Side page", "This is the side page!") # Callbacks on the server side for the sample pages home_server <- function(input, output, session) { output$power_of_input <- renderUI({ HTML(paste( "I display square of input and pass result to output$power_of_input: ", as.numeric(input$int) ^ 2)) }) } side_server <- function(input, output, session) { output$power_of_input <- renderUI({ HTML(paste( "I display cube of input and also pass result to output$power_of_input: ", as.numeric(input$int) ^ 3)) }) } # Create routing. We provide routing path, a UI as well as a server-side callback for each page. router <- make_router( route("home", home_page, home_server), route("side", side_page, side_server) ) # Create output for our router in main UI of Shiny app. ui <- shinyUI(fluidPage( shiny::sliderInput("int", "Choose integer:", -10, 10, 1, 1), router_ui() )) # Plug router into Shiny server. server <- shinyServer(function(input, output, session) { router(input, output, session) }) # Run server in a standard way. shinyApp(ui, server)

Pass parameters to an app using GET URL variables library(shiny) library(shiny.router) # Main page UI. home_page <- div( titlePanel("Home page"), p("This is the home page!"), uiOutput("power_of_input") ) # Creates routing. We provide routing path, a UI as well as a server-side callback for each page. router <- make_router( route("/", home_page, NA) ) # Create output for our router in main UI of Shiny app. ui <- shinyUI(fluidPage( shiny::sliderInput("int", "Choose integer:", -10, 10, 1, 1), router_ui() )) # Plug router into Shiny server. server <- shinyServer(function(input, output, session) { router(input, output, session) component <- reactive({ if (is.null(get_query_param()$add)) { return(0) } as.numeric(get_query_param()$add) }) output$power_of_input <- renderUI({ HTML(paste( "I display input increased by add GET parameter from app url and pass result to output$power_of_input: ", as.numeric(input$int) + component())) }) }) # Run server in a standard way. shinyApp(ui, server)

Operate routing from the server side
  • route_link – function for changing url for bookmark by adding hashbang (#!) prefix,
  • change_page – function for changing the currently displayed page,
  • get_page – function to extract “hash” part of the url,
  • is_page – function that verifies if current page is was passed succesfuly.
library(shiny) library(shiny.router) # This generates menu in user interface with links. menu <- ( tags$ul( tags$li(a(class = "item", href = route_link("home"), "Home page")), tags$li(a(class = "item", href = route_link("side"), "Side page")) ) ) # This creates UI for each page. page <- function(title, content) { div( menu, titlePanel(title), p(content), actionButton("switch_page", "Click to switch page!") ) } # Both sample pages. home_page <- page("Home page", uiOutput("current_page")) side_page <- page("Side page", uiOutput("current_page")) # Creates router. We provide routing path, a UI as # well as a server-side callback for each page. router <- make_router( route("home", home_page, NA), route("side", side_page, NA) ) # Create output for our router in main UI of Shiny app. ui <- shinyUI(fluidPage( router_ui() )) # Plug router into Shiny server. server <- shinyServer(function(input, output, session) { router(input, output, session) output$current_page <- renderText({ page <- get_page(session) sprintf("Welcome on %s page!", page) }) observeEvent(input$switch_page, { if (is_page("home")) { change_page("side") } else if (is_page("side")) { change_page("home") } }) }) # Run server in a standard way. shinyApp(ui, server)

Styling – Bootstrap and Semantic UI

You can suppress Bootstrap dependency on the specified bookmark. You can switch between Bootstrap and Semantic UI pages or disable styles. This is especially useful when using both Bootstrap and semantic-UI frameworks in one application.


library(shiny) library(shiny.router) library(shiny.semantic) # Both sample pages. bootstrap_page <- fluidPage( sidebarLayout( sidebarPanel( sliderInput("obs_bootstrap", NULL, min = 0, max = 100, value = 50, step = 1) ), mainPanel( p("Selected value:"), textOutput("value_bootstrap") ) ) ) semanticui_page <- semanticPage( slider_input("obs_semantic", min = 0, max = 100, value = 50, step = 1), p("Selected value:"), textOutput("value_semantic") ) # Creates router. We provide routing path, a UI as # well as a server-side callback for each page. router <- make_router( route("bootstrap", bootstrap_page), route("semantic", semanticui_page), page_404 = page404("You opened non existing bookmark!") ) # Create output for our router in main UI of Shiny app. ui <- shinyUI( tagList( tags$head( singleton(disable_bootstrap_on_bookmark("semantic")) ), router_ui() ) ) # Plug router into Shiny server. server <- shinyServer(function(input, output, session) { router(input, output, session) output$value_bootstrap <- renderText(input$obs_bootstrap) output$value_semantic <- renderText(input$obs_semantic) }) # Run server in a standard way. shinyApp(ui, server) How to get shiny.router?

Shiny.router is available both on R Cran and Github. If you will stumble upon any issues please file them on GitHub where our team will reply. Are you using already shiny.router package in your shiny projects? Say hello to us and share your story – it will help us make our open source better.  Look for us on R events and collect our hex stickers!


Further steps and plans for the package

We are planning to constantly work on the package to make it more versatile. As next steps we want to allow passing parameters between separated bookmarks servers and ability to save application state. We hope that you will appreciate improvements we did within last two years.

Feedback will be very valuable for us.

Artykuł Official release of shiny.router and its new features pochodzi z serwisu Appsilon Data Science | We Provide End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | We Provide End­ to­ End Data Science Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

LTV prediction for a recurring subscription with R

Wed, 09/19/2018 - 10:54

(This article was first published on R language – AnalyzeCore by Sergey Bryl' – data is beautiful, data is a story, and kindly contributed to R-bloggers)

Customers lifetime value (LTV or CLV) is one of the cornerstones of product analytics because we need to make a lot of decisions for which the LTV is a necessary or at least very significant factor. In this article, we will focus on products/services/applications with recurring subscription paymentsCustomers lifetime value (LTV or CLV) is one of the cornerstones of product analytics because we need to make a lot of decisions for which the LTV is a necessary or at least very significant factor. In this article, we will focus on products/services/applications with recurring subscription payments.

Predicting LTV is a common issue for a new, recently launched product/service/application when we don’t have a lot of historical data but want to calculate LTV as soon as possible. Even though we may have a lot of historical data on customer payments for a product that is active for years, we can’t really trust earlier stats since the churn curve and LTV can differ significantly between new customers and the current ones due to a variety of reasons.

Therefore, regardless of whether our product is new or “old”, we attract new subscribers and want to estimate what revenue they will generate during their lifetimes for business decision-making.

This topic is closely connected to the Cohort Analysis and if you are not familiar with the concept, I recommend that you read about it and look at other articles I wrote earlier on this blog.

As usual, in the article, we will review an example of LTV projection using the R language although this approach can be implemented even in MS Excel.

Ok, let’s start. In order to predict the average LTV for the product/service/application with a constant subscription payment, it suffices to know how many subscribers will churn (or be retained) at the end of each subscription period. I’ve collected four examples: two from my practice and two from research that demonstrates the effectiveness of the approach. The above examples refer to different businesses and show both monthly and annual subscriptions, but for convenience, they are all monthly ones.

click to expand R code

library(tidyverse) library(reshape2) # retention rate data df_ret <- data.frame(month_lt = c(0:7), case01 = c(1, .531, .452, .423, .394, .375, .356, .346), case02 = c(1, .869, .743, .653, .593, .551, .517, .491), case03 = c(1, .677, .562, .486, .412, .359, .332, .310), case04 = c(1, .631, .468, .382, .326, .289, .262, .241) ) %>% melt(., id.vars = c('month_lt'), = 'example', = 'retention_rate') ggplot(df_ret, aes(x = month_lt, y = retention_rate, group = example, color = example)) + theme_minimal() + facet_wrap(~ example) + scale_color_manual(values = c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2')) + geom_line() + geom_point() + theme(plot.title = element_text(size = 20, face = 'bold', vjust = 2, hjust = 0.5), axis.text.x = element_text(size = 8, hjust = 0.5, vjust = .5, face = 'plain'), strip.text = element_text(face = 'bold', size = 12)) + ggtitle('Retention Rate')


The initial data is as follows:

month_lt – month (or subscription period) of customer’s lifetime,

example – the name of the example,

retention_rate – the percentage of subscribers who have maintained the subscription.

Next, the visualization of four retention curves. Looks very familiar, doesn’t it?

Let’s consider the approach for predicting LTV. As I noted earlier, when we deal with the constant subscription payment, we can predict average subscribers’ LTV by projecting a retention curve. In other words, we need to continue “drawing” the curve that starts with factual points we have for the future periods.

However, instead of using the traditional approach (i.e. a regression model), we can use an alternative probabilistic method that Peter Fader and Bruce Hardie suggested.

The authors demonstrated the concept which, despite having simplified assumptions, shows excellent results in practice. There are a lot of materials on the Internet about the method, so we won’t go into details, we will just highlight the core ideas.

We assume that each subscriber has a certain constant churn probability and after the end of each period, they “flip a coin to decide whether to churn or to maintain the subscription”, where the probability works. Since the probability of retaining subscribers in each period is calculated by multiplying the previous probabilities, the subscriber’s lifetime is characterized by the Shifted Geometric Distribution.

Despite the fact that the specific probability for each subscriber is unknown to us, we assume that it is characterized by the Beta Distribution, which fits the goal ideally given two facts. Firstly, it has an interval from 0 to 1 (as probability). Secondly, it is flexible. Namely, we can change (tune, in our case) the distribution form by changing two parameters alpha and beta.

By combining these two assumptions through a mathematical apparatus, the authors came to the idea that subscriber’ lifetimes can be characterized by the Shifted-Beta-Geometric (sBG) distribution (which in turn is determined by two parameters: alpha and beta). However, alpha and beta are unknown to us. We can compute them by maximizing the likelihood estimation that they fit the observed factual data of subscribers’ retention and, accordingly, keep our retention curve for the future periods.

Let’s check the accuracy of the predictions on the available data. We will assume that we only know the first month’s retention, then the first and the second months, and so on until we predict all subsequent periods until the end of the 7th month. I wrote a simple loop for this, which you can apply to each example:

click to expand R code

# functions for sBG distribution churnBG <- Vectorize(function(alpha, beta, period) { t1 = alpha / (alpha + beta) result = t1 if (period > 1) { result = churnBG(alpha, beta, period - 1) * (beta + period - 2) / (alpha + beta + period - 1) } return(result) }, vectorize.args = c("period")) survivalBG <- Vectorize(function(alpha, beta, period) { t1 = 1 - churnBG(alpha, beta, 1) result = t1 if(period > 1){ result = survivalBG(alpha, beta, period - 1) - churnBG(alpha, beta, period) } return(result) }, vectorize.args = c("period")) MLL <- function(alphabeta) { if(length(activeCust) != length(lostCust)) { stop("Variables activeCust and lostCust have different lengths: ", length(activeCust), " and ", length(lostCust), ".") } t = length(activeCust) # number of periods alpha = alphabeta[1] beta = alphabeta[2] return(-as.numeric( sum(lostCust * log(churnBG(alpha, beta, 1:t))) + activeCust[t]*log(survivalBG(alpha, beta, t)) )) } df_ret <- df_ret %>% group_by(example) %>% mutate(activeCust = 1000 * retention_rate, lostCust = lag(activeCust) - activeCust, lostCust = ifelse(, 0, lostCust)) %>% ungroup() ret_preds01 <- vector('list', 7) for (i in c(1:7)) { df_ret_filt <- df_ret %>% filter(between(month_lt, 1, i) == TRUE & example == 'case01') activeCust <- c(df_ret_filt$activeCust) lostCust <- c(df_ret_filt$lostCust) opt <- optim(c(1, 1), MLL) retention_pred <- round(c(1, survivalBG(alpha = opt$par[1], beta = opt$par[2], c(1:7))), 3) df_pred <- data.frame(month_lt = c(0:7), example = 'case01', fact_months = i, retention_pred = retention_pred) ret_preds01[[i]] <- df_pred } ret_preds01 <-'rbind', ret_preds01))


The visualization of the predicted retention curves and mean average percentage error (MAPE) is the following:

As you can see, in most cases, the accuracy of the predictions is very high even for the first month’s data only and it significantly improves when adding historical periods.

Now, in order to obtain the average LTV prediction, we need to multiply the retention rate by the subscription price and calculate the cumulative amount for the required period. Let’s suppose we want to calculate the average LTV for case03 based on two historical months with a forecast horizon of 24 months and a subscription price of $1. We can do this using the following code:

click to expand R code

### LTV prediction ### df_ltv_03 <- df_ret %>% filter(between(month_lt, 1, 2) == TRUE & example == 'case03') activeCust <- c(df_ltv_03$activeCust) lostCust <- c(df_ltv_03$lostCust) opt <- optim(c(1, 1), MLL) retention_pred <- round(c(survivalBG(alpha = opt$par[1], beta = opt$par[2], c(3:24))), 3) df_pred <- data.frame(month_lt = c(3:24), retention_pred = retention_pred) df_ltv_03 <- df_ret %>% filter(between(month_lt, 0, 2) == TRUE & example == 'case03') %>% select(month_lt, retention_rate) %>% bind_rows(., df_pred) %>% mutate(retention_rate_calc = ifelse(, retention_pred, retention_rate), ltv_monthly = retention_rate_calc * 1, ltv_cum = round(cumsum(ltv_monthly), 2))


We’ve obtained the average LTV of $9.33. Note that we’ve used actual data for the observed periods (from 0 to 2nd months) and the predicted retention for the future periods (from 3rd to 24th months).

A couple of tips from the practice:

  • In order to get a retention or LTV prediction as soon as possible, for example, for a monthly subscription, you can create weekly/decadal (or even daily, if the number of subscribers is large enough) cohorts. Therefore, you can obtain the prediction after the month and the week (or decade or day) period instead of a two-month one.
  • In order to improve the accuracy of predictions, it makes sense to create more specific cohorts. We can take into account some additional subscribers’ characteristics which can affect the churn. For instance, when launching a mobile application, such characteristics can be the country of the customer, the subscription price, the type of device, etc. Thus, the cohort may look like a purchase-date + country + price + device.
  • Update the prediction once you have new factual data. This helps to increase the accuracy of the model.

Other useful links:

The post LTV prediction for a recurring subscription with R appeared first on AnalyzeCore by Sergey Bryl' – data is beautiful, data is a story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R language – AnalyzeCore by Sergey Bryl' – data is beautiful, data is a story. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Radix for R Markdown

Wed, 09/19/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today we’re excited to announce Radix, a new R Markdown format optimized for scientific and technical communication. Features of Radix include:

Radix is based on the Distill web framework, which was originally created for use in the Distill Machine Learning Journal. Radix combines the technical authoring features of Distill with R Markdown.

Below we’ll demonstrate some of the key features of Radix. To learn more about installing and using Radix, check out the Radix for R Markdown website.

Figure layout

Radix provides many flexible options for laying out figures. While the main text column in Radix articles is relatively narrow (optimized for comfortable reading), figures can occupy a larger region. For example:

For figures you want to emphasize or that require lots of visual space, you can also create layouts that occupy the entire width of the screen:

Of course, some figures and notes are only ancillary and are therefore better placed in the margin:

Citations and metadata

Radix articles support including citations and a corresponding bibliography using standard R Markdown citation syntax.

In addition, when you provide a citation_url metadata field for your article, a citation appendix that makes it easy for others to cite your article is automatically generated:

Radix also automatically includes standard Open Graph and Twitter Card metadata. This makes links to your article display rich metadata when shared in various places:

Creating a blog

You can publish a series of Radix articles as either a website or a blog. For example, the TensorFlow for R blog is implemented using Radix:

To learn more, see the article on creating a blog with Radix.

Getting started

To create an R Markdown document that uses the Radix format, first install the radix R package:


Using Radix requires Pandoc v2.0 or higher. If you are using RStudio then you should use RStudio v1.2.718 or higher (which comes bundled with Pandoc v2.0). You can download the preview release of RStudio v1.2 at

Next, use the New R Markdown dialog within RStudio to create a new Radix article:

This will give you a minimal new Radix document.

Then, check out the Radix for R Markdown website to learn more about what’s possible. Happy authoring!

.screenshot { border: 1px solid rgba(0, 0, 0, 0.2); } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

November 8th & 9th in Munich: Workshop on Deep Learning with Keras and TensorFlow in R

Wed, 09/19/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Registration is now open for my 1.5-day workshop on deep learning with Keras and TensorFlow using R.

It will take place on November 8th & 9th in Munich, Germany.

You can read about one participant’s experience in my workshop:

Big Data – a buzz word you can find everywhere these days, from nerdy blogs to scientific research papers and even in the news. But how does Big Data Analysis work, exactly? In order to find that out, I attended the workshop on “Deep Learning with Keras and TensorFlow”. On a stormy Thursday afternoon, we arrived at the modern and light-flooded codecentric AG headquarters. There, we met performance expert Dieter Dirkes and Data Scientist Dr. Shirin Glander. In the following two days, Shirin gave us a hands-on introduction into the secrets of Deep Learning and helped us to program our first Neural Net. After a short round of introduction of the participants, it became clear that many different areas and domains are interested in Deep Learning: geologists want to classify (satellite) images, energy providers want to analyse time-series, insurers want to predict numbers and I – a humanities major – want to classify text. And codecentric employees were also interested in getting to know the possibilities of Deep Learning, so that a third of the participants were employees from the company itself.

Continue reading…

In my workshop, you will learn

  • the basics of deep learning
  • what cross-entropy and loss is
  • about activation functions
  • how to optimize weights and biases with backpropagation and gradient descent
  • how to build (deep) neural networks with Keras and TensorFlow
  • how to save and load models and model weights
  • how to visualize models with TensorBoard
  • how to make predictions on test data

Keras is a high-level API written in Python for building and prototyping neural networks. It can be used on top of TensorFlow, Theano or CNTK. Keras is very convenient for fast and easy prototyping of neural networks. It is highly modular and very flexible, so that you can build basically any type of neural network you want. It supports convolutional neural networks and recurrent neural networks, as well as combinations of both. Due to its layer structure, it is highly extensible and can run on CPU or GPU.

The keras R package provides an interface to the Python library of Keras, just as the tensorflow package provides an interface to TensorFlow. Basically, R creates a conda instance and runs Keras it it, while you can still use all the functionalities of R for plotting, etc. Almost all function names are the same, so models can easily be recreated in Python for deployment.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

I’ll be talking about ‘Decoding The Black Box’ at the Frankfurt Data Science Meetup

Wed, 09/19/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

I have yet another Meetup talk to announce:

On Wednesday, October 26th, I’ll be talking about ‘Decoding The Black Box’ at the Frankfurt Data Science Meetup.

Particularly cool with this meetup is that they will livestream the event at!


And finally we will have with us Dr.Shirin Glander, whom we were inviting for a long time back. Shirin lives in Münster and works as a Data Scientist at codecentric, she has lots of practical experience. Besides crunching data, she trains her creativity by sketching information. Visit her blog and you will find lots of interesting stuff there, like experimenting with Keras, TensorFlow, LIME, caret, lots of R and also her beautiful sketches. We recommend: Besides all that she is an organiser of MünsteR – R User Group:

Traditional ML workflows focus heavily on model training and optimization; the best model is usually chosen via performance measures like accuracy or error and we tend to assume that a model is good enough for deployment if it passes certain thresholds of these performance criteria. Why a model makes the predictions it makes, however, is generally neglected. But being able to understand and interpret such models can be immensely important for improving model quality, increasing trust and transparency and for reducing bias. Because complex ML models are essentially black boxes and too complicated to understand, we need to use approximations, like LIME.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1066]

Wed, 09/19/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The second Le Monde mathematical puzzle in the new competition is sheer trigonometry:

When in the above figures both triangles ABC are isosceles and the brown segments are all of length 25cm, find the angle in A and the value of DC², respectively.

This could have been solved by R coding the various possible angles of the three segments beyond BC until the isosceles property is met, but it went much faster by pen and paper, leading to an angle of π/9 in the first case and a square of 1250 in the second case.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

«smooth» package for R. Intermittent state-space model. Part I. Introducing the model

Tue, 09/18/2018 - 22:52

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)


One of the features of functions of smooth package is the ability to work with intermittent data and the data with periodically occurring zeroes.

Intermittent time series is a series that has non-zero values occurring at irregular frequency (Svetuknov and Boylan, 2017). Imagine retailer who sells green leap sticks. The demand on such a product will not be easy to predict, because green colour is not a popular colour in this case, and thus the data of sales will contain a lot of zeroes with seldom non-zero values. Such demand is called “intermittent”. In fact, many products exhibit intermittent patterns in sales, especially when we increase the frequency of measurement (how many tomatoes and how often does a store sell per day? What about per hour? Per minute?).

The other case is when watermelons are sold in high quantities over summer and are either not sold at all or sold very seldom in winter. In this case the demand might be intermittent or even absent in winter and have a nature of continuous demand during summer.

smooth functions can work with both of these types of data, building upon mixture distributions.

In this post we will discuss the basic intermittent demand statistical models implemented in the package.

The model

First, it is worth pointing out that the approach that is used in the statistical methods and models discussed in this post assumes that the final demand on the product can be split into two parts (Croston, 1972):

  1. Demand occurrence part, which is represented by a binary variable, which is equal to one when there is a non-zero demand in period \(t\) and zero otherwise;
  2. Demand sizes part, which reflects the amount of sold product when demand occurrence part is equal to one.

This can be represented mathematically by the following equation:
\begin{equation} \label{eq:iSS}
y_t = o_t z_t ,
where \(o_t\) is the binary demand occurrence variable, \(z_t\) is the demand sizes variable and \(y_t\) is the final demand. This equation was originally proposed by Croston, (1972), although he has never considered the complete statistical model and only proposed a forecasting method.

There are several intermittent demand methods that are usually discussed in forecasting literature: Croston (Croston, 1972), SBA (Syntetos & Boylan, 2000) and TSB (Teunter et al., 2011). These are good methods that work well in intermittent demand context (see, for example, Kourentzes, 2014). The only limitation is that they are the “methods” and not “models”. Having models becomes important when you want to include additional components and produce proper prediction intervals, need the ability to select the appropriate components and do proper statistical inference. So, John Boylan and I developed the model underlying these methods (Svetunkov & Boylan, 2017), based on \eqref{eq:iSS}. It is built upon ETS framework, so we call it “iETS”. Given that all the intermittent demand forecasting methods rely on simple exponential smoothing (SES), we suggested to use ETS(M,N,N) model for both demand sizes and demand occurrence parts, because it underlies SES (Hyndman et al., 2008). One of the key assumptions in our model is that demand sizes and demand occurrences are independent of each other. Although, this is an obvious simplification, it is inherited from Croston and TSB, and seems to work well in many contexts.

The iETS(M,N,N) model, discussed in our paper is formulated the following way:
y_t = o_t z_t \\
z_t = l_{z,t-1} \left(1 + \epsilon_t \right) \\
l_{z,t} = l_{z,t-1}( 1 + \alpha_z \epsilon_t) \\
o_t \sim \text{Bernoulli}(p_t)
\end{matrix} ,
where \(z_t\) is represented by the ETS(M,N,N) model, \(l_{z,t}\) is the level of demand sizes, \(\alpha_z\) is the smoothing parameter and \(\epsilon_t\) is the error term. The important assumption in our implementation of the model is that \(\left(1 + \epsilon_t \right) \sim \text{log}\mathcal{N}(0, \sigma_\epsilon^2) \) – something that we discussed in one of the previous posts. This means that the demand will always be positive. However if you deal with some other type of data, where negative values are natural, then you might want to stick with pure additive model.

As a side note, we have advanced notations for the iETS models, which will be discuss in the next post. For now we will stick with the level models and use the shorter names.

Having this statistical model, makes it extendable, so that one can add trend, seasonal component, or exogenous variables. We don’t discuss these elements in our paper, but it is briefly mentioned in the conclusions. And we don’t discuss these features just yet, we will cover them in the next post.

Now the main question that stays unanswered is how to model the probability \(p_t\). And there are several approaches to that:

  1. iETS\(_f\) – assume that the demand occurs at random with a fixed probability (so \(p_t = p\)).
  2. iETS\(_i\) – the interval-based model, with a principle suggested by Croston (1972). In this case we assume that the probability is constant between the demand occurrences and that the intervals between the occurrences are inverse proportional to the respective probabilities in time. In this case we model and forecast the number of zeroes between the non-zero demands and then invert the forecasted value. This is one of the oldest forecasting approaches for intermittent demand, but it is not always the most accurate one. Still it seems that this model works well, when demand is building up.
  3. iETS\(_p\) – the probability-based model, the principle suggested by Teunter et al., (2011). In this case the probability is updated directly based on the values of occurrence variable, using SES method. This model works well in any case when probability changes over time.

In case (1) there is no model underlying the probability, we just estimate the value and produce forecasts. In cases of (2) and (3), we suggest using another ETS(M,N,N) model as underlying each of these processes. So when it comes to producing forecasts, in both cases we assume that future level of probability will be the same as the last obtained one (level forecast from the local-level model). After that the final forecast is generated using:
\begin{equation} \label{eq:iSSForecast}
\hat{y}_t = \hat{p}_t \hat{z}_t ,
where \(\hat{p}_t\) is the forecast of the probability, \(\hat{z}_t\) is the forecast of demand sizes and \(\hat{y}_t\) is the final forecast of the intermittent demand.
Summarising advantages of our framework:

  1. Our model is extendable: you can use any ETS model and even introduce exogenous variables. In fact, you can use any model you want for demand sizes and a wide variety of models for demand occurrence variable;
  2. The model allows selecting between the aforementioned types of intermittent models (“fixed” / “probability” / “intervals”) using information criteria. This mechanism works fine on large samples, but, unfortunately, does not seem to work well in cases of small samples (which is typical for intermittent demand) with the existing information criteria. Potentially, some modifications of criteria need to be done in order to make the mechanism work better. For example, AICc and BICc need to be adjusted in order to take the demand occurrence part into account;
  3. The model allows producing prediction intervals for several steps ahead and cumulative (over a lead time) upper bound of the intervals. The latter arises naturally from the model and can be used for safety stock calculation;
  4. The estimation of models is done using likelihood function and not some ad-hoc estimators. This means that the estimates of parameters become efficient and consistent;
  5. Although the proposed model is continuous, we show in our paper that it is more accurate than several other integer-valued models. Still, if you want to have integer numbers as your final forecasts, you can round up or round down either the point or prediction intervals, ending up with meaningful values. This can be done due to a connection between the quantiles of rounded values and the rounding of quantiles of continuous variable (discussed in Appendix of our paper).

If you need more details, have a look at our working paper (have I already advertise it enough in this post?).

Implementation. Demand occurrence

The aforementioned model with different occurrence types is available in


package. There is a special function for demand occurrence part, called


(Intermittent State-Space model) and there is a parameter in every smooth forecasting function called


, which can be one of: “none”, “fixed”, “interval”, “probability”, “sba”, “auto” or “logistic”. We do not cover “logistic” yet, this will be discussed in the next post. We also don’t discuss “sba” (it is based on Syntetos and Boylan, 2001) and “auto” options – we might cover them later at some point.

So, let’s consider an example with artificial data. We create the following time series:

x <- c(rpois(25,5),rpois(25,1),rpois(25,0.5),rpois(25,0.1))

This way we have an artificial data, where both demand sizes and demand occurrence probability decrease over time stepwise, each 25 observations. The generated data resembles something called “demand obsolescence” or “dying out demand”. Let’s fit the three different models to this time series:

issFixed <- iss(x, intermittent="f", h=25)

Intermittent state space model estimated: Fixed probability Underlying ETS model: MNN Vector of initials: level 0.55 Information criteria: AIC AICc BIC BICc 139.6278 139.6686 142.2329 142.3269

issInterval <- iss(x, intermittent="i", h=25)

Intermittent state space model estimated: Interval-based Underlying ETS model: MNN Smoothing parameters: alpha 0.13 Vector of initials: level 1.008 Information criteria: AIC AICc BIC BICc 127.4387 127.6887 135.2542 135.8299

issProbability <- iss(x, intermittent="p", h=25)

Intermittent state space model estimated: Probability-based Underlying ETS model: MNN Smoothing parameters: alpha 0.101 Vector of initials: level 0.892 Information criteria: AIC AICc BIC BICc 101.4430 101.5667 106.6534 106.9382

By looking at the outputs, we can already say that iETS\(_p\) model performs better than the others in terms of information criteria. This is due to the flexibility of the model mentioned before – it is able to adapt to the changes in probability faster than the other models. Both iETS\(_p\) and iETS\(_i\) have smoothing parameter close to 0.1 and both start from the high probability (in case of iETS\(_i\) initial level is 1.008, which transforms to the probability of \(\frac{1}{1.008} \approx 0.99 \).

We can also plot the actual occurrence variable, fitted and forecasted probabilities using plot function:




Note that the different models capture the probability differently. While iETS\(_f\) averages out the probability, both iETS\(_i\) and iETS\(_p\) models react to the changes in the data, but differently: interval-based model does that only when demand occurs, while probability-based one does that on every observation. Given that in this example, the demand becomes obsolete, neither the iETS\(_f\), nor iETS\(_i\) produce accurate forecasts for the occurrence part.

In addition, in my example the last observation in x is non-zero demand, so both iETS\(_i\) and iETS\(_p\) react to that, each of them slightly differently: if there was a zero, iETS\(_i\) would predict a higher level of occurrence, while iETS\(_p\) would predict the lower one. This is due to the differences in the mechanisms of the probability update in the two models.

We cannot do much with the occurrence part of the model at the moment, because of the limitations that will be discussed in the next post. For now we are stuck with ETS(M,N,N) model there, so, let’s move to the demand sizes part of the model.

Implementation. The whole demand

In order to deal with the intermittent data and produce the forecasts for the whole time series, we can use either


, or


, or


, or


– all of them have the parameter


, which is equal to “none” by default. We will use an example of ETS models. In order to simplify things we will use iETS\(_p\) model:

es(x, "MNN", intermittent="p", silent=FALSE, h=25)

The forecast of this model is a straight line, close to zero due to the decrease in both demand sizes and demand occurrence parts. However, knowing that the demand decreases, we can use trend model in this case. And the flexibility of the approach allows us doing that, so we fit ETS(M,M,N) to demand sizes:

es(x, "MMN", intermittent="p", silent=FALSE, h=25)

The forecast in this case is even closer to zero and reaches it asymptotically, which means that we foresee that the demand on our product will on average die out.

We can also produce prediction intervals and use model selection for demand sizes. If you know that the data cannot be negative (e.g. selling tomatoes in kilograms), then I would recommend using the pure multiplicative model:

es(x, "YYN", intermittent="p", silent=FALSE, h=25, intervals=TRUE)

Forming the pool of models based on... MNN, MMN, Estimation progress: ... Done! Time elapsed: 0.2 seconds Model estimated: iETS(MNN) Intermittent model type: Probability-based Persistence vector g: alpha 0.524 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 0.551 Cost function type: MSE; Cost function value: 1.475 Information criteria: AIC AICc BIC BICc 284.9088 285.3794 294.9455 287.8737 95% parametric prediction intervals were constructed

Compare this graph with the one when the pure additive model is selected:

es(x, "XXN", intermittent="p", silent=FALSE, h=25, intervals=TRUE)

Forming the pool of models based on... ANN, AAN, Estimation progress: ... Done! Time elapsed: 0.14 seconds Model estimated: iETS(ANN) Intermittent model type: Probability-based Persistence vector g: alpha 0.448 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 1.749 Cost function type: MSE; Cost function value: 2.893 Information criteria: AIC AICc BIC BICc 321.9607 322.4313 331.9974 324.9256 95% parametric prediction intervals were constructed

In the latter case the prediction intervals cover the negative part of the plane which does not make sense in our context. Note also that the information criteria are lower for the multiplicative model, which is due to the changing variance in sample for each 25 observations.

The important thing to note is that although multiplicative trend model sounds solid from the theoretical point of view (it cannot produce negative values), it might be dangerous in cases of small samples and positive trends. In this situation the model can produce exploding trajectory, because the forecast corresponds to the exponent. I don’t have any universal solution for this problem for the moment, but I would recommend using ETS(M,Md,N) (damped multiplicative trend) model instead of ETS(M,M,N). The reason why I don’t recommend ETS(M,A,N), is because in cases of negative trend, with the typically low level of intermittent demand, the updated level might become negative, thus making model inapplicable to the data.

Finally, the experiments that I have conducted so far show that iETS\(_p\) is the most robust intermittent demand model in many cases. This is due to the flexibility of the probability update mechanism, proposed by Teunter et al. (2011).

Still, there is a question: how to make the model even more flexible, so that the occurrence part would be as versatile as the demand sizes part of the iETS model? We will have an answer to this question in the next post…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Not Hotdog: A Shiny app using the Custom Vision API

Tue, 09/18/2018 - 21:13

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

I had a great time at the EARL Conference in London last week, and as always came away invigorated by all of the applications of R that were presented there. I'll do a full writeup of the conference later this week, but in the meantime I wanted to share the materials from my own presentation there, "Not Hotdog: Image Recognition with R and the Custom Vision API". I've embedded the slides below:

This is an embedded Microsoft Office presentation, powered by Office Online.

In the presentation, I showed how easy it is to build a Shiny app to perform image recognition with the Custom Vision API in Azure. I've provided all of the R code behind the app in this GitHub repository, and thanks to the httr package it takes just a few lines of R code to interface with the API to upload images, train a transfer learning model, and classify new images. It's also handy to be able to use the web portal to review your tagged images and model performance. 

This was my first time building a Shiny app, and it was really easy to learn how to do it thanks to the RStudio tutorials. One feature that proved particularly useful was reactive expressions, which would cache the results of the prediction and prevent a call out to the API every time I touched the "threshold" slider. The prediction API is rate-limited, and this way it was only called when the user actually changes the image URL, a relatively rare event. In all, it took me less than a day to learn Shiny and create the application below. Here it is correctly classifying this image of a hotdog (source), and correctly identifying that this hot dog (source) is not a hotdog.

If you want to try this out yourself, all you'll need is an Azure subscription (new subscribers can also get $200 free credits) and the R code in the repository below.

Github (revodavid): Not Hotdog vision recognition using R and Custom Vision API

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Snakes in a Package: combining Python and R with reticulate

Tue, 09/18/2018 - 11:21

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating, my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to miss the structure of Java/C++. At the time, Python felt like a good compromise so I switched again. After joining Mango Solutions I noticed I was not an anomaly, most data scientists here know both Python and R.

Nowadays whenever I do my work in R there is a constant nagging voice in the back of my head telling me “you should do this in Python”. And when I do my work in Python it’s telling me “you can do this faster in R”. So when the reticulate package came out I was overjoyed and in this blogpost I will explain to you why.

re-tic-u-late (rĭ-tĭkˈyə-lĭt, -lātˌ)

So what exactly does reticulate do? It’s goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty gritty of how the package works here; RStudio have done a great job in providing some excellent documentation and a webinar. Instead I’ll show a few examples of the main functionality.

Just like R, the House of Python was built upon packages. Except in Python you don’t load functionality from a package through a call to librarybut instead you import a module. reticulate mimics this behaviour and opens up all the goodness from the module that is imported.

library(reticulate) np <- import("numpy") # the Kronecker product is my favourite matrix operation np$kron(c(1,2,3), c(4,5,6)) ## [1] 4 5 6 8 10 12 12 15 18

In the above code I import the numpy module which is a powerful package for all sorts of numerical computations. reticulate then gives us an interface to all the functions (and objects) from the numpy module. I can call these functions just like any other R function and pass in R objects, reticulate will make sure the R objects are converted to the appropriate Python objects.

You can also run Python code through source_python if it’s an entire script or py_eval/py_run_string if it’s a single line of code. Any objects (functions or data) created by the script are loaded into your R environment. Below is an example of using py_eval.

data("mtcars") py_eval("r.mtcars.sum(axis=0)") ## mpg 642.900 ## cyl 198.000 ## disp 7383.100 ## hp 4694.000 ## drat 115.090 ## wt 102.952 ## qsec 571.160 ## vs 14.000 ## am 13.000 ## gear 118.000 ## carb 90.000 ## dtype: float64

Notice the use of the r. prefix in front of the mtcars object in the python code. The r object exposes the R environment to the python session, it’s equivalent in the R session is the py object. The mtcars data.frame is converted to a pandas DataFrame to which I then applied the sumfunction on each column.

Clearly RStudio have put in a lot of effort to ensure a smooth interface to Python, from the easy conversion of objects to the IDE integration. Not only will reticulate enable R users to benefit from the wealth of functionality from Python, I believe it will also enable more collaboration and increased sharing of knowledge.

Enter mailman

So what is it exactly that you can do with Python that you can’t with R? I asked myself the same question until I came across the following use case.

While helping a colleague out with a blogpost it was suggested that I should publish it on a Tuesday. No rationale was given so naturally I wondered if I could provide one using data. The data would have to come from R-bloggers. This is a great resource for reading blogposts about R (and related topics) and they also provide a daily newsletter with a link to the blogposts from that day. At the time the newsletter seemed the easiest way to collect data 1. All I needed to do now is extract the data from my Gmail account.

Therein lies the problem as I want to avoid querying the Gmail server (it wouldn’t make it easy to reproduce). Fortunately, Google have made it easy to download your data (thanks to the Google Data Liberation Front) through Google Takeout. Unfortunately, all the e-mails are exported in the mbox format. Although this is a plain text based format it would take some effort to write a parser in R, something I wasn’t willing to do. And then came along Python, which has a built-in mbox-parser in the mailbox module.

Using reticulate I extracted the necessary information from each e-mail.

# import the module mailbox <- import("mailbox") # use the mbox function to open a file connection cnx <- mailbox$mbox("rblogs_box.mbox") # the messages are stored as key/value pairs # in this case they are indexed by an integer id message <- cnx$get_message(10L) # each message has a number of fields with meta-data message$get("Date") ## [1] "Mon, 12 Dec 2016 23:56:19 +0000" message$get("Subject") ## [1] "[R-bloggers] Building Shiny App exercises part 1 (and 7 more aRticles)"

And there we have it! I just read an e-mail from an mbox-file with very little effort. Of course I will need to do this for all messages, so I wrote a function to help me. And because we’re living in the Age of R I placed this function in an R package. You can find it on the MangoTheCat github repo, it is called mailman.

To publish or not to publish?

I have yet to provide a rationale for publishing a blogpost on a particular day so let’s quickly get to it. With the package all sorted I can now call the function mailman::read_messages to get a tibble with everything I need.

We can extract the number of blogposts on a particular date from the subject of each e-mail. Aggregating that to day of week will then give us a good overview of which day is popular.

library(dplyr) library(mailman) library(lubridate) library(stringr) messages <- read_messages("rblogs_box.mbox", type="mbox") %>% mutate(Date = as.POSIXct(Date, format="%a, %d %b %Y %H:%M:%S %z"), Day_of_Week = wday(Date, label=TRUE, abbr=TRUE), Number_Articles = str_extract(Subject, "[0-9](?=[\\n]* more aRticles)"), # Whenever a regex works you feel like a superhero! Number_Articles = as.numeric(Number_Articles) + 1, # Ok, sometimes it doesn't work but you're still a hero for trying! Number_Articles = ifelse(, 1, Number_Articles)) %>% select(Date, Day_of_Week, Number_Articles)

Judging by the graph, weekends would be a good time to publish a blogpost as there is less competition. Then again, not many people might read blogposts in the weekend. The next candidate would then be Monday which has the lowest average among the weekdays. Coming back to my original quest, I can conclude that publishing on a Tuesday is not the best option.

In summary

In my opinion, the reticulate package is a ground-breaking development. It allows me to combine the good parts of R with the good parts of Python (it’s already in use by the tensorflow and keras packages). Also, it allows the data science community to collaborate more easily and focus our energy on getting things done. This is the future, this is R and Python (Rython? PRython? PyR?).

  1. After I had collected all the data, Bob Rudis wrote about the Feedly API and released a dataset of blogposts over a longer time period. I would say his solution is preferable even though my results are slightly different due to the more recent time horizon.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Case Study: How To Build A High Performance Data Science Team

Tue, 09/18/2018 - 10:04

(This article was first published on - Articles, and kindly contributed to R-bloggers)

Artificial intelligence (AI) has the potential to change industries across the board, yet few organizations are able to capture its value and realize a real return-on-investment. The reality is that the transition to AI and data driven analysis is difficult and not well understood. The issue is twofold, first, the necessary technology to complete such a task has only recently become mainstream, and second, most data scientists are inexperienced in their respective industries. However, with all the uncertainty surrounding this topic, one hedge fund has managed to navigate through these challenges and accomplish what many companies are failing to do: building a high-performing data science team that achieves real return-on-investment (ROI).

This is the story of an outlier

Business Science was recently invited inside the walls of Amadeus Investment Partners, a hedge fund that has unlocked the power of artificial intelligence to gain superior results in one of the most competitive industries in the world: investments. Amadeus Investment Partners has spent the last five years building a high performance data science team. What they have built is nothing short of extraordinary.

In this article, we will discover what makes Amadeus Investment Partners an outlier and why they are unique in the data science space. We will learn the key ingredients that provide Amadeus a recipe that is driving ROI with artificial intelligence and examine what it takes to assemble a high-performance data science team.

Data Science Team Structure, Amadeus Investment Partners

We will then describe how Business Science is using this information to develop best-in-class data science education in the form of both on-premise custom workshops and on-demand virtual workshops. We will show how we are integrating the exact same cutting-edge technology into our data science for business programs.

This is all aimed at one thing: developing a system for creating best-in-class data science teams.

If you are interested in developing a best-in-class data science team, then read on.

Examining An Outlier

Amadeus Investment Partners is a hedge fund that blends traditional fundamental investment principles with cutting-edge quantitative techniques to create “Quantamental” strategies that identify assets that yield excellent returns while minimizing risk for their investors. Their goal is to provide their investors with superior risk-adjusted returns.

Amadeus’ strategy is working. Here’s an overview of backtest results from 2005 in comparison to the S&P 500, which is a difficult benchmark to outperform. Over the backtest period, we can see that Amadeus’ strategy delivered “alpha”, which means the strategy generated excess returns (performance) beyond the returns of the benchmark.

Risk-Return Performance, Amadeus’ Quantamental Long-Short Strategy

From the Growth of $10,000 Starting 2005 chart, Amadeus appears to be a well-performing hedge fund. However, it’s not until we dive into the Risk and Return Profile, that we begin to see the magic come to light. The Sharpe Ratio, which is a ratio of reward-to-risk that is commonly referenced in investing, is almost double the S&P 500 over this time period. This means that Amadeus is taking less risk per unit of reward as compared to the S&P 500. Furthermore, the Maximum Drawdown, or the largest loss from the peak during the time frame, was about half of the S&P 500 during the same time period. Ultimately, what this means is that Amadeus is delivering exceptional returns while taking on less risk, which is very attractive to investors.

But, how is Amadeus achieving these results?

A Radically Different Organization

In our meetings with Amadeus, we found 3 key components to the high-performance data science team. Each of these are critically important to Amadeus’ successful execution of their radically-different data-driven strategy. Amadeus:

  1. Finds and trains talent in the most unlikely fashion

  2. Has a well-designed team structure and culture

  3. Provides access to cutting-edge technology

We will step through each of these key ingredients that make up the data-driven recipe for success.

Key 1: Finding and Training Talent in the Most Unlikely Fashion

The first key to the puzzle is finding and developing the talent to execute on the vision. That’s where Amadeus has excelled: finding talent in the most unlikely places.

Over the past several years, Amadeus has tactically been working with the leading educational institutions in Canada to selectively gain access to top students in…

Business Programs

Yes – Students that are top in their classes in Business Programs. If you take a look at the demographics of their team, most don’t have math or physics backgrounds. If you’re familiar with the conventional data science team makeup full of math and computer science Ph.D.’s, this might come as a surprise to you.

This unusual hiring practice is founded on the belief that the subject knowledge and the communication skills that the top business students bring are critical advantages in Amadeus’ data-driven approach. At the end of the day, data science is a tool that people use to answer questions that they’re interested in, and hiring people with the relevant subject matter expertise will ensure that the right questions will be asked. Amadeus subsequently converts these business-minded people to data scientists by augmenting their skillset with math and programming on the job.

“Hiring people with the relevant subject matter expertise ensures that the right questions are asked.”

-Rafael Nicolas Fermin Cota

In terms of training the hired talent, Amadeus has a distinct advantage. One of the founders, Rafael Nicolas Fermin Cota, was a professor at the Ivey Business School at Western University, one of the top schools for business in Canada. In his curriculum, he taught his students how to make business decisions using data science. He states,

“My work entails teaching students how to think. The specific course material, they may forget. But, if they learn to think, they will learn to solve the problems they face in their professional careers.”

-Rafael Nicolas Fermin Cota

It’s this spirit of learning and critical thinking that you experience when meeting with the Amadeus data science team. What you also take away is a structured approach to this intellectual curiosity. Each member told stories of their start at Amadeus. It begins the same – learning to code, studying statistics, and getting a great deal of mentorship. It takes six months of education and training before a new employee is ready to be an integral part of the team. The core curriculum includes the following concepts:

  1. Database management: Obtaining data from various sources and storing it effectively for further access.

  2. Data manipulation: Working with raw data (often in many different formats) and turning them into an
    organized dataset that can be easily analyzed.

  3. Exploratory data analysis: Exploring the data to determine various characteristics of the dataset (NAs,
    mean, standard deviation, type, etc.).

  4. Predictive Modelling: Using available data to predict the future outcome using machine learning and other
    artificial intelligence concepts.

  5. Visualization: Presenting the results of the exploratory data analysis and predictive modelling to various

This core training ensures a common body of knowledge that team members draw from during discussions, making the communication process much more efficient.

To continue the education and professional development of the team members, everyone is free to purchase any books, courses, or other training material as needed.

Key 2: Well-Designed Team Structure and Collaborative Culture

Once the initial training is over, each new hire is ready to be integrated into a functional part of the team. Integration involves finding the role that best suits their skill sets along with Amadeus’ needs. This approach allows the new hire to fill a position they are interested in while benefiting the organization.

The team structure was carefully designed to optimize the talent of the team members and to transparently reflect the desired interaction among the team members. Think of the High Performance Team Structure like the blueprint for success.

Data Science Team Structure, Designed for High Performance

It involves four key roles:

  1. Subject Matter Experts
  2. Data Engineering Experts
  3. Data Science Experts
  4. User Interface Experts
Subject Matter Experts (SME)

Amadeus has four SMEs that are involved at both the beginning and end of the investment strategy development process. At the beginning of the process, the SMEs are responsible for generating initial ideas for new strategies. These ideas are grounded on business fundamentals and meticulously researched before being discussed with the Data Engineering and the Data Science teams. The SMEs are also responsible for the end of the process, which is the execution of the strategies. This ensures that the investment execution in line with the original design of the strategies.

Relevant Skill-Sets:

  • Accounting and Finance: Deep understanding of financial analysis and capital markets is required to build initial strategy ideas
  • Excel: Excel is used to store initial strategy ideas
  • R: R is used to perform data exploration and efficiently work with data
Data Engineering Experts (DEE)

When the SMEs come up with new strategy ideas, the Data Engineering team is subsequently called to gather and make available the data required for the Data Science team to test the ideas. With petabytes of financial data at hand, the DEEs need to master programming methods that will make data delivery and computation as efficient as possible. Also, Amadeus has focused on data quality since further analysis is only meaningful given good quality data. The financial data is often noisy, contains many missing values, and requires timestamp joins, which is very difficult due to the size of the data and the fact that global data sources rarely align.

Relevant Skill-Sets:

  • C++: C++ is a high performance language at the heart of their data engineering operation. Parallelizing computations and developing distributed systems using C++ enables Amadeus to take full advantage of working with big data
  • SQL: SQL is the language used to directly interact with their databases
  • R: The data.table package is mainly used to scale R for speed when taking strategies from the exploration
    to production
Data Science Experts (DSE)

The DSEs at Amadeus are critical for exploring various properties of ideas generated by the SMEs and developing different algorithms required by the strategy, based on their expertise in statistical analysis, machine learning (supervised and unsupervised), time series analysis, and text analysis. The main challenge they face is being able to iterate through the stream of hypotheses generated by the SMEs and rapidly develop analyses. They are the ones who identify patterns or anomalies in the dataset, produce concise reports for the SMEs to allow fast interpretation of results, and determine when the ROI from a project has diminished and new projects should be started.

Relevant Skill-Sets:

  • R: R is used for exploratory data analysis (EDA) and visualization because of its ease of use for exploration. The tidyverse is predominantly being used for quickly transforming data prior to exploration.
  • Python: Python is used for advanced machine learning and deep learning with high-performance NVIDIA GPUs. All the top deep learning frameworks are available in Python and can be easily deployed through the tools provided in the NVIDIA GPU Cloud.
User Interface Experts (UIE)

Amadeus develops interactive web applications to support internal decision-making and operations. New challenges present themselves when building dashboards. The application needs to be customized to the problem but also perform well when it comes to interactivity. Given these constraints, building a performant application often comes down to selecting the right tools. The UIEs use R + Shiny for lightweight applications or Python, Django and JavaScript when performance and interactivity are major concerns.

Relevant Skill-Sets:

  • Databases: Data-driven web applications start at the database. Knowledge of the appropriate query language (SQL, MongoDB, etc.) is necessary for effectively handling data.
  • Data Analysis: R + Shiny can be used for a quick proof of concept, while Python + Django are used for production level performance.
  • Web Development: HTML, CSS, JavaScript are a necessity when creating sophisticated web-based user interfaces.
Emphasis on Communication

An often overlooked part of a data science team is the team aspect, which requires communicating ideas and analyses through the workflow. For most other organizations, various departments work in silos, only interacting with each other at the senior management level. This prevents members from seeing the big picture and breeds internal competition for the detriment of the organizational performance.

At Amadeus, collaborative culture is encouraged as every project is carried out by a cross-sectional team, involving at least one person from each of the four functional parts described above. This way, the projects can benefit from the different perspectives of team members and the research process is streamlined without conflicts between each stage.

Also, all-hands weekly meetings are organized to keep each other up to date on individual progress and create a forum for team members to share insights and suggestions.

Key 3: Access to Cutting-Edge Technology

As mentioned above, it takes tremendous effort to find and train talent and have them work collaboratively. At this point, all of this effort would be futile if there was a technological bottleneck in the research process.

Data Science Team members have full access to computational infrastructure for both GPU intensive work (DL, NLP), and CPU intensive work (data cleaning, report generation, EDA). Their systems provide all team members immediate access to high-performance computational resources to minimize the time spent waiting for computations to run. This enables the team to quickly iterate through ideas.

At Amadeus, each team has their own computation stack as to not interfere with the work of the other teams. This infrastructure is all connected to allow interaction between teams.

  • Data Engineering: Systems optimized for populating and querying databases. The DEEs provide a custom API that allows all other teams immediate access to data.

  • Data Science: High-performance CPU and GPU systems ideal for training machine learning models and performing EDA.

  • UI/Web Applications: Systems designed specifically for hosting web applications and in-house Shiny/Django applications. The UIEs can use the DSEs’ infrastructure when high-performance computations are required in the backend.

  • Subject Matter Experts: Access to data and high-performance hardware through front-end APIs as well as hardware specifically designed for their execution needs.

Amadeus has partnered with NVIDIA, pioneers of the next generation of computational hardware for Artificial Intelligence research and deployment. The team is actively using high-performance computing with their in-house analytical technology stack that boasts the NVIDIA DGX-1, the world’s fastest deep learning system.

Business Science witnessed Amadeus’ data science team train a text classifier on financial news data for predicting article sentiment. The NVIDIA DGX-1 produced results in a matter of minutes, what would have taken several hours if not days on a CPU system or even a GPU system that is not optimized for deep learning.

Best-In-Class Data Science Education

Turning Insights Into Education

Business Science has gained the following insights from the Amadeus case study:

  1. Hiring talent with subject matter expertise and subsequently educating them in data science has proven to be effective in building a high performance team

  2. Communication among different teams is important, and education needs to support communication among the different teams

  3. The teams need to be equipped with the latest technology to reach full potential

Unfortunately, data science education is still in its infancy because most educational institutions don’t understand what it takes to do real-world data science. Most programs focus on theory or tools. This doesn’t work. Learning how to do real-world data science only comes from application and integration, and those with an understanding of the business have an advantage.

This is why Business Science is different.

We are building a best-in-class educational program that incorporates learnings:

  • Through studying an outlier – A radically different data science team of the highest caliber that is successfully generating ROI for their organization.

  • Through our own applied consulting experiences that have successfully generated ROI for organizations

  • Through experience building the tools and software needed to solve business problems

The next-generation Business Science education offers two options that integrate this knowledge:

  1. On-Premise Custom Workshops

  2. On-Demand Virtual Workshops

On-Premise Custom Workshops

Workshops are short but potent. In as little as 2-days, we can teach what it normally takes a data scientist years to learn. The key is our approach:

  • We do 6 weeks of preparatory work with your team

  • We use data that is relevant to your business and industry

  • We have best-in-class data science instructors knowledgeable in every facet of data science

  • We are business application focused

Business Science Custom Machine Learning Workshop, Client: S&P Global

On-Demand Virtual Workshops

At Business Science University, we are building two tracks focusing on R and Python, which teach the same tools that Amadeus’ Data Science Experts use for exploratory analysis and machine learning but are available on-demand and self-paced. The Data Science For Business Tracks focus on one real-world business problem, and the students apply many of the most in-demand machine learning tools over several weeks.

DS4B Course Tracks, Business Science University

The Business Science University Course Roadmap with 12-Month Timeline specifically targets the team roles and tool integration that matches the Amadeus team’s skill sets along with the Business Science team’s consulting and software development experience.

Course Roadmap and Timeline, Business Science University

Over the next 12 months we are specifically focusing on Data Science and UI / Web Applications:

  • Data Science For Business With R and Python: DS4B 201-R (available now) and DS4B 201-P (Q4 2018)

  • Web Application Development: DS4B 301-R using R + Shiny (Q3 2018) and DS4B 301-P using Python + Django (TBD)

  • Time Series Analysis: Virtual Workshop on Time Series Fundamentals, Machine Learning, and Deep Learning (H1 2019)

  • Text Analysis: Virtual Workshop on Text Analysis Fundamentals, Deep Learning (H2 2019)

  • Crash Courses (Not Shown on Roadmap): These are short courses that prepare students in R, Python, Spark, and SQL, which are needed for 200-series courses

The program will grow into Data Engineering including high-performance languages (e.g. C++), big data tools (e.g. Spark), and distributed computing.

Build Your Data Science Team Today

Building A Data Science Team? Business Science Can Help.

We are your educational partner. We are here to support your transition by providing best-in-class data science education. No matter what point you are at, we will take you where you need to go. Contact us to learn more about our educational data science capabilities.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: - Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dot-Pipe Paper Accepted by the R Journal!!!

Tue, 09/18/2018 - 02:28

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We are thrilled to announce our (my and Nina Zumel’s) paper on the dot-pipe has been accepted by the R-Journal!

A huge “thank you” to the reviewers and editors for helping us with this! You can find our article here (pdf here)!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning for Insurance Claims

Tue, 09/18/2018 - 02:00

(This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers)

We are pleased to announce a new demo Shiny application that uses machine learning to predict annual payments on individual insurance claims for 10 years into the future.

This post describes the basics of the model behind the above Shiny app, and walks through the model fitting, prediction, and simulation ideas using a single claim as an example.


Insurance is the business of selling promises (insurance policies) to pay for potential future claims. After insurance policies are sold, it can be many years before a claim is reported to the insurer and many more years before all payments are made on the claim. Insurers carry a liability on their balance sheet to account for future payments on claims on previously sold policies. This liability is known as the Unpaid Claims Reserve, or just the Reserve. Since the Reserve is a liability for uncertain future payments, it’s exact value is not known and must be estimated. Insurers are very interested in estimating their Reserve as accurately as possible.

Traditionally insurers estimate their Reserve by grouping similar claims and exposures together and analyzing historical loss development patterns across the different groups. They apply the historical loss development patterns of the older groups (with adjustments) to the younger groups of claims and exposures to arrive at a Reserve estimate. This method does not accurately predict individual claim behavior, but, in aggregate, it can accurately estimate the expected value of the total Reserve.

Rather than using the traditional grouping methods, we predict payments on individual insurance claims using machine learning in R. Our goal is to model individual claim behavior as accurately as possible. We want our claim predictions to be indistinguishable from actual claims on an individual claim level, both in expected value and variance. If we can achieve this goal, we can come up with expected values and confidence levels for individual claims. We can aggregate the individual claims to determine the expected value and confidence levels for the total Reserve. There are many other insights we can gather from individual claim predictions, but let’s not get ahead of ourselves. Let’s start with an example.

Example – Predict Payments and Status for a Single Claim 1 Year into the Future

We will look at reported Workers’ Compensation claims. We will not predict losses on unreported claims (i.e. for our claim predictions, we already have 1 observation of the claim and we will predict future payments on this existing/reported claim). We will not predict future claims on policies that have not reported a claim to the insurer.

So our simplified model for 1 claims is:

We start with data for a single claim. Our goal is to predict the payments on our claim for 1 year into the future. We do this by feeding the claim to our payment model (which we will train later), and the model predicts the payments. Simple enough right? But we want to predict the status (“Open” or “Closed”) of the claim. We could do something like this:

Now we first predict the status, and then we predict the payments. We are getting closer to our goal now, but we also want to capture the variability in our prediction. We do this by adding a simulation after each model prediction:

The prediction returned by the above status model is the probability that the claim will be open at age 2 (e.g. our model might predict a 25% probability that our claim will be open at age 2. Since there are only 2 possible status states, “Open” or “Closed”, a 25% probability of being open implies a 75% probability the claim is closed at age 2). We use this probability to simulate open and closed observations of the claim. We then feed the observations (with simulated status) to our trained payment model. The payment model predicts an expected payment, and finally we simulate the variation in this expected payment based on the distribution of the payment model’s residuals.

If this sounds confusing, bear with me. I think a coded example will show it is actually pretty strait forward. Let’s get started:

Step 1: Load the data

Each claim has 10 variables. We have 3,296 claims in the training data and we pulled out 1 claim. We will make predictions for this 1 claim once we fit the models.

library(knitr) library(kableExtra) library(dplyr) # This .RDS can be found at data_training <- readRDS("../../static/data/model_fit_data.RDS") test_claim_num <- "WC-114870" # select one claim that we will make predictions for test_claim <- data_training %>% filter(claim_num == test_claim_num) # remove the one claim that we will predict from the training data data_training <- data_training %>% filter(claim_num != test_claim_num) knitr::kable( head(data_training) ) %>% kableExtra::kable_styling(bootstrap_options = "striped", font_size = 12) %>% kableExtra::scroll_box(width = "100%") claim_num mem_num cause_code body_part_code nature_code class_code status paid_total status_2 paid_incre_2 WC-163290 Member A Struck Or Injured Skull Amputation 7720 C 6893.6947 C 451.3185 WC-923705 Member A Struck Or Injured Brain Amputation 9101 C 60247.1871 C 0.0000 WC-808299 Member A Fall – On Same Level Skull Burn 9101 O 34240.6365 O 2201.1637 WC-536953 Member A Struck Or Injured Skull Contusion 9015 C 599.8090 C 0.0000 WC-146535 Member A Struck Or Injured Skull Contusion 7720 C 265.5697 C 0.0000 WC-523720 Member A Strain – Lifting Skull Concussion 7720 C 85.8200 C 0.0000

The first column (“claim_num”) is the claim identifier and will not be used as a predictor variable in the model training. The 7 predictor variables (columns 2 through 8) represent the claim at age 1. The last 2 columns (“status_2” and “paid_incre_2”) are the variables we will predict. “status_2” is the claim’s status (“O” for Open and “C” for Closed) at age 2. “paid_incre_2” is the incremental payments made on the claim between age 1 and 2.

Step 2: Fit the Status Model

Our status model is a logistic regression, and we train it with the caret package. We use the step AIC method to perform feature selection.

library(caret) # train the model tr_control <- caret::trainControl( method = "none", classProbs = TRUE, summaryFunction = twoClassSummary ) status_model_fit <- caret::train( status_2 ~ ., data = data_training[, !(names(data_training) %in% c("claim_num", "paid_incre_2"))], method = "glmStepAIC", preProcess = c("center", "scale"), trace = FALSE, trControl = tr_control ) # create a summary to assess the model smry <- summary(status_model_fit)[["coefficients"]] smry_rownames <- rownames(smry) out <- cbind( "Variable" = smry_rownames, as_data_frame(smry) ) # display a table of the predictive variables used in the model knitr::kable( out, digits = 3, row.names = FALSE ) %>% kableExtra::kable_styling(bootstrap_options = "striped", font_size = 12) %>% kableExtra::scroll_box(width = "100%") Variable Estimate Std. Error z value Pr(>|z|) (Intercept) -6.293 27.188 -0.231 0.817 cause_codeStrain 0.174 0.057 3.036 0.002 cause_codeStrike 0.268 0.111 2.423 0.015 cause_codeStruck Or Injured 0.301 0.140 2.151 0.031 body_part_codeSkull -0.312 0.122 -2.559 0.011 nature_codeBurn 0.266 0.123 2.168 0.030 nature_codeContusion -0.430 0.271 -1.589 0.112 nature_codeLaceration 0.179 0.093 1.914 0.056 nature_codeStrain 0.156 0.066 2.365 0.018 class_code8810 0.228 0.113 2.024 0.043 class_code9102 -2.706 144.101 -0.019 0.985 class_codeOther 0.198 0.119 1.666 0.096 statusO 1.264 0.118 10.742 0.000 paid_total 0.184 0.048 3.862 0.000

The above table shows the variables that our step AIC method decided were predictive enough to keep around. The lower the p-value in the “Pr(>|z|)” column, the more statistically significant the variable. Not surprisingly “statusO” (the status at age 1) is highly predictive of the status at age 2.

Step 3: Fit the payment model

Next we use the xgboost R package along with caret to fit our payment model. We search through many possible tuning parameter values to find to the best values to tune the boosted tree. We perform cross validation to try to avoid over fitting the model.

library(xgboost) xg_grid <- expand.grid( nrounds = 200, #the maximum number of iterations max_depth = c(2, 6), eta = c(0.1, 0.3), # shrinkage range (0, 1) gamma = c(0, 0.1, 0.2), # "pseudo-regularization hyperparameter" (complexity control) # range [0, inf] # higher gamma means higher rate of regularization. default is 0 # 20 would be an extremely high gamma and would not be recommended colsample_bytree = c(0.5, 0.75, 1), # range [0, 1] min_child_weight = 1, subsample = 1 ) payment_model_fit <- caret::train( paid_incre_2 ~ ., data = data_training[, !(names(test_claim) %in% c("claim_num"))], method = "xgbTree", tuneGrid = xg_grid, trControl = caret::trainControl( method = "repeatedcv", repeats = 1 ) )

This trained model can predict payments on an individual claim like so:

test_claim_2 <- test_claim test_claim_2$predicted_payment <- predict( payment_model_fit, newdata = test_claim_2[, !(names(test_claim_2) %in% c("claim_num", "paid_incre_2"))] ) knitr::kable( test_claim_2 ) %>% kableExtra::kable_styling(font_size = 12) %>% kableExtra::scroll_box(width = "100%") claim_num mem_num cause_code body_part_code nature_code class_code status paid_total status_2 paid_incre_2 predicted_payment WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079

The predicted payment (shown in the last column named “predicted_payment”) for our above claim was 4,103, and the actual payment between age 1 and 2 was 1,748. Our prediction was not extremely accurate, but it was at least within a reasonable range (or so it seems to me). Due to the inherent uncertainty in workers’ compensation claims, it is unlikely we will ever be able to predict each individual insurance claim with a high degree of accuracy. Our aim is to instead predict payments within a certain range with high accuracy. We use the following simulation technique to quantify this range.

Step 4: Use probility of being open to simulate the status

In our above prediction we used status at age 2 as a variable to predict the payment between age 1 and 2. Of course the actual status at age 2 will not yet be known when the claim is at age 1. Instead we can use our claim status model to simulate the status at age 2. We can then use this simulated status as a predictor variable.

# use status model to get the predicted probability that our test claim will be open test_claim_3 <- test_claim test_claim_3$prob_open <- predict( status_model_fit, newdata = test_claim_3, type = "prob" )$O # set the number of simulation to run n_sims <- 200 set.seed(1235) out <- data_frame( "sim_num" = 1:n_sims, "claim_num" = test_claim_3$claim_num, # run a binomial simulation to simulate the claim status `n_sims` times # this binomial simulation will return a 0 for close and 1 for open "status_sim" = rbinom(test_claim_3$prob_open, n = n_sims, size = 1) ) # convert 0s and 1s to "C"s and "O"s out <- out %>% mutate(status_sim = ifelse(status_sim == 0, "C", "O")) kable(head(out)) %>% kableExtra::kable_styling(font_size = 12) sim_num claim_num status_sim 1 WC-114870 C 2 WC-114870 C 3 WC-114870 C 4 WC-114870 O 5 WC-114870 C 6 WC-114870 C

The third column “status_sim” shows our simulated status. These are the values we pass the payment model (along with the 6 other predictor variables from age 1. We now have 200 observations of our single claim because we simulated the status 200 times. We can look at the number of times our claim was simulated to be open and closed like so:

table(out$status_sim) ## ## C O ## 168 32

Step 5: Predict a Payment for each of the simulated statuses

out <- left_join(out, test_claim, by = "claim_num") %>% mutate(status_2 = status_sim) %>% select(-status_sim) out$paid_incre_fit <- predict(payment_model_fit, newdata = out) knitr::kable( head(out) ) %>% kableExtra::kable_styling(font_size = 12) %>% kableExtra::scroll_box(width = "100%") sim_num claim_num mem_num cause_code body_part_code nature_code class_code status paid_total status_2 paid_incre_2 paid_incre_fit 1 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 2 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 3 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 4 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 O 1748.148 23163.445 5 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 6 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079

The predicted payments are in the last column above. There predicted payment values differ depending on if the claim status was simulated to be open or closed. This gives us a little variability in our predicted payment, but it still does not capture the random variability of real world claims.

Step 6: Simulate Variability Around the Predicted Payment

Next we apply random variation to the predicted payment. For the sake of brevity here, we arbitrarily choose the negative binomial distribution, but in a real world analysis we would fit different distributions to the residuals to determine an appropriate model for the payment variability.

out$paid_incre_sim <- sapply( out$paid_incre_fit, function(x) { rnbinom(n = 1, size = x ^ (1 / 10), prob = 1 / (1 + x ^ (9 / 10))) } ) knitr::kable( head(out) ) %>% kableExtra::kable_styling(font_size = 12) %>% kableExtra::scroll_box(width = "100%") sim_num claim_num mem_num cause_code body_part_code nature_code class_code status paid_total status_2 paid_incre_2 paid_incre_fit paid_incre_sim 1 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 12706 2 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 4133 3 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 1120 4 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 O 1748.148 23163.445 15793 5 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 2015 6 WC-114870 Member A Chemicals Burn Skull Burn Other O 5517.135 C 1748.148 4103.079 1655

The payment we just simulated above (column “paid_incre_sim”) contains significantly more variability than our previously predicted payments. We can get a better sense of this variability with a histogram:

library(ggplot2) library(scales) payment_mean <- round(mean(out$paid_incre_sim), 0) payment_mean_display <- format(payment_mean, big.mark = ",") ggplot(out, aes(x = paid_incre_sim))+ geom_histogram(color = "darkblue", fill = "white") + ylab("Number of Observations") + xlab("Predicted Payment") + ggtitle("Simulated Predicted Payments Between Age 1 and 2") + scale_x_continuous(labels = scales::comma) + geom_vline( xintercept = payment_mean, color = "#FF0000" ) + geom_text( aes( x = payment_mean, label = paste0("Simulation Mean = ", payment_mean_display), y = 10 ), colour = "#FF0000", angle = 90, vjust = -0.7 )

In the above plot we see our claim is predicted to have a payment anywhere from 0 to a little less than 80,000. The mean of the simulated prediction is 7,146 and the actual payments between age 1 and 2 now falls within our predicted range of possible payments. From a quick glance this distribution of possible claim payment values seems to be a reasonable representation of real world claim development.

There is still one additional easy alteration to make our simulated claim behave more like an actual claim. Actual claims have a fairly high probability of experiencing zero incremental payment over the upcoming year. We can test for the number of zero payment claims in our training data:

zero_paid <- data_training %>% filter(paid_incre_2 == 0) # calculate the probability that a claim does not have any payement prob_zero_paid <- nrow(zero_paid) / nrow(data_training)

69.9% of our claims have zero payment between ages 1 and 2. Our model currently does not make a clear distinction between claims with payment and claims with zero payment. A next step would be to fit another logistic regression to the probability of zero payment between age 1 and 2, and run a binomial simulation (just like our claim status simulation) to determine if the claim will have an incremental payment of zero or a positive incremental payment. We would also need to retrain our payment model using only the claims that had a positive payment between age 1 and 2 in the training set.

We will leave the zero payment model for you to try out on your own.

Why Run the Simulations

Often models are concerned with predicting the most likely outcome, the probability of an outcome, or the mean expected value. However, in insurance, we often are not as concerned with the mean, median, or mode outcome as we are with the largest loss we would expect with a high degree of confidence (e.g. We expect to pay no more than x at a 99% confidence level). Additionally, insurers are concerned with the loss payments they can expect to pay under various risk transfer alternatives (e.g. what is the expected loss if cumulative per claim payments are limited to 250,000 and aggregate losses per accident year over 1,000,000 are split 50%/50% between the insurer and a reinsurer). And what are the confidence levels for these risk transfer losses? With per claim predictions and simulations we can answer these questions for all risk transfer options at all confidence levels.

There is plenty of room for further improvement to this model, and there is a wide range of insights we could explore using individual claim simulations. Let me know your ideas in the comments.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building Reproducible Data Packages with DataPackageR

Tue, 09/18/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Sharing data sets for collaboration or publication has always been challenging, but it’s become increasingly problematic as complex and high dimensional data sets have become ubiquitous in the life sciences. Studies are large and time consuming; data collection takes time, data analysis is a moving target, as is the software used to carry it out.

In the vaccine space (where I work) we analyze collections of high-dimensional immunological data sets from a variety of different technologies (RNA sequencing, cytometry, multiplexed antibody binding, and others). These data often arise from clinical trials and can involve tens to hundreds of subjects. The data are analyzed by teams of researchers with a diverse variety of goals.

Data from a single study will lead to multiple manuscripts by different principal investigators, dozens of reports, talks, presentations. There are many different consumers of data sets, and results and conclusions must be consistent and reproducible.

Data processing pipelines tend to be study specific. Even though we have great tidyverse tools that make data cleaning easier, every new data set has idiosyncracies and unique features that require some bespoke code to convert them from raw to tidy data.

Best practices tell us raw data are read-only, and analysis is done on some form of processed and tidied data set. Conventional wisdom is that this data tidying takes a substantial amount of the time and effort in a data analysis project, so once it’s done, why should every consumer of a data set repeat it?

One important reason is we don’t always know what a collaborator did to go from raw data to their version of an analysis-ready data set, so the instinct is to ask for the raw data and do the processing ourselves, which involves shipping around inconveniently large files.


How can we ensure:

  • Data consumers aren’t reinventing the wheel and repeating work already done.
  • Data are processed and tidied consistently and reproducibly.
  • Everyone is using the same versions of data sets for their analyses.
  • Reports and analyses are tied to specific versions of the data and are updated when data change.

Solutions to these issues require reproducible data processing, easy data sharing and versioning of data sets, and a way to verify and track data provenance of raw to tidy data.

Much of this can be accomplished by building R data packages, which are formal R packages whose sole purpose is to contain, access, and / or document data sets. By coopting R’s package system, we get documentation, versioning, testing, and other best practices for free.

DataPackageR was built to help with this packaging process.

Benefits of DataPackageR
  • It aims to automate away much of the tedium of packaging data sets without getting too much in the way, and keeps your processing workflow reproducible.

  • It sets up the necessary package structure and files for a data package.

  • It allows you to keep the large, raw data sets separate and only ship the packaged tidy data, saving space and time consumers of your data set need to spend downloading and re-processing it.

  • It maintains a reproducible record (vignettes) of the data processing along with the package. Consumers of the data package can verify how the processing was done, increasing confidence in your data.

  • It automates construction of the documentation and maintains a data set version and an md5 fingerprint of each data object in the package. If the data changes and the package is rebuilt, the data version is automatically updated.

  • Data packages can be version controlled on GitHub (or other VCS sites), making for easy change tracking and sharing with collaborators.

  • When a manuscript is submitted based on a specific version of a data package, one can make a GitHub release and automatically push it to sites like zenodo so that it is permanently archived.

  • Consumers of the data package pull analysis-ready data from a central location, ensuring all downstream results rely on consistently processed data sets.


The package developed organically over the span of several years, initially as a proof-of-concept, with features bolted on over time. Paul Obrecht kindly contributed the data set autodocumentation code and provided in-house testing. As it gained usage internally, we thougt it might be useful to others, and so began the process of releasing it publicly.


DataPackageR was a departure from the usual tools we develop, which live mostly on Bioconductor. We thought the rOpenSci community would be a good place to release the package and reach a wider, more diverse audience.

The onboarding process

Onboarding was a great experience. Maëlle Salmon, Kara Woo, Will Landau, and Noam Ross volunteered their time to provide in-depth, comprehensive and careful code review, suggestions for features, documentation, and improvements in coding style and unit tests, that ultimately made the software better. It was an impressive effort (you can see the GitHub issue thread for yourself here).

The current version of the package has 100% test coverage and comprehensive documentation. One great benefit is that as I develop new features in the future, I can be confident I’m not inadvertently breaking something else.

rOpenSci encourages a number of other best practices that I had been somewhat sloppy about enforcing on myself. One is maintaining an updated file which tracks major changes and new features, and links them to relevant GitHub issues. I find this particularly useful as I’ve always appreciated an informative NEWS file to help me decide if I should install the latest version of a piece of software. The online handbook for rOpenSci package development, maintenance and peer review is a great resource to learn about what some of those other best practices are and how you can prepare your own software for submission.

Using DataPackageR

Here’s a simple example that demonstrates how DataPackageR can be used to construct a data pacakge.

In this example, I’ll tidy the cars data set by giving it more meaningful column names. The original data set has columns speed and dist. We’ll be more verbose and rename them to speed_mph and stopping_distance.

An outline of the steps involved.
  1. Create a package directory structure. (DataPackageR::datapackage_skeleton())
  2. We add a raw data set. (DataPackageR::use_raw_dataset())
  3. We add a processing script to process cars into tidy_cars. (DataPackageR::use_processing_script())
  4. We define the tidy data object (named tidy_cars) that will be stored in our data package. (DataPackageR::use_data_object())
Create a data package directory structure.

The first step is to create the data package directory structure. This package structure is based on rrrpkg and some conventions introcuded by Hadley Wickham.

datapackage_skeleton(name = "TidyCars", path = tempdir(), force = TRUE) ## ✔ Setting active project to '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXXFG1L/TidyCars' ## ✔ Creating 'R/' ## ✔ Creating 'man/' ## ✔ Writing 'DESCRIPTION' ## ✔ Writing 'NAMESPACE' ## ✔ Added DataVersion string to 'DESCRIPTION' ## ✔ Creating 'data-raw/' ## ✔ Creating 'data/' ## ✔ Creating 'inst/extdata/' ## ✔ configured 'datapackager.yml' file We get some informative output as the various directories are created. Notice the message “Setting active project to…”. This is what allows the rest of the workflow below to work as expected. Internally it uses usethis::proj_set and usethis::proj_get, so be aware if you are using that package in your own scripts, mixing it with DataPackageR, there’s a risk of unexpected side-effects.

The directory structure that’s created is shown below.

/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXXFG1L/TidyCars ├── DESCRIPTION ├── R ├── Read-and-delete-me ├── data ├── data-raw ├── datapackager.yml ├── inst │   └── extdata └── man 6 directories, 3 files Add raw data to the package.

Next, we work interactively, following the paradigm of the usethis package.

# write our raw data to a csv write.csv(x = cars,file = file.path(tempdir(),"cars.csv"),row.names = FALSE) # this works because we called datapackage_skeleton() first. use_raw_dataset(file.path(tempdir(),"cars.csv"))

use_raw_dataset() moves the file or path in its argument into inst/extdata under the data package source tree. This raw (usually non-tidy) data will be installed with the data pacakge.

For large data sets that you may not want to distribute with the package, you could place them in a directory external to the package source, or place them in inst/extdata but include them in .Rbuildignore. In fact as I write this, there should be an option to add a data set to .Rbuildignore automatically. That would be a good first issue for anyone who would like to contribute.

Add a data processing script.

We add a script to process cars into tidy_cars. The author and title are provided as arguments. They will go into the yaml frontmatter of the Rmd file.

Note we specify that our script is an Rmd file. This is recommended. Use literate programming to process your data, and the Rmd will appear as a vignette in your final data package. use_processing_script(file = "tidy_cars.Rmd", author = "Greg Finak", title = "Process cars into tidy_cars.") ## configuration: ## files: ## tidy_cars.Rmd: ## enabled: yes ## objects: [] ## render_root: ## tmp: '103469'

The script file tidy_cars.Rmd is created in the data-raw directory of the package source. The output echoed after the command is the contents of the datapackager.yml configuration file. It controls the build process. Here the file is added to the configuration. You can find more information about it in the configuration vignette.

Edit your processing script.

Our script will look like this:

--- title: Process cars into tidy_cars. author: Greg Finak date: September 5, 2018 output_format: html_document --- ```{r} library(dplyr) cars <- read.csv(project_extdata_path('cars.csv')) tidy_cars <- cars %>% rename(speed_mph = speed, stopping_distace = dist) ``` Followed by a description of what we are doing. An important note about reading raw data from the processing script.

In order to read raw data sets in a reproducible manner, DataPackageR provides an API call:
project_extdata_path() that returns the absolute path to its argument in inst/extdata in a reproducible way, independent of the current working directory. There are also project_path() and project_data_path() that will point to the source root and the data directory, respectively.

NOTE that DataPackageR is not compatible with the here package. Rather use the APIs above.

This script creates the tidy_cars object, which is what we want to store in our final package.

Let DataPackageR know about the data objects to store in the package.

We let DataPackageR know about this:

use_data_object("tidy_cars") ## configuration: ## files: ## tidy_cars.Rmd: ## enabled: yes ## objects: ## - tidy_cars ## render_root: ## tmp: '103469'

Again, the datapackager.yml configuration is echoed, and we see the data set object has been added.

The build process uses this file to know which scripts to run and what data outputs to expect. More information is in the technical vignette.

It will automatically create documentation stubs for the package and for these data objects.

Build the package (for the first time).

We build the package. It will automatically generate some documentation for the data sets that we’ll need to go in and edit. There’s also some informative output. The output has been cleaned up recently, particularly now that the package has stabilized.

options("DataPackageR_interact" = FALSE) If you run package_build() in interactive mode, you’ll be prompted to fill in one line of information that will be put in the file, describing the changes to the data package. This helps you track changes across versions.
Setting options("DataPackageR_interact" = FALSE) turns off interactive mode. package_build(packageName = file.path(tempdir(),"TidyCars"), install = FALSE) ## ## ✔ Setting active project to '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXXFG1L/TidyCars' ## ✔ 1 data set(s) created by tidy_cars.Rmd ## • tidy_cars ## ☘ Built all datasets! ## Non-interactive file update. ## ## ✔ Creating 'vignettes/' ## ✔ Creating 'inst/doc/' ## First time using roxygen2. Upgrading automatically... ## Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXXFG1L/TidyCars/DESCRIPTION ## Writing NAMESPACE ## Loading TidyCars ## Writing TidyCars.Rd ## Writing tidy_cars.Rd ## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \ ## --no-environ --no-save --no-restore --quiet CMD build \ ## '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXXFG1L/TidyCars' \ ## --no-resave-data --no-manual --no-build-vignettes ## ## Next Steps ## 1. Update your package documentation. ## - Edit the documentation.R file in the package source data-raw subdirectory and update the roxygen markup. ## - Rebuild the package documentation with document() . ## 2. Add your package to source control. ## - Call git init . in the package source root directory. ## - git add the package files. ## - git commit your new package. ## - Set up a github repository for your pacakge. ## - Add the github repository as a remote of your local package repository. ## - git push your local repository to gitub. ## [1] "/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXXFG1L/TidyCars_1.0.tar.gz" The argument install = FALSE prevents the package from being automatically installed to the system after building.

Above, you’ll see the message:

1 data set(s) created by tidy_cars.Rmd

Indicating that the build process found the one of the expected data sets. It then lists the specific data set(s) it created in that script.
It stores those data sets in the package data subfolder as .rda files.

You’ll see it also created vignettes, wrote a roxygen stub for the documentation of the TidyCars package and the tidy_cars data object. It created provisional Rd files, and built the package archive. It even provides some information on what to do next.

Next edit the data set documentation.

Let’s update our documentation as requested. We edit the documentation.R file under data-raw.

Here are its contents:

#' TidyCars #' A data package for TidyCars. #' @docType package #' @aliases TidyCars-package #' @title Package Title #' @name TidyCars #' @description A description of the data package #' @details Use \code{data(package='TidyCars')$results[, 3]} tosee a list of availabledata sets in this data package #' and/or DataPackageR::load_all #' _datasets() to load them. #' @seealso #' \link{tidy_cars} NULL #' Detailed description of the data #' @name tidy_cars #' @docType data #' @title Descriptive data title #' @format a \code{data.frame} containing the following fields: #' \describe{ #' \item{speed_mph}{} #' \item{stopping_distace}{} #' } #' @source The data comes from________________________. #' @seealso #' \link{TidyCars} NULL

This is standard roxygen markup. You can use roxygen or markdown style comments. You should describe your data set, where it comes from, the columns of the data (if applicable), and any other information that can help a user make good use of and understand the data set.

We’ll fill this in and save the resulting file.

## #' TidyCars ## #' A data package for TidyCars. ## #' @docType package ## #' @aliases TidyCars-package ## #' @title A tidied cars data set. ## #' @name TidyCars ## #' @description Cars but better. The variable names are more meaninful. ## #' @details The columns have been renamed to indicate the units and better describe what is measured. ## #' @seealso ## #' \link{tidy_cars} ## NULL ## ## ## ## #' The stopping distances of cars at different speeds. ## #' @name tidy_cars ## #' @docType data ## #' @title The stopping distances of cars traveling at different speeds. ## #' @format a \code{data.frame} containing the following fields: ## #' \describe{ ## #' \item{speed_mph}{The speed of the vehicle.} ## #' \item{stopping_distace}{The stopping distance of the vehicle.} ## #' } ## #' @source The data comes from the cars data set distributed with R. ## #' @seealso ## #' \link{TidyCars} ## NULL

Then we run the document() method in the DataPackageR:: namespace to rebuild the documentation.

# ensure we run document() from the DataPackageR namespace and not document() from roxygen or devtools. package_path <- file.path(tempdir(),"TidyCars") DataPackageR::document(package_path) ## ## ✔ Setting active project to '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXXFG1L/TidyCars' ## Updating TidyCars documentation ## Loading TidyCars ## Writing TidyCars.Rd ## Writing tidy_cars.Rd ## [1] TRUE Iterate…

We can add more data sets, more scripts, and so forth, until we’re happy with the package.

A final build.

Finally we rebuild the package one last time. The output is suppressed here for brevity.

package_build(file.path(tempdir(),"TidyCars")) Sharing and distributing data packages.

If you are the data package developer you may consider:

  • Placing the source of the data package under version control (we like git and GitHub).
  • Share the package archive (yourpackage-x.y.z.tar.gz)
    • on a public repository.
    • directly with collaborators.

We’ve placed the TidyCars data package on GitHub so that you can see for yourself how it looks.

Limitations and future work

Versions of software dependencies for processing scripts are not tracked. Users should use sessionInfo() to keep track of the versions of software used to perform data processing so that the environment can be replicated if a package needs to be rebuilt far in the future.

Tools like packrat and Docker aim to solve these problems, and it is non-trivial. I would love to integrate these tools more closely with DataPackageR in the future.

Using a data package.

If you are a user of a data pacakge.

Install a data package in the usual manner of R package installation.

  • From GitHub:
  • From the command line:
R CMD INSTALL TidyCars-0.1.0.tar.gz

Once the package is installed, we can load the newly built package and access documentation, vignettes, and use the DataVersion of the package in downstream analysis scripts.

library(TidyCars) browseVignettes(package = "TidyCars")

Typing the above in the R console will pop up a browser window where you’ll see the available vignettes in your new TidyCars package.

Clicking the HTML link gives you access to the output of the processing script, rendered as a vignette. Careful work here will let you come back to your code and understand what you have done.

We can also view the data set documentation:


And we can use the assert_data_version() API to test the version of a data package in a downstream analysis that depends on the data.

data_version("TidyCars") ## [1] '0.1.0' assert_data_version("TidyCars",version_string = "0.1.0", acceptable = "equal") assert_data_version("TidyCars",version_string = "0.2.0", acceptable = "equal") ## Error in assert_data_version("TidyCars", version_string = "0.2.0", acceptable = "equal"): Found TidyCars 0.1.0 but == 0.2.0 is required.

The first assertion is true, the second is not, throwing an error. In downstream analyses that depend on a version of a data package, this is useful to ensure updated data don’t inadvertently change results, without the user being aware that something unexpected is going on.

A data package can be built into a package archive (.tar.gz) using the standard R CMD build process. The only difference is that the .rda files in /data won’t be re-generated, and the existing vignettes describing the processing won’t be rebuilt. This is useful when the processing of raw data sets is time consuming (like some biological data), or when raw data sets are too large to distribute conveniently.

To conclude

With DataPackageR I’ve tried to implement a straightforward workflow for building data packages. One that doesn’t get in the way (based on my own experience) of how people develop their data processing pipelines. The new APIs are limited and only minor changes need to be made to adapt existing code to the workflow.

With a data package in hand, data can be easily shared and distributed. In my view, the greatest benefit of building a data package is that it encourages us to use best practices, like documenting data sets, documenting code, writing unit tests, and using version control. Since we are very often our own worst enemies, these are all Good Things.

We have been eating our own dog food, so to speak.
We use data packages internally in the Vaccine Immunology Statistical Center to prepare data sets for analysis, publication, and for sharing with collaborators.
We often do so through the Fred Hutch GitHub Organization (though most data are private).

The RGLab has used data packages (though not built using DataPackageR) to share data sets together with publications:

  • Combinatorial Polyfunctionality Analysis of Single Cells: paper, data
  • Model-based Analysis of Single-cell Trascriptomics paper, data

In the long run, packaging data saves us time and effort (of the sort expended trying to reproduce results of long ago and far away) by ensuring data processing is reproducible and properly tracked. It’s proved extremely useful to our group and I hope it’s useful to others as well.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Access the Internet Archive Advanced Search/Scrape API with wayback (+ a links to a new vignette & pkgdown site)

Tue, 09/18/2018 - 00:27

(This article was first published on R –, and kindly contributed to R-bloggers)

The wayback package has had an update to more efficiently retrieve mementos and added support for working with the Internet Archive’s advanced search+scrape API.


The search/scrape interface lets you examine the IA collections and download what you are after (programmatically). The main function is ia_scrape() but you can also paginate through results with the helper functions provided.

To demonstrate, let’s peruse the IA NASA collection and then grab one of the images. First, we need to search the collection then choose a target URL to retrieve and finally download it. The identifier is the key element to ensure we can retrieve the information about a particular collection.

library(wayback) nasa <- ia_scrape("collection:nasa", count=100L) tibble:::print.tbl_df(nasa) ## # A tibble: 100 x 3 ## identifier addeddate title ## ## 1 00-042-154 2009-08-26T16:30:09Z International Space Station exhibit ## 2 00-042-32 2009-08-26T16:30:12Z Swamp to Space historical exhibit ## 3 00-042-43 2009-08-26T16:30:16Z Naval Meteorology and Oceanography Command … ## 4 00-042-56 2009-08-26T16:30:19Z Test Control Center exhibit ## 5 00-042-71 2009-08-26T16:30:21Z Space Shuttle Cockpit exhibit ## 6 00-042-94 2009-08-26T16:30:24Z RocKeTeria restaurant ## 7 00-050D-01 2009-08-26T16:30:26Z Swamp to Space exhibit ## 8 00-057D-01 2009-08-26T16:30:29Z Astro Camp 2000 Rocketry Exercise ## 9 00-062D-03 2009-08-26T16:30:32Z Launch Pad Tour Stop ## 10 00-068D-01 2009-08-26T16:30:34Z Lunar Lander Exhibit ## # ... with 90 more rows (item <- ia_retrieve(nasa$identifier[1])) ## # A tibble: 6 x 4 ## file link last_mod size ## 1 00-042-154.jpg 06-Nov-2000 15:34 1.2M ## 2 00-042-154_archive.torrent 06-Jul-2018 11:14 1.8K ## 3 00-042-154_files.xml 06-Jul-2018 11:14 1.7K ## 4 00-042-154_meta.xml 03-Jun-2016 02:06 1.4K ## 5 00-042-154_thumb.jpg 26-Aug-2009 16:30 7.7K ## 6 __ia_thumb.jpg 06-Jul-2018 11:14 26.6K download.file(item$link[1], file.path("man/figures", item$file[1]))

I just happened to know this would take me to an image. You can add the media type to the result (along with a host of other fields) to help with programmatic filtering.

The API is still not sealed in stone, so you're encouraged to submit questions/suggestions.


The vignette is embedded below and frame-busted here. It covers a very helpful and practical use-case identified recently by an OP on StackOverflow.

There's also a new pkgdown-gen'd site for the package.

Issues & PRs welcome at your community coding site of choice.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Principal Component Momentum?

Mon, 09/17/2018 - 21:14

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will investigate using Principal Components as part of a momentum strategy.

Recently, I ran across a post from David Varadi that I thought I’d further investigate and translate into code I can explicitly display (as David Varadi doesn’t). Of course, as David Varadi is a quantitative research director with whom I’ve done good work with in the past, I find that trying to investigate his ideas is worth the time spent.

So, here’s the basic idea: in an allegedly balanced universe, containing both aggressive (e.g. equity asset class ETFs) assets and defensive assets (e.g. fixed income asset class ETFs), that principal component analysis, a cornerstone in machine learning, should have some effectiveness at creating an effective portfolio.

I decided to put that idea to the test with the following algorithm:

Using the same assets that David Varadi does, I first use a rolling window (between 6-18 months) to create principal components. Making sure that the SPY half of the loadings is always positive (that is, if the loading for SPY is negative, multiply the first PC by -1, as that’s the PC we use), and then create two portfolios–one that’s comprised of the normalized positive weights of the first PC, and one that’s comprised of the negative half.

Next, every month, I use some momentum lookback period (1, 3, 6, 10, and 12 months), and invest in the portfolio that performed best over that period for the next month, and repeat.

Here’s the source code to do that: (and for those who have difficulty following, I highly recommend James Picerno’s Quantitative Investment Portfolio Analytics in R book.

require(PerformanceAnalytics) require(quantmod) require(stats) require(xts) symbols <- c("SPY", "EFA", "EEM", "DBC", "HYG", "GLD", "IEF", "TLT") # get free data from yahoo rets <- list() getSymbols(symbols, src = 'yahoo', from = '1990-12-31') for(i in 1:length(symbols)) { returns <- Return.calculate(Ad(get(symbols[i]))) colnames(returns) <- symbols[i] rets[[i]] <- returns } rets <- na.omit(, rets2)) # 12 month PC rolling PC window, 3 month momentum window pcPlusMinus <- function(rets, pcWindow = 12, momWindow = 3) { ep <- endpoints(rets) wtsPc1Plus <- NULL wtsPc1Minus <- NULL for(i in 1:(length(ep)-pcWindow)) { # get subset of returns returnSubset <- rets[(ep[i]+1):(ep[i+pcWindow])] # perform PCA, get first PC (I.E. pc1) pcs <- prcomp(returnSubset) firstPc <- pcs[[2]][,1] # make sure SPY always has a positive loading # otherwise, SPY and related assets may have negative loadings sometimes # positive loadings other times, and creates chaotic return series if(firstPc['SPY'] < 0) { firstPc <- firstPc * -1 } # create vector for negative values of pc1 wtsMinus <- firstPc * -1 wtsMinus[wtsMinus < 0] <- 0 wtsMinus <- wtsMinus/(sum(wtsMinus)+1e-16) # in case zero weights wtsMinus <- xts(t(wtsMinus), wtsPc1Minus[[i]] <- wtsMinus # create weight vector for positive values of pc1 wtsPlus <- firstPc wtsPlus[wtsPlus < 0] <- 0 wtsPlus <- wtsPlus/(sum(wtsPlus)+1e-16) wtsPlus <- xts(t(wtsPlus), wtsPc1Plus[[i]] <- wtsPlus } # combine positive and negative PC1 weights wtsPc1Minus <-, wtsPc1Minus) wtsPc1Plus <-, wtsPc1Plus) # get return of PC portfolios pc1MinusRets <- Return.portfolio(R = rets, weights = wtsPc1Minus) pc1PlusRets <- Return.portfolio(R = rets, weights = wtsPc1Plus) # combine them combine <-na.omit(cbind(pc1PlusRets, pc1MinusRets)) colnames(combine) <- c("PCplus", "PCminus") momEp <- endpoints(combine) momWts <- NULL for(i in 1:(length(momEp)-momWindow)){ momSubset <- combine[(momEp[i]+1):(momEp[i+momWindow])] momentums <- Return.cumulative(momSubset) momWts[[i]] <- xts(momentums==max(momentums), } momWts <-, momWts) out <- Return.portfolio(R = combine, weights = momWts) colnames(out) <- paste("PCwin", pcWindow, "MomWin", momWindow, sep="_") return(list(out, wtsPc1Minus, wtsPc1Plus, combine)) } pcWindows <- c(6, 9, 12, 15, 18) momWindows <- c(1, 3, 6, 10, 12) permutes <- expand.grid(pcWindows, momWindows) stratStats <- function(rets) { stats <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets)) stats[5,] <- stats[1,]/stats[4,] stats[6,] <- stats[1,]/UlcerIndex(rets) rownames(stats)[4] <- "Worst Drawdown" rownames(stats)[5] <- "Calmar Ratio" rownames(stats)[6] <- "Ulcer Performance Index" return(stats) } results <- NULL for(i in 1:nrow(permutes)) { tmp <- pcPlusMinus(rets = rets, pcWindow = permutes$Var1[i], momWindow = permutes$Var2[i]) results[[i]] <- tmp[[1]] } results <-, results) stats <- stratStats(results)

After a cursory look at the results, it seems the performance is fairly miserable with my implementation, even by the standards of tactical asset allocation models (the good ones have a Calmar and Sharpe Ratio above 1)

Here are histograms of the Calmar and Sharpe ratios.

These values are generally too low for my liking. Here’s a screenshot of the table of all 25 results.

While my strategy of choosing which portfolio to hold is different from David Varadi’s (momentum instead of whether or not the aggressive portfolio is above its 200-day moving average), there are numerous studies that show these two methods are closely related, yet the results feel starkly different (and worse) compared to his site.

I’d certainly be willing to entertain suggestions as to how to improve the process, which will hopefully create some more meaningful results. I also know that AllocateSmartly expressed interest in implementing something along these lines for their estimable library of TAA strategies, so I thought I’d try to do it and see what results I’d find, which in this case, aren’t too promising.

Thanks for reading.

NOTE: I am networking, and actively seeking a position related to my skill set in either Philadelphia, New York City, or remotely. If you know of a position which may benefit from my skill set, feel free to let me know. You can reach me on my LinkedIn profile here, or email me.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Break Down: model explanations with interactions and DALEX in the BayArea

Mon, 09/17/2018 - 20:50

(This article was first published on English –, and kindly contributed to R-bloggers)

The breakDown package explains predictions from black-box models, such as random forest, xgboost, svm or neural networks (it works for lm and glm as well). As a result you gets decomposition of model prediction that can be attributed to particular variables.

The version 0.3 has a new function break_down. It identifies pairwise interactions of variables. So if the model is not additive, then instead of seeing effects of single variables you will see effects for interactions.
It’s easy to use this function. See an example below.
HR is an artificial dataset. The break_down function correctly identifies interaction between gender and age. Find more examples in the documentation.

# # Create a model for classification library("DALEX") library("randomForest") model <- randomForest(status ~ . , data = HR) # # Create a DALEX explainer explainer_rf_fired <- explain(model, data = HR, y = HR$status == "fired", predict_function = function(m,x) predict(m,x, type = "prob")[,1]) # # Calculate variable attributions new_observation <- HRTest[1,] library("breakDown") bd_rf <- break_down(explainer_rf_fired, new_observation, keep_distributions = TRUE) bd_rf #> contribution #> (Intercept) 0.386 #> * hours = 42 0.231 #> * salary = 2 -0.216 #> * age:gender = 58:male 0.397 #> * evaluation = 2 -0.019 #> final_prognosis 0.778 #> baseline: 0 plot(bd_rf)

Figure below shows that a single prediction was decomposed into 4 parts. One of them is related to the interaction between age and gender.

BreakDown is a part of DALEXverse – collection of tools for visualisation, exploration and explanation of complex machine learning models.

Till the end of September I am visiting UC Davis and UC Berkeley. Happy to talk about DALEX explainers, XAI and related stuff.
So, if you want to talk about interpretability of complex ML models, just let me know.

Yes, it’s part of the DALEX invasion
Thanks to the H2020 project RENOIR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

3D Correspondence Analysis Plots in R Using Plotly

Mon, 09/17/2018 - 18:00

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Explore 3D Correspondence Analysis!

Back in the “olden days” of the 1970s it was apparently not unknown for statisticians to create 3D visualizations using tinkertoys. For some inexplicable reason, the advent of the PC led to a decline in this practice, with the result that 3D visualizations are now much less common. However, it is possible to create 3D-like visualizations digitally. Although they only show two dimensions at any one time, the user can understand the third dimension by interacting with the visualization. In this post I use Plotly’s excellent plotting package to create an interactive, 3D visualization of a correspondence analysis.

The data

The data table that I use in this example shows perceptions of different cola brands. It is a good example for correspondence analysis as the table is relatively large, and correspondence analysis is thus useful at providing a summary.

The traditional 2D correspondence analysis map

The standard 2D “map” from correspondence analysis is shown below. I’ve created this using the Displayr/flipDimensionReduction package on GitHub, which creates maps that automatically arrange the labels so that they do not overlap. I created it using the following code (if you don’t already have this package, you will first need to first install it from github).

library(flipDimensionReduction) <- CorrespondenceAnalysis(, normalization = "Column principal (scaled)")

This visualization explains 86% of the variance from the correspondence analysis. This leads to the question: is the 14% that is not explained interesting?

Creating the 3D interactive visualization in Plotly

The standard visualization plots of correspondence analysis plots the first two dimensions. The code below uses plotly to create a 3D plot of the first three dimensions. In theory you can encode further dimensions (e.g., using color, font size, markers and the like), but I’ve never been smart enough to interpret them myself! You can readily repurpose this code to your own correspondence analysis by replacing with the name or the object that contains your correspondence analysis. If you use a package other than flipDimensionReduction to create the correspondence analysis you will also need to work out how to extract the coordinates.

rc =$row.coordinates cc =$column.coordinates library(plotly) p = plot_ly() p = add_trace(p, x = rc[,1], y = rc[,2], z = rc[,3], mode = 'text', text = rownames(rc), textfont = list(color = "red"), showlegend = FALSE) p = add_trace(p, x = cc[,1], y = cc[,2], z = cc[,3], mode = "text", text = rownames(cc), textfont = list(color = "blue"), showlegend = FALSE) p <- config(p, displayModeBar = FALSE) p <- layout(p, scene = list(xaxis = list(title = colnames(rc)[1]), yaxis = list(title = colnames(rc)[2]), zaxis = list(title = colnames(rc)[3]), aspectmode = "data"), margin = list(l = 0, r = 0, b = 0, t = 0)) p$sizingPolicy$browser$padding <- 0 my.3d.plot = p

If you are reading this on a desktop, you will be able to interact with the visualization below.

Explore 3D Correspondence Analysis!

In addition to creating it in Plotly, I’ve published it online using Displayr, which is what allows you to interact with it even though it is in a web page. It’s free to do this; you just click Insert > R Output, paste in the R code, press CALCULATE, and then either choose Export > Public Web Page, or Export > Embed to embed it in another document or page. If you click here you will go into a Displayr document containing the code used to create the analyses and visualizations in this chart. You can modify it however you like to re-use for your own analyses.

Make your own 3D correspondence analysis, or if you want to explore more, get started with Displayr!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Becoming a Data Scientist (Transcript)

Mon, 09/17/2018 - 15:53

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is a link to the podcast.

Introducing Renée Teate

Hugo: Hi there, Renée, and welcome to DataFramed.

Renée: Hi Hugo. Great to be here.

Hugo: It’s great to have you on the show and I’m really excited to talk about all the things we’re gonna talk about today, the podcast that you worked on for so long, the idea of becoming a data scientist and your journey and process there, but before that I’d like to find out a bit about you. Maybe you can tell us a bit about what you’re known for in the data community.

Renée: Sure. Well I think I’m known for the podcast that you mentioned. It’s called Becoming a Data Scientist. I interview people about how they got to where they are in their data science journeys and whether they consider themselves to be a data scientist. I plan to start that back up soon. I think that’s what I originally kind of got known for but a lot of people also follow me on Twitter that may or may not have been an original podcast listener. I have a Twitter account called BecomingDataSci and my name on there is Data Science Renée. I try to help people that are transitioning into a data science career to find learning resources and inspiration. I’ve built a site called, which collects learning resources and people can go on there and rate them. I hope to eventually make that into learning paths and things like that. I have a Twitter account called NewDataSciJobs where I share jobs that require less than three years of experience and I try to share articles about learning data science and getting into this field to help people transition in.

Renée: On top of that, I share my own data science challenges and achievements and try to encourage and inspire others so they can kind of watch what I do. I’m really happy, especially in the last year I feel, to see a wide variety of people with different educational backgrounds that want to enter this field, so I intend to help them become data scientists too because I think the broader the background of people in this field, the better it’s gonna get. I guess that’s what I’m known for, the podcast and Twitter account for the most part.

Hugo: Sure. I think a wonderful through line there that of course we’re very aligned with at Data Camp is lowering the barrier to entry for people who want to engage with analytics and data science. One of your wonderful approaches I think, you know you stated that on the podcast you’ll even ask people you have on about their journey but whether they consider themselves to be data scientists, kind of what this term means, and how their practices apply to it. It kind of demystifies data science as a whole, which can be a very I think unapproachable term with a lot of gatekeepers around as well. I think the work you do is very similar to how we think about our approach at Data Camp so that’s really cool.

Renée: Great. I definitely aim for that.

How did you get into data science?

Hugo: How did you get into data science initially?

Renée: This is my favorite question because this is what we talk about the whole time on my podcast, so hopefully I don’t run too long but I will give a detailed answer. I’ve worked with data my whole career. You might call me a data generalist. Right out of college, I went to James Madison University in Harrisonburg, Virginia, where I still live, and I majored in something called integrated science and technology. It was a very broad major. It gave more breadth than depth in a lot of topics. We covered everything from biotech to manufacturing and engineering to programming, but you kind of get a taste of everything and find out what you like and don’t like. It had a lot of hands-on real-world projects and one thing we learned in the programming courses in the ISAT program was relational database design. This is something I had never done before then but when I was in the class I realized hey I’m pretty good at this. I get this. It makes sense to me. Right out of college, I started doing that type of work. I was designing databases, building data-driven websites, and designing forms and reports to interact with the data. I did a lot of SQL and helped design a reporting data warehouse and building interactive reports where people can interact with the data and I did some analysis on that.

Renée: I wanted to take my career to the next level beyond that. At the time, I thought that a masters in systems engineering would fill in a lot of the gaps in my knowledge so in my undergrad program I didn’t have a lot of depth in math, for instance, or coding. I just had some introductory classes. This program had, it was at the University of Virginia, and it had simulation and modeling courses, optimization, statistics, and at the time I was kind of afraid of the math. I had to take linear algebra at the community college in a summer course to even qualify to apply for this masters program. This is eight years after undergrad. I should have known that it was gonna be more math intensive than I originally thought but I found out that the title of each of these courses in the systems engineering program is kind of like a code for another type of math. It was very math intensive but I needed that. That’s something that I wouldn’t have learned as much on my own if I did all self directed learning.

Hugo: I have a question around that which of course I get a lot as an educator, which is to be an effective data analyst or data scientist, how much linear algebra do people need to know?

Renée: I think it’s good to understand the basics. It gives you a sense of what’s going on behind the scenes of those algorithms, to understand how data is being transformed and processed, however if you’re really going to be an applied data scientist and not so much like a machine learning researcher, you don’t have to really know all those intricacies. I’m glad I got a background in it so I understand how these things work, but I don’t use those skills on my day-to-day work. They’re like packages that abstract all that away so I don’t have to be doing those type of calculations on a daily basis as a data scientist. I would say it’s good to get a grasp of it and feel like you understand the concepts but you don’t need to like have a mastery of the actual computations yourself. I mean that’s what computers are for. They can do a lot of that for you.

Hugo: Yeah. I agree completely and I do think there is a lot of anxiety around learning these types of things, linear algebra and I suppose multivariate calculus in particular. I do also encourage people to push through a bit and persevere a bit because a big part of the challenge is the language and notation. A lot of the concepts aren’t necessarily that tough but when you’re writing a whole bunch of matrices and that type of stuff, you get pretty gnarly pretty quickly.

Renée: Yeah. I still like shudder when I see certain depictions of the … Like you said with multi variable calculus and calculus that’s done in a matrix. It just looks so overwhelming and the notation still gets me so I feel that.

Hugo: Yeah.

Renée: But I’m glad I understand the concepts behind it, even if I still shudder every time I see those.

Hugo: Yeah and you can have some crazy notation that really what it is referring to is the directional flow along a surface or something like that, like something that intuitively is quite easy to grasp but we’ve got this heavy archaic notation around it.

Renée: Yeah and it’s not even consistent. I was in a program that had like professors from different departments at different universities and my husband is a physicist and there was a course where I was just really struggling with this particular type of computation and the notation and he looked at it and he was like you just learned this last semester. I was like I’ve never seen this before. He said no it’s the same concept, it’s just different notation. That’s when I really started to understand like mathematicians and engineers for instance might use different notation for the same thing. It gets complicated. I do think if you’re gonna become like a machine learning researcher or go into like a PhD program or you’re developing things around the cutting edge of data science and really pushing forward the field and building algorithms that other people will use, then you need to really understand that stuff but if you’re mostly applying algorithms that are already built, you don’t have to get as in depth. For statistics I do think you really need a solid statistical foundation. I would kind of say the opposite. Everybody that does data science really needs to understand basic statistics well.

Hugo: Great. So what then happened in your journey while or after you did this program?

Renée: Yeah. While I was in the program, the data science institute got started at UVA. I had been hearing about data science everywhere and I kind of wanted to switch into that program but I couldn’t without completely starting over. They kind of moved as a cohort through their program so I found out that I could take a machine learning course as an elective and so I started taking that just because I wanted to know what it’s about and how close is it to what I’m already doing. It felt like my whole career up to that point was kind of leading towards data science and I had never heard of it. In this machine learning class, it started with a lot of the math and it moved really fast and I’ll be honest I bombed that mid term. I really thought I was gonna fail out of the course but I decided to keep going because the first half of the course was the math and the second half of the course was the coding and applied part of it which was what I was looking forward to, so I thought well even if I get a bad grade I want to learn what I’m supposed to learn in this course so let me stick with it.

Renée: Like you said, with the abstract symbols and things I was having a hard time even understanding the textbook but then the last part of the course we had been building these machine learning algorithms from scratch. Oh and by the way all the examples were in C++ but the professor let us use whatever coding language we wanted to, so I started picking up Python at that point. I didn’t have a very good grasp of C++. I had mostly done visual basic .NET up until that point and SQL and I didn’t know Python at all but I figured that was my chance to learn it so I kind of learned Python as I went as well, which is probably part of the reason I struggled in the class. By the end we had this project. By then I kind of got Python and I kind of got what was going on with machine learning. I was going to school part time while I worked, so I asked my manager can I use this data that we use at work to apply it to this project that I’m doing in school. He said yes that was fine.

Renée: So what I did, I was working in the advancement division at JMU which is basically the fundraising arm of the university. For my project, I predicted which alumni were most likely to become donors in the next fiscal year. The professor loved it and maybe even mentioned this is something I could publish in the future. I guess that project outweighed my performance in the math portion of the course because I ended up getting an A in that class, which just blew my mind.

Hugo: That’s incredible.

Renée: I was like okay now that’s kind of confirmation that this is something I should be doing.

Hugo: Absolutely. I just wanna flag that before you go on, that you’ve actually made an incredible point there which is that you didn’t do a project kind of in a vacuum essentially. You were working on data that was meaningful for you, meaningful to your employer, and actually gave some insight into something important to a bunch of stakeholders.

Renée: Yeah and it took what like in class we have pre-prepared data sets and they were all just lists of numbers. They weren’t even like kind of related to the real world at all. The professor chose those data sets because the answer would come out a certain way and so diving into something that was unknown that no one had really looked at before at least in our university and finding some insights that I could share and actually make a real-world difference, that tied it all together for me.

Hugo: In a learning experience as well, working on something that means something to you and interests you is so important.

Renée: Oh absolutely. I always encourage people to find datasets that are interesting to them and use them throughout their learning journey because it keeps you interested when things get tough and also you’ll understand the output better if it’s something that you’ve had a background in or even interested in. If you’re into sports, use a sports data set because you’ll have a better sense of whether the output of your model even makes sense in context of sports.

Hugo: I always say if you, a lot of people wear fitness trackers these days and they can get their own data with respect to exercise and sleeping patterns and that type of stuff. They can quickly do a brief analysis or visualization of stuff that’s happening physiologically with them.

Renée: Yeah. That’s an awesome idea and definitely something I would encourage.

Hugo: Awesome. So what happened next in your journey?

Renée: For my last class, so most of my program that I did in grad school was online. It was synchronous so I was actually watching lectures over the internet that were live and there was a class there but for the last semester I commuted to campus which was an hour for me. I started listening to a lot of data science podcasts because I knew at that point I’m interested in this thing. Back then I was listening to Partially Derivative and Talking Machines and the O’Reilly Data Show, Linear Digressions, Data Skeptic, so I was just absorbing all of this data science information and I knew that this was what I wanted to do. As soon as I graduated, I started diving into books about data science and teaching myself what I needed to know to get a job in this field and move on from, at the time I was a data analyst and I wanted to move into being a data scientist. That’s what I did next.

Renée: Then I applied to a bunch of different jobs that like at the time I was just getting comfortable with data science so I didn’t want necessarily a data scientist job but I wanted to make sure it was a job that was moving in that direction because the job I was in wasn’t giving me a lot of opportunities to really exercise these new skills and do machine learning on the job. I knew I was good with designing analytical reports. I knew I was good with SQL. I had this new masters degree in systems engineering but I wanted to grow into a data science role. I started applying to a bunch of different jobs that partially involved data science but they had components that I knew I already had the skills to provided value in. I didn’t get any of the first several I applied to, but I was starting to learn by doing those interviews what they were gonna ask and what gaps I had in my knowledge so I can go back and learn more.

Renée: At the time, there were two different startups, one on each side of the country, that apparently needed that type of generalist that could do both the backend data engineering and SQL stuff and move into the predictive modeling side. I got two offers at the same time. They were both for remote roles that were like a combination of data analytics and entry level data science. I didn’t have to do whiteboard interviews or coding interviews for either of them which was nice because that part, I don’t think I was as good at the time, but they needed somebody with my background and my experience with databases and someone that was good at communicating with the stakeholders. I think that helped me stand out and I think we’re gonna talk a little bit more about that later.

Hugo: Absolutely.

Renée: But one of those two job offers was with people I had worked with before. I worked at Rosetta Stone as a data analyst and a lot of the people at this startup had come from Rosetta Stone. I was more comfortable with that one and took that one and have been able to build my data science and machine learning skills on the job. That company is called HelioCampus. We work with university data and I can tell you more about that if we’re interested, but I’ve been in that role for about two years now as a data scientist.

Hugo: Fantastic. That’s telling that the project you did did involve alumni data initially, when you were first learning.

Renée: Yeah. At HelioCampus we’ve kind of … It’s extended me into a new domain. It’s still university data but we work a lot with the student success data and admissions and things like that. I guess I’ll give a little brief overview of the company. At universities they have databases that are like all kinds of data that you might not even think of when you’re applying and enrolling at this university. There would be a system for admissions and applications. There’s usually a separate system for enrollment and courses and faculty and then there’s another system that they have for payroll and financials and then they’ll have another system for the fundraising and alumni information. They have all these databases across campus and the leaders want kind of a big picture look at the students’ trajectory through this whole experience of applying and then going to college and becoming alumni.

Renée: To get metrics on that whole system, you have to combine that data. We combine it into a data warehouse and we have reports in Tableau that point at that data. We have some canned reports and then my job is to then work with the end users to do analysis that’s not already built to answer questions they have about the students and to do some predictive modeling. One example is for the admissions team, we have … We’ll take a look at all the students that have been admitted to a university and try to predict how many of them will enroll or which ones might be on the borderline of the type of students that sometimes enroll and sometimes don’t. They might need some extra outreach in order for the school to get their attention or students that need additional financial aid for instance. We’ve helped them get some insight by doing predictive modeling into what their student body looks like and what type of students they can except to come to their university and what trend we expect in the future for their enrollment. That’s just one example of many different aspects of what we do with the universities at HelioCampus but that’s the kind of work I’m doing now.

Hugo: That sounds like very interesting and fulfilling work, particularly with your kind of deep interest and mission as an educator and investing in learners.

Renée: Yeah definitely.

What questions do aspiring data scientists need to think about?

Hugo: It was fantastic to find out once again about your journey to becoming a data scientist and something that of course you do is insist through your podcast, through a lot of different media that this is only one journey, that everyone’s journey particularly to becoming a data scientist, there are a lot of different paths and there isn’t a one-size-fits-all approach to becoming a data scientist, and that before actually deciding on a path, people need to figure out both where they are and where they need to go and connect those points somehow. So: what I’d like to know is what questions do aspiring data scientists need to think about when figuring out where they’re starting from on their journey?

Renée: Yeah definitely. That’s actually why I started my podcast because I was listening to all these other podcasts showing what cool stuff data scientists were doing, but none of them had focused on how did they get there? What did they do? I started asking questions and one of the things I realized that you have to asses no matter which different educational background or career background you have is your starting point. The kind of questions you need to ask to map out your data science learning path is like have you coded before? What language have you coded in before? Data scientists typically learn R or Python, often need to know SQL. How comfortable are you with the mathematics and statistics and do you need to brush up on those things and get some refreshers? Maybe you need to take it to the next level from where you’re at? Have you ever presented a report based on data? Have you done an analysis in a professional setting before? Have you ever answered questions with data? These are like the basics that you need.

Renée: Then, you’re gonna probably be working in a particular domain so within that field do you know the lingo? Do you know what kind of data related career paths there are in that domain? How you might focus in your data science learning to target one of those career paths. You might want to talk to a data scientist in that domain or analyst in that field and get a sense of the common questions and state of the art of what problems are they working on and what are they asking so you get that language. It’s kind of this baseline of all the different parts of those common data science Venn diagrams that you see of how many of those pieces do you still need to work on to fill in. You’re just assessing your starting point and then next you’ll look at where you wanna go so that you know how to map out that learning path.

Data Science Profiles

Hugo: Yeah. So to recap, essentially we have coding chops, whether you can program, what languages, comfort with maths and stats, then communication skills and actually presenting I was gonna say data-based reports but I really mean reports based on data and then domain knowledge. I think these are definitely very important aspects of your own practice to analyze when figuring out where you’re starting from and then of course, as we both said, you need to have an idea of where you wanna end up. This may be a relatively amorphous, changing, vague notion but what are the typical data science profiles that we’ve seen emerge that people can end up as?

Renée: Yeah. As you mentioned, data science can mean a whole lot of things. I’ve noticed that there seems to be these groupings of specialties within data science. There’s like an analyst type of data science: these are people that are usually working with end users or leaders or other people in the business. They’re understanding the kind of questions that can be asked and figuring out how to convert those questions into data questions and determine "do you have the data available to answer those questions?" and doing the analysis and then presenting the results and proudly developing data visualizations for those kind of things. There are the engineer types of data scientists that are doing a lot of the backend work, the coding, working with databases and data warehouses, probably doing some of the feature engineering, working with big data systems and technologies that can handle massive data sets, building those data pipelines that support the analysis.

Renée: Then there’s what I mentioned earlier, the researcher type of data scientist: they’re improving those cutting edge algorithms and developing new tools and techniques, so that’s a different focus of data science. I’ll say that most people end up doing some combination of these things but you end up specializing either in like the analysis part or the engineering part or the research part. In my current role, I do a lot of the back-end engineering stuff because I have that background but also mostly focusing on the analysis tasks and communicating with people at the universities, the institutional researchers and decision makers that are gonna use the results of what it is that I’m doing.

What paths should individuals take?

Hugo: Yeah great. We’ve identified the three archetypes, the analyst, engineer, and researcher as end points or at least career paths. Knowing kind of the ways we need to think about where we are and knowing where we can end up, what are paths that you would recommend? What do recommended paths look like essentially?

Renée: Yeah I’m hoping to formalize this more in the future with the information I’m gathering at Data Sci Guide but it really depends on the individual. That starting point that you assessed, the ending point of where you want to end up at, and what are you comfortable teaching yourself or taking courses in, learning online, deciding if you need to go back to school. I do think it’s a myth that you need a PhD to be a data scientist. I don’t have one. A lot of data scientists I know don’t have one. I would say go back to school if there’s something like there was for math for me that you would be uncomfortable teaching yourself and you really need someone else to help you understand like the fundamental concepts there. Talk to someone that has a similar background as you and has become a data scientist or find people on Twitter that seem to be following paths that you like and you want to follow that.

Renée: Then do that project based learning like you talked about. Finding the data set that has the information that you’re interested in, whether that’s sports, statistics, or political data, or geospatial imagery or medical data or entertainment data. There’s so many different types of data out there that you can find something that’s really interesting to you. Ask a question that you can answer with the data and then learn whatever techniques you need to learn in order to answer that question. I think project directed learning is really valuable but that exact path and what resources you use, I have a really hard time recommending any one thing because different things work for different people, though I would recommend keep trying different things until you find out what works for you. Don’t get discouraged if you pick up a book that a lot of people say is popular and great and you don’t really get it and it’s not sinking in for you. Just try something else. Don’t give up and say oh I’m not cut out for this because this popular book doesn’t make sense to me.

Hugo: Yeah. There’s a lot of great advice in there. Something I haven’t thought about a lot beforehand is talking to someone that has a similar background, essentially finding people like you. I think this is really cool because after you’ve done the work of identifying where you are and where you wanna go or where you’d like to be in whatever time frame you’re thinking, I think it’s easy to forget or to think that there aren’t people like you out there and that you’re alone in this journey, particularly in a field that’s moving so quickly so to find people at different points in their career who are like you, that type of community to advise or be a mentor or a mentee later on, these types of things, is an incredible idea.

Renée: Yeah. I think another thing that I just thought of that ends up being difficult is just even orienting to the terminology. Even when you’re out there looking for someone like you, like there’s a lot of weird words that are used in data science that can be confusing at first and you don’t really know is that person doing what I think I wanna do. I have an article on my blog about how I used Twitter to do this. Podcasts like yours are great for that, just hearing people talk about data science and learning like what kind of things data scientists have to think about. When I was ready to move into this career path I got this book. It was called Doing Data Science by Kathy O’Neill and Rachel Shut. That was great for me in terms of getting an overview of the big picture of what is this stuff and what do I need to learn and what are some of the basic terms and it pointed you at other resources to learn.

Renée: Yeah just orienting to like how people even talk and what … What matters in data science and maybe there are things that you actually know already but it’s called something else by data scientists. Data science is kind of a combination of fields that have already existed for a while. Yeah just learning that terminology and listening to data scientists and watching them on Twitter and reading articles to figure out what you don’t know yet is important first step.

Specific Learning Tasks for Beginners

Hugo: In terms of this journey of becoming a data scientist, can you suggest any learning tasks for beginners?

Renée: Yeah. I would say build a report. Like you were saying, maybe use your own data from a Fit Bit or something like that. Just explore a dataset and do some basic statistical summaries and then practice communicating those results. As you learn, you’re gonna be using different tools and techniques but you wanna make sure that the outcome is always understandable and so see if you can bridge that gap as you go. Actually I think when you’re learning it’s a great time to do this because that’s when it’s fresh and new to you as well so you can bridge that gap between the technical analysis and using that information to make decisions and talk to people that are less technical to get the point across. Constantly blogging is a great way to do this. Talking to friends or people in your field is a good way to do this and just explaining the analysis you did but in a way that just makes people comfortable that you know what you’re talking about and then makes that information usable without getting into too much of the nitty gritty of statistics behind it.

Hugo: For sure. I do think working on datasets that are relevant to you is so important. The titanic and iris data sets don’t count even if you think they’re relevant to you.

Hugo: We need to move away. I think you dispelled very importantly the myth that you need a PhD to do this type of stuff. I’m wondering what other potential pitfalls or warnings you have for people along the way on their journey.

Renée: I think there’s some misconceptions about how much you need to learn. A pitfall is that it’s really easy to get discouraged when you’re learning. There’s so many topics under this umbrella of data science that you can easily get really overwhelmed and not know where to go, especially with self-directed learning. You have to kind of balance learning enough to qualify for the type of job you want but then not over planning it or overdoing it to the point where you’re starting to feel totally off track and psyching yourself out and feeling like you’re never gonna make it.

Renée: In a talk I gave, I talked about it like you’re planning a trip. You could plan it out turn by turn and print out the directions and know exactly where you’re gonna turn and what it’s gonna look like at each of those turns, but you still wanna have your GPS handy because if you run into unexpected traffic or road closings you gotta route around that. At some point you’re gonna feel lost in your learning or like you’ve totally hit a roadblock but instead of giving up you might just need to go back and find other resources to get you more comfortable with the topic before you move forward again or decide do I really even need to learn this? Maybe you can skip that part and come back later when you have a better understanding. Instead of just getting stuck and waiting for things to kind of clear up in front of you just be prepared to reroute. There’s a whole lot of different paths to a data science career and just be prepared to change course.

Renée: Also I think a lot of people look at those terrible job postings that are like a wish list of everything that company could ever want a data scientist to be able to do and they’re basically describing a whole data science team in one job posting. People think that they need to learn all of those things in order to get that job so I would say no. Learn a few key things really well. Practice applying that knowledge you have to real world problems so you have experience like overcoming challenges that you’re gonna encounter in a real job and that will also help you have a story to tell in your interviews of how you overcame trouble and ended up having usable results in the end. I guess what I’m trying to say is don’t derail yourself and don’t feel like you have to learn everything you’ve ever heard of in data science in order to be a data scientist. None of us know how to do everything. You just have to know enough of the basics that you feel solid in that understanding and confident that you could pick up other tools and techniques as you need them. I would say learn the basics and then learn a couple specialty items that might set you apart or are particular to the field that you’re trying to get into. Also those communication skills are really important too, not just the tools and techniques.

Hugo: Absolutely. To build on that, something that you hinted at earlier is get out there and do some job interviews as well to find out what the market is like and what interviewers want and ask them questions to figure out what gaps you may have as opposed to learning in the abstract what you think may be needed out in the job market.

Renée: Yeah. It can be discouraging not to get a job but I remember once I did get a data science job looking back and saying all those ones that I didn’t get, they weren’t right for me any way so why should I feel bad about not getting them? I wasn’t right for the job or the company wasn’t right for me and so once I found one and it was the right fit and I feel good about it and I like my job so looking back I realize there’s just times when it really gets frustrating or depressing if you keep getting turned down, but there’s just so many different kinds of data science jobs out there. I think everybody can find one that matches their skills even though it might take a while.

Hugo: Yeah and I do think it’s incredible discouraging and horrifying to not get a bunch of jobs in a row. Advice I give which I definitely don’t necessarily I find it difficult to take myself though is that you only need one hit. You’re looking for one hit out of a bunch of opportunities and the ones that don’t work out can be really incredible learning experiences as well. That doesn’t make it any less brutal to be turned down.

Renée: Yeah. It’s not until after the fact that you look back and realize like how much you learned and how valuable those rejections were.

Hugo: Yeah. Exactly. Talking about what employers are looking for, I think one thing that we can forget about when thinking about data science in the abstract is that a lot of the time it’s used to solve business questions. You have a great slide that demonstrates how data analysis and science can be used essentially as an intermediary step to get from a business question to a business answer, so this movement from a business question to a business answer is factored through data science. I’m wondering how keen this concept is to your understanding of data science as a whole.

Renée: Yeah I created that for one of my first data science talks in order to illustrate what I think the data analysis process is. I got such good feedback on it and people really like it so I go back to it a lot now. If for anyone that hasn’t seen it, it has four little phrases with arrows between them. It starts with business question and goes to data question and then to data answer and then to business answer. I’ll go through each of those.

Renée: For the business question, I don’t necessarily mean like a sales and marketing kind of business but like a domain question, something that a decision maker in your particular field or business might ask. You’re job as an analyst is to convert that into a data question. What data is required in order to answer it? Do we have it available? What related questions might we have to answer first to get to that one? What type of analysis needs to be done to get us to a usable answer? Then you have to do the analysis so that’s the data answer piece. This type of analysis will depend on like what kind of field you’re in, what’s your role and your skills, what data is available so the type of analysis differs but basically to turn that data question into a data answer you’re doing analysis.

Renée: Then you have to take the results of that and turn that into a business answer. There’s very few people out there that will want to hear your data answer. You have to be able to communicate that in terms that a non data scientist can understand so that they know what the data is telling them and can use that information to make a business decision. You have to be able to convey statistical results and uncertainty in business terms and explain what your analysis means and does not mean so it’s not misused. A report when we talk about building a report, in the real world the end result is usually not some sort of statistical readout with model evaluation metrics. It’s like a presentation of the results that are clear and usable by people that are not data scientists.

Hugo: Absolutely and I do think to keep in mind that we’re always attempting to answer business questions or develop business insights in this context is incredibly important. I wanna shift slightly. We have a lot of aspiring data scientists and learners out there. I’m wondering what’s your take on where people can learn, particular places people can learn the skills and knowledge necessary to become a data scientist.

Renée: Well like I said I have a hard time giving specific recommendations because it’s so personal but I’ve heard great things about DataCamp of course. It’s actually the highest rated course system on DataSciGuide, so people that use DataCamp seem to really love it.

Hugo: That’s great. I’m personally I’m a huge fan of Data Camp as well. I don’t know whether there’s any bias involved here.

Renée: I’m not saying that just to suck up. It’s really… people love it. Also there’s Data Quest. There’s Khan Academy for some of those basic skills. There are lots of books out there. People tend to really like the O’Reilly books and there’s some other favorites. Again, I hesitate to give specific recommendation just because they vary so much. People can tweet me if you’re looking for a certain resource that will get you started from where you’re at and usually I retweet that and lots of people that follow me will help answer. It’s really kind of a personalized answer but I’ll just say there are a ton of resources and it’s easy to get overwhelmed by the resources so don’t be afraid to ask to find what might be best for you and then if someone recommends something and you really don’t like it don’t feel bad about that either. Just move on to the next thing.

Renée: So yeah I mean my site, Data Sci Guide, I’m trying to collect those reviews from data science learners so we can get a sense of what did you need to know before you used this resources because that tripped me up a lot when I was learning is there weren’t clear pre-requisites for certain resources and I would start out real excited like yeah I’m getting it and then five lessons in be totally overwhelmed and wanting to give up. I think that’s dangerous. Yeah talk to people that are just ahead of you on the learning path maybe and find out what helped them get to over that first step from where you are to where they are and maybe not reach out to people that are already working as data scientists but other data science learners.


Hugo: So something we’ve been talking around, Renée, is Twitter which can be an incredible resource for aspiring data scientists so maybe you can tell me a bit more about that.

Renée: Yeah so in addition to all like the books and courses and tutorials, I really use Twitter a lot to get the lingo of data science. There are these great communities on Twitter and you can usually use them by searching for certain hashtags. I’ll give you a few of them. For Python people, there’s pydata, pyladies, p4ds. For people learning R, there’s Rstats and Rladies, R4ds. These are all hashtags you can search. A lot of those have slack channels too. There’s a data science learning club slack channel that some followers of mine started a while back based on my podcast learning activities. There’s a slack called data for democracy for people who want to get into political data. There’s a hashtag for data ethics, so I’m sure there’s similar groups like these on other social media like Facebook and LinkedIn but I’m mostly on Twitter so I have a whole blog post about using Twitter to learn data science and if you start searching for hashtags related to what you’re learning, you’ll usually start finding the leaders or the hubs in these communities and you can learn a whole lot just by following them. Then if you ask a question and use that hashtag you’ll usually get an answer. It’s pretty cool.

Hugo: That’s awesome. We’ll link to your article on how to use Twitter to learn data science in the show notes as well. So for learners, how will they know when they’re ready to actually be a data science or start interviewing?

Renée: Yeah. I think people are ready to start applying for jobs before they feel fully ready to make that jump. Don’t wait too long to start looking. Like we talked about, like doing those interviews is really instructional as well but I’d say that you’re ready when you’re confident enough with those basics so you know how to do exploratory data analysis and do some statistical summaries. You know that basic feature engineering, how to get a dataset into shape that you can use for machine learning. You know how to do some of that pre-processing and clean up. You can build a good report and a data visualization and communicate the results. Maybe you’ve used a few basic commonly used machine learning algorithms like logistic regression and random forest, so you’re confident enough with these basics that you know that you’re not gonna be totally struggling on the job.

Renée: Once you feel that you have that solid understanding of like how machine learning works and you can apply it, you probably want to also add in a few specific techniques that will make you stand out, either something you feel like you’re good at. Maybe you’re really awesome at building pretty visualizations that are easy to read. Maybe you’re really good at that back-end data engineering stuff. Something that you can say is your specialty when you’re applying for the jobs but you don’t need to check off the entire list of every algorithm and every tool and technique out there.

Renée: I’ve interviewed for jobs that included skills that I already had throughout my career and I was confident with, plus some skills that I was still picking up. If I knew that I could understand what people wanted and I was confident enough that I could pick up those new tools and techniques along the way, then I realized like I got a job before I thought I was ready and at least I hope and I’ve been told that I’ve done really well there. A lot of stuff you can pick up as you go if you have the basics down. Don’t feel like you have to be an expert in every area. Nobody is. Start applying and you’ll get a sense for what it is that you still need to learn in order to get a certain type of job but yeah don’t wait too long.

Hugo: I think the field is so vast and there are so many techniques and new techniques emerging all the time that if you try to be as comprehensive as possible you’ll always feel there’s more stuff to learn and you’ll never get out there.

Renée: Yeah you’re going to be learning on the job no matter how advanced you are when you apply. There’s a huge demand out there right now for people with data skills, so even if you get kind of a transitional data analyst type of role you might not have the title of data scientist right away, but if it’s a role that offers you the possibility of doing some machine learning yeah you can grow into that as you work.

Biggest Ethical Challenges

Hugo: I wanna shift slightly. Recently you gave a talk called Can a Machine be Racist or Sexist? Using this question you posed as a jumping board, can you speak to what you consider the biggest ethical challenges facing data science and data scientists as a community?

Renée: Yeah so we could do a whole episode just about this. I’ll connect you with some people that I think would be great interviews that could talk extensively on this topic but the main purpose of me doing that talk was to get people to understand that even though you’re using these mathematical algorithms and computers to get a result, that doesn’t mean that things produced by data science are unbiased. There’s so many ways that bias, maybe you’d say racism or sexism and I’m talking about a system at kind, so not somebody yelling a word at somebody on the street, but historical racism that’s baked into systems. I have that masters in systems engineering and I think I’ve always been kind of a systems thinker so I picked up on this quickly and I was trying to share it with other people. You can link to my whole talk for all the slides. I really struggle to cram in all the examples I wanted to give because there’s really so much to learn here. With machine learning, you’re really doing pattern matching. That’s what those algorithms are doing, finding patterns in the data which is a lot like stereotyping. You have to be aware of what data is going into making those decisions and make sure you understand the model outputs and it’s not completely a black box where you don’t understand why a particular decision was made by the model when people’s lives are being affected. Biases can be introduced at every step along the way in this development process. The data could have been incorrectly recorded in the first place. It might not be representative of the full population. It might be a limited sample and you’re training your model assuming it’s gonna generalize and it might not.

Renée: Your data could contain historic biases. For instance, crime databases are only gonna contain records for crimes in areas that are policed. If a crime at a certain location isn’t observed or isn’t recorded into the system by the police, an algorithm you train on that is gonna think there was no crime there and make predictions accordingly so it’s just you’re encoding not what’s happening in the real world necessarily but you’re capturing what people are capturing about the system that you’re looking at. There’s certain techniques that can amplify bias when you’re doing your pre-processing and model training.

Renée: There’s the question of what are you even optimizing for? For instance, YouTube has this problem where they’re optimizing for viewing time. They want your eyeballs on their ads. If something is like particularly crazy or creepy or exciting, people are gonna watch it a little longer and so those videos that are really extreme will bubble up to the top and be recommended to more people because when you watch them you might be fascinated by them and watch longer. It can kind of radicalize people. People might get to the point, especially kids I think, where you can’t necessarily separate the truth from this fiction that’s constantly in front of you because that fiction is exciting and interesting and makes you watch longer.

Renée: What you’re optimizing for and what kind of effects that could have is important. How do you even decide when to stop optimizing or if the results of your model are good? That’s a decision that requires a human input. How do you know if the results of your model are being used properly and it’s not being misused or misinterpreted? There’s people and people making decisions at every step along the molded development process so you can’t say that oh it’s automated and computerized. There’s no bias involved. There can be bias introduced at every single step.

Hugo: A lot of these issues are cultural as well, that as a community of data scientists we’re only now really starting, well there’s been work done on it previously, I don’t wanna dismiss that but we’re really only starting to think collectively about how to approach these problems now.

Renée: Yeah definitely. Yeah and it’s a culture of how the company is run and it really takes us data scientists making decisions about what we’re willing to do as well. So much of this like the models are being built under pressure for deadlines and being rolled out and you might not even know how it’s being used in the end, but just being aware of the impact of these things that we’re building is important. I love this quote from Susan Etlinger in a TED Talk that she gave. She said we have the potential to make bad decisions far more quickly, efficiently, and with far greater impact than we did in the past. We’re really just speeding up these decisions. We’re not necessarily making them better unless we make an effort to do that, so we have to make sure that as data scientists that we’re not causing harm and we’re in high demand right now so we’re lucky we have some choice in what kind of businesses we’re willing to work for and what kind of products we’re willing to contribute to. We can make a difference in our future and hopefully make it a little less dystopian than the entertainment world imagines or that we can imagine just by being aware of this and making conscious decisions of what we’re willing to build.

Call to Action

Hugo: I couldn’t agree more. Renée, do you have a final call to action for our listeners out there?

Renée: Yeah so I know there’s a lot of people that listen to these podcasts that are just getting into data science but some people have been lurking on Twitter for a long time, listening to podcasts for a long time, reading books, and so my call to action for them is like dig in. Find a data set. Start working with it. Tweet me at becomingdatasci if you need help. I’ll connect you with an online community that can help get you started. Don’t delay actually working with real data.

Renée: My call to action for people that aren’t new to data science is I would encourage you to read up on the data ethics so that you understand how the work that you do in this field can affect real people’s lives. There’s lots of great books out there now so someone remind me when this episode comes out and I will tweet a list and share a bunch of books that I’ve collected that I’ve either read already or they’re in my kindle waiting to be read because I’m really interested in this topic and it’s important to me and I think it’s vital for people in our industry to be well aware of, so that would be my call to action for people that are already data scientists.

Hugo: Fantastic. Renée, it’s been such a pleasure having you on the show.

Renée: Great thanks for having me, Hugo. I’ve been listening for a long time and it’s exciting to actually be on here.

Hugo: It’s great to have you on particularly because I was listening to your podcast for so long, so it was a really fun experience.

Renée: Great.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data visualization with statistical reasoning: seeing uncertainty with the bootstrap

Mon, 09/17/2018 - 12:31

(This article was first published on R – Dataviz – Stats – Bayes, and kindly contributed to R-bloggers)

This blog post is one of a series highlighting specific images from my book Data Visualization: charts, maps and interactive graphics. This is from Chapter 8, “Visualizing Uncertainty”.


One of the most common concerns that I hear from dataviz people is that they need to visualise not just a best estimate about the behaviour of their data, but also the uncertainty around that estimate. Sometimes, the estimate is a statistic, like the risk of side effects of a particular drug, or the percentage of voters intending to back a given candidate. Sometimes, it is a prediction of future data, and sometimes it is a more esoteric parameter in a statistical model. The objective is always the same: if they just show a best estimate, some readers may conclude that it is known with 100% certainty, and generally that’s not the case.

I want to describe a very simple and flexible technique for quantifying uncertainty called the bootstrap. This tries to tackle the problem that your data are often just a sample from a bigger population, and so that sample could yield an under- or over-estimate just by chance. We can’t tell if the sample’s estimate is off the true value, because we don’t know the true value, but (and I found this incredible when I first learnt it) statistical theory allows us to work out how likely we are to be off by a certain distance. That lets us put bounds on the uncertainty.

Now, it is worth saying here, before we go on, that this is not the only type of uncertainty you might come across. The poll of voters is uncertain because you didn’t ask every voter, just a sample, and we can quantify that as I’m describing here, but it’s also likely to be uncertain because the voters who agreed to answer your questions are not like the ones who did not agree. That latter source of uncertainty calls for other methods.

The underlying task is to work out what the estimates would look like if you had obtained a different sample from the same population. Sometimes, there are mathematical shortcut formulas that give you this — the familiar standard error, for example — immediately, by just plugging the right stats into a formula. But, there are some difficulties. For one, the datavizzer needs to know about these formulas, which one applies to their purposes, and to be confident in obtaining them from some analytical software or programming them. The second problem is that these formulas are sometimes approximations, which might be fine or might be off, and it takes experience and skill to know the difference. The third is that there are several useful stats, like the median, for which no decent shortcut formula exists, only rough approximations. The fourth problem is that shortcut formulas (I take this term from the Locks) mask the thought process and logic behind quantifying uncertainty, while the bootstrap opens it up to examination and critical thought.

The American Statistical Association’s GAISE guidelines for teaching stats now recommend starting with the bootstrap and related methods before you bring in shortcut formulas. So, if you didn’t study stats, yet want to visualise uncertainty from sampling, read on.


If you do dataviz, and you come from a non-statistical background, you will probably find bootstrapping useful. Here it is in a nutshell. If we had lots of samples (of the same size, picked the same way) from the same population, then it would be simple. We could get an estimate from each of the samples and look at how variable those estimates are. Of course, that would also be pointless because we could just put all the samples together to make a megasample. Real life isn’t like that. The next best thing to having another sample from the same population is having a pseudo-sample by picking from our existing data. Say you have 100 observations in your sample. Pick one at random, record it, and put it back — repeat one hundred times. Some observations will get picked more than once, some not at all. You will have a new sample that behaves like it came from the whole population.

Sounds too easy to be true, huh? Most people think that when they first hear about it. Yet its mathematical behaviour was established back in 1979 by Brad Efron.

Now, if you work out the estimate of interest from that pseudo-sample, and do this a lot (as the computer’s doing it for you, no sweat, you can generate 1000 pseudo-samples and their estimates of interest). Look at the distribution of those bootstrap estimates. The average of them should be similar to your original estimate, but you can shift them up or down to match (a bias-corrected bootstrap). How far away from the original do they stretch? Suppose you pick the central 95% of the bootstrap estimates; that gives you a 95% bootstrap confidence interval. You can draw that as an error bar, or an ellipse, or a shaded region around a line. Or, you could draw the bootstrap estimates themselves, all 1000 of them, and just make them very faint and semi-transparent. There are other, more experimental approaches too.

Here, some 2-dimensional data (two variables) is summarised by the mean of x and the mean of y. How uncertain are those means? Are they correlated? Let’s visualise it by bootstrapping 100 times. The bootstrap means are semi-transparent red markers.

You can apply the bootstrap to a lot of different statistics and a lot of different data, but use some common sense. If you are interested in the maximum value in a population, then your sample is always going to be a poor estimate. Bootstrapping will not help; it will just reproduce the highest few values in your sample. If your data are very unrepresentative of the population for some reason, bootstrapping won’t help. If you only have a handful of observations, bootstrapping isn’t going to fill in more details than you already have. But, in that way, it can be more honest than the shortcut formulas.

100 bootstrapped splines through the same data; you can see the excessive influence the point at top right, and the two on the far left, exert on the ends of the curves.

If you want to read more about bootstrapping, you’ll need some algebra at the ready. There are two key books, one by bootstrap-meister Brad Efron with Stanford colleague Rob Tibshirani, and the other by Davison and Hinkley. They are pretty similar for accessibility. I own a copy of Davison and Hinkley, for what it’s worth.

You could do bootstrapping in pretty much any software you like, as long as you know how to pick one observation out of your data at random. You could do it in a spreadsheet, though you should be aware of the heightened risk of programming errors. I wrote a simple R function for bootstraps a while back, for my students when I was teaching intro stats at St George’s Medical School & Kingston Uni. If you use R, check that out.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Dataviz – Stats – Bayes. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...