Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 56 min ago

Building an Interactive Globe Visualization in R

Thu, 11/01/2018 - 12:55

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In this earlier post we analysed the location of meteorite impacts from this dataset, including plotting their fall locations on a globe. Forming part of the analysis was this interactive globe visualization below, which plots the location and age of meteorites.

In this post I am going to show how to create this globe with code. I then go on to describe other options and variations on the same theme.

Preparing and plotting the data

The code below does everything we need. Stepping through each section:

  • First, import the data then convert from the year the meteorite fell to an age in years.
  • Tidy the data frame to collect just the variables that we need.
  • Convert age into a number from 1 to 10 (the lowest 10% of ages map to 1, the next 10% to 2 .. etc), then map those numbers to shades of red.
  • Finally, plot the data on the globe where val (the mass) determines the length of each line and fixed pointsize determines the line thickness.
library(threejs) library(flipChartBasics) # Read the data and calculate age in years x = read.csv("") current = as.numeric(format(Sys.Date(), "%Y")) x$age = current - as.numeric(substr(x$year, 7, 10)) # Filter the required information x = x[ , c("reclong", "reclat", "mass..g.", "age")] colnames(x) = c("long","lat","value", "age") # Set colors on a scale of 1 to 10 by percentile colors = as.numeric(cut(x$age, breaks = quantile(x$age, probs = seq(0, 1, 0.1), include.lowest = TRUE, na.rm = TRUE))) palette = ChartColors(10, "Reds", reverse = TRUE) colors = palette[colors] # Plot the data on the globe globejs(lat = x$lat, long = x$long, val = 2 * log(x$value), color = colors, pointsize = 0.5, atmosphere = TRUE)

The output is slightly different from the original in the meteorites post because we have not excluded any data. You can spin the globe around with your mouse and zoom in or out.

Capital cities

As a second example, we’ll plot capital city data available here. The data is imported with the DownloadXLSX function from the flipAPI package.

The Population (thousands) column is imported as a factor. We amend it to be a number. The code to make the globe is broadly similar to that above.

library(threejs) library(flipChartBasics) library(flipAPI) # Make a data.frame of the required information url <- "" x = DownloadXLSX(url, skip = 15) x = x[, c("Longitude", "Latitude", "Population (thousands)", "Capital City")] names(x) = c("long","lat", "population", "city") # Convert population to numeric x$population = as.character(x$population) x$population[x$population == "\u2026"] = 0 # remove ellipsis x$population = as.numeric(x$population) # Set colors according to first letter of the city name first.letters = sapply(substring(x$city, 1, 1), utf8ToInt) - utf8ToInt("A") + 1 palette = ChartColors(26, "Blues") colors = palette[first.letters] # Plot the data on the globe earth = "" globejs(img = earth, lat = x$lat, long = x$long, val = 10 * log(x$population), color = colors, pointsize = 5, atmosphere = FALSE, bg = "white")

A few notable differences are:

  • Addition of a more colorful background world image via the img argument.
  • Line lengths related to population and colors according to the first letter of the city name (light blue = A, dark = Z).
  • Removal of atmosphere, increase of pointsize and amending the default background (bg) color.

Finally, the globe image can be changed. If you wanted to confuse people by presenting the same information with an arbitrary citrus theme, you could make the following, again using the img argument.

The globes are created with the threejs package. You can see and modify the examples in this post by following this link.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

What R version do you really need for a package?

Thu, 11/01/2018 - 11:26

(This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers)

At Jumping Rivers we run a lot of R courses. Some of our most popular courses revolve around the tidyverse, in particular, our Introduction to the tidyverse and our more advanced mastering course. We even trained over 200 data scientists NHS – see our case study for more details.

As you can imagine, when giving an on-site course, a reasonable question is what version of R is required for the course. We always have an RStudio cloud back-up, but it’s nice for participants to run code on their own laptop. If participants are to bring there own laptop it’s trivial for them to update R. But many of our clients are financial institutions or government where an upgrade is a non-trivial process.

So, what version of R is required for a tidyverse course? For the purposes of this blog post, we will define the list of packages we are interested in as

library("tidyverse") tidy_pkgs = c("ggplot2", "purrr", "tibble", "dplyr", "tidyr", "stringr", "readr", "forcats")

The code below will work with any packages of interest. In fact, you can set pkgs to all R packages in CRAN, it just takes a while.

Package descriptions

In R, there is a handy function called available.packages() that returns a matrix of details corresponding to packages currently available at one or more repositories. Unfortunately, the format isn’t initially amenable to manipulation. For example, consider the readr package

readr_desc = available.packages() %>% as_tibble() %>% filter(Package == "readr")

I immediately converted the data to a tibble, as that

  • changed the rownames to a proper column
  • changed the matrix to a data frame/tibble, which made selecting easier

Looking at the read_desc, we see that it has a minimum R version

readr_desc$Depends ## [1] "R (>= 3.0.2)"

but due to the format, it would be difficult to compare to R versions. Also, the list of imports

readr_desc$Imports ## [1] "Rcpp (>=, tibble, hms, R6"

has a similar problem. For example, with the data in this format, it would be difficult to select packages that depend on tibble.

Tidy package descriptions

We currently have four columns

  • Imports, Depends, Suggests, Enhances

each entry in these columns contains multiple packages, with possible version numbers. To tidy the data set I’m going to create four new columns:

  • depend_type: one of Imports, Depends, Suggests, Enhances and LinkingTo
  • depend_package: the package name
  • depend_version: the package version
  • depend_condition: something like equal to, less than or greater than

The hard work is done by the function clean_dependencies(), which is at the end of the blog post. It essentially just does a bit of string manipulation to separate out the columns. The function works per package, so we iterate over packages using map_df()

pkg_deps = c("Depends", "Enhances", "Suggests", "Imports", "LinkingTo") av_pkgs = available.packages() %>% as_tibble() av_pkgs = map_df(pkg_deps, clean_dependencies, av_pkgs = av_pkgs) %>% select(-pkg_deps) # Std version: 3.4 -> 3.4.0 av_pkgs = av_pkgs %>% mutate(depend_version = if_else(depend_package == "R", clean_r_ver(depend_version), depend_version)) # Standardise column names colnames(av_pkgs) = str_to_lower(colnames(av_pkgs))

After this step, we now have tibble with tidy columns:

av_pkgs %>% select(package, depend_type, depend_package, depend_version, depend_condition) %>% slice(1:4) ## # A tibble: 4 x 5 ## package depend_type depend_package depend_version depend_condition ## ## 1 A3 Depends R 2.15.0 >= ## 2 abbyyR Depends R 3.2.0 >= ## 3 abc Depends R 2.10.0 >= ## 4 Depends R 2.10.0 >=

and we can see minimum R version the package authors have indicated for their package. However, this isn’t the minimum version required. Each package imports a number of other packages, e.g. the readr imports 4 packages

av_pkgs %>% filter(package == "readr" & depend_type == "Imports") %>% select(depend_package) %>% pull() ## [1] "Rcpp" "tibble" "hms" "R6"

and each of those packages, also import other packages. Clearly, the minimum version required to install dplyr is the maximum R version of all imported packages

Some interesting things

Before we work out the maximum R version for each set of imports, we should first investigate how many imports each package using a bit of dplyr

imp = av_pkgs %>% group_by(package, depend_type) %>% summarise(n = length(depend_package)) %>% arrange(package, n) %>% filter(depend_type != "Enhances" & depend_type != "LinkingTo") imp %>% group_by(depend_type) %>% summarise(mean(n)) ## # A tibble: 3 x 2 ## depend_type `mean(n)` ## ## 1 Depends 1.96 ## 2 Imports 4.98 ## 3 Suggests 3.63

Using histograms we get a better idea of the overall numbers. Note that here we’re using the ipsum theme from the hrbrthemes package.

library(hrbrthemes) ggplot(imp, aes(n)) + geom_histogram(binwidth = 1, colour= "white", fill = "steelblue") + facet_wrap(~depend_type) + xlim(c(0, 25)) + ylim(c(0, 6000)) + theme_ipsum() + labs(x="No. of packages", y="Frequency", title="Package dependencies", subtitle="An average of 5 imports per package", caption="Jumping Rivers")

Maximum overall imports

As I’ve mentioned, we need to obtain not just the imported packages, but also their dependencies. Fortunately, the tools package comes to our rescue,

tools::package_dependencies("readr", which = c("Depends", "Imports", "LinkingTo"), recursive = TRUE) %>% unlist() %>% unname() ## [1] "Rcpp" "tibble" "hms" "R6" "BH" ## [6] "methods" "pkgconfig" "rlang" "utils" "cli" ## [11] "crayon" "pillar" "assertthat" "grDevices" "fansi" ## [16] "utf8" "tools"

Using the package_dependencies() function, we simply

  • Obtain a list of dependencies for a given package
  • Extract the maximum version of R for all packages in the list

At the end of this post, there are two helper functions:

  • max_r_version() – takes a vector of R versions, and returns a maximum version. E.g.

    max_r_version(c("3.2.0", "3.3.2", "3.2.0")) ## 3.3.2
  • get_r_ver() – calls package_dependencies() and returns the maximum R version out of all of the dependencies.

Also, we have simplified some the details and what we’ve done isn’t quite right – it’s more of a first approximation. See the end of the post for details

Now that’s done, we can pass the list of tidyverse packages then compare their stated R version, with the actual required R version

# Here we are using just the tidyverse # But this works with any packages tidy = map_dfr(tidy_pkgs, get_r_ver, av_pkgs) %>% group_by(package) %>% slice(1) %>% select(package, r_real, r_cur)

We then select the packages where there is a difference

tidy %>% filter(r_real != r_cur | (! & # A tibble: 3 x 3 # Groups: package [3] package r_real r_cur 1 readr 3.1.0 3.0.2 2 tidyr 3.1.2 3.1.0

The largest difference in R versions is for readr (which feeds into the tidyverse). readr claims to only need R version 3.0.2 but a bit more investigation shows that readr depends on the tibble package which is version 3.1.0. Although, it is worth noting that 3.1.0 is fairly old!

Take away lessons

The takeaway message is that dependencies matter. A single change affects everything in the package dependency tree. The other lesson is that the tidyverse team have been very careful about there dependencies. In fact, all of their packages are checked on R 3.1, 3.2, …, devel

Simplifications: skipping package versions

In this analysis, we’ve completely ignored version numbers and always assumed we need the latest version of a package. This clearly isn’t correct. So to do this analysis properly, we would need the historical DESCRIPTION files for packages and use that to determine versions.

Thanks to Jim Hester who spotted an error in a previous version of this post.

Functions clean_dependencies = function(av_pkgs, type) { no_of_cols = max(str_count(av_pkgs[[type]], ","), na.rm = TRUE) + 1 cols = paste0(type, seq_len(no_of_cols)) av_clean = av_pkgs %>% separate(type, sep = ",", into = cols, fill = "right", remove = FALSE) %>% gather(key = type, pkg, cols) %>% drop_na(pkg) %>% mutate(type = !!type) %>% separate(pkg, sep = "\\(", into = c("package", "version"), fill = "right") %>% mutate(version = str_remove(version, "\\)"), package = str_remove(package, "\\)")) %>% mutate(version = str_remove(version, "\n"), package = str_remove(package, "\n")) %>% mutate(version = str_trim(version), package = str_trim(package)) %>% filter(package != "") ## Detect where version is version_loc = str_locate(av_clean$version, "[0-9]") av_clean %>% mutate(condition = str_sub(version, end = version_loc[, "end"] - 1), version = str_sub(version, start = version_loc[, "end"])) %>% rename(depend_type = type, depend_package = package, depend_version = version, depend_condition = condition) %>% mutate(depend_condition = str_trim(depend_condition)) } clean_r_ver = function(vers) { nas = vers[!nas] = paste0(vers[!nas], if_else(str_count(vers[!nas], "\\.") == 2, "", ".0")) vers } max_r_version = function(r_versions) { s = str_split(r_versions, "\\.", n = 3) major = map(s, `[`, 1) %>% map(as.integer) if(sum(major == 3) > 0) { maj_number = 3 } else if(sum(major == 2) > 0) { maj_number = 2 } else if(sum(major == 1) > 0) { maj_number = 1 } else { return("") } minor = map(s, `[`, 2) %>% map(as.integer) min_number= minor[which(major == maj_number)] %>% unlist() %>% max() third = map(s, `[`, 3) %>% map(as.integer) third_number = third[which(major == maj_number & minor == min_number)] %>% unlist() %>% max() glue::glue("{maj_number}.{min_number}.{third_number}") } get_r_ver = function(pkg, clean_av) { r_cur = clean_av %>% filter(depend_package == "R" & package == pkg) %>% select(depend_version) %>% pull() if(length(r_cur) == 0) r_cur = NA pkg_dep = tools::package_dependencies(pkg, which = c("Depends", "Imports", "LinkingTo"), recursive = TRUE) %>% unlist() %>% unname() %>% c(pkg) # Optional and recommended packages with_r = filter(av_pkgs, ! %>% select(package) %>% pull() %>% unique() pkg_dep = pkg_dep[!(pkg_dep %in% with_r)] r_ver = clean_av %>% filter(package %in% pkg_dep) %>% filter(depend_type == "Depends") %>% filter(depend_package == "R") %>% select(depend_version) %>% summarise(max_r_version(depend_version)) %>% pull() clean_av %>% filter(package == pkg) %>% mutate(r_real = as.character(r_ver), r_cur = r_cur) }

The post What R version do you really need for a package? appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppTOML 0.1.5: Small extensions

Thu, 11/01/2018 - 03:38

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Coming on the heels of last week’s RcppTOML 0.1.4 release bringing support for TOML v0.5.0, we have a new release 0.1.5 on CRAN with better encoding support as well as support for the time type.

RcppTOML brings TOML to R. TOML is a file format that is most suitable for configurations, as it is meant to be edited by humans but read by computers. It emphasizes strong readability for humans while at the same time supporting strong typing as well as immediate and clear error reports. On small typos you get parse errors, rather than silently corrupted garbage. Much preferable to any and all of XML, JSON or YAML – though sadly these may be too ubiquitous now. TOML has been making inroads with projects such as the Hugo static blog compiler, or the Cargo system of Crates (aka “packages”) for the Rust language.

The list of changes in this incremental version is below.

Changes in version 0.1.5 (2018-10-31)
  • Escape characters treatment now has toggle (Václav Hausenblas in #27 fixing #26)

  • The TOML 0.5.0 ‘time’ type is now supported (somewhat) by returning a formatted string (#29)

  • Encodings are now supported by defaulting to UTF-8 (Václav Hausenblas in #30 fixing #28)

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppTOML page page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

From webscraping data to releasing it as an R package to share with the world: a full tutorial with data from NetHack

Thu, 11/01/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

If someone told me a decade ago (back before I'd ever heard the term "roguelike") what I'd be doing today, I would have trouble believing this…

Yet here we are.

— Josh Ge (@GridSageGames) June 21, 2018


In this post, I am going to show you how you can scrape tables from a website, and then create a package
with the tidied data to share with the world. The data I am going to scrape comes from a NetHack
public server (link). The data I discuss in this blog post is
available in the {nethack} package I created and I will walk you through the process of releasing
your package on CRAN. However, {nethack} is too large to be on CRAN (75 mb, while the maximum
allowed is 5mb), so you can install it to play around with the data from github:


And to use it:

library(nethack) data("nethack")

The data contains information on games played from 2001 to 2018; 322485 rows and 14 columns. I
will analyze the data in a future blog post. This post focuses on getting and then sharing the
data. By the way, all the content from the public server I scrape is under the CC BY 4.0 license.

I built the package by using the very useful {devtools} package.


NetHack is a game released in 1987 that is still being played and developed today.
NetHack is a roguelike game, meaning that is has procedurally generated dungeons and permadeath.
If you die, you have to start over, and because the dungeons are procedurally generated, this means
that you cannot learn the layout of the dungeons you explore or know when ennemies are going
to attack or even what ennemies are going to attack. Ennemies are not the only thing that you have
to be careful about; you can die from a lot of different events, as you will see in this post.
Objects that you find, such as a silver ring, might be helpful in a run, but be cursed in the next run.

The latest version of the game, 3.6.1, was released on April 27th 2018, and this is how it looks like:

The graphics are… bare-bones to say the least. The game runs inside a terminal emulator and is
available for any platform. The goal of NetHack is to explore a dungeon and go down every
level until you find the Amulet of Yendor. Once you find this Amulet, you have to go all the way
back upstairs, enter and fight your way through the Elemental Planes, enter the final Astral Plane,
and then finally offer the Amulet of Yendor to your god to finish the game. Needless to say,
NetHack is very difficult and players can go years without ever finishing the game.

When you start an new game, you have to create a character, which can have several attributes.
You have to choose a race (human, elf, orc, etc), a role (tourist, samurai, mage, etc) and an
alignment (neutral, law, chaos) and these choices impact your base stats.

If you can’t get past the ASCII graphics, you might want to try
Pathos, which is NetHack with touch screen support
(on smartphones) and tile graphics:

You can install NetHack on your computer or you can play online on a pulbic server, such as
this one. There are several advantages when playing on a pubic server;
the player does not have to install anyhing, and we data enthusiasts have access to a mine of
information! For example, you can view the following table
which contains data on all the games played on October 25th 2018. These tables start in the year 2001,
and I am going to scrape the info from these tables, which will allow me to answer several questions.
For instance, what is the floor most players die on? What kills most players?
What role do players choose more often? I will explore this questions in a future blog post, but for
now I will focus on scraping the data and realeasing it as a package to CRAN.

Scraping the data

To scrape the data I wrote a big function that does several things:

library("tidyverse") library("rvest") scrape_one_day <- function(link){ convert_to_seconds <- function(time_string){ time_numeric <- time_string %>% str_split(":", simplify = TRUE) %>% as.numeric time_in_seconds <- time_numeric * c(3600, 60, 1) if({ time_in_seconds <- 61 } else { time_in_seconds <- sum(time_in_seconds) } return(time_in_seconds) } Sys.sleep(1) date <- str_extract(link, "\\d{8}") read_lines_slow <- function(...){ Sys.sleep(1) read_lines(...) } page <- read_html(link) # Get links dumplogs <- page %>% html_nodes(xpath = '//*[(@id = "perday")]//td') %>% html_children() %>% html_attr("href") %>% keep(str_detect(., "dumplog")) # Get table table <- page %>% html_node(xpath = '//*[(@id = "perday")]') %>% html_table(fill = TRUE) if(is_empty(dumplogs)){ print("dumplogs empty") dumplogs <- rep(NA, nrow(table)) } else { dumplogs <- dumplogs } final <- table %>% janitor::clean_names() %>% mutate(dumplog_links = dumplogs) print(paste0("cleaning data of date ", date)) clean_final <- final %>% select(-x) %>% rename(role = x_2, race = x_3, gender = x_4, alignment = x_5) %>% mutate(time_in_seconds = map(time, convert_to_seconds)) %>% filter(!(death %in% c("quit", "escaped")), time_in_seconds > 60) %>% mutate(dumplog = map(dumplog_links, ~possibly(read_lines_slow, otherwise = NA)(.))) %>% mutate(time_in_seconds = ifelse(time_in_seconds == 61, NA, time_in_seconds)) saveRDS(clean_final, paste0("datasets/data_", date, ".rds")) }

Let’s go through each part. The first part is a function that converts strings like “02:21:76” to

convert_to_seconds <- function(time_string){ time_numeric <- time_string %>% str_split(":", simplify = TRUE) %>% as.numeric time_in_seconds <- time_numeric * c(3600, 60, 1) if({ time_in_seconds <- 61 } else { time_in_seconds <- sum(time_in_seconds) } return(time_in_seconds) }

I will use this function on the column that gives the length of the run. However,
before March 2008 this column is always empty, this is why I have the if()...else() statement at
the end; if the time in seconds is NA, then I make it 61 seconds. I do this because I want to keep
runs longer than 60 seconds, something I use filter() for later. But when filtering, if the condition
returns NA (which happens when you do NA > 60) then you get an error, and the function fails.

The website links I am going to scrape all have the date of the day the runs took place. I am going to
keep this date because I will need to name the datasets I am going to write to disk:

date <- str_extract(link, "\\d{8}")

Next, I define this function:

read_lines_slow <- function(...){ Sys.sleep(1) read_lines(...) }

It is a wrapper around the readr::read_lines() with a call to Sys.sleep(1). I will be scraping
a lot of pages, so letting one second pass between each page will not overload the servers so much.

I then read the link with read_html() and start by getting the links of the dumplogs:

page <- read_html(link) # Get links dumplogs <- page %>% html_nodes(xpath = '//*[(@id = "perday")]//td') %>% html_children() %>% html_attr("href") %>% keep(str_detect(., "dumplog"))

You might be wondering what are dumplogs. Take a look at this screenshot:


When you click on those d‘s, you land on a page like this one (I
archived it to be sure that this link will not die). These logs contain a lot of info that I want
to keep. To find the right ’xpath’ to scrape the links, "’//[(@id = "perday")]//td’", I used
SelectorGadget* extension for Chrome. First I chose the table:


and then the links I am interested in:


Putting them together, I get the right “xpath”. But just as with the time of the run, dumplogs are
only available after a certain date. So in case the dumplogs column is empty, I relpace it with NA.

if(is_empty(dumplogs)){ print("dumplogs empty") dumplogs <- rep(NA, nrow(table)) } else { dumplogs <- dumplogs }

The rest is quite simple:

# Get table table <- page %>% html_node(xpath = '//*[(@id = "perday")]') %>% html_table(fill = TRUE) final <- table %>% janitor::clean_names() %>% mutate(dumplog_links = dumplogs) print(paste0("cleaning data of date ", date))

I scrape the table, and then join the dumplog links to the table inside a new column called “dumplog_links”.

Because what follows is a long process, I print a message to let me know the progress of the scraping.

Now the last part:

clean_final <- final %>% select(-x) %>% rename(role = x_2, race = x_3, gender = x_4, alignment = x_5) %>% mutate(time_in_seconds = map(time, convert_to_seconds)) %>% filter(!(death %in% c("quit", "escaped")), time_in_seconds > 60) %>% mutate(dumplog = map(dumplog_links, ~possibly(read_lines_slow, otherwise = NA)(.))) %>% mutate(time_in_seconds = ifelse(time_in_seconds == 61, NA, time_in_seconds))

I first remove and remane columns. Then I convert the “time” column into seconds and also remove
runs that lasted less than 60 seconds or that ended either in “quit” (the player left the game)
or “escaped” (the player left the dungeon and the game ended immediately). There are a lot of runs
like that and they’re not interesting. Finally, and this is what takes long, I create a new
list-column where each element is the contents of the dumplog for that run. I wrap read_lines_slow()
around purrr::possibly() because dumplogs are missing for certains runs and when I try to read
them I get an 404 error back. Getting such an error stops the whole process, so with purrr::possibly()
I can specify that in that case I want NA back. Basically, a function wrapped inside purrr::possibly()
never fails! Finally, if a game lasts for 61 seconds, I convert it back to NA (remember this was
used to avoid having problems with the filter() function).

Finally, I export what I scraped to disk:

saveRDS(clean_final, paste0("datasets/data_", date, ".rds"))

This is where I use the date; to name the data. This is really important because scraping takes
a very long time, so if I don’t write the progress to disk as it goes, I might lose hours of work
if my internet goes down, or if computer freezes or whatever.

In the lines below I build the links that I am going to scrape. They’re all of the form: so it’s quite easy to create a list of
dates to scrape, for example, for the year 2017:

link <- "" dates <- seq(as.Date("2017/01/01"), as.Date("2017/12/31"), by = "day") %>% str_remove_all("-") links <- paste0(link, dates)

Now I can easily scrape the data. To make extra sure that I will not have problems during the
scraping process, for example if on a given day no games were played (and thus there is no table
to scrape, which would result in an error) , I use the same trick as above by using purrr::possibly():

map(links, ~possibly(scrape_one_day, otherwise = NULL)(.))

The scraping process took a very long time. I scraped all the data by letting my computer run for
three days!

After this long process, I import all the .rds files into R:

path_to_data <- Sys.glob("datasets/*.rds") nethack_data <- map(path_to_data, readRDS)

and take a look at one of them:

nethack_data[[5812]] %>% View()

Let’s convert the “score” column to integer. For this, I will need to convert strings that look
like “12,232” to integers. I’ll write a short function to do this:

to_numeric <- function(string){ str_remove_all(string, ",") %>% as.numeric } nethack_data <- nethack_data %>% map(~mutate(., score = to_numeric(score)))

Let’s merge the data into a single data frame:

nethack_data <- bind_rows(nethack_data)

Now that I have a nice data frame, I will remove some columns and start the process of making a
packages. I remove the columns that I created and that are now useless (such as the dumplog_links

nethack_data <- nethack_data %>% select(rank, score, name, time, turns, lev_max, hp_max, role, race, gender, alignment, death, date, dumplog)

Export this to .rds format, as it will be needed later:

saveRDS(nethack_data, "nethack_data.rds") Making a package to share your data with the world

As stated in the beginning of this post, I will walk you through the process of creating and
releasing your package on CRAN. However, the data I scraped was too large to be made available
as a CRAN package. But you can still get the data from Github (link is in the abstract at the
beginning of the post).

Making a data package is a great way to learn how to make packages, because it is relatively easy
to do (for example, you do not need to write unit tests). First, let’s start a new project
in RStudio:

Then select “R package”:

Then name your package, create a git repository and then click on “Create Project”:

RStudio wil open the hello.R script which you can now modify. You got to learn from the best, so
I suggest that you modify hello.R by taking inspiration from the babynames package made by Hadley
Wickham which you can find here.
You do not need the first two lines, and can focus on lines 4 to 13. Then, rename the script to
data.R. This is how {nethack}'s looks like:

#' NetHack runs data. #' #' Data on NetHack runs scraped from #' #' @format A data frame with 14 variables: \code{rank}, \code{score}, #' \code{name}, \code{time}, \code{turns}, \code{lev_max}, \code{hp_max}, \code{role}, \code{race}, #' \code{gender}, \code{alignment}, \code{death}, \code{date} and \code{dumplog} #' \describe{ #' \item{rank}{The rank of the player on that day} #' \item{score}{The score the player achieved on that run} #' \item{name}{The name of the player} #' \item{time}{The time the player took to finish the game} #' \item{turns}{The number of turns the player played before finishing the game} #' \item{lev_max}{The maximum character level the player achieved} #' \item{hp_max}{The maximum character health points the player achieved} #' \item{role}{The role the player chose to play as} #' \item{race}{The race the player chose to play as} #' \item{gender}{The gender the playr chose to play as} #' \item{alignement}{The alignement the playr chose to play as} #' \item{death}{The reason of death of the character} #' \item{date}{The date the game took place} #' \item{dumplog}{The log of the end game; this is a list column} #' } "nethack"

The comments are special, the “#” is followed by a '; these are special comments that will be
parsed by roxygen2::roxygenise() and converted to documentation files.

Next is the DESCRIPTION file. Here is how {nethack}’s looks like:

Package: nethack Type: Package Title: Data from the Video Game NetHack Version: 0.1.0 Authors@R: person("Bruno André", "Rodrigues Coelho", email = "", role = c("aut", "cre")) Description: Data from NetHack runs played between 2001 to 2018 on , a NetHack public server. Depends: R (>= 2.10) License: CC BY 4.0 Encoding: UTF-8 LazyData: true RoxygenNote: 6.1.0

Adapt yours accordingly. I chose the license CC BY 4.0 because this was the licence under which
the original data was published. It is also a good idea to add a Vignette:


Vignettes are very useful documentation with more details and examples.

It is also good practice to add the script that was used to scrape the data. Such scripts go into
data-raw/. Create this folder with:


This creates the data-raw/ folder where I save the script that scrapes the data. Now is time to
put the data in the package. Start by importing the data:

nethack <- readRDS("nethack_data.rds")

To add the data to your package, you can use the following command:

devtools::use_data(nethack, compress = "xz")

This will create the data/ folder and put the data in there in the .rda format. I use the “compress”
option to make the data smaller. You can now create the documentation by running:


Pay attention to the log messages: you might need to remove files (for example the documentation
hello.R, under the folder man/).

Now you can finaly run R CMD Check by clicking the Check button on the “Build” pane:

This will extensively check the package for ERRORS, WARNINGS and NOTES. You need to make sure
that the check passes without any ERRORS or WARNINGS and try as much as possible to remove all
NOTES too. If you cannot remove a NOTE, for example in my case the following:

checking installed package size ... NOTE installed size is 169.7Mb sub-directories of 1Mb or more: data 169.6Mb R CMD check results 0 errors | 0 warnings | 1 note

You should document it in a new file called

## Test environments * local openSUSE Tumbleweed install, R 3.5.1 * win-builder (devel and release) ## R CMD check results There were no ERRORs or WARNINGs. There was 1 NOTE: * installed size is 169.7Mb sub-directories of 1Mb or more: data 169.6Mb The dataset contains 17 years of NetHack games played, hence the size. This package will not be updated often (max once a year).

Once you have eliminated all errors and warnings, you are almost ready to go.

You need now to test the package on different platforms. This depends a bit on the system you run,
for me, because I run openSUSE (a GNU+Linux distribution) I have to test on Windows. This can be done

devtools::build_win(version = "R-release")


devtools::build_win(version = "R-devel")

Explain that you have tested your package on several platforms in the file.

Finally you can add a and a file and start the process of publishing the
package on CRAN:


If you want many more details than what you can find in this blog post, I urge you to read
“R Packages” by Hadley Wickham, which you can read for free here.

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Communicating results with R Markdown

Thu, 11/01/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In my training as a consultant, I learned that long hours of analysis were typically followed by equally long hours of preparing for presentations. I had to turn my complex analyses into recommendations, and my success as a consultant depended on my ability to influence decision makers. I used a variety of tools to convey my insights, but over time I increasingly came to rely on R Markdown as my tool of choice. R Markdown is easy to use, allows others to reproduce my work, and has powerful features such as parameterized inputs and multiple output formats. With R Markdown, I can share more work with less effort than I did with previous tools, making me a more effective data scientist. In this post, I want to examine three commonly used communication tools and show how R Markdown is often the better choice.

Microsoft Office

The de facto tools for communication in the enterprise are still Microsoft Word, PowerPoint, and Excel. These tools, born in the 80’s and rising to prominence in the 90’s, are used everywhere for sharing reports, presentations, and dashboards. Although Microsoft Office documents are easy to share, they can be cumbersome for data scientists to write because they cannot be written with code. Additionally:

  • They are not reproducible.
  • They are separate from the code you used to create your analysis.
  • They can be time-consuming to create and difficult to maintain.

In data science, your code – not your report or presentation – is the source of your results. Therefore, your documents should also be based on code! You can accomplish this with R Markdown, which produces documents that are generated by code, reproducible, and easy to maintain. Moreover, R Markdown documents can be rendered in Word, PowerPoint, and many other output formats. So, even if your client insists on having Microsoft documents, by generating them with R Markdown, you can spend more time working on your code and less time maintaining reports.

R Scripts

Data science often involves interactive analyses with code, but code by itself is usually not enough to communicate results in an enterprise setting. In a previous post, I explained the benefits of using R Notebooks over R scripts for doing data science. An R Notebook is a special execution mode of R Markdown with two characteristics that make it very useful for communicating results:

  • Rendering a preview of an R Notebook does not execute R code, making it computationally convenient to create reports during or after interactive analyses.
  • R Notebooks have an embedded copy of the source code, making it convenient for others to examine your work.

These two characteristics of R Notebooks combine the advantages of R scripts with the advantages of R Markdown. Like R scripts, you can do interactive data analyses and see all your code, but unlike R scripts you can easily create reports that explain why your code is important.


Shiny and R Markdown are both used to communicate results. They both depend on R, generate high-quality output, and can be designed to accept user inputs. In previous posts, we discussed Dashboards with Shiny and Dashboards with R Markdown. Knowing when to use Shiny and when to use R Markdown will increase your ability to influence decision makers.

Shiny Apps R Markdown Documents Have an interactive and responsive user experience. Are snapshots in time, rendered in batch. Are hosted on a web server that runs R. Have multiple output types such as HTML, Word, PDF, and many more. Are not portable (i.e., users must visit the app). Are files that can be sent via email or otherwise shared.

Shiny is great – even “magical” – when you want your end users to have an interactive experience, but R Markdown documents are often simpler to program, easier to maintain, and can reach a wider audience. I use Shiny when I need an interactive user experience, but for everything else, I use R Markdown.

If you need to accept user input, but you don’t require the reactive framework of Shiny, you can add parameters to your R Markdown code. This process is easy and powerful, yet remains underutilized by most R users. It is a feature that would benefit a wide range of use cases, especially where the full power of Shiny is not required. Additionally, adding parameters to your document makes it easy to generate multiple versions of that document. If you host a document on RStudio Connect, then users can select inputs and generate new versions on demand. Many Shiny applications today would be better suited as parameterized R Markdown documents.

Finally, Shiny and R Markdown are not mutually exclusive. You can include Shiny elements in an R Markdown document, which enables you create a report that responds interactively to user inputs. These Shiny documents are created with the simplicity of R markdown, but have the same hosting requirements of a Shiny app and are not portable.


Using the right tools for communication matters. R Markdown is a better solution than conventional tools for the following problems:

Problem Common tool Better tool Share reports and presentations Microsoft Office R Markdown Summarize and share your interactive analyses R Scripts R Notebooks Update results (in batch) based on new inputs Shiny Parameterized reports

R For Data Science explains that, “It doesn’t matter how great your analysis is unless you can explain it to others: you need to communicate your results.” I highly recommend reading Part V of this book, which has chapters on using R Markdown as a unified authoring framework for data science, using R Markdown formats for effective communication, and using R Markdown workflows to create analysis notebooks. There are references at the end of these chapters that describe where to learn more about communication.


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RHL’19 St-Cergue, Switzerland, 25-27 January 2019

Wed, 10/31/2018 - 22:06

(This article was first published on - r-project, and kindly contributed to R-bloggers)

(translated from original French version)

The Rencontres Hivernales du Libre (RHL) (Winter Meeting of Freedom) takes place 25-27 January 2019 at St-Cergue. invites the free software community to come and share workshops, great meals and good times.

This year, we celebrate the 5th edition with the theme «Exploit».

Please think creatively and submit proposals exploring this theme: lectures, workshops, performances and other activities are all welcome.

RHL’19 is situated directly at the base of some family-friendly ski pistes suitable for beginners and more adventurous skiers. It is also a great location for alpine walking trails.

Why, who?

RHL’19 brings together the forces of freedom in the Leman basin, Romandy, neighbouring France and further afield (there is an excellent train connection from Geneva airport). Hackers and activists come together to share a relaxing weekend and discover new things with free technology and software.

If you have a project to present (in 5 minutes, an hour or another format) or activities to share with other geeks, please send an email to or submit it through the form.

If you have any specific venue requirements please contact the team.

You can find detailed information on the event web site.

Please ask if you need help finding accommodation or any other advice planning your trip to the region.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: - r-project. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 9 (2018)

Wed, 10/31/2018 - 21:22

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 9 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 9 (2018) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spooky! Gravedigger in R

Wed, 10/31/2018 - 16:38

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Need something to distract the kids while they're waiting to head out Trick-or-Treating? You could have them try out a Creepy Computer Game in R! Engineer and social scientist Dr Peter Provost translated one of the old BASIC games from the classic book Creepy Computer Games into R: just have them type in this file (and yes, they do need to type it character-by-character to get the full experience) and then source it into R to play the game. The goal is to move your character * to the exit at the SE corner of the graveyard populated by gravestones + while avoiding the skeletons X.

For extra credit, see if they can figure out the bug that causes the move counter to go down, instead of up. It might help to refer to the orginal BASIC sources — the book is online at the publisher's website as a PDF here.

The Lucid Manager: Celebrate Halloween with Creepy Computer Games in R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Now "fread" from data.table can read "gz" and "bz2" files directly

Wed, 10/31/2018 - 15:14

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

Dear R Programmers,

Those who all use data.table for your data readings, good news is that now, fread supports direct reading of zip formats like”gz” and “bz2”.

To all my followers and readers, as mentioned earlier several times, good way for saving both space and reading fast is achievable by first saving 
raw files into “gz” format and their after reading the same into R and convert them into to fst format for all further reads/loading.

Happy R Programming!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to run R from the Task Scheduler

Wed, 10/31/2018 - 13:36

(This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers)

(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });

In a prior post, we covered how to run Python from the Task Scheduler on Windows. This article is similar, but it’ll show how to run R from the Task Scheduler, instead. Similar to before, let’s first cover how to R from the command line, as knowing this is useful for running it from the Task Scheduler.

Running R from the Command Line

To open up the command prompt, just press the windows key and search for cmd. When R is installed, it comes with a utility called Rscript. This allows you to run R commands from the command line.

If Rscript is in your PATH, then typing Rscript into the command line, and pressing enter, will not result in an error. Otherwise, you might get a message saying “‘Rscript’ is not recognized as an internal or external command.” If that’s the case, then to use Rscript, you will either need to add it to your PATH (as an environment variable), or append the full directory of the location of Rscript on your machine. To find the full directory, search for where R is installed your computer. For instance, it may be something like below (this will vary depending on what version of R you have installed):

C:\Program Files\R\R-3.4.3

Within this folder, there should be a sub-folder called bin. This sub-folder contains Rscript.

Thus, the full path is:

C:\Program Files\R\R-3.4.3\bin\Rscript

For appending the PATH variable, see this link.

Using the above we can run an R command like this:

“C:\Program Files\R\R-3.4.3\bin\Rscript” -e “cat(‘this is a test’)”

Or – if you have Rscript in your path, you can run this:

Rscript -e “cat(‘this is a test’)”

This code will run the cat command to print out test. The quotes are necessary around the full path to Rscript in the first option because of the spaces in the path name. The -e flag refers to expression. Thus the expression following -e gets executed.

Hence, if we want to run an R script from the command line, we can call source with Rscript. Suppose the name of the script we want to run is called “C:/path/to/script/some_code.R”

Then, we can run this script from the command line like so:

Rscript -e “source(‘C:/path/to/script/some_code.R’)”

Running R from the Task Scheduler

Now, we can take this notion to run R via the Task Scheduler. Pressing the windows key, followed by typing “task scheduler” should bring the Task Scheduler up. Once it’s open, click on “Action”, and then press “Create Task”.

After this, you will see a place where you need to input the name of your task.

You will also see where you can select “Run whether user is logged on or not.” If you need to run a script in production where you won’t necessarily be logged on all the time, then you should select the “whether user is logged on or not” option. As a note, however, you will need to have the appropriate permissions for this to work for you. This generally means you’ll need to have batch job rights. If you’re an administrator on the machine you’re creating the task, then this shouldn’t be a problem.

If you’re just running a task on a computer where you’ll be logged on when you need it to run, then you can select the “Run only when user is logged on” option.

The next step is to select the “Triggers” tab.

You should get a box that pops up where you can select what dates / times you need your script to run. After you’ve made your selections, you can go to the “Actions” tab, where you’ll create a new action. Clicking “New” brings up the box below.

There’s a couple ways you can add an action here. One, you can copy exactly what we did above to run an R script from the command line into the “Program/Script” box:

Rscript -e “source(‘C:/path/to/script/some_code.R’)”

Or two, you can type Rscript into this box, and put the rest of the line, -e “source(‘C:/path/to/script/some_code.R’)” into the “Add Arguments (optional)” box.

Lastly, just press “OK”, and you’re done!

That’s it for this post! Check out other R articles of mine here:

The post How to run R from the Task Scheduler appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Multilevel Modeling Solves the Multiple Comparison Problem: An Example with R

Wed, 10/31/2018 - 08:09

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

Multiple comparisons of group-level means is a tricky problem in statistical inference. A standard practice is to adjust the threshold for statistical significance according to the number of pairwise tests performed. For example, according to the widely-known Bonferonni method, if we have 3 different groups for which we want to compare the means of a given variable, we would divide the standard significance level (.05 by convention) by the number of tests performed (3 in this case), and only conclude that a given comparison is statistically significant if its p-value falls below .017 (e.g. .05/3).

In this post, we’ll employ an approach often recommended by Andrew Gelman, and use a multi-level model to examine pairwise comparisons of group means. In this method, we use the results of a model built with all the available data to draw simulations for each group from the posterior distribution estimated from our model. We use these simulations to compare group means, without resorting to the expensive corrections typical of standard multiple comparison analyses (e.g. the Bonferroni method referenced above).

The Data

We will return to a data source that we have examined previously on this blog: walking data from the Accupedo app on my phone. Accupedo is a free step-counting app that I’ve been using for more than 3 years to track how many steps I walk each day. It’s relatively straightforward to import these data into R and to extract the cumulative number of steps taken per hour.

In the data used for this post, I exclude all measurements taken before 6 AM (because the step counts are cumulative we don’t lose any steps by excluding these observations). This results in 18,117 observations of cumulative hourly step counts for 1113 days. The first observation is on 3 March 2015 and the last on 22 March 2018, resulting in just over 3 years of data!

A sample of the dataset, called aggregate_day_hour, is shown below:

oneday dow hour steps week_weekend table { border-collapse: collapse; width: 100%; } th, td { text-align: left; padding: 8px; } tr:nth-child(even) {background-color: #f2f2f2;} 1 2015-03-07 Sat 16 14942 Weekend 2 2015-03-07 Sat 17 14942 Weekend 3 2015-03-07 Sat 18 14991 Weekend 4 2015-03-07 Sat 19 15011 Weekend 5 2015-03-07 Sat 20 15011 Weekend 6 2015-03-07 Sat 21 15011 Weekend 7 2015-03-07 Sat 22 15011 Weekend 8 2015-03-07 Sat 23 15011 Weekend 9 2015-03-08 Sun 11 2181 Weekend 10 2015-03-08 Sun 12 2428 Weekend

Oneday represents the calendar year/month/day the observation was recorded, dow contains the name of the day of the week, hour is the hour of the day (in 24-hour or “military” style), steps gives the cumulative step count and week_weekend indicates whether the day in question is a weekday or a weekend.

Analytical Goal

Our ultimate purpose here is to compare step counts across a large number of days, without having to adjust our threshold for statistical significance. In this example, we will compare the step counts for 50 different days. The standard Bonferonni correction would require us to divide the significance level by the number of possible pairwise comparisons, 1225 in this case. The resulting significance level is .05/1225, or .00004, which imposes a very strict threshold, making it more difficult to conclude that a given comparison is statistically significant.

As mentioned above, the method adopted in this post is one that Andrew Gelman often advocates on his blog. This advice often comes in the form of: “fit a multilevel model!,” with a link to this paper by Gelman, Hill and Yajima (2012). I have to admit that I never fully understood the details of the method from the blog posts. However, by looking at the paper and the explanations and code presented in Chapter 12 of Gelman and Hill’s (2007) book, I think I’ve figured out what he means and how to do it. This post represents my best understanding of this method.*

In brief (see the paper for more details), we calculate a multilevel model using flat priors (e.g. the standard out-of-the-box model in the lme4 package). We extract the coefficients from the model and the standard deviation of the residual error of estimated step counts per day (also known as sigma hat), and use these values to simulate step count totals for each day (based on the estimated total step count and sigma hat). We then compare the simulations for each day against each other in order to make pairwise comparisons across days.

Sound confusing? You’re not alone! I’m still grappling with how this works, but I’ll gently walk through the steps below, explaining the basic ideas and the code along the way.

Step 1: Calculate the Model

In order to perform our pairwise comparisons of 50 days of step counts, we will construct a multi-level model using all of the available data. We’ll employ a model similar to one we used in a previous post on this blog. The model predicts step count from A) the hour of the day and B) whether the day is a weekday or a weekend. We specify fixed effects (meaning regression coefficients that are the same across all days) for hour of the day and time period (weekday vs. weekend). We specify a random effect for the variable that indicates the day each measurement was taken (e.g. a random intercept per day, allowing the average number of estimated steps to be different on each day). We center the hour variable at 6 PM (more on this below).

We can calculate this model using the lme4 package. We will also load the arm package (written to accompany Gelman and Hill’s (2007) book) to extract model parameters we’ll need to compute our simulations below.

# load the lme4 package
# load the arm package
# to extract the estimate of the
# standard deviation of our residuals
# (sigma hat)

# center the hour variable for 6 pm
aggregate_day_hour$hour_centered <- aggregate_day_hour$hour -18

# compute the model
lme_model <- lmer(steps ~ 1 + hour_centered + week_weekend + (1 | oneday),

# view the model results

The estimates of the fixed effects are as follows:

Estimate Std. Error t value (Intercept) 13709.05 143.67 95.42 hour_centered 1154.39 4.55 253.58 week_weekendWeekend -1860.15 268.35 -6.93

Because we have centered our hour variable at 6 PM, the intercept value gives the estimated step count at 6 PM during a weekday (e.g. when the week_weekend variable is 0, as is the case for weekdays in our data). The model thus estimates that at 6 PM on a weekday I have walked 13,709.05 steps. The hour_centered coefficient gives the estimate of the number of steps per hour: 1,154.39. Finally, the week_weekendWeekend variable gives the estimated difference in the total number of steps per day I walk on a weekend, compared to a weekday. In other words, I walk 1,860.15 fewer steps on a weekend day compared to a weekday.

Step 2: Extract Model Output


In order to compare the step counts across days, we will make use of the coefficients from our multilevel model. We can examine the coefficients (random intercepts and fixed effects for hour and day of week) with the following code:

# examine the coefficients

Which the gives the coefficient estimates for each day (first 10 rows shown):

(Intercept) hour_centered week_weekendWeekend 2015-03-03 -184.31 1154.39 -1860.15 2015-03-04 11088.64 1154.39 -1860.15 2015-03-05 9564.42 1154.39 -1860.15 2015-03-06 9301.65 1154.39 -1860.15 2015-03-07 15159.48 1154.39 -1860.15 2015-03-08 10097.27 1154.39 -1860.15 2015-03-09 15163.94 1154.39 -1860.15 2015-03-10 18008.48 1154.39 -1860.15 2015-03-11 15260.7 1154.39 -1860.15 2015-03-12 10892.19 1154.39 -1860.15

We will extract these coefficients from the model output, and merge in the day of the week (week vs. weekend) for every date in our dataset. We will then create a binary variable for week_weekend, which takes on a value of 0 for weekdays and 1 for weekends (akin to that automatically created when running the multilevel model with the lme4 package):

# extract coefficients from the lme_model
coefficients_lme <-[1])
names(coefficients_lme) <- c('intercept', "hour_centered", "week_weekendWeekend")
coefficients_lme$date <- row.names(coefficients_lme)

# create a day-level dataset with the indicator
# of day (week vs. weekend)
week_weekend_df <- unique( aggregate_day_hour[ , c('oneday', 'week_weekend') ] )
# join the week/weekend dataframe to the coefficients dataframe
library(plyr); library(dplyr)
coefficients_lme <- left_join(coefficients_lme,
week_weekend_df[,c('oneday', 'week_weekend')],
by = c("date" = 'oneday'))
# make a dummy variable for weekend which
# takes on the value of 0 for weekday
# and 1 for the weekend
coefficients_lme $weekend <- ifelse(coefficients_lme $week_weekend=='Weekend',1,0)

Our resulting dataframe looks like this:

intercept hour_centered week_weekendWeekend date week_weekend weekend 1 -184.31 1154.39 -1860.15 2015-03-03 Weekday 0 2 11088.64 1154.39 -1860.15 2015-03-04 Weekday 0 3 9564.42 1154.39 -1860.15 2015-03-05 Weekday 0 4 9301.65 1154.39 -1860.15 2015-03-06 Weekday 0 5 15159.48 1154.39 -1860.15 2015-03-07 Weekend 1 6 10097.27 1154.39 -1860.15 2015-03-08 Weekend 1 7 15163.94 1154.39 -1860.15 2015-03-09 Weekday 0 8 18008.48 1154.39 -1860.15 2015-03-10 Weekday 0 9 15260.7 1154.39 -1860.15 2015-03-11 Weekday 0 10 10892.19 1154.39 -1860.15 2015-03-12 Weekday 0

These coefficients will allow us to create an estimate of the daily step counts for each day.

Standard Deviation of Residual Errors: Sigma Hat

In order to simulate draws from the posterior distribution implied for each day, we need the estimated step counts per day (also known as y hat, which we’ll calculate using the coefficient dataframe above), and the sigma hat value (e.g. the standard deviation of the residual error from our estimated step counts) from the model. As Gelman and Hill explain in their book, we can simply extract the standard deviation using the sigma.hat command (this function is part of their arm package, which we loaded above). We’ll store the sigma hat value in a variable called sigma_y_hat. In this analysis, the standard deviation of the residual error of our estimated step counts is 2193.09.

# extract the standard deviation of the residual error of predicted step counts
# store in a variable called "sigma_y_hat"
sigma_y_hat <- sigma.hat(lme_model)$sigma$data
# 2913.093

Step 3: Sample 50 days and plot

For this exercise, we’ll randomly select 50 days for our pairwise comparisons. We’ll then plot the model estimated step count for the selected days. Here, I calculate the estimated step counts at 6 PM (using the formula and coefficients from the model) directly in the ggplot2 code.

# sample 50 random days
sampled_dates <- coefficients_lme[sample(nrow(coefficients_lme), 50), ]

# visualize estimated step count
# for each date
ggplot(sampled_dates, aes(x = date, y = intercept + (week_weekendWeekend*weekend),
color = week_weekend)) +
# point plot
geom_point() +
# set the title
labs(title = "Model Estimated Step Count at 6 PM") +
# set the y axis title
ylab('Estimated Number of Steps at 6 PM') +
# turn off x axis title, make dates on x axis
# horizontal
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, vjust = .5)) +
# set the legend title
scale_colour_discrete(name ="Week/Weekend")

Our sampled dates dataframe looks like this (first 10 rows shown):

intercept hour_centered week_weekendWeekend date week_weekend weekend 296 3985.94 1154.39 -1860.15 2015-12-25 Weekday 0 414 16165.61 1154.39 -1860.15 2016-04-21 Weekday 0 637 11963.4 1154.39 -1860.15 2016-11-30 Weekday 0 1009 10950.75 1154.39 -1860.15 2017-12-07 Weekday 0 224 14669.36 1154.39 -1860.15 2015-10-14 Weekday 0 996 13273.51 1154.39 -1860.15 2017-11-24 Weekday 0 1046 17883.73 1154.39 -1860.15 2018-01-13 Weekend 1 731 10716 1154.39 -1860.15 2017-03-04 Weekend 1 696 16334.55 1154.39 -1860.15 2017-01-28 Weekend 1 69 15280.03 1154.39 -1860.15 2015-05-12 Weekday 0

And our plot looks like this:

We can see that ‘2015-12-25’ (Christmas 2015) was day in which I walked very few steps. The highest step count is on ‘2016-08-08’, while the second-highest date is ‘2016-04-23’.

Note that the samples dataframe shown above is no longer in chronological order. However, ggplot2 is smart enough to understand that the x-axis is a date, and orders the values in the plot accordingly.

Step 4: Simulate Posterior Distributions of Step Counts Per Day

We now need to simulate posterior distributions of step counts per day. The method below is, as far as I can understand it, that applied in the Gelman et al. (2012) paper. I have adapted the logic from this paper and the Gelman and Hill book (particularly Chapter 12).

What are we doing?

Our model estimates the overall step counts as a function of our predictor variables. However, the model is not perfect, and the predicted step counts do not perfectly match the observed step counts. This difference between the observed and the predicted step counts for each day is called “residual error” – it’s what your model misses.

We will make 1000 simulations for each day with step count values that we did not actually see, but according to the model estimate and the uncertainty of that estimate (e.g. sigma hat), we could have seen. To do this, we simply calculate the model-predicted step count for each day, and use this value along with the sigma hat value (the standard deviation of our residual error) to sample 1000 values from each day’s implied posterior distribution.

How do we do it?

I wrote a simple function to do this. It takes our sample data frame and, for each row (date) in the dataset, it calculates the estimated step count using the regression formula and parameter estimates and passes this value, along with the sigma hat value, to the rnrom function, drawing 1000 samples from the implied posterior distribution for that day.

Note that I’m not including the hour_centered variable in this function. This is because I’m choosing to estimate the step count when hour_centered is zero (e.g. 6 PM because we centered on this value above).

# this function calculates the estimated step count
# and then draws 1000 samples from the model-implied
# posterior distribution
lme_create_samples <- function(intercept, w_wk_coef, w_wk_dum){
intercept_f <- intercept + (w_wk_coef * w_wk_dum)
y_tilde <- rnorm(1000, intercept_f, sigma_y_hat)

# here we apply the function to our sampled dates
# and store the results in a matrix called
# "posterior_samples"
posterior_samples <- mapply(lme_create_samples,
# [1] 1000 50

Our resulting posterior_samples matrix has 1 column for each day, with 1000 samples for each day contained in the rows. Below I show a small subset of the entire matrix – just the first 10 rows for the first 5 columns. It is interesting to note that the first column, representing the first day (‘2015-12-25’) in our dataframe of samples above, has smaller values than the second column (representing the second day, ‘2016-04-21’). If we look at the intercepts in the samples dataset, we can see that the first day is much lower than the second. This difference in our model-estimated intercepts is propagated to the simulations of daily step counts.

3822.43 12517.38 10611.98 7940.35 15046.33 3532.09 13224.83 10720.97 5916.09 15879.66 -298.5 18354.48 7002.13 7571.94 18489.92 2593.04 12354.25 8929.43 6891.52 9846.13 5203.44 17702.38 14202.8 8032.03 13051.09 7943.9 14611.36 9792.22 14878.74 9892.48 3686.51 15005.1 10546.27 5777.57 15511.05 5115.26 13865.52 10934.97 12029.68 21345.46 3829.2 15495.18 11781.92 11072.98 17209.84 -25.57 18720.93 16681.49 10564.03 13591.19 Step 5: Compare the Simulations For Each Day

We can see in the above matrix of simulations that the first day (‘2015-12-25’, contained in the first column) has lower values than the second day (‘2016-04-21’, contained in the second column). In order to formally compare the the two columns, we will compare the simulations and see how many times the values in the first column are larger than those in the second column. If values in one column are larger more than 95% of the time, we will say that there is a meaningful (“significant”) difference in step counts between the two dates.

We will apply this logic to all of the possible pairwise comparisons in our sampled 50 dates. The following code accomplishes this (adopted from this tremendous Stackoverflow question + answer):

# do the pairwise comparisons
# first construct an empty matrix
# to contain results of comparisons
# give column and row names for the matrix
# (these are our observed dates)
colnames(comparison_matrix) <- sampled_dates$date
rownames(comparison_matrix) <- sampled_dates$date
# loop over the columns of the matrix
# and count the number of times the values
# in each column are greater than the values
# in the other columns
for(col in 1:ncol(posterior_samples)){
comparisons_counts[col]<- 0 # Set the same column comparison to zero.

# shape of output comparison matrix
# [1] 50 50

The first 10 rows of the first 5 columns of our comparison matrix look like this:

2015-12-25 2016-04-21 2016-11-30 2017-12-07 2015-10-14 2015-12-25 0 998 970 954 995 2016-04-21 2 0 173 103 360 2016-11-30 30 827 0 395 722 2017-12-07 46 897 605 0 822 2015-10-14 5 640 278 178 0 2017-11-24 13 755 384 299 656 2018-01-13 3 514 173 114 379 2017-03-04 114 962 786 698 915 2017-01-28 7 660 273 223 529 2015-05-12 3 569 223 152 436

The counts in each cell indicate the number of times (out of 1000) that the column value is greater than the row value (we set the diagonals – comparisons of a day with itself – to zero in the above code). The comparison we identified above (‘2015-12-25’ and ‘2016-04-21’) is shown in the first row, second column (and the second row, first column – this matrix contains the same information but in reverse above and below the diagonals). The samples from ‘2016-04-21’ were greater than those from ‘2015-12-25’ 998 times out of 1000!

Step 6: Make the Pairwise Comparison Plot

We can make a heat map using ggplot2 to show which pairwise comparisons are “significantly” different from one another, akin to that used in Gelman et al. (2007).

Because ggplot2 requires data to be in a tidy format, we’ll have to melt the comparison matrix so that it has 1 row per pairwise comparison. The code to do this was based on these very clear explanations of how to make a heatmap with ggplot2:

# load the reshape2 package
# for melting our data
# melt the data
melted_cormat <- melt(comparison_matrix)
# rename the variables
# identify which comparisons are "significant"
melted_cormat$meaningful_diff = factor(melted_cormat$count>950)
# and set the levels
levels(melted_cormat$meaningful_diff) = c('No Difference',
'Row > Column')
# sort the matrix by the dates
# first turn the date columns
# into date types in R using
# the lubridate package
melted_cormat$x <- ymd(melted_cormat$x)
melted_cormat$y <- ymd(melted_cormat$y)
# then arrange the dataset by x and y dates
melted_cormat %>% arrange(x,y) -> melted_cormat

Our final formatted matrix for plotting, called melted_cormat, looks like this (first 10 rows shown):

x y count meaningful_diff 1 2015-03-18 2015-03-18 0 No Difference 2 2015-03-18 2015-03-29 222 No Difference 3 2015-03-18 2015-05-12 737 No Difference 4 2015-03-18 2015-06-29 222 No Difference 5 2015-03-18 2015-07-19 515 No Difference 6 2015-03-18 2015-09-15 768 No Difference 7 2015-03-18 2015-09-22 884 No Difference 8 2015-03-18 2015-10-14 683 No Difference 9 2015-03-18 2015-10-18 837 No Difference 10 2015-03-18 2015-10-22 728 No Difference

The count variable here gives the number of times that the y date simulations were greater than the x date simulations. So the second row above indicates that the simulations on ‘2015-03-29’ were greater than those for ‘2015-03-18’ a total of 222 times.

We can make the heatmap with the following code:

# make the heatmap
ggplot(melted_cormat, aes(as.factor(x), as.factor(y),
fill = meaningful_diff)) +
# tile plot
geom_tile() +
# remove x and y axis titles
# rotate x axis dates 90 degrees
axis.text.x = element_text(angle = 90,
vjust = .5)) +
# choose the colors for the plot
# and specify the legend title
scale_fill_manual(name = "Comparison",
values=c("#999999", "#800000")) +
# add a title to the plot
labs(title = "Pairwise Comparisons of Model-Estimated Step Counts at 6 PM")

Which gives us the following plot:

How can we read this plot? All of our 50 dates are displayed on the x and y axes. When the row step count is meaningfully larger than the column step count (according to our method, e.g. if more than 950 of the simulations for the row date are greater than those for the column date), the cell is colored in maroon. Otherwise, cells are colored in grey.

Rows that have a lot of maroon values are days that have higher step counts. To take a concrete example, the row representing ‘2016-08-08’ has many red values. If we look at the estimated step count for that day on the first graph above, we can see that it has the highest predicted step count in the sampled dates – over 25,000! It therefore makes sense that the estimated step count on this day is “significantly” larger than those from most of the other dates.

Columns that have many red values are days that have especially low step counts. For example, the column representing ‘2015-12-25’ is mostly red. As we saw above, this date has the lowest predicted step count in our sample – under 5,000! It is no wonder then that this date has a “significantly” lower step count than most of the other days.

How Baysian Am I?

I’m a newbie to Baysian thinking, but I get the sense that Baysian statistics comes in many different flavors. The approach above strikes me as being a little bit, but not completely, Baysian.

That’s So Baysian

This approach is Baysian in that it uses posterior distributions of daily step counts to make the comparisons between days. The approach recognizes that observed daily step counts are just observations from a larger posterior distribution of possible step counts, and we make explicit use of this posterior distribution in the data analysis. In contrast, frequentist methods basically only make use of point estimates of coefficients and the standard errors of those estimates to draw conclusions from data.

Not So Fast!

There are some things about this perspective that aren’t so Baysian. First, we use flat priors in the model; a “more Baysian” approach would assign prior distributions to the intercept and the coefficients, which would then be updated during the model computation. Second, we use the point estimates of our coefficients to compute the estimated daily step counts. A “more Baysian” approach would recognize that the coefficient estimates also have posterior distributions, and would incorporate this uncertainty in the simulations of daily step counts.

What Do I Know?

This is my current best understanding of the differences in Baysian and frequentist perspectives as they apply to what we’ve done here. I’m reading Statistical Rethinking (and watching the accompanying course videos) by Richard McElreath (love it), and these differences are what I’ve understood from the book and videos thus far.

The method used in this blog post is a nice compromise for someone comfortable with frequentist use of multi-level models (like myself). It requires a little bit more work than just interpreting standard multilevel model output, but it’s not a tremendous stretch. Indeed, the more “exotic” procedures (for a Baysian newbie) like assigning priors for the model parameters are not necessary here.

Summary and Conclusion

In this post, we used multilevel modeling to construct pairwise comparisons of estimated step counts for 50 different days. We computed the multilevel model, extracted the coefficients and sigma hat value, and computed 1000 simulations from each day’s posterior distribution. We then conducted pairwise comparisons for each day’s simulations, concluding that a day’s step count was “significantly” larger if 950 of the 1000 simulations were greater than the comparison day. We used a heat map to visualize the pairwise comparisons; this heat map highlighted days with particularly high and low step counts as being “significantly” different from most of the other days. This allowed us to conduct these comparisons without the expensive adjustments that come with lowering our significance threshold for multiple comparisons!
Coming Up Next

In the next post, we will move away from classical statistics and talk about machine learning techniques. Specifically, we will use a type of deep learning model to automatically generate band names.

Stay tuned!


* Please correct me if I’m wrong!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Analysis with Python Course: How to read, wrangle, and analyze data

Wed, 10/31/2018 - 02:12

(This article was first published on Open Source Automation, and kindly contributed to R-bloggers)

(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });

For anyone in the NYC area, I will be holding an in-person training session December 3rd on doing data analysis with Python. We will be covering the pandas, pyodbc, and matplotlib packages. Please register at Eventbrite here:


Learn how to apply Python to read, wrangle, visualize, and analyze data!  This course provides a hands-on session where we’ll walk through a prepared curriculum on doing data analysis with Python.  All code and practice exercises during the session will be made available after the course is complete.  


About the course

During this hands-on class, you will learn the fundamentals of doing data analysis in Python, the powerful pandas package, and pyodbc for connecting to databases.

We will walk through using Python to analyze and answer key questions on sales data, as well as utilizing historical stock price data to learn how to work with time series in Python. We will also cover how to create data visualizations with matplotlib.


What you’ll learn

  • How to import data from multiple sources (Excel, HTML, databases)
  • How to construct DataFrames and Series
  • How to merge datasets
  • The apply and map methods for performing column calculations
  • Rearranging and pivoting data
  • Adding and deleting columns
  • Subsetting and sorting
  • Working with time series data
  • Handling missing and duplicate values
  • Creating visualizations with matplotlib


Who is this for?

Anyone looking to learn how to analyze data with Python or increase their Python skills!



The post Data Analysis with Python Course: How to read, wrangle, and analyze data appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Source Automation. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1072]

Wed, 10/31/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The penultimate Le Monde mathematical puzzle  competition problem is once again anti-climactic and not worth its points:

For the figure below [not the original one!], describing two (blue) half-circles intersecting with a square of side 20cm, and a (orange) quarter of a circle with radius 20cm, find the radii of both gold circles, respectively tangent to both half-circles and to the square and going through the three intersections.

Although the problem was easily solvable by some basic geometry arguments, I decided to use them a minima and resort to a mostly brute-force exploration based on a discretisation of the 20×20 square, from which I first deduced the radius of the tangent circle by imposing (a) a centre O on the diagonal (b) a distance from O to one (half-)circle equal to the distance to the upper side of the square, leading to a radius of 5.36 (and a centre O at a distance 20.70 from the bottom left corner):

diaz=sqrt(2)*20 for (diz in seq(5/8,7/8,le=1e4)*diaz){#position of O radi=sqrt(diz^2/2+(diz/sqrt(2)-10)^2)-10 if (abs(20-(diz/sqrt(2))-radi)<3e-3){ print(c(radi,diz));break()}}

In the case of the second circle I first deduced the intersections of the different circles by sheer comparison of black (blue!) pixels, namely A(4,8), B(8,4), and C(10,10), then looked at the position of the centre O’ of the circle such that the distances to A, B, and C were all equal:

for (diz in seq(20*sqrt(2)-20,10*sqrt(2),le=1e4)){ radi=10*sqrt(2)-diz distA=sqrt((diz/sqrt(2)-4)^2+(diz/sqrt(2)-8)^2) if (abs(distA-radi)<4e-4){ print(c(radi,diz));break()}}

even though Heron’s formula was enough to produce the radius. In both approaches, this radius is 3.54, with the position of the centre O’at 10.6 from the lower left corner. When plotting these results, which showed consistency at this level of precision,  I was reminded of the importance of plotting the points with the option “asp=1” in plot() as otherwise, does not plot the circles at the right location!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

Tue, 10/30/2018 - 19:47

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

  • Which columns are keys (together identify rows or records).
  • Which columns are data/payload (are considered free varying data).
  • Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

In mathematics the statement "y is a function of x" merely means there is an ideal lookup table with which if you knew the value of x, then you could in principle know the value of y.

For example in the following R data.frame: y is a function of x, as all rows that have the same value for x also have the same value for y.

d1 <- data.frame( x = c("a", "a", "b", "b", "c"), y = c(10, 10, 20, 20, 20), z = c(1, 2, 1, 1, 1), stringsAsFactors = FALSE) print(d1) ## x y z ## 1 a 10 1 ## 2 a 10 2 ## 3 b 20 1 ## 4 b 20 1 ## 5 c 20 1

Notice if we know the value of x we then, in principle know the value of y. In the same example z is not a function of x, as it does not have this property.

A more concrete example would be: user-name is a function of user-ID. If you know the an individual’s user-ID, then you also (if you have the right lookup table) know the individual’s user-name.

We first taught these concepts in the context of SQL and SQL grouped aggregation. In SQL once you aggregate on one column, then all other columns in your query must either be the grouping columns, or also aggregated. This is easiest to show in code, but we will use the dplyr package for our example.

library("dplyr") d1 %>% group_by(x) %>% summarize(y = max(y)) %>% ungroup() ## # A tibble: 3 x 2 ## x y ## ## 1 a 10 ## 2 b 20 ## 3 c 20

Notice only grouping columns and columns passed through an aggregating calculation (such as max()) are passed through (the column z is not in the result). Now because y is a function of x no substantial aggregation is going on, we call this situation a "pseudo aggregation" and we have taught this before.

Our wrapr package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg(). It works as follows.

library("wrapr") d1 %>% group_by(x) %>% summarize(y = psagg(y)) %>% ungroup() ## # A tibble: 3 x 2 ## x y ## ## 1 a 10 ## 2 b 20 ## 3 c 20

psagg() pretty much worked the same as the earlier max(). However, it documents our belief that y is a function of x (that nothing interesting is going on for this column during aggregation). Where psagg() differs is if our assumption that y is a function of x is violated.

d2 <- data.frame( x = c("a", "a", "b", "b", "c"), y = c(10, 10, 20, 23, 20), stringsAsFactors = FALSE) print(d2) ## x y ## 1 a 10 ## 2 a 10 ## 3 b 20 ## 4 b 23 ## 5 c 20 d2 %>% group_by(x) %>% summarize(y = psagg(y)) %>% ungroup() ## Error in summarise_impl(.data, dots): Evaluation error: wrapr::psagg argument values are varying.

The code caught that our assumption was false and raised on error. This sort of checking can save a lot of time and prevent erroneous results.

And that is part of what we teach:

  • document intent
  • check assumptions
  • double-check results
  • use composable small helper-tools.

psagg() also works well with data.table.

library("data.table")[ , .(y = psagg(y)), by = "x"] ## x y ## 1: a 10 ## 2: b 20 ## 3: c 20[ , .(y = psagg(y)), by = "x"] ## Error in psagg(y): wrapr::psagg argument values are varying

Of course we don’t strictly need psagg() as we could insert checks by hand (though this would become burdensome if we had many derived columns).[ , .(y = max(y), y_was_const = min(y)==max(y)), by = "x"] ## x y y_was_const ## 1: a 10 TRUE ## 2: b 20 TRUE ## 3: c 20 TRUE[ , .(y = max(y), y_was_const = min(y)==max(y)), by = "x"] ## x y y_was_const ## 1: a 10 TRUE ## 2: b 23 FALSE ## 3: c 20 TRUE

Unfortunately, this sort of checking does not currently work for dplyr.

packageVersion("dplyr") ## [1] '0.7.7' d1 %>% group_by(x) %>% summarize(y = max(y), y_was_const = min(y)==max(y)) %>% ungroup() ## # A tibble: 3 x 3 ## x y y_was_const ## ## 1 a 10 TRUE ## 2 b 20 TRUE ## 3 c 20 TRUE d2 %>% group_by(x) %>% summarize(y = max(y), y_was_const = min(y)==max(y)) %>% ungroup() ## # A tibble: 3 x 3 ## x y y_was_const ## ## 1 a 10 TRUE ## 2 b 23 TRUE ## 3 c 20 TRUE

Notice the per-group variation in y was not detected. This appears to be a dplyr un-caught result corruption issue.

If we don’t attempt to re-use the variable name we get the correct result.

d2 %>% group_by(x) %>% summarize(y_for_group = max(y), y_was_const = min(y)==max(y)) %>% ungroup() ## # A tibble: 3 x 3 ## x y_for_group y_was_const ## ## 1 a 10 TRUE ## 2 b 23 FALSE ## 3 c 20 TRUE

However, being forced to rename variables during aggregation is a needless user burden.

Our seplyr package (a package that is a very thin adapter on top of dplyr) does issue a warning in the failing situation.

library("seplyr") d2 %>% group_by_se("x") %>% summarize_nse(y = max(y), y_was_const = min(y)==max(y)) %>% ungroup() ## Warning in summarize_se(res, summarizeTerms, warn = summarize_nse_warn, : ## seplyr::summarize_se possibly confusing column name re-use c('y' = ## 'max(y)', 'y_was_const' = 'min(y) == max(y)') ## # A tibble: 3 x 3 ## x y y_was_const ## ## 1 a 10 TRUE ## 2 b 23 TRUE ## 3 c 20 TRUE

The result is still wrong, but at least in this situation the user has a better chance at noticing and working around the issue.

The problem may not be common (it may or may not be in any of your code, or code you use), and is of course easy to avoid (once you know the nature of the issue).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R-bloggers weekly – most loved R posts from last week (2018-10-21 till 2018-10-27)

Tue, 10/30/2018 - 17:48

What are the best R posts to read from last week? To help you decide, here is a list of the top R posts, sorted based on the number of likes they got on twitter, enjoy :


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Data + Art STEAM Project: Final Results

Tue, 10/30/2018 - 14:28

(This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

Look closely at the images above. Do you notice anything? If you zoom in close enough, you will notice that the pictures are composed entirely of small graphs and images. These images are part of the final results for my Data + Art STEAM project with Forest Hill Junior and Senior Public School.

I have to admit that it took me much longer than I had hoped to get this project all packaged up and finalized. A lot of that had to do with being busy, and a lot had to do with the desire to keep analyzing the data set and then being overwhelmed by the amount of analysis that the kids and I ended up producing!

As a reminder, STEAM stands for science, technology, education, art and math. I worked on this project with my good friend Katherine Lafranier and her gifted grade 6 students at Forest Hill Junior and Senior Public School. She had just taken her class through a unit on data management and we used this project to teach them how to collect data through surveys and analyze it with Google Sheets. For our research project the students picked the subject: “Self-analysis:  Exploring student lifestyles and how it relates to their self-perception”. The final survey format is available here

We created the survey together and then the students set off to survey their school peers. We consolidated their collected data and they each analyzed it with the skills they were taught in class. They were then to review their analysis and submit a range of images to show their findings. The images should include both graphs and pictures which were inspired by their findings. That takes care of the science, technology and math portion of STEAM. For the art portion, all images were bundled up into a large mosaic to produce a higher level graphic relating to students feelings and self perception. I’m really excited that the students produced a large enough volume of analysis to produce such cool mosaics above!

Below I’m going to take you through some of the highlights of the analysis. Please keep in mind that this review only covers a fraction of the analysis performed and that there are some known data quality issues associated with manual survey collection.


Hours of the Day

The first and easiest thing to look at was the relationship between numerical variables. I had a look to see if there was any relationship between a students time spent on their screens, sleeping, being physically active and doing homework.


I began by producing a quick correlation matrix. The thing that immediately jumped out at me was the inverse relationship between sleep time and screen time. This chart implies that the more time a student spends on the screen, the less time they sleep and vice versa.

Of course I wanted to jump at this finding, but it’s important to ensure that there is nothing else going on here. I needed to look at the other factors involved in this type of association. I visualized a variety of other relationships through scatterplots.




If you look at the top left image above, we see the continued trend that as screen time goes up, sleep time goes down. However, we begin to see some of the cracks in the data set. There are a few folks who have listed 2-4 hours sleep and I start to think there are some erroneous data points. I then have a look at sleep time and screen time by grade to see if grade is a confounding factor. If you look at the graphs above you will see that screen time goes up by grade and sleep time goes down by grade. Therefore without the grade level being isolated, it can appear that as sleep time goes down when screen time goes up. I tried to get a visual of the relationship between these variables with grade holding steady by plotting screen time vs screen time by grade. Though it still looks like there is an inverse relationship, the data is really too small to make conclusions from.



Just for fun, I decided to make a little animation which shows how the spread of sleep time changes from grade to grade.

The animation uses a boxplot graph. Box plots use a box to represent the 25-50th percentiles of the data. The middle bolded line shows the median value. The lines (often called whiskers) show the lower 0-25th and upper 75th-100th percentiles. The little dots are outliers.

In this graph you can see that the distribution of screen time typically gets higher as the grades get higher.


Gender and Self Perception




Since our data set collected both gender and self description, I wanted to see see if the students saw themselves differently by gender. Our students interviewed 92 females, 91 males and two students who don’t identify.

Self Perception

To get started, I created a fun little word cloud of the words the students used to describe themselves broken down by gender. The female responses are in pink and male in blue. At first blush it looks like most students use positive terms to describe themselves!


Favorite subjects

I then wanted to see if there was any difference between males and females in the sentiment (positive/negative) used for self description. I used the affin lexicon included in the tidytext package to give a sentiment ranking to the words used for self description. The afinn scale ranges from -3 to +3. A very negative word like agonizing would have a ranking of -3 and a very positive word like affectionate would have a ranking of +3. I was very pleasantly surprised that out of the 54 unique descriptive words used across male and female, the spread of sentiment used for self description was almost identical! The median score was +2 for both males and females. This is excellent because it means that not only are students describing themselves in a positive tone, the gender sentiment tones are also roughly equal!


It’s written in the Stars?

Now, I didn’t focus too much of my own efforts on the horoscope connection. But, boy did the students do a great job!

As can be seen in this quad chart, the students analyzed a variety of distributions by horoscope sign. They then looked into how their star sign affects a variety of factors. They analyzed everything from time spent as shown in the graphs below to personality traits like introvert/extrovert and optimist/pessimist. I was very impressed by the variety of angles they explored!

I also loved how a few students were inspired to make an image showing their take aways from the analysis. This quote “I’m one of seven female virgos who took the survey” shows that she considered how she personally fit into the broader survey population!


When I grow up

Another aspect that I loved about this survey was seeing what all of the kids wanted to be when they grew up. I looked at this data from a variety of lenses. One of the most interesting views to me was viewing the career breakdown by gender. Doctor, lawyer, scientist and engineer are common popular choices among the genders. Athletic careers tended to be more prevalent in the boy choices. Both genders had a healthy dose of students still unsure of their career path and that is okay! I loved that a student even submitted an image with the finding that it’s totally normal not to know your dream job yet!


Play Time

I didn’t dive too deep into how the students spent their play time. But, because I love a nice little image word cloud created in R, I had to make this one.

Of course tv is a popular way to spend time, but I noticed that these are some well rounded kids! We can see baking, piano, tennis, lego, origami and guitar as hobbies listed among others.


Introvert vs Extrovert

This is another area that the students really dove into. Like the star sign analysis, the students analyzed how your introvert/extrovert tendencies can impact a variety of factors or how your personality factors could impact these tendencies.

The students also put a number of killer images with hilarious cartoons about what it’s like to be an introvert. Of course I LOVED these. It’s extra touches like this that can make finding insights more fun.



This is just the tip of the iceberg when it comes to the work that the students and myself did to analyze the data set. Please feel free to have a look on my github repo for the full set of images and code.

Thanks for reading along and please share your thoughts and creations with me on twitter

Note that the full code is available on my  github repo.  If you have trouble downloading the file from github, go to the main page of the repo and select “Clone or Download” and then “Download Zip”.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Site Migration

Tue, 10/30/2018 - 08:00

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

Its been almost two years since I first hosted this blog as an alternative outlet of my writings. Everything is currently being built using jekyll and hosted in github. It works fine but I want to have more control over the deployment of the site, using my own server and domain.

Also, my new blog is now based on Hugo using package blogdown and template academic. This format offers far more functionality and it is great! You can check the new site at

I’ll still keep this site for a couple of months but it won’t be updated any more. Every other material such as online books are also going to be migrated to the new domain.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Are petrol prices in Australia fair?

Tue, 10/30/2018 - 07:11

(This article was first published on R – Alex Levashov – eCommerce Consultant (Melbourne, Australia), and kindly contributed to R-bloggers)

Petrol is a product that used by most of Australians. So people are pretty sensitive to price changes, especially when the fuel become more expensive. With prices reaching $1.6 for unleaded the debates are becoming more and more hot.

Are greedy petrol traders ripping off Australians or the price changes are fair and primarily defined by the factors such as FX rate and crude oil price? Official point of view from ACCC is the prices are mostly determined by international factors.

Let’s check, is it true or not using R and some simple data analysis and visualization techniques.  We’ll check how retail petrol price correlate with two major international factors that affects it – crude oil price and Australian dollar exchange rate.

Get the data

First we need to collect the data.  We need the next pieces of data:

  • Crude oil price. It is usually calculated in USD for barrel
  • AUD to USD foreign exchange rate
  • Retail petrol prices in Australia

I was able quite quickly get first 2 pieces of data, they are available at FRED up to 1980th in easy to use CSV format. I’ve just had to add couple final rows with the most recent data, did it manually.

Getting Aussie retail petrol prices were much harder. There is plenty of data available of current prices, but getting monthly historical data appeared to be a challenge to my surprise. The best I was able to get is the data from Western Australia government agency starting from early 2000th.  Taking into account that WA displays the shortest petrol price circle it is probably good enough.

After a bit of data wrangling I was have produced the data frame with next columns:

  • Date (month and year)
  • Crude oil price in AUD per barrel
  • Petrol price in AUD per litre

Having oil price in AUD is important to account for FX rate fluctuation. You can get full csv file from my Github.

Let’s start making charts. I’ve used R package ggplot2 for this job.

Crude oil price

The piece of code below produce a chart of monthly crude oil price in Australian dollars since 2001.

df1 %>% ggplot(aes(date, oilaud)) +
xlab("Date")+ylab ("Price per barrel, $AUD")+
ggtitle("Crude oil price")+

Crude oil price, AUD

dplyr package is used to allow pipes (%>% syntax).

Petrol retail prices

Next let’s create retail petrol prices, the approach is very similar.

The code:

df1 %>% ggplot(aes(oilaud, petrol_price, color=factor(year(date)))) +
xlab("Crude oil, $ per barrel")+ylab ("Petrol, cents per liter")+
ggtitle("Petrol vs crude oil")+

and the chart:

Retail petrol prices, Australia

The charts looks quite similar isn’t it?

Let’s put them together.

Crude oil with petrol

ggplot2 allows quite easily to put 2 variables into one chart. Here is the code to do it:

df1 %>% ggplot(aes(date))+
geom_line(aes(y=oilaud, colour= "Oil"))+
geom_line(aes(y=petrol_price, colour="Petrol"))+
xlab("Date")+ylab ("Prices")+
ggtitle("Retail petrol price vs crude oil price")+

Because we have retail prices in cents per litre and oil prices in dollars per barrel the numbers are quite similar and we don’t really need to make more complex things like changing scales, which we have to do for vastly different values.

Crude oil and retail petrol prices

Looks like the prices correlate, isn’t it?

Let’s make another visualization, scatterplot.

The code:

df1 %>% ggplot(aes(oilaud, petrol_price, color=factor(year(date)))) +
labs(title = "Petrol vs crude oil\n", x = "Crude oil, $ per barrel", y = "Petrol, cents per liter", color = "Colors for years\n")+

We use here colours to represent years of the data and a bit different approach to make titles and legend.

Crude oil and retail petrol prices scatter plot

Visually it seems that the correlation between crude oil price and retail petrol price is quite strong indeed.

Let’s calculate the correlation with R function

cor(df1$petrol_price, df1$oilaud)

The value returned is 0.9256118 which is a very strong correlation.

So it seems that overall petrol vendors are just passing the factors beyond their control (global oil price and FX rate) and there is nothing to worry about, all good. Or not?

Have a closer look

Probably you have noticed that while overall the dots on the scatter plot form the line, some of them well above it. And many of such dots are in the colours of recent years (2014-2018). What does it mean?

If the dots are above a trend line (imaginary in our case) they are out of general pattern, in our case the retail price is above the level we would expect with perfect correlation.

You may also notice that the crude oil price now is quite far from the highest point it, while current retail price is very close to historical maximum level.

The code below filter out subset of data when crude oil price was above AU$100

over100 <- df1 %>% filter(oilaud >100)
over100 %>% ggplot(aes(oilaud, petrol_price, color=factor(year(date)))) +
labs(title = "Petrol retail in Australia vs crude oil, when crude oil over AU$100\n", x = "Crude oil, $ per barrel", y = "Petrol, cents per liter", color = "Colors for years\n")+
geom_hline(yintercept = 157.3)+
geom_vline(xintercept = 104.29)+
It produces the next scatter plot:

Crude oil and retail petrol prices, oil over $100

Vertical and horizontal black line intercepts on the most recent price point (begin of Oct 2018).

It is clear that:

  • Current price is historically the second most expensive
  • In the past with the same crude oil price retail price was much lower – very often below $1.40 and always below $1.5
  • Similar retail price existed when crude oil price was close to $140 per barrel, over 20% more expensive than now

So it looks like at the moment Australian consumers are being ripped off. But based on the history it shouldn’t last too long, so hopefully the prices soon return to more fair level.

All code and data available on Github

The post Are petrol prices in Australia fair? appeared first on Alex Levashov – eCommerce Consultant (Melbourne, Australia).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Alex Levashov – eCommerce Consultant (Melbourne, Australia). offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning Basics – Random Forest

Tue, 10/30/2018 - 01:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

A few colleagues of mine and I from are currently working on developing a free online course about machine learning and deep learning. As part of this course, I am developing a series of videos about machine learning basics – the first video in this series was about Random Forests.

You can find the video on YouTube but as of now, it is only available in German. Same goes for the slides, which are also currently German only.

I did however translate my script:

Random Forest (RF) is one of the many machine learning algorithms used for supervised learning, this means for learning from labelled data and making predictions based on the learned patterns. RF can be used for both classification and regression tasks.

Decision trees

RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of “if… then…” decision splits – similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let’s say it would be “age”; we would now make a decision about which branch to follow: “if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch”. This we would do until we come to the next branch and repeat the same decision process until there are no more branches before us. This endpoint is called a leaf and in decision trees would represent the final result: a predicted class or value.

At each branch, the feature thresholds that best split the (remaining) samples locally is found. The most common metrics for defining the “best split” are gini impurity and information gain for classification tasks and variance reduction for regression.

Single decision trees are very easy to visualize and understand because they follow a method of decision-making that is very similar to how we humans make decisions: with a chain of simple rules. However, they are not very robust, i.e. they don’t generalize well to unseen samples. Here is where Random Forests come into play.

Ensemble learning

RF makes predictions by combining the results from many individual decision trees – so we cal them a forest of decision trees. Because RF combines multiple models, it falls under the category of ensemble learning. Other ensemble learning methods are gradient boosting and stacked ensembles.

Combining decision trees

There are two main ways for combining the outputs of multiple decision trees into a random forest:

  1. Bagging, which is also called Bootstrap aggregation (used in Random Forests)
  2. Boosting (used in Gradient Boosting Machines)

Bagging works the following way: decision trees are trained on randomly sampled subsets of the data, while sampling is being done with replacement. Bagging is the default method used with Random Forests. A big advantage of bagging over individual trees is that it decrease the variance of the model. Individual trees are very prone to overfitting and are very sensitive to noise in the data. As long as our individual trees are not correlated, combining them with bagging will make them more robust without increasing the bias. The part about correlation is important, though! We remove (most of) the correlation by randomly sampling subsets of data and training the different decision trees on this subsets instead of on the entire dataset. In addition to randomly sampling instances from our data, RF also uses feature bagging. With feature bagging, at each split in the decision tree only a random subset of features is considered. This technique reduces correlation even more because it helps reduce the impact of very strong predictor variables (i.e. features that have a very strong influence on predicting the target or response variable).

Boosting works similarly but with one major difference: the samples are weighted for sampling so that samples, which were predicted incorrectly get a higher weight and are therefore sampled more often. The idea behind this is that difficult cases should be emphasized during learning compared to easy cases. Because of this difference bagging can be easily paralleled, while boosting is performed sequentially.

The final result of our model is calculated by averaging over all predictions from these sampled trees or by majority vote.

Hyperparameters to be tuned

Hyperparameters are the arguments that can be set before training and which define how the training is done. The main hyperparameters in Random Forests are

  • The number of decision trees to be combined
  • The maximum depth of the trees
  • The maximum number of features considered at each split
  • Whether bagging/bootstrapping is performed with or without replacement
Training Random Forest models

Random Forest implementations are available in many machine learning libraries for R and Python, like caret (R, imports the randomForest and other RF packages), Scikit-learn (Python) and H2O (R and Python).

Examples in R can be found here:

Other tree-based machine learning algorithms

The pros of Random Forests are that they are a relatively fast and powerful algorithm for classification and regression learning. Calculations can be parallelized and perform well on many problems, even with small datasets and the output returns prediction probabilities.

Downsides of Random Forests are that they are black-boxes, meaning that we can’t interpret the decisions made by the model because they are too complex. RF are also somewhat prone to overfitting and they tend to be bad at predicting underrepresented classes in unbalanced datasets.

Other tree-based algorithms are (Extreme) Gradient Boosting and Rotation Forests.



var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Arnaub Chatterjee discusses artificial intelligence (AI) and machine learning (ML) in healthcare.

Mon, 10/29/2018 - 17:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Arnaub Chatterjee, Senior Expert and Associate Partner in the Pharmaceutical and Medical Products group at McKinsey & Company.

Here is the podcast link.

Introducing Arnaub Chatterjee

Hugo: Arnaub, I’m really excited to have you here today to talk about the role of AI, data science, and machine learning in health care, what has and hasn’t worked, but before we get there, I’d love for you to set the scene of your own journey and let us know how you got into data science originally.

Arnaub: Thank, Hugo, and thanks to DataCamp for having me today. The notion of how I got into data science is quite serendipitous, and as is a lot of things in life, a bit of the right time in the right place. I think there’s also a concurrent movement within health care where data science is really taken off within the last 10 years, so as a lot of perfect storms work, all these factors aligned. I think my career’s been a little bit of a zig-zag in terms of the fact that I’ve held roles in consulting, worked for the previous administration in both of the technology and the policy worlds, and then in pharma, and now back in consulting. The central theme or ethos around how I’ve worked has always been around data science and the common thread and some link to data science.

Arnaub: Just to give you the quick background, I started my career after grad school as a consultant, originally focusing on pharma MNA, but then helping AIDS filled health data infrastructure in a pre-ACA time, Affordable Care Act time. That actually led me to go work for the Obama administration. I transitioned over to initially work on some data science efforts around health care fraud and abuse, and thinking not only about the policy, but also around how do we think about the utilization of that data, and to predict who may be more likely to commit fraud and fraud the government.

Arnaub: Things transitioned into more of a technology bench from that perspective. I then had the opportunity to work with a few technology officers within HHS, and at the time Todd Park and Bryan Sivak were creating a new movement around open data and building APIs and platforms that access this tremendous amount of data the government sat on, so very, very fortunate to have run into two Silicon Valley guys who brought their DNA over to government and start a number of initiatives like the Health Data Initiative. We’re building platforms around,, open APIs. A lot of that was also very right time and the right place, and being able to learn from folks who’d done it in the private sector with a technology mindset. Then the pharma aspect of it, I actually ended up transitioning over to pharma because of government, and followed some folks up who had spent some time working in government and started a team that was focused on data science at Merck. What we were doing at Merck was very much around how do we utilize and identify novel data sets, which could include rogue data-like claims, but it could also include clinical and genomic and social media and censors. It really utilized and think about how we demonstrate the clinical and economic value of Merck products. This was a whole new way of thinking about positioning the drug in a different way and thinking about the advent of new methodologies within data science to support and bolster the value of the drug.

Arnaub: I did that for a number of years and worked with a variety of academic institutions on different machine learning methodologies, different data science methods. All that eventually led me to McKinsey, where I am right now. In that capacity, I do a lot of work with not only pharma clients but technology clients and how they’re entering the health care sector. I feel in that capacity I’ve been fortunate to be at the front lines of how different companies are deploying machine learning, data science in a variety of different settings. Hopefully we’ll cover a lot of that today.

Hugo: Absolutely. And as you say, it took a lot of moving parts, or it’s almost a perfect storm of your interests and serendipity of technology and the emerging data science stack and all the data available that made this career path for you what it’s become. We’ll see that that happened in this space as well. That is actually took a lot of moving parts, availability, and generation of large scale data, computational power, statistical insights, that allowed data science, in health and otherwise, to emerge as it has.

How did you learn your data skills?

Hugo: I’m also interested in the fact that you started thinking about this type of stuff at grad school, and that’s when you started working with data. The skills you needed then and need now to work with data, are these skills that you learned on the job, or were you trained specifically to do this type of work?

Arnaub: It’s an interesting question. I think in grad school I spent some time in biostatistics and epidemiology. My background’s in health care on the business side and then also health policy bent. The funny thing about data science is if you ask a lot of folks in healthcare they’ll tell you that a data scientist is a statistician in California, and they’ll basically say that that notion and term has changed a lot as methods and the advent of machine learning has really warped what that definition means.

Arnaub: In some capacity I think people who have worked with traditional claims data and have skill sets like epidemiology, like biostatistics, are in their own right data scientists. I think what’s changed now is, like you mentioned, greater volumes and different types of data. There are different ways of processing and understanding how we use it. My training started there, and then now I think, like a lot of people, I’m having to evolve and learn just based on the fact that a lot of disciplines are converging. Computer scientists and the fact that there are actually data science degrees now, programs that are teaching different ways of using data, those are converging with a lot of old school ways that people have used health care data. That’s where I find myself right now.

Hugo: Great, and I love that you mentioned different types of data because I suppose heterogeneity of data is something which is so abundant, and actually, when people ask me where does this occur, the clinical setting is one that I mention initially that you can have tabular data from experiments and controls in addition to imaging data from scans, which we’ll get to, in addition to natural language from doctors’ notes on patients files and that types of stuff.

AI in Healthcare

Hugo: I think that’s a great time to move into this conversation about the use of AI in health care. As we know, there’s a great deal of hype around the use of machine learning in AI and health care. I’m wondering from your point of view, what has actually worked in this space?

Arnaub: Yeah, it’s a really important question, and I think when we talk about what’s actually worked, I think it’s important to know that this is a very much evolving space, so in some cases the jury is still out. In other cases, we’re starting to see very promising signs. Just to acknowledge the hype, I guess in my personal opinion, we’re in a golden era of funding for AI in health care. I think the statistic that I saw more recently was that since 2016, there have been 300 startups in AI in health care alone that have emerged. These companies span the gamut from reducing administrative scut work within insurance and billing companies, to actually creating new drugs and empowering compound development. I think what’s important about that is we’re starting to now see where a lot of the attention for AI is being driven towards and where the media attention is going towards. Venture dollars are flowing to companies that are also streamlining a lot of operational and efficiency tasks, administrative efficiency tasks in health care. They’re also flowing towards companies that have these bold aspirations of upending processes that have been happening for decades now.

Arnaub: What we’re trying to discern is what a successful outcome looks like, and how do we think about what the bigger aspiration is, where do we see tangible improvements in health care, how do we think about the advancement of patient outcomes? This conversation, it’s not meant to splash cold water or put a wet blanket on the great work that’s being done. It’s just meant to try and understand, like you mentioned at the very beginning of the conversation, a lot of talk about this has been taken place to where we’ve seen the promise.

Arnaub: Let me go ahead and take on a few different examples in spaces where I think this has worked, and we can have a deeper conversation about this. The very first things you mentioned was around imaging, and particularly within diagnostic imaging and health care, that’s a bedrock. I think it’s really important to remember that pioneering use cases of AI in other industries actually started with the ability to read back pictures and look at faces and patterns and objects in photographs. That is very much similar in health care where a lot of our lighthouse use cases where we’ve seen great success are starting to take place.

Arnaub: To give you a little bit of a context on the growth of this market, AI facilitated diagnostic imaging is supposed to be a two billion dollar industry by 2023. That is just a fraction of the whole medical imaging market, which is 57 billion dollars. 57 billions dollars includes equipment, it includes software, services, so it’s a massive, massive marketplace that has been in health care for quite some time. I think what we’re seeing now is a consensus from a number of parties, whether it’s hospitals and technology companies, is that AI is going to transform the diagnostic imaging industry, whether it’s enhanced productivity, whether it’s improved accuracy, personalized treatment planning, all of these functions are up for grabs.

Arnaub: Just to build a little bit more on this, why imaging as sort of the first place we’re seeing improvement, for one thing, hospitals are producing around 50 petabytes of data per year, and 90% of that data is from medical imaging. We’re talking about MRI scans, PET scans, CT scans, and all of these are also embedded within EHRs. I think that’s one reason, is the availability and the ubiquity of this data.

Arnaub: I think the second reason is that there are a number of compelling use cases that are actually there now within health care. To pick on some great work that Google has done, Google Brain has published their really powerful paper within JAMA, where they worked with a 130 thousand patients from an eye institute, and they looked at retinal fundus images. What they were able to do was come up with a more sophisticated, convolutional neural network that was able to predict diabetic retinopathy, which is one of the leading causes of blindness globally. Around 400 million people have this disorder.

Arnaub: What they effectively did was come up with a more refined version using a subset of those 130,000 images, and they outperformed a panel of eight ophthalmologists in terms of understanding where the retinopathy took place and how do you actually characterize what are the contextual clues. Their F-score was .95. The fact that they had an adjusted AUC, the fact that it was in JAMA. There’s quite a strong, clinical argument to be made that if we get access to more of this type of data and we’re able to build it into different tools and processes and how ophthalmologists see their patients, that’s just the beginning, I think. Deep Mind has a very similar study within the retinal space. Just those two examples alone, I think, are pretty compelling. Not only are you seeing this in ophthalmology, but you’re also starting to see it within dermatology and pathology as your next set of lighthouse use cases.

What types of companies are making the advancements?

Hugo: It strikes me as interesting, and I’m wondering if it’s interesting to you, that it’s companies such as Google in this case, that aren’t traditionally known in the health care space that are making such advancements.

Arnaub: Yeah, I think you’re starting to see a lot of interesting collaborations between companies that have world class machine learning and health care institutions. On the other side of Silicon Valley you have Facebook. Facebook just announced their collaboration with the NYU School of Medicine, where they’re utilizing their AI to speed up recognition of MRI scans. The projects for those of you who are interested is called Fast MRI. It’s looking initially at around three million images of the knee and the brain and the liver, and looking at around 10 thousand different patient cases. That was just recently announced. We’ll see what the fruits of that labor are.

Arnaub: I don’t think it’s all that surprising. I think the computational power, it’s now incumbent upon Google to figure out how do they think about where the applications are, where the use cases are, and I think this is where you’re starting to see imaging as that first initial lighthouse for them because they can, compellingly … they have done this in other industries, and now they’re compellingly able to do it with health care data as well.

Use Cases for AI in Healthcare

Hugo: So you were about to go and tell us a bit more about use cases in ophthalmology and dermatology, for example.

Arnaub: Yeah. I think we’re starting to see similar … ophthalmology’s obviously the retinal disease example. We’ve started to see different cases with breast cancer. There is a great example of a partnership between Kaggle and Intel and a company called MobileODT, where they developed an algorithm that accurately identified a woman’s cervix and how do we better screen and treat woman for cervical cancer. The data was consisting of around 10 thousand label cervix images, and it had type one, two and three cervical cancer. This was a 50 layer convolutional, neural network, deep learning model that accurately segmented different parts of cervix type identification. Just another example where this algorithm, just by leveraging the power of the crowd, it wasn’t even academically trained or clinically trained folks, they are able to capture and accurately identify cervix type 75% of the time.

Arnaub: I think something to notice that’s really important is that these CNNs are actually reproducible. You don’t have to rebuild and reinvent to wheel every single time. I think that’s where you’re going to start to see great refinement, and you’re going to start to see a lot of enhancement in terms of how we do imaging recognition and reproducing these algorithms.

Arnaub: I think the second thing is these major partnerships where you’re starting to see tech companies partner with eye institutes and large corporations that have imaging data. That’s going to be very compelling and powerful.

Hugo: And so when you say reproducible, do you also mean reusable in the transfer learning sense?

Arnaub: Yeah, I think we’ll get to this, I think, later on, hopefully, but one of the big challenges with AI is just making it reproducible within health care. The big hurdle there is that data is different in many different parts of the health care system. The patient that you’re seeing California will be very different than the one that you’re seeing in Texas or South Carolina or Boston. I think what we’re trying to better understand is how do you create a generalize-ability on an algorithm that may have been used within a certain subsection of the US population or the global population. Then being able to consistently come up with those algorithms is what is a challenge because there are also different ways of characterizing this, and I’ll spend some time talking about this later.

Arnaub: For radiology specifically, the outcomes that you’re looking for are different. You might look at the probability of a lesion, and you might look at the feature of a tumor. You might look at the location of a tumor. You have to consistently do that exercise with different types of imaging data over and over again in order to day the algorithm is reproducible. I think that’s where we’re starting to see that we have to continuously be able to prove that this algorithm is accurate and identifiable with other data settings.

Other Examples

Hugo: Look, in all honesty, having this conversation makes me realize even more how important it is to de-mystify these things, particularly because there are so many touch points, right, of where AI can have an impact in health, as you’ve mentioned, everything from administrative tasks to the scut work, the insurance industry, to all of these diagnostics. Before we move on, I’m wondering if there are any other examples or super interesting use cases to your mind?

Arnaub: Yeah. Absolutely. The second, I think, bedrock, lighthouse that we’re seeing a lot of is around diagnosis prediction. How do you think about new variables that you haven’t unearthed within data that might contribute to the progression of treatment.

Arnaub: We’re actually working with several clients in this space right now on coming up with novel prognostic variables that could lead to the progression of a disease, perhaps predicting earlier onset of the disease. I think what makes this compelling is there’s still a great amount of misunderstanding, there’s still a great amount of unmet needs that we’re not characterizing within our patient population. If we’re able to better understand using machine learning methods on who those patients might be, we might be able to do incredible things in how we’re getting them in and out of the hospital, seeing a provider quicker.

Arnaub: A great example of this is: Emory just released a study on sepsis where they looked at 42 thousand patients, and they looked at 65 different measure that might be predictive on the onset of sepsis. They looked at within different time intervals, so within four, six, eight, and 12 hours. What was cool about this is that they were able to come up with the same model and the same level of accuracy as a physician in predicting sepsis with a validation cohort between the doctor and the tool that the algorithm was basically indistinguishable. This is an example of not machine versus physician. It was more that we have the ability to not only confirm and corroborate what physicians are finding. If we continuously refine this, we might find more measures that are more predictive on sepsis.

Arnaub: The one other example I want to share with you actually just came out last week, and this was in the Journal of the American Medical Association, a pretty top tier publication. This was looking at a randomized trial of 500 and some patients with staph infections. They looked at patients over a six year period, and they found that an algorithm did just as well as doctors in suggesting how to treat them with antibiotics. What makes this really compelling is that they were able to say that patients who were on certain antibiotic protocols were on a drug for a certain number of days. They might have been on a certain number of drugs for a fewer number of days. You basically are looking at how we can think about antibiotic protocols and what is the best practice for keeping patients in and out of the hospital. I think that’s where you’re starting to see a lot of compelling evidence, and given that this is now appearing in top tier medical journals, it’s not something that’s in the future. These are things that are happening right now.

Hugo: And something you’ve hinted at several times already is that we’re not necessarily … there’s a false dichotomy between humans and machines, essentially, right?

Arnaub: Yeah.

Future of AI in Health Care

Hugo: And what is more interesting, I think, is the idea of human algorithmic interaction. The idea of having artificial intelligence and machine learning models with humans in the loop. Do you see this being part of the future of AI in health care?

Arnaub: I guess this is your robots versus physicians conversation?

Hugo: Sure.

Arnaub: Yeah, I think there are a few … I’ll give you two funny anecdotes that are perpetuating this theory of whether we … We hear a lot of statements around whether physicians are going to be replaced by doctors. One example is medical students are actually reportedly not specializing in radiology because they fear that the job market won’t even exist anymore within 10 years.

Arnaub: The other examples is there’s a really interesting company in China called IFLYTEK, that is a pretty substantial Chinese AI company. It was the first machine to pass a medical exam, and it scored substantially higher than the student body population. When you hear these types of statements, and then you see all the JAMA evidence or New England Journal evidence of doctors being at the same level as a machine, there’s going to be a lot of conversation. It also demonstrates how far machine learning has actually come.

Arnaub: I think there’s a few thing that make me believe that we’re not at a place where physicians are anywhere close to being replaced. One is that a lot of these AI systems, like if you take the radiology example, they perform what’s called narrow AI. These are single tasks, and they’re being programmed, and the deep learning models are being set for specific image recognition tasks, so detecting a nodule or looking at a chest CT and looking for a hemorrhage. These are N of one tasks, and they’re binary and either yes or no. I think if we keep tasks in this narrow detection focus, we’re going to find a lot of interesting things, but what that means is that these are going to be augmenting tools. They’re going to help physicians improve diagnostic accuracy, but as we all know, physicians do quite a variety of tasks. There is a substantial amount of mental labor that goes into how physicians diagnose patients.

Arnaub: In the short term, I think we’re looking for AI to power a lot of solutions that can reduce costs and improve accuracy and augment physician’s decision making. I don’t see it replacing the tremendous work that docs do or our providers do any time soon.

Hugo: Yeah, and I love that you mention narrow AI which, as you said, is algorithms, AI models, to solve specific tasks. I think in the cultural consciousness when people hear AI, they don’t think of narrow, weak AI. They think of a strong AI, which reflects human cognition in some sense, and that isn’t even necessarily what we want in most places and what we’re working towards. Right?

Arnaub: Mm-hmm (affirmative). Yeah. That’s right. Exactly. It has to be more expansive. I think the other things that’s worth mentioning is that there has to be … we talked about this already, but the consistency and the portability in the models has to happen. We’re still a long way from integrating this into physician decision making. I think different vendors are focused on different deep learning algorithms and a variety of different use cases. Even certain things, we’ll get to this, they’re being approved by the FDA, but they have completely different focal points. Until we can start to standardize a lot of that, it’s going to take some time. To your point of narrow versus much more expansive thinking of AI, that’s also part of the equation, and then how do we actually make this reproducible.

How has data science, ML, and AI evolved in the health care space?

Hugo: Something that you’ve mentioned several times is that a lot of the power we see now is from deep learning. You mentioned the use of convolutional neural networks. I’ll just step back a bit, and for those people who want a bit of de-mystification of deep learning, deep learning is … and correct me if I’m wrong at any point … is a subclass of machine learning and mostly in supervised learning where you’re trying to predict something. This particular type of supervised learning model, which is inspired loosely form neural networks in our physiological systems, in our brains.
Hugo: Convolutional neural networks are ones that are quite good at picking out patterns in images, essentially. It uses a convolutional technique to do so. Of course, AI pre-dates convolutional neural networks, and although they’re very strong at the moment, I’m sure you’ve seen trends emerge and dissolve. I’m just wondering if you could speak to how the moving parts of data science, ML and AI, have evolved in the health care space since you’ve been working in it?

Arnaub: Yeah, and I think if we’re going to generalize it to health care, there’s quite a bit … there’s a whole variety of different models and sophistication of those modes that are being thrown at different problems. I think at it’s very basic level, a lot of early applications of AI in health care were focused on the relationships between diagnoses and medications. Some of the more basic techniques, such as association rule mining or supervised learning, those were meant to come up and find and extract important associations. There are a lot of limitations to those methods, so I think if you look at our methods, they were only looking at item level co-occurrences. They weren’t really to abstractions at a higher level. There wasn’t a lot of usefulness for data exploration or clinical decision support.

Arnaub: And I think if you look at supervised learning techniques, they are tackling these problems from a prediction perspective. If we have the right level of data, we can come up with more non-predictive applications. The things like disease group identifications or patient stratification. There are things that can happen as data becomes much more usable, I guess, for lack of a better word. I think that’s where we’ll actually be able to see supervised learning become much more applicable, going from small data sets with few observations into much more massive examples. There’s great work, for example being done at Stanford and UCFS where they’ve looked at 100s of thousands of patients over 10 years of longitudinality, billions of observations, and come up with sophisticated deep learning neural networks. I think that’s where you’re starting to see the far reaching applications of AI.

Arnaub: In other cases, we’re still working on the data problem, which is that we’re getting enough data to make this interesting, but the sophistication of certain models or methods may not be there because the data’s, quite frankly, not that good.

What does the future of data science, ML, and AI in health care look like to you?

Hugo: All that being said, what does the future of data science, ML and AI, in health care looks like to you?

Arnaub: Yeah, so I think there’s a lot of applications that we haven’t talked about yet. I think we’ve picked on the two easy ones in terms of what’s happened and what’s been working – disease diagnosis prediction and then on the imaging. I think there’s quite a bit of work in terms of what’s happening in drug development. The fact that we are looking at companies now … there are exciting startups that are doing this that are focused on things like drug repurposing where they’re using real world data and machine learning algorithms to explore the relationships between drug molecules and disease. That is extremely compelling. That’s where you’re starting to see a lot of funding go into, especially from biotech and pharma, there are companies like BenevolentAI and Numerate and others that are using deep learning to mine a quite vast amount of data to look at everything from scientific papers, clinical trials, they’re effectively just trying to just understand which compounds can be more effective at targeting disease.

Arnaub: These are the types of things that I think are getting quite a bit of investment, but we haven’t seen the fruits of the labor yet. I mentioned Benevolent. They started identifying hypothesis around ALS treatments, and you know this is just a start, but it’s starting to narrow down which drug targets or which compounds to go after. It not only saves a tremendous amount of time for biotech and pharma, it also expedites the drug development process. I think that’s one example.

Arnaub: There are really interesting and powerful examples of genomic data that we haven’t talked about yet, so DeepVariant, if I go back to Google for one sec, DeepVariant is an open source tool that was about two years of work between Google Brain and Verily, which is Google’s life science arm. What they effectively are able to do is come up with a more sophisticated statistical approach to spot mutations and filter out errors. What DeepVariant does, it changes to whole task of variant calling, which is trying to figure out which base pairs are part of you and they’re not part of some kind of processing artifact. It turns it into an image classification problem. Deep variant is starting to replace and outperform these basic biology tools, like GATK and SAM tools, and reducing the amount of error by up to 10 times.

Arnaub: This is just, I think, in the beginning stages. Even companies like Google will tell you that their genomics work is a couple years out, considering this one took two years to build. But I’m extremely excited about that type of potential. There are other examples around physician burnout and the advent of voice technology within health care, where we’re starting understand that doctors spend an enormous amounts of time on EHR, electronic health record data entry, and if we’re able to use machine learning and natural language processing and voice technology in the future, then we start to auto populate structure fields within records, make the doctor’s job less burdensome, reduce physician documentation burden. Those are three use cases that I think are on the frontier. They’re areas that there’s a lot of hype and interest and really amazing work happening, but that’s a short list of where I see the future going.

Questions from the Audience Liability

Hugo: Great. Before I go on, there are actually a few interesting questions from the crowd, and Gamal has asked a question, what about liability? I actually want to frame that in terms of thinking about the future of data science and ML and AI in healthcare, and in particular the fact that a lot of the algorithms we’ve discussed are essentially black box algorithms in terms of it’s difficult to see why they make the predictions that they do. So in terms of interpretability versus black box, maybe you can discuss that with respect to, I suppose, liability for the models that we as data scientists build.

Arnaub: Yeah, I think that’s an incredibly important question. One thing I want to talk about it is the policy space in terms of where the future is. This notion of FDA approved algorithms is actually starting to happen. What we’re seeing right now is this lack of consistency, transferability in the current models because they focus on different end points, they are done in a black box setting where it’s data in, we’re not really sure what comes out. I think what that means is that regulatory bodies are going to intervene, albeit in a positive way.

Arnaub: As an example, the American College of Radiology is actually helping to provide methodologies for vendors to verify the effectiveness of the algorithm before they’re taken to market. I think that’s one example. The other example: about accepting algorithms and approving them is part of diagnosis. They gave a positive vote to a decision support tool that uses an algorithm for neurovascular disease. They did the same thing for diabetic retinopathy in April, and then they did something for a computer aided tool that helps with wrist fractures in adult patients. These are all the FDA’s permitting the marketplace to be open. They are allowing algorithms to actually help providers in a regulated way.

Arnaub: There’s actually really cool stuff happening within the White House and the House Oversight Information Technology Committee. If you guys are extremely bored, you should read this really ominous report called "Rise of the Machines" that the House Oversight Committee just put out. It’s basically how is the NIH going to ensure that there’s standardization within algorithms. The same thing with the White House. They put out a really interesting plan from the government to build up AI in an ethical way. I think the black box problem is going to continue to happen. We’ve already seen it be problematic for some large companies. We need to be able to address that, and although we don’t love government intervention, I think this is one instance where we’re actually seeing a lot of positive things come out of it.

Do ethical questions in data science have a much bigger impact on AI in health care, and are there moves towards developing ethical guidelines for researchers in this space?

Hugo: Following up on this, we’ve actually got a great question from someone in the audience called Daniel about ethical questions in general: Do ethical questions in data science have a much bigger impact on AI in health care, and are there moves towards developing ethical guidelines for researchers in this space? You’ve spoken to that we respective to top down. I’m also wondering about practice from within the data science community. What type of stakeholders will hold data scientists accountable? And also, the fact that in marketing, for example, or advertising perhaps… if you show someone the wrong ad, that doesn’t have such an important impact as giving someone the wrong diagnosis, right? Are there things that are particularly valuable in the health space?

Arnaub: Yeah, so I think what we’re seeing is how do we standardize an ontology for a disease, and that’s an evolving question. There are academic consortiums that are focused on reproducing these phenotypes. So a phenotype is basically how are we characterizing a patient and their respective disease. If academic groups and organizations come together to say this is a generally accepted algorithm, this is how we can avoid erroneous cancer treatment advice, or this is how we see this is an unsafe or incorrect treatment recommendation, I think that will actually compel more folks to work within certain parameters and build algorithm that are within guidelines and practice. Otherwise, it’s incredibly easy to find signal within a lot of health data noise. I think there are companies that have suffered some hard lessons that way. I think as long as we are working with organizations that are attempting to do this, that’s one way of tackling this.

Arnaub: I think the other thing is health data is incredibly inconsistent, and there is a national subcommittee called HL-7, which is a health data standards committee. They are really making a strong push for something called FHIR, which is Fast Health Care Interoperability Resourcing. It’s trying to create a standard where data is no longer somebody’s competitive advantage, but it’s something that everyone can use, and it’s something that is standardized for everyone. You’re not just looking at inconsistent standards.

Arnaub: The Centers for Medicare/Medicaid services are really trying to push standard ontologies. I think FHIR and these other organizations are trying to create a consistency behind all the chaos and the noise.
Hugo: Fantastic. That actually answers the next question that we had from the audience, which is from David, which concerns research into the policy implications of AI in health care, and in particular, will FHIR have any impact on AI implementation. That’s great that you were able to answer that question even before I framed it.
Hugo: Another question, I’ll just remind people who are here but listening and watching today to send through questions in the chat as soon as they come to mind. I’ve got a really interesting question from William, which we hinted at earlier, but William says "What I notice is that a big part of the hype is focused on the R&D side of health care. For example, image analysis, drug discovery. What the the hypes, and more important, the promising applications on the manufacturing side?"

Arnaub: Yeah, so that’s a good question. I assume that’s in relation to drug development, sorry, within pharmacos?

Hugo: Yeah, exactly.

Arnaub: Yeah, so I think we’re starting to see a lot of activity in this space. It’s a little bit more nuanced, but in terms of how manufacturing is trying to tackle this, I think we are now in a place where the ability to standardize and create better understanding of how drug cycling takes place, where the supply chain can be optimized. I think that’s where there are companies like BERG, for example, that is using AI for not only research applications but for manufacturing. It’s something that I come across less, but something that is still popular. I think there are ways to think about unsupervised learning methods in terms of how we’re trying to understand drug circulation and where we can improve our supply chain efforts.

Arnaub: There is actually a cool effort from the UK Royal Society that is looking at the role of machine learning in bio-manufacturing. Can we actually help optimize the time factor here, like helping manufacturers reduce the time to produce drugs, lower costs, improve replication? Yes, still very much a popular topic. It’s not something that we’ve avoided, but discovery is where I’ve seen a lot of the money and interest flow to right now.

Given the black box nature of AI and the proprietary nature of industry, how will external validation and reproducibility of algorithms be assessed?

Hugo: We’ve got several questions coming through about, I suppose, the ethical nature of using AI and ML and data science in health care. The first one I want to ask you is from a listener called James. James says written given the black box nature of AI and the proprietary nature of industry, how will external validation and reproducibility of algorithms be assessed? He also says, basically, where does open science fit in for the commercial AI space?

Arnaub: Yeah, that’s a good question. I think what we need to do is come up with more of an interdisciplinary, multi-stakeholder way of evaluating different algorithms that are entering the marketplace. You have these large top down institutions like the FDA that are evaluating the ability for a physician to utilize an algorithm in practice. I think there are other organizations at the academic level that are more interdisciplinary. A great one is called OHDSI, which is the observational health data sciences and informatics group. What they’re attempting to do, and this is actually a collaboration between pharma and academics and startups. One thing that they’ve done that I think is really important is they’ve created a common data model for health care. They’ve looked at disparate observational health care databases and decided that EMRs are really important for supporting clinical care. Databases like clean data are important for reimbursement, but they both serve different purposes. We need to create a common data model that accommodates both of these different types of data.

Arnaub: This CDM, through a partnership called OMOP, which stands for the Observational Medical Outcomes Partnership, it basically is trying to take all this noise from random coding systems and create one standardized ontology and vocabulary. That’s one way of trying to get buy in from other people, multiple players, interdisciplinary players. That’s, I think, something that helps with the ethical challenge.

Arnaub: OHDSI is an organization that actually works on reproducing and publishing all this research. All of it’s open source. They’ve created a number of software tools like Atlas and Achilles that are standardized databases for profiling different data and data quality. This is not something that we’re going to solve overnight. I think regulatory bodies are going to be very judicious about what they approve and what they don’t approve. What tends to happen in health care is as soon as there’s some sort of adverse event, or there’s some kind of clinical error, you’re going to see the entire industry get slapped. Nobody wants that to happen. I think that’s what’s stifling for innovation.

Arnaub: We are, hopefully, trying to … It’s weird, the world that we sit in right now where we’re trying to get as much AI related work out there as possible, also being very mindful that a successful deployment of this, once it enters the physician’s hands, or it starts to be part of patient care, is incredibly challenging. That’s what the next evolution of all of this stuff is, it’s the very careful implementation of these algorithm into a clinical decision model.

Is it or will it be enough to prove the efficacy of an algorithm or model, or is a fully, full-functional description required?

Hugo: I’m glad that you mentioned regulation there because we have a related question from Stephen, the question is from a regulatory perspective, is it or will it be enough to prove the efficacy of an algorithm or model, or is a fully, full-functional description required?

Arnaub: Yeah, I don’t know if we have proper guidance around that. I think there are a lot of organizations that are trying to demystify how the AI, or how the FDA should be thinking about this. A few things that are standard in terms of how much do you have to prove something. One is that what is your benchmark data? And are you using … are you using ground truth data? Meaning is it a trusted claims data source? Is it a standardized ontology for EHR data? I think companies grapple with choosing a whole variety of different data sources and collection methods. Then they realize that their algorithm isn’t that good, or they’re hoping for its approval. There are good definitions on what ground truth is. That’s one way of creating a really strong model.

Arnaub: I think there are other ways of thinking about what’s the intended use of the algorithm. How do we think about this interfacing with a physician? Does it have any downstream patient implications? Is there going to be algorithm bias? Meaning are you going to deny care to a certain patient population? That’s something that the FDA considers in terms of how it considers it to be ethical or not ethical.

Arnaub: Then I think there’s a whole regulatory approach on refitting models so that they’re continuously learning. Scott Gottlieb, who runs the FDA, has talked a lot about how will patients evolve, and how do we think about when companies have to do a refit of the model and how it should be validated. What is the right refitting cadence? Is it every hour? Is it six months? Is it annually? I feel like a few organizations have taken a stab at trying to create these lists of guiding light questions that can help us come up with a good model versus a subpar model and one that’s more likely to be implemented and accepted by the clinical community versus ones that have a great finding but may have some holes in them.

Due to health care being a regulated area, do you feel that AI and health care will be focused area for only big companies like Google and Facebook to be able to make advances? Or do you feel small companies have a space?

Hugo: In terms of the regulation, as well we’ve got a great question from Harsh. Harsh asks that due to health care being a regulated area, do you feel that AI and health care will be focused area for only big companies like Google and Facebook to be able to make advances? Or do you feel small companies have a space? You mentioned earlier that there have been several hundred startups in this space in particular. Maybe you can speak a bit more about it?

Arnaub: Sure. I think it’s a really interesting question. I think … I’ll give you one example that I think raised a to of eyebrows. To your point about big tech companies like Google. Facebook actually announced earlier this year that they are looking at using AI to monitor suicide and understand which one of their users might be more likely to commit a suicidal event. That is a highly ethically challenging place for a tech company to play in.

Arnaub: I think that large tech companies, while they have great applications and great computer science, are treading very carefully because they realize that this isn’t something that you can just wade into. I think they actually have a tremendous amount of organizational risk as they enter this market. One, because patient care is totally different than selling ads. I think the ability for them to use their science and use the computational power comes at a great risk. They have to evaluate whether it’s worth it or not, but they all want to make the world a better place. That’s their aspiration.

Arnaub: With other companies, there are so many powerful tech companies. I mentioned Voice Technologies in this space. Companies like Virtual Assistance and Voice Technology are going to be major players in this space. And I don’t just mean Amazon, but I mean companies like Orbita and others are doing incredible work, Robin AI, they are basically trying to help reduce physician documentation burden. These are well funded, well capitalized, strongly supported startups that are doing great things. There is patient data and risk analytics. There’s companies like NarrativeDx that are doing really compelling work. They’re partnering directly with health care systems to do this in a safe and compliant way, so I don’t think large tech companies are the only one that can play in this space.

Arnaub: I think if you are very calculated, methodical about how you enter, if you engage in the right partnerships, a startup … I mean, there are quite a few startups that have made a lot of headway in this space, in the drug discovery world. These are startups that have raised hundreds of millions of dollars, literally, that are now functioning quite well and successfully. Pharma companies are making multi-year bets on companies like Numerate and BenevolentAI and investing quite a bit of money. This is not just a large tech company space anymore.

What are the main limitations on the rate of adoption on AI in health care?

Hugo: There are actually an overwhelming number of fantastic questions coming through, but we’re going to need to wrap up in the next 10 minutes or so. I’ve got two more questions. The first is from Christopher, which I’m very interested in. Christopher asks what are the main limitations on the rate of adoption on AI in health care?

Arnaub: Yeah. That’s a good question. We talked a little bit about what are the policy hurdles, and we talked about how do we think about approaches. I think what is the biggest rate limiting step in health care is going to be the ubiquity of good quality data. This is the biggest challenge that I think has been plaguing health care for decades now, is that as soon as a novel data set is released into the health care world, everybody gets really excited about it. As soon as the government made standards for EHR, electronic health records, EHR became competitive domain for any organization that had the record. The challenge was getting access to that data. It’s the same thing with genomics now. We’re starting to see bio-banks and the ability to have genomically sequenced data, genetically sequenced data. That is the next domain. That’s where people are trying to get to, but none of this matters unless the data is linkable, unless there is a standard, unless there is labeled data. We talk a lot about imaging today, but radiologists suffer from the fact that imaging data is stored within these pax warehouses, pax is the archiving system, and then they’re not labeled. We don’t know what we’re necessarily looking at. What that all goes to show is that the biggest hurdle for AI adoption in health care is good quality data, which is why I mentioned standards like FHIR and others trying to create some kind of harmony and consistency in the data in a very chaotic world.

Arnaub: I think the other thing is that hospitals and other organizations that have the data are very much willing to work with players, but there’s quite a bit of overlap in terms of what companies are promising. We’re starting to see a lot of companies tread into different spaces and say that they’re doing compound development or they’re looking at molecule identification or target validation. They’re trying to be jack of all trades. I think that is obfuscating what the company actually does.

Arnaub: My suggestion is just having a very clear, refined focus on what you think you are doing and what you’re good at versus trying to then wade into a lot of other murky waters. With that said, I mean, the market is so hot right now that you will see a tremendous amount of partnership and opportunity for startups. The biggest, limiting step is access to the data, finding the right partner, being able to demonstrate a use case, and then the application of that algorithm within clinical practice.

Do you think artificial intelligence will democratize medicine?

Hugo: I’ve got one more question from one of our listeners, and then I’m going ask you one final question. This is from Gamal, and it’s a relatively general question that I’d like you to interpret in whatever way you see fit. Do you think artificial intelligence will democratize medicine?

Arnaub: Oh, interesting. I think that we will get to a place where … I’m going to be liberal about the use of the world democratize. I think what that means is enable access to care, perhaps, or maybe that’s the definition that we’ll choose to use. I think patients are increasingly interfacing with the health system in different ways, and the fact that the majority, the vast majority, of patients go online to look up health information, almost 90% now. The fact that there is still … there’s a lot of ways for patients to share their data now with physicians, with technology companies. We all know the work that Apple’s doing in its work in health kit and research kit to try to get more access to data. I think there is going to be some greater role for AI, and maybe for technology to help with access to care. And hopefully I’m addressing your question, but feel free to rephrase.

Arnaub: At the same time the US is suffering from tremendous endemic health policy challenges that I don’t think AI will fix. I think AI will enable and help certain things. It will maybe power diagnosis. Maybe it will improve better health outcomes over time. There’s still a good portion of the population that will never be AI enabled, for lack of a better word, or have access to health care resources. I think that’s the greatest hurdle in our system.

Call to Action

Hugo: That definitely does answer the question. My final question for you is: for all our listeners out there, do you have a final call to action?

Arnaub: Yeah. I think we’ve talked a lot about challenges. We’ve also talked a lot about promise and where the industry is going. I think this concept of fixing a lot of low-hanging fruit problems, we’ve picked on a lot of sexier things like drug development, but our health care system suffers from tremendous waste. These are enormous problems, and AI can fix a lot of these things, like insurance and billing claims. My old mentors used to say that a lot of the most lucrative work in health care is also the least sexy and the back office application type stuff.

Arnaub: If we’re able to do things like predict better waste or fraud, or if we’re able to improve billing and documentation processes, these are incredibly important problems to go after, and I think there’s something there. You should use AI and your powers to go fix them.

Arnaub: I think the other thing is that these problems shouldn’t be done in isolation, or fixed in isolation. You are going to see a lot of different and perhaps unique partnerships in health care take place. Hospitals and tech companies and patient groups working with startups. I think that whole model has flipped on its head. I encourage everyone to be very inventive about how they want to work with different parties. There are a lot of non-traditional folks that are getting into the health care space, so think about where the intersections lie and where the cross-functionality lies. That’s where you usually find more inventive solutions as opposed to working through the same channels.

Hugo: Thanks, Arnaub.

Arnaub: Yeah, thank you, Hugo. Thanks to DataCamp for the time. I really appreciate the opportunity.
Hugo: Absolutely. All right.

Arnaub: Thank you.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...